-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'ar/prep-040' into 'master'
Prep docs and loose ends for 0.4.0 See merge request machine-learning/modkit!224
- Loading branch information
Showing
55 changed files
with
2,597 additions
and
2,033 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
[package] | ||
name = "mod_kit" | ||
version = "0.3.3" | ||
version = "0.4.0" | ||
edition = "2021" | ||
|
||
[[bin]] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Evaluate a table of known motifs | ||
|
||
The `modkit search` command has an option to provide any number of known motifs with `--know-motif`. | ||
If you already have a list of candidate motifs (e.f. from a previous run of `modkit motif search`) you can check these motifs quickly against a bedMethyl table with `modkit motif evaluate`. | ||
|
||
```bash | ||
modkit motif evaluate -i ${bedmethyl} --known-motifs-table motifs.tsv -r ${ref} | ||
``` | ||
|
||
Similarly, the search [algorithm](./intro_find_motifs.md#simple-description-of-the-search-algorithm) can be run using known motifs as seeds: | ||
|
||
```bash | ||
modkit motif refine -i ${bedmethyl} --known-motifs-table motifs.tsv -r ${ref} | ||
``` | ||
|
||
The output tables to both of these commands have the same schema: | ||
|
||
| column | name | description | type | | ||
|--------|------------|-------------------------------------------------------------------------------------------------|-------| | ||
| 1 | mod_code | code specifying the modification found in the motif | str | | ||
| 2 | motif | sequence of identified motif using [IUPAC](https://www.bioinformatics.org/sms/iupac.html) codes | str | | ||
| 3 | offset | 0-based offset into the motif sequence of the modified base | int | | ||
| 4 | frac_mod | fraction of time this sequence is found in the _high modified_ set col-5 / (col-5 + col-6) | float | | ||
| 5 | high_count | number of occurances of this sequence in the _high-modified_ set | int | | ||
| 6 | low_count | number of occurances of this sequence in the _low-modified_ set | int | | ||
| 7 | mid_count | number of occurances of this sequence in the _mid-modified_ set | int | | ||
| 8 | log_odds | log2 odds of the motif being in the high-modified set | int | | ||
|
||
In the human-readable table columns (1) and (2) are merged to show the modification code in the motif sequence context, the rest of the columns are the same as the machine-readable table. | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Investigating patterns with localise | ||
|
||
One a bedMethyl table has been created, `modkit localise` will use the pileup and calculate per-base modification aggregate information around genomic features of interest. | ||
For example, we can investigate base modification patterns around CTCF binding sites. | ||
|
||
<p align="center"> | ||
<img src="./images/modkit_localise_ctcf_5mC.png" alt="5mC patterns at CTCF sites" width="500" /> | ||
</p> | ||
|
||
The input requirements to `modkit localise` are simple: | ||
1. BedMethyl table that has been bgzf-compressed and tabix-indexed | ||
1. Regions file in BED format (plaintext). | ||
1. Genome sizes tab-separated file: `<chrom>\t<size_in_bp>` | ||
|
||
an example command: | ||
|
||
```bash | ||
modkit localise ${bedmethyl} --regions ${ctcf} --genome-sizes ${sizes} | ||
``` | ||
|
||
The output table has the following schema: | ||
|
||
| column | Name | Description | type | | ||
|--------|------------------|---------------------------------------------------------------------------------------------------------------------|-------| | ||
| 1 | mod code | modification code as present in the bedmethyl | str | | ||
| 2 | offset | distance in base pairs from the center of the genome features, negative values reflect towards the 5' of the genome | int | | ||
| 3 | n_valid | number of valid calls at this offset for this modification code | int | | ||
| 4 | n_mod | number of calls for this modification code at this offset | int | | ||
| 5 | percent_modified | `n_mod` / `n_valid` * 100 | float | | ||
|
||
Optionally the `--chart` argument can be used to create HTML charts of the modification patterns. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Working with sequence motifs | ||
|
||
The `modkit motif` suite contains tools for discovery and exploration of short degenerate sequences (motifs) that may be enriched in a sample. | ||
A common use case is to discover the motifs enriched for modification in a native bacterial sample which can give indication of methyltransferase enzymes present in the genomes present in the sample. | ||
|
||
The following tools are available: | ||
|
||
1. [Find enriched motifs _de novo_ from a bedMethyl with `search`.](,/intro_find_motifs.md) | ||
1. [`evaluate` or `refine` a table of known motifs](./evaluate_motif.md) | ||
4. [Making a motif BED file with `motif bed`](./intro_motif_bed.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Calculating modification statistics in regions | ||
|
||
There are many analysis operations available in `modkit` once you've generated a bedMethyl table. | ||
One such operation is to calculate aggregation statistics on specific regions, for example in CpG islands or gene promoters. | ||
The `modkit stats` command is designed for this purpose. | ||
|
||
```bash | ||
# these files can be found in the modkit repository | ||
cpgs=tests/resources/cpg_chr20_with_orig_names_selection.bed | ||
sample=tests/resources/lung_00733-m_adjacent-normal_5mc-5hmc_chr20_cpg_pileup.bed.gz | ||
modkit stats ${sample} --regions ${cpgs} -o ./stats.tsv [--mod-codes "h,m"] | ||
``` | ||
|
||
> Note that the argument `--mod-codes` can alternatively be passed multiple times, e.g. this is equivalent: <br /> | ||
> `--mod-codes c --mod-codes h` | ||
The output TSV has the following schema: | ||
|
||
| column | Name | Description | type | | ||
|--------|----------------|-------------------------------------------------------------------------------|-------| | ||
| 1 | chrom | name of reference sequence from BAM header | str | | ||
| 2 | start position | 0-based start position | int | | ||
| 3 | end position | 0-based exclusive end position | int | | ||
| 4 | name | name of the region from input BED (`.` if not provided) | str | | ||
| 5 | strand | Strand (`+`, `-`, `.`) from the input BED (`.` assumed for when not provided) | str | | ||
| 6+ | count_x | total number of `x` base modification codes in the region | int | | ||
| 7+ | count_valid_x | total valid calls for the primary base modified by code `x` | int | | ||
| 8+ | percent_x | `count_x` / `count_vali_x` * 100 | float | | ||
|
||
Columns 6, 7, and 8 are repeated for each modification code found in the bedMethyl file or provided with `--mod-codes` argument. | ||
|
||
An example output: | ||
|
||
```text | ||
chrom start end name strand count_h count_valid_h percent_h count_m count_valid_m percent_m | ||
chr20 9838623 9839213 CpG: 47 . 12 1777 0.6752954 45 1777 2.532358 | ||
chr20 10034962 10035266 CpG: 35 . 7 1513 0.46265697 0 1513 0 | ||
chr20 10172120 10172545 CpG: 35 . 15 1229 1.2205045 28 1229 2.278275 | ||
chr20 10217487 10218336 CpG: 59 . 29 2339 1.2398461 108 2339 4.617358 | ||
chr20 10433628 10434345 CpG: 71 . 29 2750 1.0545455 2 2750 0.07272727 | ||
chr20 10671925 10674963 CpG: 255 . 43 9461 0.45449743 24 9461 0.25367296 | ||
``` | ||
|
Oops, something went wrong.