Skip to content

Commit

Permalink
Merge branch 'ar/prep-040' into 'master'
Browse files Browse the repository at this point in the history
Prep docs and loose ends for 0.4.0

See merge request machine-learning/modkit!224
  • Loading branch information
ArtRand committed Sep 17, 2024
2 parents 3d4f2ec + eca57ca commit f882a43
Show file tree
Hide file tree
Showing 55 changed files with 2,597 additions and 2,033 deletions.
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,22 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [v0.4.0]
### Adds
- [motif] Add `search` and `evaluate` subcommands under `motif` command hierarchy.
- [stats, localize] Add `stats` and `localize` commands, see documentation for details.
- [dmr, multi] Combine samples when they have the same name.
- [extract] Add option to emit bgzf-compressed output.
### Fixes
- [validate] Only consider modification codes attributed to the primary sequence base being validated.
- [pileup, extract] Improve iteration over regions when `--include-bed` is provided.
### Changes
- [dmr] Use `htslib` `tbx` module for reading tabix index instead of `noodles`.
- [pileup] Require `.fai` FASTA index when a reference is provided, load only sections of reference that are necessary.
- [extract] Separate `calls` and `full` commands to produce the read calls, and full tables, respectively.
- [validate] Change "any-mod" code from {`A`, `C`, `G`, `T`} to `*`.


## [v0.3.3]
### Fixes
- [sample-probs, summary, pileup] Refactor sampling algorithm so that it will not over-sample reads leading to excessive memory usage.
Expand Down
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "mod_kit"
version = "0.3.3"
version = "0.4.0"
edition = "2021"

[[bin]]
Expand Down
12 changes: 8 additions & 4 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,22 @@

- [Quick Start guides](./quick_start.md)
- [Constructing bedMethyl tables](./intro_bedmethyl.md)
- [Make hemi-methylation bedMethyl tables](./intro_pileup_hemi.md)
- [Updating and adjusting MM tags](./intro_adjust.md)
- [Inspecting base modification probabilities](./intro_sample_probs.md)
- [Summarizing a modBAM](./intro_summary.md)
- [Making a motif BED file](./intro_motif_bed.md)
- [Extracting read information to a table](./intro_extract.md)
- [Calculating modification statistics in regions](./intro_stats.md)
- [Calling mods in a modBAM](./intro_call_mods.md)
- [Removing modification calls at the ends of reads](./intro_edge_filter.md)
- [Repair MM/ML tags on trimmed reads](./intro_repair.md)
- [Make hemi-methylation bedMethyl tables](./intro_pileup_hemi.md)
- [Working with sequence motifs](./intro_motif.md)
- [Making a motif BED file](./intro_motif_bed.md)
- [Find highly modified motif sequences](./intro_find_motifs.md)
- [Evaluate and refine a table of known motifs](./evaluate_motif.md)
- [Extracting read information to a table](./intro_extract.md)
- [Investigating patterns with `localise`](./intro_localize.md)
- [Perform differential methylation scoring](./intro_dmr.md)
- [Validate ground truth results](./intro_validate.md)
- [Find highly modified motif sequences](./intro_find_motifs.md)
- [Calculating methylation entropy](./intro_entropy.md)
- [Narrow output to specific positions](./intro_include_bed.md)
- [Extended subcommand help](./advanced_usage.md)
Expand Down
1,096 changes: 597 additions & 499 deletions book/src/advanced_usage.md

Large diffs are not rendered by default.

30 changes: 30 additions & 0 deletions book/src/evaluate_motif.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Evaluate a table of known motifs

The `modkit search` command has an option to provide any number of known motifs with `--know-motif`.
If you already have a list of candidate motifs (e.f. from a previous run of `modkit motif search`) you can check these motifs quickly against a bedMethyl table with `modkit motif evaluate`.

```bash
modkit motif evaluate -i ${bedmethyl} --known-motifs-table motifs.tsv -r ${ref}
```

Similarly, the search [algorithm](./intro_find_motifs.md#simple-description-of-the-search-algorithm) can be run using known motifs as seeds:

```bash
modkit motif refine -i ${bedmethyl} --known-motifs-table motifs.tsv -r ${ref}
```

The output tables to both of these commands have the same schema:

| column | name | description | type |
|--------|------------|-------------------------------------------------------------------------------------------------|-------|
| 1 | mod_code | code specifying the modification found in the motif | str |
| 2 | motif | sequence of identified motif using [IUPAC](https://www.bioinformatics.org/sms/iupac.html) codes | str |
| 3 | offset | 0-based offset into the motif sequence of the modified base | int |
| 4 | frac_mod | fraction of time this sequence is found in the _high modified_ set col-5 / (col-5 + col-6) | float |
| 5 | high_count | number of occurances of this sequence in the _high-modified_ set | int |
| 6 | low_count | number of occurances of this sequence in the _low-modified_ set | int |
| 7 | mid_count | number of occurances of this sequence in the _mid-modified_ set | int |
| 8 | log_odds | log2 odds of the motif being in the high-modified set | int |

In the human-readable table columns (1) and (2) are merged to show the modification code in the motif sequence context, the rest of the columns are the same as the machine-readable table.

Binary file added book/src/images/modkit_localise_ctcf_5mC.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 5 additions & 4 deletions book/src/intro_bedmethyl.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ for details.

## Basic usage

In its simplest form `modkit` creates a bedMethyl file using the following:
In its simplest form `modkit pileup` creates a bedMethyl file using the following:

```text
modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
Expand All @@ -36,6 +36,8 @@ in the reference:
modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta
```

**Note** that when passing a reference with `--ref` a FASTA index `.fai` file is required to be at `path/to/reference.fasta.fai`.

To restrict output to only certain CpGs, pass the `--include-bed` option with the CpGs to be used,
see [this page](./intro_include_bed.md) for more details.

Expand Down Expand Up @@ -180,6 +182,5 @@ CG->CH substitution such that no modification call was produced by the basecalle

## Performance considerations

The `--interval-size`, `--threads`, `--chunk-size`, and `--max-depth` parameters can be used to tweak the parallelism and
memory consumption of `modkit pileup`. The defaults should be suitable for most use cases, for more details see the
[advanced usage](./advanced_usage.md) and [performance considerations](./perf_considerations.md) sections.
The `--interval-size`, `--threads`, `--chunk-size`, and `--max-depth` parameters can be used to tweak the parallelism and memory consumption of `modkit pileup`.
The defaults should be suitable for most use cases, for more details see [performance considerations](./perf_considerations.md) sections.
3 changes: 3 additions & 0 deletions book/src/intro_dmr.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,9 @@ The full schema is described [below](#differential-methylation-output-format).
## 2. Perform differential methylation detection on all pairs of samples over regions from the genome.
The `modkit dmr multi` command runs all pairwise comparisons for more than two samples for all regions provided in the regions BED file.
The preparation of the data is identical to that for the [previous section](#preparing-the-input-data) (for each sample, of course).

**Note** that if multiple samples are given the same name, they will be combined.

An example command could be:

```bash
Expand Down
42 changes: 18 additions & 24 deletions book/src/intro_extract.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# Extracting base modification information

The `modkit extract` sub-command will produce a table containing the base modification probabilities,
the read sequence context, and optionally aligned reference information.
For `extract`, if a correct `MN` tag is found, secondary and supplementary alignments may be output with the `--allow-non-primary` flag.
The `modkit extract full` sub-commands will produce a table containing the base modification probabilities, the read sequence context, and optionally aligned reference information.
For `extract full` and `extract calls`, if a correct `MN` tag is found, secondary and supplementary alignments may be output with the `--allow-non-primary` flag.
See [troubleshooting](./troubleshooting.md) for details.

The table will by default contain unmapped sections of the read (soft-clipped sections, for example).
Expand All @@ -13,7 +12,7 @@ the size of the BAM). You may want to either use the `--num-reads` option, the `
pre-filter the modBAM ahead of time. You can also stream the output to stdout by setting the output to `-`
or `stdout` and filter the columns before writing to disk.

## Description of output table
## Description of output table for `extract full`

| column | name | description | type |
|--------|-----------------------|---------------------------------------------------------------------------------|------|
Expand All @@ -38,11 +37,10 @@ or `stdout` and filter the columns before writing to disk.
| 19 | flag | FLAG from alignment record | str |


# Tabulating base modification _calls_ for each read position
Passing `--read-calls <file-path>` option will generate a table of read-level base modification calls using the
same [thresholding](./filtering.md) algorithm employed by `modkit pileup`. The resultant table has, for each read,
one row for each base modification call in that read. If a base is called as modified then `call_code` will be the
code in the `MM` tag. If the base is called as canonical the `call_code` will be `-` (`A`, `C`, `G`, and `T` are
# Tabulating base modification _calls_ for each read position with `extract calls`
The `modkit extract calls` command will generate a table of read-level base modification calls using the same [thresholding](./filtering.md) algorithm employed by `modkit pileup`.
The resultant table has, for each read, one row for each base modification call in that read.
If a base is called as modified then `call_code` will be the code in the `MM` tag. If the base is called as canonical the `call_code` will be `-` (`A`, `C`, `G`, and `T` are
reserved for "any modification"). The full schema of the table is below:

| column | name | description | type |
Expand Down Expand Up @@ -88,44 +86,40 @@ For secondary and supplementary alignments, soft-clipped positions are not repea

## Example usages:

### Extract a table from an aligned and indexed BAM
### Extract a table of base modification probabilities from an aligned and indexed BAM
```
modkit extract <input.bam> <output.tsv>
modkit extract full <input.bam> <output.tsv>
```
If the index `input.bam.bai` can be found, intervals along the aligned genome can be performed
in parallel.

### Extract a table from a region of a large modBAM
The below example will extract reads from only chr20, and include reference sequence context
```
modkit extract <intput.bam> <output.tsv> --region chr20 --ref <ref.fasta>
modkit extract full <intput.bam> <output.tsv> --region chr20 --ref <ref.fasta>
```

### Extract only sites aligned to a CG motif
```
modkit motif-bed <reference.fasta> CG 0 > CG_motifs.bed
modkit extract <in.bam> <out.tsv> --ref <ref.fasta> --include-bed CG_motifs.bed
modkit motif bed <reference.fasta> CG 0 > CG_motifs.bed
modkit extract full <in.bam> <out.tsv> --ref <ref.fasta> --include-bed CG_motifs.bed
```

### Extract only sites that are at least 50 bases from the ends of the reads
```
modkit extract <in.bam> <out.tsv> --edge-filter 50
modkit extract full <in.bam> <out.tsv> --edge-filter 50
```

### Extract read-level base modification calls
```
modkit extract <input.bam> null --read-calls <calls.tsv>
```
Using "null" in the place of the normal output will direct the normal extract output
to /dev/null, to keep this output specify a file or `-` for standard out.

```
modkit extract <input.bam> <output.tsv> --read-calls <calls.tsv>
modkit extract calls <input.bam> <calls.tsv>
```

Use `--allow-non-primary` to get secondary and supplementary mappings in the output.

```
modkit extract <input.bam> <output.tsv> --read-calls <calls.tsv> --allow-non-primary
modkit extract calls <input.bam> <output.tsv> --allow-non-primary
```


See the help string and/or [advanced_usage](./advanced_usage.md) for more details.
See the help string and/or [advanced_usage](./advanced_usage.md) for more details and [performace considerations](./perf_considerations.m) if you encounter issues with memory usage.
2 changes: 1 addition & 1 deletion book/src/intro_find_motifs.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ For example, to run the command with default settings (recommended):
bedmethyl=/path/to/pileup.bed
ref=/path/to/reference.fasta

modkit find-motifs \
modkit motif search \
-i ${bedmethyl} \
-r ${ref} \
-o ./motifs.tsv \
Expand Down
31 changes: 31 additions & 0 deletions book/src/intro_localize.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Investigating patterns with localise

One a bedMethyl table has been created, `modkit localise` will use the pileup and calculate per-base modification aggregate information around genomic features of interest.
For example, we can investigate base modification patterns around CTCF binding sites.

<p align="center">
<img src="./images/modkit_localise_ctcf_5mC.png" alt="5mC patterns at CTCF sites" width="500" />
</p>

The input requirements to `modkit localise` are simple:
1. BedMethyl table that has been bgzf-compressed and tabix-indexed
1. Regions file in BED format (plaintext).
1. Genome sizes tab-separated file: `<chrom>\t<size_in_bp>`

an example command:

```bash
modkit localise ${bedmethyl} --regions ${ctcf} --genome-sizes ${sizes}
```

The output table has the following schema:

| column | Name | Description | type |
|--------|------------------|---------------------------------------------------------------------------------------------------------------------|-------|
| 1 | mod code | modification code as present in the bedmethyl | str |
| 2 | offset | distance in base pairs from the center of the genome features, negative values reflect towards the 5' of the genome | int |
| 3 | n_valid | number of valid calls at this offset for this modification code | int |
| 4 | n_mod | number of calls for this modification code at this offset | int |
| 5 | percent_modified | `n_mod` / `n_valid` * 100 | float |

Optionally the `--chart` argument can be used to create HTML charts of the modification patterns.
10 changes: 10 additions & 0 deletions book/src/intro_motif.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Working with sequence motifs

The `modkit motif` suite contains tools for discovery and exploration of short degenerate sequences (motifs) that may be enriched in a sample.
A common use case is to discover the motifs enriched for modification in a native bacterial sample which can give indication of methyltransferase enzymes present in the genomes present in the sample.

The following tools are available:

1. [Find enriched motifs _de novo_ from a bedMethyl with `search`.](,/intro_find_motifs.md)
1. [`evaluate` or `refine` a table of known motifs](./evaluate_motif.md)
4. [Making a motif BED file with `motif bed`](./intro_motif_bed.md)
43 changes: 43 additions & 0 deletions book/src/intro_stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Calculating modification statistics in regions

There are many analysis operations available in `modkit` once you've generated a bedMethyl table.
One such operation is to calculate aggregation statistics on specific regions, for example in CpG islands or gene promoters.
The `modkit stats` command is designed for this purpose.

```bash
# these files can be found in the modkit repository
cpgs=tests/resources/cpg_chr20_with_orig_names_selection.bed
sample=tests/resources/lung_00733-m_adjacent-normal_5mc-5hmc_chr20_cpg_pileup.bed.gz
modkit stats ${sample} --regions ${cpgs} -o ./stats.tsv [--mod-codes "h,m"]
```

> Note that the argument `--mod-codes` can alternatively be passed multiple times, e.g. this is equivalent: <br />
> `--mod-codes c --mod-codes h`
The output TSV has the following schema:

| column | Name | Description | type |
|--------|----------------|-------------------------------------------------------------------------------|-------|
| 1 | chrom | name of reference sequence from BAM header | str |
| 2 | start position | 0-based start position | int |
| 3 | end position | 0-based exclusive end position | int |
| 4 | name | name of the region from input BED (`.` if not provided) | str |
| 5 | strand | Strand (`+`, `-`, `.`) from the input BED (`.` assumed for when not provided) | str |
| 6+ | count_x | total number of `x` base modification codes in the region | int |
| 7+ | count_valid_x | total valid calls for the primary base modified by code `x` | int |
| 8+ | percent_x | `count_x` / `count_vali_x` * 100 | float |

Columns 6, 7, and 8 are repeated for each modification code found in the bedMethyl file or provided with `--mod-codes` argument.

An example output:

```text
chrom start end name strand count_h count_valid_h percent_h count_m count_valid_m percent_m
chr20 9838623 9839213 CpG: 47 . 12 1777 0.6752954 45 1777 2.532358
chr20 10034962 10035266 CpG: 35 . 7 1513 0.46265697 0 1513 0
chr20 10172120 10172545 CpG: 35 . 15 1229 1.2205045 28 1229 2.278275
chr20 10217487 10218336 CpG: 59 . 29 2339 1.2398461 108 2339 4.617358
chr20 10433628 10434345 CpG: 71 . 29 2750 1.0545455 2 2750 0.07272727
chr20 10671925 10674963 CpG: 255 . 43 9461 0.45449743 24 9461 0.25367296
```

Loading

0 comments on commit f882a43

Please sign in to comment.