Merge branch 'ar/prep-040' into 'master'

Prep docs and loose ends for 0.4.0 See merge request machine-learning/modkit!224
nanoporetech · Sep 17, 2024 · f882a43 · f882a43
2 parents 3d4f2ec + eca57ca
commit f882a43
Show file tree

Hide file tree

Showing 55 changed files with 2,597 additions and 2,033 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,22 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [v0.4.0]
+### Adds
+- [motif] Add `search` and `evaluate` subcommands under `motif` command hierarchy.
+- [stats, localize] Add `stats` and `localize` commands, see documentation for details.
+- [dmr, multi] Combine samples when they have the same name.
+- [extract] Add option to emit bgzf-compressed output.
+### Fixes
+- [validate] Only consider modification codes attributed to the primary sequence base being validated.
+- [pileup, extract] Improve iteration over regions when `--include-bed` is provided.
+### Changes
+- [dmr] Use `htslib` `tbx` module for reading tabix index instead of `noodles`.
+- [pileup] Require `.fai` FASTA index when a reference is provided, load only sections of reference that are necessary.
+- [extract] Separate `calls` and `full` commands to produce the read calls, and full tables, respectively.
+- [validate] Change "any-mod" code from {`A`, `C`, `G`, `T`} to `*`.
+
+
 ## [v0.3.3]
 ### Fixes
 - [sample-probs, summary, pileup] Refactor sampling algorithm so that it will not over-sample reads leading to excessive memory usage.

diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "mod_kit"
-version = "0.3.3"
+version = "0.4.0"
 edition = "2021"
 
 [[bin]]

diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md
@@ -2,18 +2,22 @@
 
 - [Quick Start guides](./quick_start.md)
     - [Constructing bedMethyl tables](./intro_bedmethyl.md)
+    - [Make hemi-methylation bedMethyl tables](./intro_pileup_hemi.md)
     - [Updating and adjusting MM tags](./intro_adjust.md)
     - [Inspecting base modification probabilities](./intro_sample_probs.md)
     - [Summarizing a modBAM](./intro_summary.md)
-    - [Making a motif BED file](./intro_motif_bed.md)
-    - [Extracting read information to a table](./intro_extract.md)
+    - [Calculating modification statistics in regions](./intro_stats.md)
     - [Calling mods in a modBAM](./intro_call_mods.md)
     - [Removing modification calls at the ends of reads](./intro_edge_filter.md)
     - [Repair MM/ML tags on trimmed reads](./intro_repair.md)
-    - [Make hemi-methylation bedMethyl tables](./intro_pileup_hemi.md)
+    - [Working with sequence motifs](./intro_motif.md)
+        - [Making a motif BED file](./intro_motif_bed.md)
+        - [Find highly modified motif sequences](./intro_find_motifs.md)
+        - [Evaluate and refine a table of known motifs](./evaluate_motif.md)
+    - [Extracting read information to a table](./intro_extract.md)
+    - [Investigating patterns with `localise`](./intro_localize.md)
     - [Perform differential methylation scoring](./intro_dmr.md)
     - [Validate ground truth results](./intro_validate.md)
-    - [Find highly modified motif sequences](./intro_find_motifs.md)
     - [Calculating methylation entropy](./intro_entropy.md)
     - [Narrow output to specific positions](./intro_include_bed.md)
 - [Extended subcommand help](./advanced_usage.md)

diff --git a/book/src/advanced_usage.md b/book/src/advanced_usage.md
diff --git a/book/src/evaluate_motif.md b/book/src/evaluate_motif.md
@@ -0,0 +1,30 @@
+# Evaluate a table of known motifs
+
+The `modkit search` command has an option to provide any number of known motifs with `--know-motif`.
+If you already have a list of candidate motifs (e.f. from a previous run of `modkit motif search`) you can check these motifs quickly against a bedMethyl table with `modkit motif evaluate`.
+
+```bash
+modkit motif evaluate -i ${bedmethyl} --known-motifs-table motifs.tsv -r ${ref}
+```
+
+Similarly, the search [algorithm](./intro_find_motifs.md#simple-description-of-the-search-algorithm) can be run using known motifs as seeds:
+
+```bash
+modkit motif refine -i ${bedmethyl} --known-motifs-table motifs.tsv -r ${ref}
+```
+
+The output tables to both of these commands have the same schema:
+
+| column | name       | description                                                                                     | type  |
+|--------|------------|-------------------------------------------------------------------------------------------------|-------|
+| 1      | mod_code   | code specifying the modification found in the motif                                             | str   |
+| 2      | motif      | sequence of identified motif using [IUPAC](https://www.bioinformatics.org/sms/iupac.html) codes | str   |
+| 3      | offset     | 0-based offset into the motif sequence of the modified base                                     | int   |
+| 4      | frac_mod   | fraction of time this sequence is found in the _high modified_ set col-5 / (col-5 + col-6)      | float |
+| 5      | high_count | number of occurances of this sequence in the _high-modified_ set                                | int   |
+| 6      | low_count  | number of occurances of this sequence in the _low-modified_ set                                 | int   |
+| 7      | mid_count  | number of occurances of this sequence in the _mid-modified_ set                                 | int   |
+| 8      | log_odds   | log2 odds of the motif being in the high-modified set                                           | int   |
+
+In the human-readable table columns (1) and (2) are merged to show the modification code in the motif sequence context, the rest of the columns are the same as the machine-readable table.
+
diff --git a/book/src/images/modkit_localise_ctcf_5mC.png b/book/src/images/modkit_localise_ctcf_5mC.png
diff --git a/book/src/intro_bedmethyl.md b/book/src/intro_bedmethyl.md
@@ -12,7 +12,7 @@ for details.
 
 ## Basic usage
 
-In its simplest form `modkit` creates a bedMethyl file using the following:
+In its simplest form `modkit pileup` creates a bedMethyl file using the following:
 
 ```text
 modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
@@ -36,6 +36,8 @@ in the reference:
 modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta
 ```
 
+**Note** that when passing a reference with `--ref` a FASTA index `.fai` file is required to be at `path/to/reference.fasta.fai`.
+
 To restrict output to only certain CpGs, pass the `--include-bed` option with the CpGs to be used, 
 see [this page](./intro_include_bed.md) for more details.
 
@@ -180,6 +182,5 @@ CG->CH substitution such that no modification call was produced by the basecalle
 
 ## Performance considerations
 
-The `--interval-size`, `--threads`, `--chunk-size`, and `--max-depth` parameters can be used to tweak the parallelism and 
-memory consumption of `modkit pileup`. The defaults should be suitable for most use cases, for more details see the
-[advanced usage](./advanced_usage.md) and [performance considerations](./perf_considerations.md) sections.
+The `--interval-size`, `--threads`, `--chunk-size`, and `--max-depth` parameters can be used to tweak the parallelism and memory consumption of `modkit pileup`.
+The defaults should be suitable for most use cases, for more details see [performance considerations](./perf_considerations.md) sections.
diff --git a/book/src/intro_dmr.md b/book/src/intro_dmr.md
@@ -96,6 +96,9 @@ The full schema is described [below](#differential-methylation-output-format).
 ## 2. Perform differential methylation detection on all pairs of samples over regions from the genome.
 The `modkit dmr multi` command runs all pairwise comparisons for more than two samples for all regions provided in the regions BED file.
 The preparation of the data is identical to that for the [previous section](#preparing-the-input-data) (for each sample, of course).
+
+**Note** that if multiple samples are given the same name, they will be combined.
+
 An example command could be:
 
 ```bash

diff --git a/book/src/intro_extract.md b/book/src/intro_extract.md
@@ -1,8 +1,7 @@
 # Extracting base modification information
 
-The `modkit extract` sub-command will produce a table containing the base modification probabilities, 
-the read sequence context, and optionally aligned reference information.
-For `extract`, if a correct `MN` tag is found, secondary and supplementary alignments may be output with the `--allow-non-primary` flag. 
+The `modkit extract full` sub-commands will produce a table containing the base modification probabilities, the read sequence context, and optionally aligned reference information.
+For `extract full` and `extract calls`, if a correct `MN` tag is found, secondary and supplementary alignments may be output with the `--allow-non-primary` flag. 
 See [troubleshooting](./troubleshooting.md) for details.
 
 The table will by default contain unmapped sections of the read (soft-clipped sections, for example). 
@@ -13,7 +12,7 @@ the size of the BAM). You may want to either use the `--num-reads` option, the `
 pre-filter the modBAM ahead of time. You can also stream the output to stdout by setting the output to `-`
 or `stdout` and filter the columns before writing to disk.
 
-## Description of output table
+## Description of output table for `extract full`
 
 | column | name                  | description                                                                     | type |
 |--------|-----------------------|---------------------------------------------------------------------------------|------|
@@ -38,11 +37,10 @@ or `stdout` and filter the columns before writing to disk.
 | 19     | flag                  | FLAG from alignment record                                                      | str  |
 
 
-# Tabulating base modification _calls_ for each read position
-Passing `--read-calls <file-path>` option will generate a table of read-level base modification calls using the 
-same [thresholding](./filtering.md) algorithm employed by `modkit pileup`. The resultant table has, for each read,
-one row for each base modification call in that read. If a base is called as modified then `call_code` will be the 
-code in the `MM` tag. If the base is called as canonical the `call_code` will be `-` (`A`, `C`, `G`, and `T` are
+# Tabulating base modification _calls_ for each read position with `extract calls`
+The `modkit extract calls` command will generate a table of read-level base modification calls using the same [thresholding](./filtering.md) algorithm employed by `modkit pileup`.
+The resultant table has, for each read, one row for each base modification call in that read.
+If a base is called as modified then `call_code` will be the code in the `MM` tag. If the base is called as canonical the `call_code` will be `-` (`A`, `C`, `G`, and `T` are
 reserved for "any modification"). The full schema of the table is below:
 
 | column | name                  | description                                                                     | type |
@@ -88,44 +86,40 @@ For secondary and supplementary alignments, soft-clipped positions are not repea
 
 ## Example usages:
 
-### Extract a table from an aligned and indexed BAM 
+### Extract a table of base modification probabilities from an aligned and indexed BAM 
 ```
-modkit extract <input.bam> <output.tsv> 
+modkit extract full <input.bam> <output.tsv> 
 ```
 If the index `input.bam.bai` can be found, intervals along the aligned genome can be performed
 in parallel.
 
 ### Extract a table from a region of a large modBAM
 The below example will extract reads from only chr20, and include reference sequence context
 ```
-modkit extract <intput.bam> <output.tsv> --region chr20 --ref <ref.fasta>
+modkit extract full <intput.bam> <output.tsv> --region chr20 --ref <ref.fasta>
 ```
 
 ### Extract only sites aligned to a CG motif
 ```
-modkit motif-bed <reference.fasta> CG 0 > CG_motifs.bed
-modkit extract <in.bam> <out.tsv> --ref <ref.fasta> --include-bed CG_motifs.bed
+modkit motif bed <reference.fasta> CG 0 > CG_motifs.bed
+modkit extract full <in.bam> <out.tsv> --ref <ref.fasta> --include-bed CG_motifs.bed
 ```
 
 ### Extract only sites that are at least 50 bases from the ends of the reads
 ```
-modkit extract <in.bam> <out.tsv> --edge-filter 50
+modkit extract full <in.bam> <out.tsv> --edge-filter 50
 ```
 
 ### Extract read-level base modification calls
-```
-modkit extract <input.bam> null --read-calls <calls.tsv>
-```
-Using "null" in the place of the normal output will direct the normal extract output
-to /dev/null, to keep this output specify a file or `-` for standard out.
 
 ```
-modkit extract <input.bam> <output.tsv> --read-calls <calls.tsv>
+modkit extract calls <input.bam> <calls.tsv>
 ```
+
 Use `--allow-non-primary` to get secondary and supplementary mappings in the output.
+
 ```
-modkit extract <input.bam> <output.tsv> --read-calls <calls.tsv> --allow-non-primary
+modkit extract calls <input.bam> <output.tsv> --allow-non-primary
 ```
 
-
-See the help string and/or [advanced_usage](./advanced_usage.md) for more details.
+See the help string and/or [advanced_usage](./advanced_usage.md) for more details and [performace considerations](./perf_considerations.m) if you encounter issues with memory usage.
diff --git a/book/src/intro_find_motifs.md b/book/src/intro_find_motifs.md
@@ -8,7 +8,7 @@ For example, to run the command with default settings (recommended):
 bedmethyl=/path/to/pileup.bed
 ref=/path/to/reference.fasta
 
-modkit find-motifs \
+modkit motif search \
   -i ${bedmethyl} \
   -r ${ref} \
   -o ./motifs.tsv \

diff --git a/book/src/intro_localize.md b/book/src/intro_localize.md
@@ -0,0 +1,31 @@
+# Investigating patterns with localise
+
+One a bedMethyl table has been created, `modkit localise` will use the pileup and calculate per-base modification aggregate information around genomic features of interest.
+For example, we can investigate base modification patterns around CTCF binding sites.
+
+<p align="center">
+  <img src="./images/modkit_localise_ctcf_5mC.png" alt="5mC patterns at CTCF sites" width="500" />
+</p>
+
+The input requirements to `modkit localise` are simple: 
+1. BedMethyl table that has been bgzf-compressed and tabix-indexed
+1. Regions file in BED format (plaintext).
+1. Genome sizes tab-separated file: `<chrom>\t<size_in_bp>`
+
+an example command:
+
+```bash
+modkit localise ${bedmethyl} --regions ${ctcf} --genome-sizes ${sizes}
+```
+
+The output table has the following schema:
+
+| column | Name             | Description                                                                                                         | type  |
+|--------|------------------|---------------------------------------------------------------------------------------------------------------------|-------|
+| 1      | mod code         | modification code as present in the bedmethyl                                                                       | str   |
+| 2      | offset           | distance in base pairs from the center of the genome features, negative values reflect towards the 5' of the genome | int   |
+| 3      | n_valid          | number of valid calls at this offset for this modification code                                                     | int   |
+| 4      | n_mod            | number of calls for this modification code at this offset                                                           | int   |
+| 5      | percent_modified | `n_mod` / `n_valid`  * 100                                                                                          | float |
+
+Optionally the `--chart` argument can be used to create HTML charts of the modification patterns.
diff --git a/book/src/intro_motif.md b/book/src/intro_motif.md
@@ -0,0 +1,10 @@
+# Working with sequence motifs
+
+The `modkit motif` suite contains tools for discovery and exploration of short degenerate sequences (motifs) that may be enriched in a sample. 
+A common use case is to discover the motifs enriched for modification in a native bacterial sample which can give indication of methyltransferase enzymes present in the genomes present in the sample.
+
+The following tools are available:
+
+1. [Find enriched motifs _de novo_ from a bedMethyl with `search`.](,/intro_find_motifs.md)
+1. [`evaluate` or `refine` a table of known motifs](./evaluate_motif.md)
+4. [Making a motif BED file with `motif bed`](./intro_motif_bed.md)
diff --git a/book/src/intro_stats.md b/book/src/intro_stats.md
@@ -0,0 +1,43 @@
+# Calculating modification statistics in regions
+
+There are many analysis operations available in `modkit` once you've generated a bedMethyl table.
+One such operation is to calculate aggregation statistics on specific regions, for example in CpG islands or gene promoters.
+The `modkit stats` command is designed for this purpose. 
+
+```bash
+# these files can be found in the modkit repository
+cpgs=tests/resources/cpg_chr20_with_orig_names_selection.bed
+sample=tests/resources/lung_00733-m_adjacent-normal_5mc-5hmc_chr20_cpg_pileup.bed.gz
+modkit stats ${sample} --regions ${cpgs} -o ./stats.tsv [--mod-codes "h,m"]
+```
+
+> Note that the argument `--mod-codes` can alternatively be passed multiple times, e.g. this is equivalent: <br />
+> `--mod-codes c --mod-codes h`
+
+The output TSV has the following schema:
+
+| column | Name           | Description                                                                   | type  |
+|--------|----------------|-------------------------------------------------------------------------------|-------|
+| 1      | chrom          | name of reference sequence from BAM header                                    | str   |
+| 2      | start position | 0-based start position                                                        | int   |
+| 3      | end position   | 0-based exclusive end position                                                | int   |
+| 4      | name           | name of the region from input BED (`.` if not provided)                       | str   |
+| 5      | strand         | Strand (`+`, `-`, `.`) from the input BED (`.` assumed for when not provided) | str   |
+| 6+     | count_x        | total number of `x` base modification codes in the region                     | int   |
+| 7+     | count_valid_x  | total valid calls for the primary base modified by code `x`                   | int   |
+| 8+     | percent_x      | `count_x` / `count_vali_x` * 100                                              | float |
+
+Columns 6, 7, and 8 are repeated for each modification code found in the bedMethyl file or provided with `--mod-codes` argument.
+
+An example output:
+
+```text
+chrom  start     end       name     strand   count_h        count_valid_h  percent_h   count_m        count_valid_m  percent_m
+chr20  9838623   9839213   CpG: 47  .        12             1777           0.6752954   45             1777           2.532358
+chr20  10034962  10035266  CpG: 35  .        7              1513           0.46265697  0              1513           0
+chr20  10172120  10172545  CpG: 35  .        15             1229           1.2205045   28             1229           2.278275
+chr20  10217487  10218336  CpG: 59  .        29             2339           1.2398461   108            2339           4.617358
+chr20  10433628  10434345  CpG: 71  .        29             2750           1.0545455   2              2750           0.07272727
+chr20  10671925  10674963  CpG: 255 .        43             9461           0.45449743  24             9461           0.25367296
+```
+