How to interpret failed reads during modkit extract #220

lkwhite · 2024-06-29T15:02:45Z

I have two datasets:

SUP-rebasecalled direct RNA with pseudouridine calling, aligned to a reference
The same data as above, but now put through reference-anchored inference with remora for pseU

When I run modkit extract on these, I get two very different results:

processed 151548 reads, 7390161 rows, skipped ~36 reads, failed ~0 reads
processed 3859 reads, 110317 rows, skipped ~0 reads, failed ~147524 reads

Is this an expected behavior? How should I interpret this?

The text was updated successfully, but these errors were encountered:

ArtRand · 2024-06-29T16:50:11Z

Hello @lkwhite,

Could you run modkit extract with --log-filepath $log_file and attach it? It should tell you why reads are failing.

lkwhite · 2024-06-29T18:39:32Z

Looks like we have a whole bunch of record XXX has improper data, MN tag length X and seq length Y don't match.

These are tRNA sequencing data so those lengths are 85-135 nt and we use BWA MEM instead of mm2 for alignment.

lkwhite · 2024-06-29T18:46:52Z

The log was a little too big to attach so I've split it in two parts.
mkextract_postremora_part1.txt
mkextract_postremora_part2.txt

ArtRand · 2024-06-30T14:19:11Z

Hello @lkwhite,

It is possible that remora reference anchoring doesn't emit the correct MN tag or doesn't update it when the sequence length changes. Could you try removing the MN tags and seeing if modkit extract will parse the base modifications? As a reminder you can remove the tags with samtools:

samtools view -bhx MN ${bam} | modkit extract - extract.tsv --log-filepath test_tags.log

If the base modification tags are actually incorrect, you'll get different errors. If this works and you want the --read-calls output you'll have to write the output of samtools view to a file.

Let me know,

A

lkwhite · 2024-06-30T15:01:36Z

That reduces the % of reads failing, and the ones that fail now say record has improper data, malformed MM delta list.

I couldn't find MN in the sam spec, how is Remora using this tag?

test_tags.log

ArtRand · 2024-07-01T15:24:05Z

Hello @lkwhite,

The MN tag isn't in the spec yet, and remora doesn't use it. But dorado does, so we need to update the recommendation to remove the tags when using remora infer.

Looks like quite a few reads are failing with the "improper data" error. I've extracted the read ids and attached them to this thread:

grep -Ei 'record [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} has improper data' ${fp} | grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}' > malformed_read_ids.txt

Could you send me a few of these BAM records? Preferably both before and after reference-anchored base modification inference. If the files are too large (or you don't want them on github) you can email me at art.rand[at]nanoporetech.com and we can work out a way to share them.

Thanks.

malformed_read_ids.txt

ArtRand · 2024-07-16T00:38:47Z

Just in case anyone else encounters an issue with large number of skipped or error reads following "reference-anchored" remora base modification calling.

If you have previously used base modification calling with dorado, it seems there is a bug in remora where the output will have multiple MM tags (the original and the reference-anchored one). The parser in modkit will throw an error on these reads since the MN tag will not match the SEQ length - which is correct. If you remove the MN tag, you will get around this error, but now the basecall-anchroed base modification call will be used incorrectly or the read will fail completely. A modkit command to fix these tags and remora fix are in progress.

The correct work-around is to either not use base modification calling in the original dorado command (but include --emit-moves) or make sure to completely remove the MM/ML/MN tags prior to remora reference-anchored inference.

ArtRand added the troubleshooting workflow and data preparation questions label Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to interpret failed reads during modkit extract #220

How to interpret failed reads during modkit extract #220

lkwhite commented Jun 29, 2024

ArtRand commented Jun 29, 2024

lkwhite commented Jun 29, 2024

lkwhite commented Jun 29, 2024

ArtRand commented Jun 30, 2024

lkwhite commented Jun 30, 2024

ArtRand commented Jul 1, 2024

ArtRand commented Jul 16, 2024

How to interpret failed reads during modkit extract #220

How to interpret failed reads during modkit extract #220

Comments

lkwhite commented Jun 29, 2024

ArtRand commented Jun 29, 2024

lkwhite commented Jun 29, 2024

lkwhite commented Jun 29, 2024

ArtRand commented Jun 30, 2024

lkwhite commented Jun 30, 2024

ArtRand commented Jul 1, 2024

ArtRand commented Jul 16, 2024