Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to interpret failed reads during modkit extract #220

Open
lkwhite opened this issue Jun 29, 2024 · 7 comments
Open

How to interpret failed reads during modkit extract #220

lkwhite opened this issue Jun 29, 2024 · 7 comments
Labels
troubleshooting workflow and data preparation questions

Comments

@lkwhite
Copy link

lkwhite commented Jun 29, 2024

I have two datasets:

  1. SUP-rebasecalled direct RNA with pseudouridine calling, aligned to a reference
  2. The same data as above, but now put through reference-anchored inference with remora for pseU

When I run modkit extract on these, I get two very different results:

  1. processed 151548 reads, 7390161 rows, skipped ~36 reads, failed ~0 reads
  2. processed 3859 reads, 110317 rows, skipped ~0 reads, failed ~147524 reads

Is this an expected behavior? How should I interpret this?

@ArtRand
Copy link
Contributor

ArtRand commented Jun 29, 2024

Hello @lkwhite,

Could you run modkit extract with --log-filepath $log_file and attach it? It should tell you why reads are failing.

@ArtRand ArtRand added the troubleshooting workflow and data preparation questions label Jun 29, 2024
@lkwhite
Copy link
Author

lkwhite commented Jun 29, 2024

Looks like we have a whole bunch of record XXX has improper data, MN tag length X and seq length Y don't match.

These are tRNA sequencing data so those lengths are 85-135 nt and we use BWA MEM instead of mm2 for alignment.

@lkwhite
Copy link
Author

lkwhite commented Jun 29, 2024

The log was a little too big to attach so I've split it in two parts.
mkextract_postremora_part1.txt
mkextract_postremora_part2.txt

@ArtRand
Copy link
Contributor

ArtRand commented Jun 30, 2024

Hello @lkwhite,

It is possible that remora reference anchoring doesn't emit the correct MN tag or doesn't update it when the sequence length changes. Could you try removing the MN tags and seeing if modkit extract will parse the base modifications? As a reminder you can remove the tags with samtools:

samtools view -bhx MN ${bam} | modkit extract - extract.tsv --log-filepath test_tags.log

If the base modification tags are actually incorrect, you'll get different errors. If this works and you want the --read-calls output you'll have to write the output of samtools view to a file.

Let me know,

A

@lkwhite
Copy link
Author

lkwhite commented Jun 30, 2024

That reduces the % of reads failing, and the ones that fail now say record has improper data, malformed MM delta list.

I couldn't find MN in the sam spec, how is Remora using this tag?

test_tags.log

@ArtRand
Copy link
Contributor

ArtRand commented Jul 1, 2024

Hello @lkwhite,

The MN tag isn't in the spec yet, and remora doesn't use it. But dorado does, so we need to update the recommendation to remove the tags when using remora infer.

Looks like quite a few reads are failing with the "improper data" error. I've extracted the read ids and attached them to this thread:

grep -Ei 'record [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} has improper data' ${fp} | grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}' > malformed_read_ids.txt

Could you send me a few of these BAM records? Preferably both before and after reference-anchored base modification inference. If the files are too large (or you don't want them on github) you can email me at art.rand[at]nanoporetech.com and we can work out a way to share them.

Thanks.

malformed_read_ids.txt

@ArtRand
Copy link
Contributor

ArtRand commented Jul 16, 2024

Just in case anyone else encounters an issue with large number of skipped or error reads following "reference-anchored" remora base modification calling.

If you have previously used base modification calling with dorado, it seems there is a bug in remora where the output will have multiple MM tags (the original and the reference-anchored one). The parser in modkit will throw an error on these reads since the MN tag will not match the SEQ length - which is correct. If you remove the MN tag, you will get around this error, but now the basecall-anchroed base modification call will be used incorrectly or the read will fail completely. A modkit command to fix these tags and remora fix are in progress.

The correct work-around is to either not use base modification calling in the original dorado command (but include --emit-moves) or make sure to completely remove the MM/ML/MN tags prior to remora reference-anchored inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
troubleshooting workflow and data preparation questions
Projects
None yet
Development

No branches or pull requests

2 participants