-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MM tag and hard clipped alignments #646
Comments
The topic of hard clipping came up during the original PR discussion, in #418 (comment) and the following comments. I believe we concluded that using
However I don't think we ever added such a sentence. That would be a more generous alternative to your suggested sentence disallowing MM+hard clipping entirely. |
Thanks @jmarshall, I missed that when reviewing the original discussion before posting this issue. I agree using Is the mapper allowed to modify/fix |
You're right that — either way — this needs more details in the spec. The N notation enables non-MM-aware mappers to hard clip reads and leave MM as is, such that downstream tools that do understand MM can parse MM-with-Ns correctly (on the understanding that the N counts include the bases within What MM-aware mappers should do is another question. For either exact-base-MM or MM-with-Ns, they would be capable of adjusting the counts, so that downstream tools could parse either correctly by considering the These two are inconsistent… What to do?
|
I've been mulling this over for a few days and leaning towards adding another flag. Without a flag the counts can be relative to the original (possibly unknown) sequence, and when the flag is set the counts are relative to SEQ instead. |
This is one of those things which just becomes akward once hard clipping is used. We could state that if hard clipping is to be used then the I think it's best to recommend that hard-clipping isn't used. We can and should also be explicit with how to interpret the data when something wrong happens. So either something like this:
Or if we wish to permit the possibility of working around with the N code:
I prefer the former as it's simple, and the This does however mean there's no way for an MM aware aligner to work properly, which I assume is the reason behind @jmarshall 's comment above about an additional flag. If we add this, then we'd need a similar language to above to describe the flag-less state (ie it's invalid so ignore), as well as a description of the flag so we can document how to do things properly. So I could get behind this additional flag proposal too. Care to suggest any recommended code? It could be something inline like Edit: or (ghastly!) actually the only additional information we need when hard clipping is how many of each base type were clipped. I don't like where that's going though, and just starting so-many bases into the original unclipped fragment is a cleaner approach. Either way the presence of such data, as is the presence of an additional character flag, is what distinguishs the hard-clipped record with MM tag from being valid or invalid. |
Correct me if I am wrong: ML tag can’t show the correct number of how many seq bases of the stated base type. In other words, ML tag DOESN'T know whether there are skipped base sites after the last modified base. So We need improvement for the ML tag I will simplify the question with the assumption that the modified base is C. So there should be a relationship between 1) The number of skipped C sites Based on my observation, there is a potential issue here: ML tag can’t show whether there are skipped C sites after the last modified C. For instance, if there is only 1 single C is modified and it is the 3rd C that is modified, I think the sequence So currently the current relationship is: If that’s the case, I think an extra number should be introduced as the last element of the MM tag to label the number of skipped C sites to distinguish the circumstances. In thise case, the So the question will become: How to solve the current inconsistency due to the MM tag issue? Is there a way to use ML tag to show the correct number of skipped C in the prediction? Thank you so much for your help! |
You are incorrect in this interpretation. The length of items in the ML tag should match the length of items in the MM tag, if both are present. Basically MM defines the bases while ML defines their confidence, just like traditional SEQ and QUAL. Where things get a bit different is whether the MM tag is generated by checking all sites on the read (of a specific base type) or just a subset / specific sites. If specific sites: we record in MM the bases we inspected, and use ML to indicate whether or not we believe they are modified. All sites: we inspect all bases. We could do as per the specific-sites strategy above and record all these bases (essentially MM of 0,0,0,0...) and use ML to distinguish which are modified and which are not. However it's usually preferable to simple record in MM which of these we believe has a significant likelihood of being modified. This distinction is where the implicit vs explicit notation came from. It wasn't initially in the specification as at the time of drafting it we weren't aware that people would only do base modification assessment in a subset of the read, believing instead it to be an inherent part of the base-calling and either always-on or always-off. I'm sorry but I don't understand your second point. Perhaps it's related to orientation however. We use MM to mark a particular count of bases, say 3rd and 8th C. However this is from the 5' end, even when reverse complemented. We don't need to know how many additional bases there are between the last call and the end of the sequence. See the "specific" vs "all" above for why. |
Thanks for the reply. Let me refine the question: I want to know the coordination of called specific site in the original sequence. It seems that the called site is not necessarily the last site in the sequence. I will still use specific sites C as an example. I indeed observed that the length of items in the MM tag matches the length of items in the ML tag when they are both present. Meantime, MM/ML tag can both mark the number of called C (length of items in the ML tag) and the number of skipped C (The sum of items saved in the MM tag) before the last call. In other words, If the sequence is
|
Ah yes, that's true there may be additional Cs. The intention of the tag is not to act as a counter for the base types, but to identify specific bases as being methylated. There is nothing inconsistent here, just a different intention. What is the reason for wanting to know the remaining C count? Is it to act as an error check? I can see, had we included the remainder somewhere, then we could spot disparities between MM and SEQ, eg due to hard clipping. |
I am trying to align the multiple alignments with the primary alignment for downstream analysis, and I utilized the MM tag as an error check as you mentioned, and clipping will also be involved. But previously I am not sure 1) whether the MM tag count from the 1st C 2) whether there are additional Cs after the last called C. Now after your clarification, I think I know how to do it. Thank you so much for your help! |
Mulling over ideas on twitter with others, and in the htslib PR to reject MM on hardclipped reads (samtools/htslib#1567), I had the following thought process:
Which leads me to a final conclusion of:
Thoughts? |
I think the new tag might be a good idea to check whether a read is hard-clipped or not, but it might also increase the storage. - Not sure if it will be a concern or not because the bam format is highly condensed, but I assume that introducing one more tag might not be a big issue. |
A new tag is just a single number, so even uncompressed it's just 5-7 bytes per read, and less when compressed. I think that's largely an irrelevance for long-read data. |
I think adding
or
On the balance though MZ is probably the best way to solve the immediate problem while we think of a longer term solution. |
Adding seq to MH is basically the same as switching hard-clips to soft-clips. There's really no point to doing the hard clipping if we're just going to store it somewhere else. Remember the reason for hard clipping is that sub-string secondary alignments on long-read data can be vast. It's not uncommon to see 10s of KB of hard-clipped data. Adding all that back again isn't really an option IMO. Or if it is, the option is to soft-clip not hard-clip. We perhaps just record ACGT base counts for the trimmed data as that's all you really need to correct MM. However, if we have a M*-tag aware hard-clipping tool then it should just hard-clip MM/ML too and trim out the modifications that fall outside the clipped range, making MH/MS totally redundant. |
If a sequence is hard-clipped after calling the base modifications, then the tool may, or may not, update the MM and ML tags accordingly. We have no way of distinguishing these two cases. While the base modification parsing code already detects overflows where the coordinates go beyond the sequence end, this isn't fool proof, especially if the clipping is short. So instead we have an (as yet unwritten) proposal of MZ:i tag holding the sequence length, to be written at the same time as the MM and ML tags. This can then be used as a sanity check later on, to detect cases where the sequence has changed length via a tool that is unaware of base modifications. TODO: as a separate PR, we should add a new API that can trim bases off the start/end of MM/ML strings to make it trivial for tools that are doing hard clipping via htslib. (Indeed we don't even have an API for SEQ/QUAL either, so it can do all together). This would make it far easier for people to keep everything in sync, and this code could then also update MZ while it's at it. That's new API though so it can arrive as a separate commit. See samtools/hts-specs#646
This is used as a sanity check on the validity of the MM and ML tags. It holds the length of SEQ at the time MM and ML were produced and/or updated. The intention is to provide a mechanism to detect hard-clipping has been performed with a tool that is not MM/ML aware. Fixes samtools#646
If a sequence is hard-clipped after calling the base modifications, then the tool may, or may not, update the MM and ML tags accordingly. We have no way of distinguishing these two cases. While the base modification parsing code already detects overflows where the coordinates go beyond the sequence end, this isn't fool proof, especially if the clipping is short. So instead we have an (as yet unwritten) proposal of MZ:i tag holding the sequence length, to be written at the same time as the MM and ML tags. This can then be used as a sanity check later on, to detect cases where the sequence has changed length via a tool that is unaware of base modifications. TODO: as a separate PR, we should add a new API that can trim bases off the start/end of MM/ML strings to make it trivial for tools that are doing hard clipping via htslib. (Indeed we don't even have an API for SEQ/QUAL either, so it can do all together). This would make it far easier for people to keep everything in sync, and this code could then also update MZ while it's at it. That's new API though so it can arrive as a separate commit. See samtools/hts-specs#646
This is used as a sanity check on the validity of the MM and ML tags. It holds the length of SEQ at the time MM and ML were produced and/or updated. The intention is to provide a mechanism to detect hard-clipping has been performed with a tool that is not MM/ML aware. Fixes samtools#646
If a sequence is hard-clipped after calling the base modifications, then the tool may, or may not, update the MM and ML tags accordingly. We have no way of distinguishing these two cases. While the base modification parsing code already detects overflows where the coordinates go beyond the sequence end, this isn't fool proof, especially if the clipping is short. So instead we have an (as yet unwritten) proposal of MZ:i tag holding the sequence length, to be written at the same time as the MM and ML tags. This can then be used as a sanity check later on, to detect cases where the sequence has changed length via a tool that is unaware of base modifications. TODO: as a separate PR, we should add a new API that can trim bases off the start/end of MM/ML strings to make it trivial for tools that are doing hard clipping via htslib. (Indeed we don't even have an API for SEQ/QUAL either, so it can do all together). This would make it far easier for people to keep everything in sync, and this code could then also update MZ while it's at it. That's new API though so it can arrive as a separate commit. See samtools/hts-specs#646
This is used as a sanity check on the validity of the MM and ML tags. It holds the length of SEQ at the time MM and ML were produced and/or updated. The intention is to provide a mechanism to detect hard-clipping has been performed with a tool that is not MM/ML aware. Fixes samtools#646
If a sequence is hard-clipped after calling the base modifications, then the tool may, or may not, update the MM and ML tags accordingly. We have no way of distinguishing these two cases. While the base modification parsing code already detects overflows where the coordinates go beyond the sequence end, this isn't fool proof, especially if the clipping is short. So instead we have an (as yet unwritten) proposal of MZ:i tag holding the sequence length, to be written at the same time as the MM and ML tags. This can then be used as a sanity check later on, to detect cases where the sequence has changed length via a tool that is unaware of base modifications. TODO: as a separate PR, we should add a new API that can trim bases off the start/end of MM/ML strings to make it trivial for tools that are doing hard clipping via htslib. (Indeed we don't even have an API for SEQ/QUAL either, so it can do all together). This would make it far easier for people to keep everything in sync, and this code could then also update MZ while it's at it. That's new API though so it can arrive as a separate commit. See samtools/hts-specs#646
Carrying conversation over from related IGV thread. From the ONT side, modified base calls will always be made along side basecalling. So aligners would have to either pass through or treat MM/ML tags. I would not expect that we will end up in a place where aligners would be expected to make modified base calls. We have research commands to make modified base calls after canonical basecalling is completed, but this requires some extra tags at basecalling time as well as a lot of large and random file access and thus would not recommend this workflow in general. These commands are generally only recommended for research settings (e.g. applying several different or advanced research modified base models to the same set of basecalls). Overall, I think the MN tag is a nice solution to validating MM/ML tags. If we can get good adoption across the board I think it makes sense for downstream tools (IGV etc) to require the MN tag. We can consider adding a command to |
I was directed from igvteam/igv#1435 as well. I will explain my preference first. I propose to add a new The root cause of the problem is that the spec doesn't say how to interpret MM in the presence hard clippings. Minimap2 thinks MM uses offsets on the original read, while IGV assumes MM uses offsets on the SEQ coordinate. To address the issue, the spec shall clearly specify what MM means when there are no additional tags. I see two possibilities:
We are discussing solution 2 only because IGV makes that assumption. I see no other benefit. It instead incurs several problems: a) Imposing 2 invalidates minimap2 which currently doesn't violate the spec. It turns makes existing minimap2, which conforms to the spec at present, non-conforming and thus breaks the backward compatibility of the spec. b) Due to the co-existence of new and old versions, we will see MM produced by both rules in coming years and we can't tell which rule is in action. This is bad for everyone. c) Existing tools that clip MM for minimap2 alignment assume rule 1. Imposing 2 will break them. Imposing rule 1 will affect IGV and existing tools that clip MM. These are easy to fix. For IGV, subtract the hard clip length. For tools that clip MM, output The MN tag in #714 could work if it defaults to the original read length (this adopts rule 1). Nonetheless, MN brings a new complication: what if the original read length, SEQ length, and MN length are all different? We only have two scenarios here. My proposed |
I agree that we should look again at #714's handling of the missing-MN case and revisit or at least expand on the recommended behaviour. |
I'm sorry I still don't get it. What use is knowing that MM applies to the original sequence when the recorded SEQ differs, other than the knowledge that MM Is now invalid. I did think quite a lot on this before proposing MN, and realised that your MC idea doesn't work. Imagine this scenario.
The file is now fibbing. We claim MM refers to SEQ, but it clearly does not. The same would apply if the aligner knew about MM/ML, modified it, and set MC:A:Y, but then a later stage subsequently modifies SEQ again. The benefit of MN is that it encodes the one thing that really matters - has the coordinate system changed? Changing SEQ will lead to a mismatch in MN. Also the idea of it being the aligned sequencing strand is fundamentally flawed again, for exactly the same reasons. It only works if all tools are updated to modify these tags. We did think though most of it before making the proposal. :) |
The spec does sort of hint at MM referring to the original sequence as-reported by the sequencing instrument because it categorically states that it is in the original sequence orientation and the "fundamental" base type reported for the modification is on that strand too. However I agree it could be clearer with respect to editing SEQ. Had we thought of that (beyond merely reverse complementing) then obviously we'd have thought about MN at the time too. This isn't just a knee jerk reaction to appease IGV though. The system was designed around storing evidence and data, not around alignment viewers, and both option 1 and 2 were assumed to be the same thing simply because we didn't think about tools editing the sequence. (Thinking of the things missing from a spec is much harder than discussing the wording that is there.) Sadly both 1 and 2 in your example are not working well, as some tools think it's 1 and some tools thing it's 2. As you say going forward this becomes bad for everyone unless there is a clear way to disambiguate these cases, hence the MN proposal. Please if you have improvements for wording or something is missing, do please comment on that PR. Also, I'm hesitant to suggest hacks like if the MM positions refer to the original sequence (detected through some boolean tag) then we can just adjust them based on hard-clipping. It may work, but it's a blunt instrument. We don't know it is valid because we're just assuming that the only thing that modified SEQ post MM-calling was the aligner. It could be UMI processing, sequencing vector removal without hard clipping, etc. I don't want us to be in a position of making assumptions only to find them bust apart again in 6 months time. Yet another reason why a boolean "yes / no" tag is just an accident waiting to happen. We have a fighting chance of compensating for the original vs editing SEQ coordinates if we know both the original length and the new modified lengths. It's not perfect, but was the best I could come up with. |
You only pass MM/ML to an aligner. MC encodes alignment information. You don't pass MC just like you don't pass NM.
SAM is designed to allow the retrieval of the original sequences such that they can be realigned. As long as SEQ and MM are intact on the primary record, these tools don't matter. If you can't reconstruct the original sequence and its base modifications, other solutions will be problematic as well.
I don't have that idea. I think everyone agrees MM is relative to the read strand.
We need to rewrite it. Let's replace MC with MH. In the text describing MM, add:
Then describe MH:
I could send a PR with these several sentences but that would create another thread.
There is no perfection solution right now. I have explained the problems with forcing MM to describe SEQ. Those are worse. If we had followed the first rule, we wouldn't have had this thread and everything would have naturally worked together. |
Apologies for somehow hitting edit instead of quote reply and wrecking your comment Heng! Hopefully it represents what you originally wrote now. Try 2 on reply...
Says who? Aligners get the whole kit and caboodle normally - all the tags they're meant to spit out into the SAM file. You seem to be punting the problem onto someone else by saying the aligner should get a subset of the tags only, and then presumably we have to lift-over the ones we didn't send to the aligner to add them back into the SAM file? I don't get it, sorry. Or are you saying MC is something produced by the aligner and not something we have currently? If so, we have a problem with realignment potentially, as then we need software that knows how to convert from SAM back to FASTQ but dropping certain tag types. Although... hopefully that's only primary records and hopefully they have only been soft-clipped.
I'm assuming people are wanting MM to work correctly on secondary alignments, which is the whole reason for this problem. Primary alignments hopefully are already fine and aligners (again hopefully) aren't hard clipping them. However you still haven't thought about non-alignment tools. Something else that takes an MM/ML tag and the SEQ and does something (it doesn't matter what - anything will do) and isn't aware of MM/ML still needs a way of signally that the SEQ and MM are now out of sync. When that software isn't being modified, the ONLY way that they can be detected as out of sync is some additional tag (or embedded MM data, but that ship has sailed now) that permits detection. Hence MN. I don't think we can get around it with a boolean aligner-specific thing. We need the detection of edited SEQ to be evident without modifying any existing software.
Sorry I strongly disagree. Maybe I'm not putting my point over correctly, so let's tackle your suggestion and work through a scenario so the problem is evident. We start with a fastq containing SEQ, MM and ML. "If the MH tag is absent, offsets in MM are relative to the original read sequence." So can we think of any process that could take a FASTQ, modify the SEQ/QUAL to spit out one or more reads of differing lengths, while faithfully keeping all the aux tags it's aware of unchanged? I don't know of such tools, but I could imagine one - our own custom barcode splitting. (This is quite possibly how this sort of thing first happened, before instrument manufacturers started supporting these protocols themselves.) We ligate a 6mer to the start of every DNA fragment as part of the library prep to indicate where it came from - maybe some pooling strategy. We then apply the standard instrument protocol, sequence it and get the data out. The instrument is totally unaware of our own barcode shenanigans, so we get a bunch of reads, all starting with our ligated barcode. We wish to strip that barcode off and, let's say, split the data up into barcode specific files. Eg maybe we're inventing a pooling or multiplexing strategy to put several samples into a single lane. So we then have a tool which reads the first 6 bases of each seq and splits up the fastq into a set of new FASTQs, one per barcoded pool, with the first 6 bases of seq and qual removed. We may, or may not, copy those 6 bases + qual into their own tag, but that's an irrelevance here. What we now have is an offset of 6bp for each seq, breaking MM coordinates. This tool isn't aware of MM, so it doesn't know to set MH. Do you see the problem now? Any strategy you think of that requires a flag to be set when the sequence is modified inherently requires us to edit existing software or to dictate that all future software is MM aware. We need the reverse - a strategy that can detect when SEQ has been modified but MM/ML have not even when software is totally unaware of the new aux tags. (We also need a mechanism where new software that is aware of the new tags can indicate that they're still valid and have been updated - this is the bit of the problem that you are solving I think.)
I'll repeat this again - we are not forcing MM to describe SEQ. Instead we are enabling a mechanism where we can detect they are not in sync so we don't make mistakes. I also don't get what you mean by "followed the first rule". (Don't mention fight club?) Fundamentally we wouldn't want to start from here, but here we are (largely because of lack of engagement earlier on when MM was first proposed). The best way forward is simply a way that can A) detect when there are problems and SEQ / MM are not in sync and possibly B) describe by how much they are out of sync so we can check whether it matches known hypothesis as to the cause, such as hard-clipping amounts. Your MH tag doesn't fufill either of these while the existing MN proposal does. I'm open to other ideas that do fulfil these requirements though. |
We are going around the circle. I understand what you said but most of them are not relevant.
That is the fundamental difference between us. You assume MM and SEQ should be in sync, while I think MM and CIGAR should be in sync; as long as MM and CIGAR are synced, you can get correct methylation on SEQ. Anyway, note that I have said "The MN in #714 could work if it defaults to the original read length". I can live with MN provided that the spec defines MM to be on the original sequence coordinate in the absence of MN/MH. |
Back from a meeting. If you want to go with MN, I would modify the spec to the following. In the text describing MM:
Then describe MN:
Consequence of this proposal: minimap2 stays the same. IGV needs to be updated. Users no longer need to clip MM. If a pipeline prefers to clip MM to reduce storage, which will be minor, it can implement a step to clip MM and add MN. Samtools can provide one as well. EDIT again: hmm... Looking at what I wrote above, I think the right solution is to just let MM defined on the original sequence without introducing new tags. This is not optimal storage-wise, but it is simple and clear. There will be confusion during the transition to this MM definition, but the confusion is unavoidable anyway. |
> detect when there are problems and SEQ / MM are not in sync
That is the fundamental difference between us. You assume MM and SEQ should be in sync, while I think MM and CIGAR should be in sync; as long as MM and CIGAR are synced, you can get correct methylation on SEQ.
That is incorrect. If MM uses a fundamental base of A, C, G or T then we need to know more than the sequence length. We need to know how many of that fundamental base was trimmed. This is why your proposal does not work, sorry.
|
The first statement is already in there. Some of the other bits are too (eg explicitly stating that the absence of MN means MM is considered to be valid - which is basically that SEQ is the same as the original produce), but it doesn't harm to rephrase the same thing in multiple ways. It's good to be explicit about where MM is invalid too.
I can live with that, although I'd say "at the time MM/ML were produced or updated" as we won't to promote the idea that MM/ML aren't immutable. I'd also ditch the CIGAR comment. It's simply not helpful IMO as in most (but not all) cases we need the sequence too to process MM. If we wish to include such a statement, then it should be fully explained. Eg "When a fundamental base of N is in use, the MN length may be validated against the sequence length computed from CIGAR". That permits us to keep MM working, in a restricted sense.
No one has ever claimed minimap2 needs to be updated. Whether or not you update it is entirely up to you, or to anyone else who happens to fork it. This proposal is simply a way of distinguishing software that does hard clipping within updating MM/ML from software that does hard clipping and understands MM/ML and updates them accordingly. For sorted data, such as emitted by minimap2, it should be quite easy to modify MM/ML/MN on hard-clipped data and to trim them, as we'll also have the unclipped data along side it. Thank you for the idea of adding a tool to do this in Samtools. It fits naturally in with the other fixes applied during
Not introducing any new tag and stating that it simply matches the original sequence (essentially what is in the spec already) is just tantamount to saying "it's sometimes ok and sometimes broke; shrug!" as we offer no way for the user to work out when it is valid and when it isn't. MN Is a solution to this and we'll just have to agree to disagree here. I'm not going to drop it unless a better alternative is proposed. |
Ah, I see your point. I agree to add MN. It would be better to have a new tag that defines offsets on all bases, but the ship has sailed. |
I did think about having that as a tag, but felt it probably wasn't justified. It's still better IMO if we can get a way to trim the MM tag back, especially when dealing with long sequences with potentially several short secondary matches. A post-processing tool while the data is still name collated would definitely be possible. |
I suspect @ArtRand will provide that in nanoporetech/modkit, but also seems like a natural thing to be in samtools (like fixmate, calmd, and friends). On the broader theme, I have not been able to come up with a more compelling solution than MN as recording the length of SEQ when MM was produced --- short of tearing it all down and starting from scratch. |
I'm also in favour of the MN tag, and agree that a tool that does streaming correction of the MM tag for hardclipped reads prior to position sorting is a good idea. |
What is the status of this tag? Our tools at ONT are now supporting this tag and I believe IGV is using this tag for validation now. |
The status is the committee is all in favour of it (even if we weren't as you say it's pretty much deployed already), and we recently realised that #714 has not actually been merged yet — and it's at the top of my review 'n' merge list. |
Sounds great. Thank you! Just wanted to make sure this one hadn't dropped off the radar. |
This is used as a sanity check on the validity of the MM and ML tags. It holds the length of SEQ at the time MM and ML were produced and/or updated. The intention is to provide a mechanism to detect hard-clipping has been performed with a tool that is not MM/ML aware. Fixes samtools#646
The
MM
tag describes the modification string relative to the sequencing read as generated by the instrument. For instance, consider this read, where only the 3rd C is modified:If the read is aligned end-to-end, the CIGAR, SEQ and modification tag would be
9M
,ACCTCGCCA
andMM:Z:C+m,2
. If the mapper has hard clipped the first two bases of the read however then the modification tag is the same but the CIGAR and SEQ will be2H7M
,CTCGCCA
. A naive interpretation of theMM
tag will indicate that the wrong base is modified (in this case, the 4th C from the original read).This situation is not currently handled by the MM tag specification. I would suggest discouraging the use of hard clips (e.g. suggest the
-Y
flag is used for minimap2) and saying something like "MM tags are invalid when the alignment contains hard clips".The text was updated successfully, but these errors were encountered: