-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to maintain N in umi #907
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far. Thank-you for contributing!
…ct the specific implementation.
…th counting of N's
Thanks for the feedback! Sorry I missed the usage doc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor tweak. @tfenne can you also take a look?
@tfenne sorry to bother - but could I get this reviewed soon? Thanks in advance. |
@tfenne since i've your attention - could we wrap this one up too =)? |
So ... I think there's an issue with this. It's a bit of an edge case, but when running with Similarly if you have a bunch of failed UMI reads that are all Ns it will treat the Ns as matches to each other and create groups, when you have no evidence of what the UMI actually was. What do you think about:
I think I like (2) the best as it's explicit and (3) might cause a performance problem when we have lots of reads at a given position. |
Yes, that case 100% exists. Glad you're bringing this up. 1.) I think umi's which are fully N need filtering regardless ( they are just non-sense ) 2.) I personally think that your option 3 is the most internally consistent option, while 2 is explicit it's a bit of an odd usage & puts a fair amount of error possibility into the equation ( it precludes my point number 1 & also explicit would allow the issue you enumerated ) Alternatively ( which is what i did in this PR ) assume that if a case like this occurs one has very bad data & it's a "use at your own risk" In my testing with good ( high input germline data ) and hard ( lower input somatic type data ) this case hasn't occured ( of course it could ) i'd be happy to do any of these ( other than your 1, i think that's an odd conflation of parameters ), but do prefer what i laid out above assuming it's not performance encumbering. Does that sound good to you? |
@tfenne updated per ideas above - lmk what you think. |
Add flag
N
to disable filtering of reads which contain non-ATCG containing UMIsCreated test case as well.
This was discussed in issue #906