Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to maintain N in umi #907

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

JoeVieira
Copy link
Contributor

@JoeVieira JoeVieira commented Apr 7, 2023

Add flag N to disable filtering of reads which contain non-ATCG containing UMIs
Created test case as well.

This was discussed in issue #906

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far. Thank-you for contributing!

@JoeVieira JoeVieira requested a review from nh13 April 11, 2023 19:48
@JoeVieira
Copy link
Contributor Author

JoeVieira commented Apr 11, 2023

Thanks for the feedback! Sorry I missed the usage doc.
Please let me know if the docs are not clear enough.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor tweak. @tfenne can you also take a look?

@nh13 nh13 requested a review from tfenne April 11, 2023 23:15
@JoeVieira JoeVieira changed the title Add option to maintain non-ATCG in umi Add option to maintain N in umi Apr 12, 2023
@JoeVieira
Copy link
Contributor Author

@tfenne sorry to bother - but could I get this reviewed soon? Thanks in advance.

@JoeVieira
Copy link
Contributor Author

JoeVieira commented Sep 29, 2023

@tfenne since i've your attention - could we wrap this one up too =)?

@tfenne
Copy link
Member

tfenne commented Sep 29, 2023

So ... I think there's an issue with this. It's a bit of an edge case, but when running with --strategy paired I think you could in theory create a graph of UMIs that are e.g. ACGT -> ACGN -> ACNN -> ANNN -> NNNN, and that doesn't seem good.

Similarly if you have a bunch of failed UMI reads that are all Ns it will treat the Ns as matches to each other and create groups, when you have no evidence of what the UMI actually was.

What do you think about:

  1. When allowing Ns still filter out UMIs that contain more than edits Ns?
  2. Changing it from includeNs: boolean to maxNsInUmis: Int = 0, so that users can decide how many Ns are ok?
  3. Change the matching logic to say that N mismatches everything (including other Ns) so that the edit threshold is breached when there are too many Ns

I think I like (2) the best as it's explicit and (3) might cause a performance problem when we have lots of reads at a given position.

@JoeVieira
Copy link
Contributor Author

JoeVieira commented Sep 29, 2023

Yes, that case 100% exists. Glad you're bringing this up.
A couple additional thoughts.

1.) I think umi's which are fully N need filtering regardless ( they are just non-sense )

2.) I personally think that your option 3 is the most internally consistent option, while 2 is explicit it's a bit of an odd usage & puts a fair amount of error possibility into the equation ( it precludes my point number 1 & also explicit would allow the issue you enumerated )

Alternatively ( which is what i did in this PR ) assume that if a case like this occurs one has very bad data & it's a "use at your own risk" In my testing with good ( high input germline data ) and hard ( lower input somatic type data ) this case hasn't occured ( of course it could )

i'd be happy to do any of these ( other than your 1, i think that's an odd conflation of parameters ), but do prefer what i laid out above assuming it's not performance encumbering.

Does that sound good to you?

@JoeVieira
Copy link
Contributor Author

@tfenne updated per ideas above - lmk what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants