Add option to maintain N in umi #907

JoeVieira · 2023-04-07T23:01:52Z

Add flag N to disable filtering of reads which contain non-ATCG containing UMIs
Created test case as well.

This was discussed in issue #906

…ings.

…n for 'N' argument

nh13

Looking good so far. Thank-you for contributing!

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

…ct the specific implementation.

…th counting of N's

JoeVieira · 2023-04-11T19:49:36Z

Thanks for the feedback! Sorry I missed the usage doc.
Please let me know if the docs are not clear enough.

nh13

One minor tweak. @tfenne can you also take a look?

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

JoeVieira · 2023-05-22T15:20:42Z

@tfenne sorry to bother - but could I get this reviewed soon? Thanks in advance.

JoeVieira · 2023-09-29T16:13:48Z

@tfenne since i've your attention - could we wrap this one up too =)?

tfenne · 2023-09-29T19:52:24Z

So ... I think there's an issue with this. It's a bit of an edge case, but when running with --strategy paired I think you could in theory create a graph of UMIs that are e.g. ACGT -> ACGN -> ACNN -> ANNN -> NNNN, and that doesn't seem good.

Similarly if you have a bunch of failed UMI reads that are all Ns it will treat the Ns as matches to each other and create groups, when you have no evidence of what the UMI actually was.

What do you think about:

When allowing Ns still filter out UMIs that contain more than edits Ns?
Changing it from includeNs: boolean to maxNsInUmis: Int = 0, so that users can decide how many Ns are ok?
Change the matching logic to say that N mismatches everything (including other Ns) so that the edit threshold is breached when there are too many Ns

I think I like (2) the best as it's explicit and (3) might cause a performance problem when we have lots of reads at a given position.

JoeVieira · 2023-09-29T20:24:29Z

Yes, that case 100% exists. Glad you're bringing this up.
A couple additional thoughts.

1.) I think umi's which are fully N need filtering regardless ( they are just non-sense )

2.) I personally think that your option 3 is the most internally consistent option, while 2 is explicit it's a bit of an odd usage & puts a fair amount of error possibility into the equation ( it precludes my point number 1 & also explicit would allow the issue you enumerated )

Alternatively ( which is what i did in this PR ) assume that if a case like this occurs one has very bad data & it's a "use at your own risk" In my testing with good ( high input germline data ) and hard ( lower input somatic type data ) this case hasn't occured ( of course it could )

i'd be happy to do any of these ( other than your 1, i think that's an odd conflation of parameters ), but do prefer what i laid out above assuming it's not performance encumbering.

Does that sound good to you?

JoeVieira · 2023-09-29T23:03:58Z

@tfenne updated per ideas above - lmk what you think.

JoeVieira added 3 commits March 28, 2023 19:01

Simple flag for allowing non-atcg base chars to be present in UMI str…

10f5fe4

…ings.

Include UMIs which contain non-ATCG bases (N) in output. documentatio…

5be3e22

…n for 'N' argument

More comprehensive tests for UMIs containing Ns

d6b314f

nh13 requested changes Apr 11, 2023

View reviewed changes

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala Outdated Show resolved Hide resolved

JoeVieira added 2 commits April 11, 2023 15:37

Adjust docstring & argument value for inclusion of Ns to better refle…

746a3c6

…ct the specific implementation.

update usage docs to include relevant information about N flag & leng…

685e4d4

…th counting of N's

JoeVieira requested a review from nh13 April 11, 2023 19:48

nh13 approved these changes Apr 11, 2023

View reviewed changes

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala Outdated Show resolved Hide resolved

nh13 requested a review from tfenne April 11, 2023 23:15

formatting fix for docstrings

cf5f1bc

JoeVieira changed the title ~~Add option to maintain non-ATCG in umi~~ Add option to maintain N in umi Apr 12, 2023

nh13 added the needs review label Aug 16, 2023

JoeVieira added 5 commits September 29, 2023 17:32

filter reads which have UMIs only containing N's

e28bd7b

cover "-" cases also

efda09e

If UMI is all N's ( and - ) filter it out & log that.

a57c902

add test to ensure N is treated as mismatch

3a10e7b

count N in either position as a mismatch

e2e2049

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to maintain N in umi #907

Add option to maintain N in umi #907

JoeVieira commented Apr 7, 2023 •

edited

Loading

nh13 left a comment

JoeVieira commented Apr 11, 2023 •

edited

Loading

nh13 left a comment

JoeVieira commented May 22, 2023

JoeVieira commented Sep 29, 2023 •

edited

Loading

tfenne commented Sep 29, 2023

JoeVieira commented Sep 29, 2023 •

edited

Loading

JoeVieira commented Sep 29, 2023

Add option to maintain N in umi #907

Are you sure you want to change the base?

Add option to maintain N in umi #907

Conversation

JoeVieira commented Apr 7, 2023 • edited Loading

nh13 left a comment

Choose a reason for hiding this comment

JoeVieira commented Apr 11, 2023 • edited Loading

nh13 left a comment

Choose a reason for hiding this comment

JoeVieira commented May 22, 2023

JoeVieira commented Sep 29, 2023 • edited Loading

tfenne commented Sep 29, 2023

JoeVieira commented Sep 29, 2023 • edited Loading

JoeVieira commented Sep 29, 2023

JoeVieira commented Apr 7, 2023 •

edited

Loading

JoeVieira commented Apr 11, 2023 •

edited

Loading

JoeVieira commented Sep 29, 2023 •

edited

Loading

JoeVieira commented Sep 29, 2023 •

edited

Loading