-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "P" special Number character for FORMAT #442
Conversation
Specifies that the number of values is the sample ploidy (i.e. the number of GT entries).
I agree that any field with "ploidy cardinality" at the moment needs And the problem with putting it in INFO is that not all samples may have the same ploidy, for instance human male/female samples on chrX. Can we come up with other real-world use cases for this? |
The Allele/haplotype frequencies that I'm suggesting here are a property of the sample. This totally makes sense in the case of tumour samples (as suggested in the example that I gave) where the allele frequencies are usually unknown (e.g. due to subclonality, tumour impurity, copy-number variation), or other such samples where unknown allele frequencies are unknown and vary between samples (e.g. polyclonal bacterial or viral samples). |
This seems like a good idea although I'm not sure it will see a lot of use. The example you gave is a bit strange because the numbers you specified would be better represented with an A count. I see the point about haplotype frequency in somatic samples though. Are there any tools that can generate that? We do know of at least one additional use case as well. The proposed PSL and PSQ tags in #421 are haplotype specific phasing information and have this count but they didn't define a new ordering for it. This would be a breaking change, existing parsers would not understand this. So I think it would probably have to go into 4.4 instead of 4.3. |
@lbergelson Allele count doesn't work if you want to refer to each unique haplotype. For example, you might have something like
Now there are three unique haplotypes in sample Octopus generates genotypes like this for somatic samples and infers frequencies for each haplotype. Hence my interest in seeing this added. |
@dancooke Right. I agree allele doesn't work if you are referring to haplotypes. I was just saying the example wasn't that convincing because it seems like it could be better represented with just the allele information. I'm in favor of this idea. We need a way of representing per haplotype information and supporting this is not a difficult change. I'm just not certain if it should go into 4.3. |
Rather than reporting a single statistic (or pair), report statistics for each called haplotype. This addresses the issue faced in issue #81. TODO: update header rows once the VCF spec adds special Number "P" (see samtools/hts-specs#442).
I agree that if added, it should go into the next version. I am not prinicipally opposed, but unsure if another special |
@pd3 It won't gain momentum if it's not in the spec... I'd like to have Octopus generating VCFs using this but am currently forced to use |
@dancooke It sometimes takes surprisingly long for libraries and programs to adopt new versions of VCF, in fact, some never do. I witness it firsthand, it's a problem for both users and developers. To explain better what I meant: adopting a breaking feature like this may not be such a big win for Octopus because it may end up generating VCFs that other programs refuse to work with. There must be a strong support to implement the feature. On the other hand, if you use the Hence my cautious support for this addition - if requests like this really become more frequent as you are predicting. |
@pd3 I understand your concern, but I think if people get used to using |
@dancooke My suggestion was to give it more time to get some feeling for how widely this would be used and what benefits In general, the more unnecessary breaking changes are introduced, the less likely it is that the new version of the format gets adopted. In past I forced changes to BCF which made life more difficult for everyone, and although I still believe they were necessary, it certainly took a toll on adoption of the format. |
I agree with @pd3. It would be good to have the feature at the beginning of VCF, but adding it now brings more troubles than benefit. Furthermore, if I were an Octopus developer, I wouldn't consider "P" anyway. Doing that would only make most existing tools reject Octopus VCF. |
I think this is a simple enough change that there's no reason to keep it out of a new version of vcf due to difficulty of adoption. t would need to go in vcf 4.4 and as a developer I probably wouldn't start emitting it until 4.4 exists, but I think it would be worthwhile. We haven't had much need for something like this because support for phasing has been very limited and spotty. We already have at least 1 other active specs PR that would benefit from this right now though. With new long read technology we're going to be seeing more tools generate phased data and needing a way to associate values with each haplotype so it seems likely that this will be increasingly used if we ratify it. |
Features shouldn't be added to the specification just because they are easy to add, but because they add new functionality with benefits. In this thread, I've been hoping to hear more about the latter. So far, nothing has been said that couldn't be addressed with the existing I can imagine that with the proposed convention, phasing tools can automatically rearrange |
I agree that we shouldn't add features that have no use just because they are easy, but I think moderately useful features with low implementation cost should be a much easier sell than moderately useful features with very difficult implementation cost. |
This is a good point. However, phasing in the long read era will be very different from what we are thinking now. It is not worth breaking forward compatibility for hypothetical operations we haven’t implemented in htslib or htsjdk. I don’t hold a strong opinion on this, though, mostly because bcftools silently ignores unrecognized Number anyway. |
NOT true...the length of the PSL is the number of phased alleles in the GT field...not the length of the GT field. |
@yfarjoun Would it make sense to have PSL correspond to the length of GT and explicitly specify |
yes...that's a good idea. moving it to that thread. |
If this feature is designed to support somatic samples, shouldn't we include support for subclone at the same time? Currently everything is designed around pure samples with integer copy numbers for everything. |
Sure. In what way does it not? Perhaps ploidy should just mean the number
of entries in the gt field and then domestic entries can use it regardless
of the copy number associated with the allele
…On Mon, Feb 24, 2020 at 11:01 PM Daniel Cameron ***@***.***> wrote:
If this feature is designed to support somatic samples, shouldn't we
include support for subclone at the same time? Currently everything is
designed around pure samples with integer copy numbers for everything.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#442?email_source=notifications&email_token=AAU6JUXJNPLRBKXCKT4XMPDREQYT5A5CNFSM4ISLUH2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMZQNHA#issuecomment-590546588>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAU6JUSTPTTFASSDZPAC5DDREQYT5ANCNFSM4ISLUH2A>
.
|
Seems reasonable. We'd have to make CN a float instead of a int but that's forr aanother PR. |
I'm confused where this conversation stands, but I'd like to add another vote in favor of Number=P for format fields. I'm from the lab which supports the GangSTR tandem repeat caller. Currently we output the length of the corresponding allele for each called haplotype, and can't represent the number of integers in this format field appropriately without Number=P. That use-case is for convenience, and could be handled by just Number=R fields and some lookups. However, we may want to emit a quality metric for each haplotype call. I don't see how the current system can support that without Number=. and the exact same thing, just by less-than-clear convention instead of specification. Also, a lot of the talk on this page is with regards to somatic variation. The use-case we want this for is simpler: 2/1 calls of X chromosomes and 0/1 calls of Y chromosomes in women/men. That seems like a fundamental use-case that should be well supported. |
Unfortunately, even human sex chromosome ploidy is messier than you describe with ~0.5% of individuals neither XX nor XY. When running across thousands of samples, you need to be able to handle this. Similarly, if/when GangSTR decides to expand it's scope to call somatic STRs (since MSI is important in cancer genomics), then you'll likely consider somatic analysis to be a common use case. A well designed specifications makes common uses cases easy, and uncommon use cases possible. If we don't consider all potential use cases when designing a feature, we end up in the situation in which yet another file format is required for the scenarios that VCF doesn't handle. We're in that situation right now for aneuploidy & subclonality - many CNV callers write their own custom file format. |
Number=P has been added to VCF 4.5 |
Specifies that the number of values is the sample ploidy (i.e. the number of GT entries).
One possible example use case is to specify the frequency of alleles (or haplotypes):