Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phasing of first genotype #1113

Open
lczech opened this issue Jul 28, 2020 · 5 comments
Open

Phasing of first genotype #1113

lczech opened this issue Jul 28, 2020 · 5 comments

Comments

@lczech
Copy link

lczech commented Jul 28, 2020

Hi there,

according to the VCF Spec 4.2,

For one individual, each integer in the vector is organized as (allele + 1) << 1 | phased [...]

(emphasis mine)

However, looking at the implementation as well as some quick tests, it seems that the first genotype (e.g., the 0 in 0|1) never has the phased bit set, so that bcf_gt_is_phased() always returns false. It seems to me that at least for diploids, that does not make all too much sense, as knowing that the second genotype is phased also determines this for the first one, doesn't it?!

The VCF Spec however also has the example

0/1|2 is tetraploid with a single phased allele

In that case, it is not clear to me how to interpret phasing. Only the 2 is phased here, while the order of the 0 and 1 is not determined?

So, my questions here are:

  • Is it an oversight in the VCF standard that the first genotype cannot be marked as phased, or is that an oversight/bug in the implementation?
  • How would the poly-ploid case from above have to be handled/interpreted correctly?
  • Is there any "best practice" to test for phasing? Should I iterate all values for a given sample and output "phased" if all or if any of the bcf_gt_is_phased() calls return true?

Thanks and all the best
Lucas

@jkbonfield
Copy link
Contributor

Thank you for raising this issue. I am not a VCF maintainer and admit that I found this confusing too when I looked into it. However I see this very topic is in discussion already, with a proposal to fix it (possibly wrong? or exposing a difference in VCF and BCF?). I'm out of my depth really, but I'd recommend adding a users perspective to the samtools/hts-specs#421 PR.

@jkbonfield
Copy link
Contributor

Also, I note that "0 / 1 | 2" comment was there in the initial checkin of the initial VCF spec, so we have no more history to know where it came from and who authored it and whether this is what was intended. It was committed by @amarcket but maybe @pd3 or @lh3 know the origin of this too. Can the above named people cast their eyes over the PR mentioned above too?

@pd3
Copy link
Member

pd3 commented Jul 29, 2020

This was never explicitly defined. I am open to modifying the existing specification and htslib as mixed ploidy is not a widely used feature.

@jkbonfield
Copy link
Contributor

It does appear to be implicitly defined in BCF though as the spec indicates the | vs / affects the next value, not the previous one.

@pd3
Copy link
Member

pd3 commented Jul 29, 2020

Yes, and the examples 0|1 and 0/1|2 given in the specification are the only mention of this. The "phase follows the allele" notation seems more intuitive and VCF is the major format, an argument in favor of changing the specification. I am happy either way though, as long as the ambiguity is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants