-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanding the Phasing notation for triploid+ genotypes #421
Expanding the Phasing notation for triploid+ genotypes #421
Conversation
@d-cameron could you review this? |
VCFv4.3.tex
Outdated
@@ -503,6 +507,10 @@ \subsubsection{Genotype fields} | |||
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. | |||
If the genotype in the GT field is unphased, the corresponding PS field is ignored. | |||
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). | |||
\item PQL (List of integers): The list of PQs one for each phase set in PSL (encoded like PQ) | |||
\item PSL (List of non-negative 32-bit Integer): The list of PSs one for each pipe ($\mid$) in the GT field, specifying the phase set for the allele prior to the pipe. | |||
A given sample-genotype should not have values for both PS and PSL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth discussing what implications the "should not have" has for rows with multiple samples, some having PSL and some PS. I'd go with a stronger "must not have".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. I'll update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had some time to think about this and the phasing issue isn't as intractable as I'd initially thought.
Take the simplest case of a simple duplication:
1234567890 (ref position)
ATAGGTTCGC haplotype 1 (reference)
TTAGGTTCGGTTCGC haplotype 2 (variant)
The symbolic allele notation for this is trivial:
contig 1 snp A T . . GT:PS 0|1:1
contig 4 dup G <DUP> . . SVTYPE=DUP;SVLEN=5;END=8 GT:PS 0|1:1
Just switching to BND notation introduces aneuploidy:
contig 1 snp A T . . GT:PSL 0|1:1,2
contig 4 bnd1 G ]contig:8]G . . SVTYPE=BND;PARID=bnd2 GT:PSL 0|0|1|:1,2,2
contig 8 bnd2 C C[contig:4[ . . SVTYPE=BND;PARID=bnd1 GT:PSL 0|1|0|:1,2,2
At a conceptual level, we can interpret a given PSL as a path through the derivate chromosome.
In the above trivial example, to traverse the duplication we take the ref path at bnd1, then
take the alt path at bnd2, return to bnd1 via the alt path, then traverse through bnd2 through
the ref path.
I believe this approach fully generalises to arbitrary paths through the graph if you follow the
convention that path traversal follows the allele for the given PSL according to their ordinal.
Explicitly add modulo wrap-around and SNVs in amplified regions can be represented as 0|1|:1,2
instead of having to write 0|1|1|1|1|1|1|1|1|:1,2,2,2,2,2,2,2,2
for every single CNV in an 8x amplified region.
Subclonality is still not addressed by this PR, but we can keep that as an independent issue.
VCFv4.3.tex
Outdated
@@ -424,16 +427,17 @@ \subsubsection{Genotype fields} | |||
No white-space or semi-colons permitted. | |||
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant). | |||
\item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities. | |||
\item GT (String): Genotype, encoded as allele values separated by either of $/$ or $\mid$. | |||
\item GT (String): Genotype, encoded as allele values followed by either of $/$ or $\mid$. | |||
The last separator may be ommitted if un-needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to explicitly state what implied value of the last separator is. 1/1
could be 1/1/
or it could be 1/1|
. IMO it should default to the type of the final separator that is actually specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm. I think that 0/1|2 means that the 0 & 2 are not phased and that only the 1 is phased. this is in agreement with 0|1/2 in which the 0 would be phased and the 1 & 2 are not. but I agree that 0|1|2 means that they are phased, so perhaps we say that if unspecified, the last haplotype is phased iff all the separators are | (which would also mean that in a mathematical sense a haploid is always phased...which is of course true)
VCFv4.3.tex
Outdated
\item PQL (List of integers): The list of PQs one for each phase set in PSL (encoded like PQ) | ||
\item PSL (List of non-negative 32-bit Integer): The list of PSs one for each pipe ($\mid$) in the GT field, specifying the phase set for the allele prior to the pipe. | ||
A given sample-genotype must not have values for both PS and PSL. | ||
However, they are interoperable, in that a PS mentioned in one variant can be references in a PSL in another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not interoperable and can't be mixed. Take the following example:
contig 1 snp1 A T . . GT:PS 0|1:1
contig 2 snp2 A G . . GT:PSL 0/1|:1
The G at snp2 is phased with snp1, but we don't know if it's phased with the A or the T allele since PS applies to the set of all alleles whereas PSL applies to just an individual one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the conclusion that PS and PSL are not interoperable and can't be mixed. However, I believe the example is incorrect in showing PSL having a single value. According to the proposed definition, it must have two values.
VCFv4.3.tex
Outdated
@@ -503,6 +507,10 @@ \subsubsection{Genotype fields} | |||
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. | |||
If the genotype in the GT field is unphased, the corresponding PS field is ignored. | |||
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This convention will not work for PSL since you can be starting multiple PSL 'phase sets' at the same position.
VCFv4.3.tex
Outdated
@@ -503,6 +507,10 @@ \subsubsection{Genotype fields} | |||
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. | |||
If the genotype in the GT field is unphased, the corresponding PS field is ignored. | |||
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recommendation will also cause naming collisions since PSLs can span multiple chromsomes
@d-cameron can you take over this PR? |
Nope...I'd need the number of the phased alleles/haplotypes, not the ploidy. |
due to the breaking change in the genotype field, I'm wondering if this should actually go into 4.4. what do people think? |
No-one cares about breaking 4.3? I'm happy to add this to 4.3 in that case... |
It looks like you've commented out the entire |
Lets move this to 4.4 so we don't have the breaking change. Better to not rush it and not break anything. |
There was a suggestion to make PSL have the same length as the GT field, and have "." to non-phased alleles. @d-cameron mentioned in one of the conversations that there's a need to have a notation for NOT phase-set, meaning that one can tell that a particular allele is NOT phased with a phase-set but not specifying which phase set it is phased with. This feels like a can of worms...what if someone can tell that an allele is either setA or setB, should we also have a notation for that? what about a full language that can express dependencies between alleles ("allele 1 is setA or setB unless allele 2 is setC in which case allele1 is actually setD"....) That said, if there's an urgent need for this kind of expressivity, then we should consider providing it. |
0a99872
to
0dd0712
Compare
VCFv4.4.tex
Outdated
\begin{tabular}{ l l l l l l l l l l} | ||
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO & FORMAT & SAMPLE1\\ | ||
chr19 & $5$ & . & T & G & . & PASS & DP=100 >:PSL & \tt{0|1:.,chr9*5*1}\\ | ||
chr20 & $10$ & . & A & T,G & . & PASS & DP=100 >:PSL & \tt{1|2/3|:chr20*10*1,.,chr9*5*1} \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an example of "1|2/3|" implies the binding of / and | is to the left. However the BCF spec indicates it is to the right so you'd need the extra symbol at the start and not the end.
From the table on page 32 of https://github.com/samtools/hts-specs/blob/master/VCFv4.3.pdf it shows "0 / 1 | 2" as 0x02 04 07 via the formula (allele+1)<<1 | phased
. The third byte is the phased one, indicating the "|" is affecting the value to its right.
This really doesn't seem to be clearly defined in the spec, and is causing confusion. See samtools/htslib#1113 for an example of this very problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point and also the behavior of the reference implementation in htslib. I'd be open to modifying the existing specification and htslib since mixed ploidy is not a widely used feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no need...I agree that it's best to keep with the current example.
so, problem here is that the example vcf table ends up being too wide for the page....any advice? |
Maybe try making the example Although what we really need is someone with some major tex-fu to come up with something a bit like |
Change font size is easiest; There's also the tabto package which permits equal spaces or explicitly defined tab stops. It uses \tab though, but it's perhaps easier than table. |
It is indeed quite messy but I think we need to work look out our design options for doing so. LINX is currently using non-standard fields form GRIDSS2 to actually do derivate chromosome reconstruction. Notably, during the construction of breakage-fusion-bridge rearrangements, SV phasing information is used including simultaneous cis and trans phasing of adjacent (amplified) SVs. I'm looking at adding similar capabilities for Dragen SV/manta so it would be great to have this capability standardised - especially since long read sequencing contains exactly this sort of long range phasing information. It still counts as two independent implementation even if they're both by me, right? I'll try to come up with a design that's sufficiency expressive to handle my use cases, yet simple enough that the simple cases aren't an absolute mess to interpret. Design goals:
Doing this might also require addressing the issues that the current genotype fields have with aneuploidy. |
Resurrecting this issue to incorporate into 4.4 and the fact that bundles aren't well defined has finally come up (#643) . The plan is:
-- We can use
|
Added PSO field to remove traversal ambiguity Using preceding GT notation to match BCF Added BCF clarification what to do with the missing first allele GT separator Defined implicit GT separator based on the other separators Removed absolete definition of bundles #643
In 4.4 |
This is for one of the user stories in the Future of VCF.