Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanding the Phasing notation for triploid+ genotypes #421

Closed

Conversation

yfarjoun
Copy link
Contributor

This is for one of the user stories in the Future of VCF.

@hts-specs-bot
Copy link

Changed PDFs as of 5d4080e: VCFv4.3 (diff).

@yfarjoun
Copy link
Contributor Author

@d-cameron could you review this?

VCFv4.3.tex Outdated
@@ -503,6 +507,10 @@ \subsubsection{Genotype fields}
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
\item PQL (List of integers): The list of PQs one for each phase set in PSL (encoded like PQ)
\item PSL (List of non-negative 32-bit Integer): The list of PSs one for each pipe ($\mid$) in the GT field, specifying the phase set for the allele prior to the pipe.
A given sample-genotype should not have values for both PS and PSL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth discussing what implications the "should not have" has for rows with multiple samples, some having PSL and some PS. I'd go with a stronger "must not have".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I'll update.

@jkbonfield jkbonfield added the vcf label Jun 18, 2019
@hts-specs-bot
Copy link

Changed PDFs as of 63bde13: VCFv4.3 (diff).

Copy link
Contributor

@d-cameron d-cameron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had some time to think about this and the phasing issue isn't as intractable as I'd initially thought.

Take the simplest case of a simple duplication:

1234567890 (ref position)
ATAGGTTCGC haplotype 1 (reference)
TTAGGTTCGGTTCGC haplotype 2 (variant)

The symbolic allele notation for this is trivial:

contig	1	snp	A	T	.	.		GT:PS	0|1:1
contig	4	dup	G	<DUP>	.	.	SVTYPE=DUP;SVLEN=5;END=8	GT:PS	0|1:1

Just switching to BND notation introduces aneuploidy:

contig	1	snp	A	T	.	.		GT:PSL	0|1:1,2
contig	4	bnd1	G	]contig:8]G	.	.	SVTYPE=BND;PARID=bnd2	GT:PSL	0|0|1|:1,2,2
contig	8	bnd2	C	C[contig:4[	.	.	SVTYPE=BND;PARID=bnd1	GT:PSL	0|1|0|:1,2,2

At a conceptual level, we can interpret a given PSL as a path through the derivate chromosome.
In the above trivial example, to traverse the duplication we take the ref path at bnd1, then
take the alt path at bnd2, return to bnd1 via the alt path, then traverse through bnd2 through
the ref path.

I believe this approach fully generalises to arbitrary paths through the graph if you follow the
convention that path traversal follows the allele for the given PSL according to their ordinal.
Explicitly add modulo wrap-around and SNVs in amplified regions can be represented as 0|1|:1,2
instead of having to write 0|1|1|1|1|1|1|1|1|:1,2,2,2,2,2,2,2,2 for every single CNV in an 8x amplified region.

Subclonality is still not addressed by this PR, but we can keep that as an independent issue.

VCFv4.3.tex Outdated
@@ -424,16 +427,17 @@ \subsubsection{Genotype fields}
No white-space or semi-colons permitted.
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
\item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities.
\item GT (String): Genotype, encoded as allele values separated by either of $/$ or $\mid$.
\item GT (String): Genotype, encoded as allele values followed by either of $/$ or $\mid$.
The last separator may be ommitted if un-needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explicitly state what implied value of the last separator is. 1/1 could be 1/1/ or it could be 1/1|. IMO it should default to the type of the final separator that is actually specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. I think that 0/1|2 means that the 0 & 2 are not phased and that only the 1 is phased. this is in agreement with 0|1/2 in which the 0 would be phased and the 1 & 2 are not. but I agree that 0|1|2 means that they are phased, so perhaps we say that if unspecified, the last haplotype is phased iff all the separators are | (which would also mean that in a mathematical sense a haploid is always phased...which is of course true)

VCFv4.3.tex Outdated
\item PQL (List of integers): The list of PQs one for each phase set in PSL (encoded like PQ)
\item PSL (List of non-negative 32-bit Integer): The list of PSs one for each pipe ($\mid$) in the GT field, specifying the phase set for the allele prior to the pipe.
A given sample-genotype must not have values for both PS and PSL.
However, they are interoperable, in that a PS mentioned in one variant can be references in a PSL in another.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not interoperable and can't be mixed. Take the following example:

contig	1	snp1	A	T	.	.		GT:PS	0|1:1
contig	2	snp2	A	G	.	.		GT:PSL	0/1|:1

The G at snp2 is phased with snp1, but we don't know if it's phased with the A or the T allele since PS applies to the set of all alleles whereas PSL applies to just an individual one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the conclusion that PS and PSL are not interoperable and can't be mixed. However, I believe the example is incorrect in showing PSL having a single value. According to the proposed definition, it must have two values.

VCFv4.3.tex Outdated
@@ -503,6 +507,10 @@ \subsubsection{Genotype fields}
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This convention will not work for PSL since you can be starting multiple PSL 'phase sets' at the same position.

VCFv4.3.tex Outdated
@@ -503,6 +507,10 @@ \subsubsection{Genotype fields}
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recommendation will also cause naming collisions since PSLs can span multiple chromsomes

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Oct 7, 2019

@d-cameron can you take over this PR?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Oct 18, 2019

If we get #442 merged, these could also be corrected in 4.4 to use P as their number.

Nope...I'd need the number of the phased alleles/haplotypes, not the ploidy.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Nov 3, 2019

due to the breaking change in the genotype field, I'm wondering if this should actually go into 4.4.

what do people think?

@hts-specs-bot
Copy link

Changed PDFs as of 3e0be24: VCFv4.3 (diff).

@yfarjoun
Copy link
Contributor Author

No-one cares about breaking 4.3? I'm happy to add this to 4.3 in that case...

@daviesrob
Copy link
Member

It looks like you've commented out the entire GT, PQL and PSL descriptions in the last commit. Was that intentional?

@lbergelson
Copy link
Member

Lets move this to 4.4 so we don't have the breaking change. Better to not rush it and not break anything.

@yfarjoun
Copy link
Contributor Author

There was a suggestion to make PSL have the same length as the GT field, and have "." to non-phased alleles.

@d-cameron mentioned in one of the conversations that there's a need to have a notation for NOT phase-set, meaning that one can tell that a particular allele is NOT phased with a phase-set but not specifying which phase set it is phased with. This feels like a can of worms...what if someone can tell that an allele is either setA or setB, should we also have a notation for that? what about a full language that can express dependencies between alleles ("allele 1 is setA or setB unless allele 2 is setC in which case allele1 is actually setD"....) That said, if there's an urgent need for this kind of expressivity, then we should consider providing it.

@hts-specs-bot
Copy link

Changed PDFs as of 0a99872: VCFv4.3 (diff).

@yfarjoun yfarjoun force-pushed the yf_high_ploidy_phase_indicator_vcf branch from 0a99872 to 0dd0712 Compare April 30, 2020 12:00
@hts-specs-bot
Copy link

Changed PDFs as of 0dd0712: VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of e124af8: VCFv4.4 (diff).

VCFv4.4.tex Outdated
\begin{tabular}{ l l l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO & FORMAT & SAMPLE1\\
chr19 & $5$ & . & T & G & . & PASS & DP=100 &GT:PSL & \tt{0|1:.,chr9*5*1}\\
chr20 & $10$ & . & A & T,G & . & PASS & DP=100 &GT:PSL & \tt{1|2/3|:chr20*10*1,.,chr9*5*1} \\
Copy link
Contributor

@jkbonfield jkbonfield Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an example of "1|2/3|" implies the binding of / and | is to the left. However the BCF spec indicates it is to the right so you'd need the extra symbol at the start and not the end.

From the table on page 32 of https://github.com/samtools/hts-specs/blob/master/VCFv4.3.pdf it shows "0 / 1 | 2" as 0x02 04 07 via the formula (allele+1)<<1 | phased. The third byte is the phased one, indicating the "|" is affecting the value to its right.

This really doesn't seem to be clearly defined in the spec, and is causing confusion. See samtools/htslib#1113 for an example of this very problem.

Copy link
Member

@pd3 pd3 Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point and also the behavior of the reference implementation in htslib. I'd be open to modifying the existing specification and htslib since mixed ploidy is not a widely used feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no need...I agree that it's best to keep with the current example.

@hts-specs-bot
Copy link

Changed PDFs as of d1a742e: VCFv4.4 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of 09a30d7: VCFv4.4 (diff).

@yfarjoun
Copy link
Contributor Author

so, problem here is that the example vcf table ends up being too wide for the page....any advice?

@hts-specs-bot
Copy link

Changed PDFs as of 0427dbd: VCFv4.4 (diff).

@daviesrob
Copy link
Member

Maybe try making the example {verbatim} instead of {tabular}? And if all else fails, make the font smaller.

Although what we really need is someone with some major tex-fu to come up with something a bit like {verbatim} but that makes the tabs line up as for the table, and has them cut & paste as real tabs so you can copy the example and get something like real VCF format.

@jkbonfield
Copy link
Contributor

Change font size is easiest; \small or \footnotesize

There's also the tabto package which permits equal spaces or explicitly defined tab stops. It uses \tab though, but it's perhaps easier than table.

@d-cameron
Copy link
Contributor

@d-cameron mentioned in one of the conversations that there's a need to have a notation for NOT phase-set, meaning that one can tell that a particular allele is NOT phased with a phase-set but not specifying which phase set it is phased with.

This feels like a can of worms...

It is indeed quite messy but I think we need to work look out our design options for doing so. LINX is currently using non-standard fields form GRIDSS2 to actually do derivate chromosome reconstruction. Notably, during the construction of breakage-fusion-bridge rearrangements, SV phasing information is used including simultaneous cis and trans phasing of adjacent (amplified) SVs.

I'm looking at adding similar capabilities for Dragen SV/manta so it would be great to have this capability standardised - especially since long read sequencing contains exactly this sort of long range phasing information.

It still counts as two independent implementation even if they're both by me, right?

I'll try to come up with a design that's sufficiency expressive to handle my use cases, yet simple enough that the simple cases aren't an absolute mess to interpret.

Design goals:

  • Able to simulatenously phase variants as both cis & trans
    • cis/trans AF estimates (or CN?)
  • Able to specify long-range phasing information (ONT/bionano)
    • these data types contains approximate phased distance between DNA segments - possibly incorporate this explicitly?
  • define what we mean by phasing of SVs. cis=nearby; trans=distant. Trans phased SV can be on the same derivate chromosome

Doing this might also require addressing the issues that the current genotype fields have with aneuploidy.

d-cameron pushed a commit to d-cameron/hts-specs that referenced this pull request Jun 21, 2022
@d-cameron
Copy link
Contributor

Resurrecting this issue to incorporate into 4.4 and the fact that bundles aren't well defined has finally come up (#643) . The plan is:

  • Break BCF partial phasing; use suffix notation with final separate optional (phased if all others are phased)
  • PSL pretty much as-is
    -- PSL unique across VCF is very annoying to do incremental phasing but the string representation is rather verbose. Any preferences? I'd prefer numeric as I don't want yet another field separator character.
    -- Trans phasing of nearby SVs is just cis phasing of the ref (not sure why I didn't think of it in these terms earlier!)
    --- Example: 2 ref copies, and two more copies (each containing one DEL):
chrA 10 . T <DEL> . . SVLEN=5   GT:PSL 0|0|0|1:1,2,3,4
chrA 20 . T <DEL> . . SVLEN=10  GT:PSL 0|0|1|0:1,2,3,4

-- We can use CN to do the heavy lifting for aneuploid samples.
--- If we allow CN to be specified for any ALT allele, we don't need to duplicate any ALTs, and it provides an (optional) compressed GT representation at high copy number.

  • Need a PSO or PSI phase ordinal/index field which defines the order in which the variants are encountered.
    -- contained variants are apply to the modification. Example:
chrA 10 . A C,T                      GT:PSL:PSO 0/1|1|:0,1,1:0,1,3
chrA 5  . T <DUP> SVLEN=10;SVCLAIM=J GT:PSL:PSO 0/1|  :0,1  :0,2
-- 
This has the C and T SNVs phased with the <DUP>. The C SNV occurs on the first instance, the dup happens and we jump back to chrA:5, then the T occurs on the second instance of the dup.

Can anyone see any ambiguities in this ordering convention? I'm worried I've overlooked an edge case in which the 'direction' of traversal is unclear. Fix example, does missing SVs on circular chromosomes with foldback inversions break this model?




d-cameron added a commit that referenced this pull request Aug 23, 2022
Added PSO field to remove traversal ambiguity
Using preceding GT notation to match BCF
Added BCF clarification what to do with the missing first allele GT separator
Defined implicit GT separator based on the other separators
Removed absolete definition of bundles #643
@d-cameron
Copy link
Contributor

In 4.4

@d-cameron d-cameron closed this Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants