Skip to content

Commit

Permalink
Incorporated @yfarjoun #421 into VCFv4.4
Browse files Browse the repository at this point in the history
Added PSO field to remove traversal ambiguity
Using preceding GT notation to match BCF
Added BCF clarification what to do with the missing first allele GT separator
Defined implicit GT separator based on the other separators
Removed absolete definition of bundles #643
  • Loading branch information
d-cameron committed Aug 23, 2022
1 parent e1acf3f commit 008387a
Showing 1 changed file with 57 additions and 62 deletions.
119 changes: 57 additions & 62 deletions VCFv4.4.draft.tex
Original file line number Diff line number Diff line change
Expand Up @@ -489,6 +489,10 @@ \subsubsection{Genotype fields}
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
PQ & 1 & Integer & Phasing quality \\
PS & 1 & Integer & Phase set \\
PSL & P & String & Phase set list \\
PSO & P & Integer & Phase set list ordinal \\
PSQ & P & Integer & Phase set list quality \\
\end{longtable}
\begin{itemize}
Expand All @@ -503,17 +507,18 @@ \subsubsection{Genotype fields}
No whitespace or semicolons permitted.
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
\item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities.
\item GT (String): Genotype, encoded as allele values separated by either of $/$ or $\mid$.
The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
For diploid calls examples could be $0/1$, $1\mid0$, or $1/2$, etc.
Haploid calls, e.g.\ on Y, male non-pseudoautosomal X, or mitochondrion, are indicated by having only one allele value.
A triploid call might look like $0/0/1$.
If a call cannot be made for a sample at a given locus, `.' must be specified for each missing allele in the GT field (for example `$./.$' for a diploid genotype and `.' for haploid genotype).
The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes):
\begin{itemize}
\item $/$ : genotype unphased
\item $\mid$ : genotype phased
\end{itemize}
\item GT (String): Genotype, encoded as allele value preceded by either of $/$ or $\mid$ depending on whether that allele is considered phased.
The first separator may be omitted and is implicitly defined as $/$ if any separator are $/$ and $\mid$ otherwise.
The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
For diploid calls examples could be $0/1$, $1\mid0$, $/0/1$, or $1/2$, etc.
Haploid calls, e.g.\ on Y, male non-pseudoautosomal X, or mitochondria, should be indicated by having only one allele value.
A triploid call might look like $0/0/1$, and a partially phased triploid call could be $|0/1/2$ to indicate that the first allele is phased with another variant in the VCF.
If a call cannot be made for a sample at a given locus, `$.$' must be specified for each missing allele in the {\tt GT} field (for example `$./.$' for a diploid genotype and `$.$' for haploid genotype).
The meanings of the separators are as follows (see the {\tt PS} and {\tt PSL} fields below for more details on incorporating phasing information into the genotypes):
\begin{itemize}
\item $/$ : preceding allele is unphased
\item $\mid$ : preceding allele is phased (according to the phase-set indicated in {\tt PS} or {\tt PSL})
\end{itemize}
\item GL (Float): Genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
In presence of the GT field the same ploidy is expected; without GT field, diploidy is assumed.
Expand Down Expand Up @@ -583,6 +588,45 @@ \subsubsection{Genotype fields}
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
\item PSL (List of Strings): The list of phase sets, one for each allele specified in the {\tt GT}.
Unphased alleles (without a $\mid$ separator before them) must have the value '$.$' in their corresponding position in the list.
Unlike {\tt PS} (which is defined per CHROM), records with different CHROM but the same phase-set name are considered part of the same phase set.
If an implementation cannot guarantee uniqueness of phase-set names across the VCF (for example, phasing a streaming VCF or each CHROM is processed independently in parallel), new phase-set names should be of the format CHROM*POS*ALLELE-NUMBER of the ``first'' allele which is included in this set, with ALLELE-NUMBER being the index of the allele in the {\tt GT} field, since multiple distinct phase-sets could start at the same position. \footnote{The `*' character is used as a separator since `:' is not reserved in the CHROM column.}
A given sample-genotype must not have values for both PS and PSL.
In addition, PS and PSL are not interoperable, in that a PS mentioned in one variant cannot be referenced in a PSL in another, since when used in PS it isn't connected to any specific haplotype (i.e. first or second), but PSL is.
Example:
\vspace{0.5em}
\begin{tabular}{ l l l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO & FORMAT & SAMPLE1\\
chr19 & $5$ & . & T & G & . & PASS & DP=100 &GT:PSL & \tt{|0/1:chr9*5*1,.}\\
chr20 & $10$ & . & A & T,G & . & PASS & DP=100 &GT:PSL & \tt{|1/2|3:chr20*10*1,.,chr9*5*1} \\
chr20 & $15$ & . & G & C & . & PASS & DP=100 &GT:PSL & \tt{1|2:.,chr20*10*1}\\
\end{tabular}
\item PSO (List of integers): List of phase set ordinals.
For each phase-set name, defines the order in which variants are encountered when traversing a derivate chromosome.
The missing value '$.$' should be used when the corresponding PSO value is missing.
For each phase-set name, PSO should be defined if any allele with that phase-set name on any record is symbolic structural variant or in breakpoint notation.
Variants in breakpoint notation must have the same PSL and PSO on both records.
Without explicitly specifying the derivate chromosome traversal order, multiple derivate chromosome reconstructions are possible.
Take for example this tandem duplication in a triploid organism with SNVs (ID/QUAL/FILTER columns removed for clarity):
\vspace{0.5em}
\begin{tabular}{ l l l l l l l l l l}
\#CHROM & POS & REF & ALT & INFO & FORMAT & SAMPLE1\\
chr1 & $10$ & T & $<$DUP$>$ & SVCLAIM=DJ & GT:PSL:PSO & \tt{/0/0|1:.,.,chr1*10*1:.,.,3}\\
chr1 & $20$ & A & G & . & GT:PSL:PSO & \tt{/0/0|0|1:.,chr1*10*1:.,.,4,1} \\
chr1 & $30$ & G & T & . & GT:PSL:PSO & \tt{/0/0|0|1:.,chr1*10*1:.,.,2,5} \\
\end{tabular}
Without defining PSO, would be ambiguous as to which copy of the duplicated region the SNVs occur on.
In this example, the presence of the PSO field clarifies that the SNVs are cis phased with the duplication, the first SNV occurs on the first copy of the duplicated region, and second SNV on the second copy.
\item PSQ (List of integers): The list of PQs, one for each phase set in PSL (encoded like PQ).
The missing value '$.$' should be used when the corresponding PSL value is missing, or when the phasing is of unknown quality.
\end{itemize}
Expand Down Expand Up @@ -1541,57 +1585,6 @@ \subsubsection{Clonal derivation relationships}
In the case of the duplication of a region within a haplotype, one copy retains the original haplotype identifier, and the others are considered to be novel haplotypes with their own unique identifiers.
All these novel haplotypes have in common their \textbf{haplotype ancestor} in the parent genome.
\subsubsection{Phasing adjacencies in an aneuploid context}
In a cancer genome, due to duplication followed by mutation, there can in principle exist any number of haplotypes in the sampled genome for a given location in the reference genome.
We assume each haplotype that the user chooses to name is named with a numerical haplotype identifier.
Although it is difficult with current technologies to associate haplotypes with novel adjacencies, it might be partially possible to deconvolve these connections in the near future.
We therefore propose the following notation to allow haplotype-ambiguous as well as haplotype-unambiguous connections to be described.
The general term for these haplotype-specific adjacencies is \textbf{bundles}.
The diagram in Figure 11 will be used to support examples below:
\begin{figure}[ht]
\centering
\includegraphics[width=4in,height=2.59in]{img/phasing-400x259.png}
\caption{Phasing}
\end{figure}
In this example, we know that in the sampled genome:
\begin{enumerate}
\item A reference bundle connects breakend U, haplotype 5 on chr13 to its partner, breakend X, haplotype 5 on chr13,
\item A novel bundle connects breakend U, haplotype 1 on chr13 to its mate breakend V, haplotype 11 on chr2, and finally,
\item A novel bundle connects breakend U, haplotypes 2, 3 and 4 on chr13 to breakend V, haplotypes 12, 13 or 14 on chr2 without any explicit pairing.
\end{enumerate}
These three are the bundles for breakend U. Each such bundle is referred to as a haplotype of the breakend U.
Each allele of a breakend corresponds to one or more haplotypes.
In the above case there are two alleles: the 0 allele, corresponding to the adjacency to the partner X, which has haplotype (1), and the 1 allele, corresponding to the two haplotypes (2) and (3) with adjacency to the mate V.
For each haplotype of a breakend, say the haplotype (2) of breakend U above, connecting the end of haplotype 1 on a segment of Chr 13 to a mate on Chr 2 with haplotype 11, in addition to the list of haplotype-specific adjacencies that define it, we can also specify in VCF several other quantities.
These include:
\begin{enumerate}
\item The depth of reads on the segment where the breakend occurs that support the haplotype, e.g., the depth of reads supporting haplotype 1 in the segment containing breakend U
\item The estimated copy number of the haplotype on the segment where the breakend occurs
\item The depth of paired-end or split reads that support the haplotype-specific adjacencies, e.g., that support the adjacency between haplotype 1 on Chr 13 to haplotype 11 on Chr 2
\item The estimated copy number of the haplotype-specific adjacencies
\item An overall quality score indicating how confident we are in this asserted haplotype
\end{enumerate}
These are specified using the using the DP, CN, BDP, BCN, and HQ subfields, respectively.
The total information available about the three haplotypes of breakend U in the figure above may be visualized in a table as follows.
\vspace{0.3cm}
\begin{tabular}{ l l l l }
Allele & 1 & 1 & 0 \\
Haplotype & 1$>$11 & 2,3,4$>$12,13,14 & 5$>$5 \\
Segment Depth & 5 & 17 & 4 \\
Segment Copy Number & 1 & 3 & 1 \\
Bundle Depth & 4 & 0 & 3 \\
Bundle Copy Number & 1 & 3 & 1 \\
Haplotype quality & 30 & 40 & 40 \\
\end{tabular}
\pagebreak
\subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
\label{unspecified-allele}
Expand Down Expand Up @@ -2037,6 +2030,7 @@ \subsubsection{Type encoding}
For one individual, each integer in the vector is organized as $(allele+1) << 1 \mid phased$ where allele is set to $-1$ if the allele in GT is a dot `.' (thus the higher bits are all 0).
The vector is padded with the END\_OF\_VECTOR values if the GT having fewer ploidy.
We note specifically that except for the END\_OF\_VECTOR byte, no other negative values are allowed in the GT array.
When processing VCF version 4.3 or earlier files, the phasing of the first allele should be treated as missing and inferred from the remaining alleles.
Examples:
Expand Down Expand Up @@ -2302,6 +2296,7 @@ \subsection{Changes between VCFv4.4 and VCFv4.3}
\item Deprecate SVTYPE INFO field preferring the use of symbolic alleles in the ALT field
\item Define new reserved INFO field EVENT, EVENTTYPE and SVCLAIM
\item Redefined INFO field SVLEN to be always positive
\item Added Phase-Set List (PSL \& PSO \& PSQ) and allele-specific phasing notation (in GT)
\end{itemize}
\subsection{Changes to VCFv4.3}
Expand Down

0 comments on commit 008387a

Please sign in to comment.