Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add reference blocksize and checkpointing to VCF #435

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified VCFv4.3.pdf
Binary file not shown.
39 changes: 38 additions & 1 deletion VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,17 @@ \subsubsection{Pedigree field format}
##pedigreeDB=URL
\end{verbatim}


\subsubsection{Reference block checkpoint}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be placed after the final sentence (“See [PedigreeInDetail] for details”) of the preceding section.

Given the ability to interpret missing genotypes as either truly missing or as part of a reference block (see RBS in Genotype tags below), it can be useful to limit the genomic distance required to scan in order to find the "top" of the reference block.
To enable this, one may specify a Reference Block Checkpoint scheme:
\begin{verbatim}
##REFERENCE_BLOCK=<CHECKPOINT=1000>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalid as a structured meta-information line, because it does not have an ID=… subfield.

Are there plans to expand this in future with additional subfields within the <…>? If not, I'd suggest reworking this as a plain unstructured meta-information line, e.g.,

##REFERENCE_BLOCK_CHECKPOINT=1000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##REFERENCE_BLOCK=<CHECKPOINT=1000>
##REFERENCE_BLOCK_CHECKPOINT=1000

just reifying this as a one-click GitHub suggestion.

\end{verbatim}
To indicate that a reference block will not span a POS that is divisible by 1000 so that a user can check less than 1000 genomics positions to find out whether the `.` in the genotype field means "missing" or is part of a reference block.
Values other than 1000 can be used.
Non-positive values or a missing meta line indicate that there is no checkpointing.

\noindent See \ref{PedigreeInDetail} for details.


Expand Down Expand Up @@ -410,8 +421,8 @@ \subsubsection{Genotype fields}
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
PQ & 1 & Integer & Phasing quality \\
PS & 1 & Integer & Phase set \\
RBS & 1 & Integer & Reference Block Size\\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RBS & 1 & Integer & Reference Block Size\\
RBS & 1 & Integer & Reference block size\\

\end{longtable}

\begin{itemize}
\renewcommand{\labelitemii}{$\circ$}
\item AD, ADF, ADR (Integer): Per-sample read depths for each allele; total (AD), on the forward (ADF) and the reverse (ADR) strand.
Expand Down Expand Up @@ -503,6 +514,32 @@ \subsubsection{Genotype fields}
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
\item RBS(Integer): An integer describing the size of this genotype's reference block, or missing ``.'' if unknown.
A ``reference block" is a set of adjacent loci that are determined to be reference with a particular confidence.
The RBS notation enables an implementation to avoid writing any information in subsequent genotypes and place the missing value (`.') with the implication that
the confidence other attributes of the missing genotypes are the same as that in the anchor genotype (the one with the RBS value).
Clearly, this can only be used when the genotype in the anchor variant is reference.
The numerical value of RBS is the difference between the last position (inclusive) of the reference block and POS.
Missing genotypes (`.') that are not covered by a reference block are to be interpreted as missing, i.e. no information is known about the site.
To disambiguate a `.' between being truly missing and part of a reference block, one would therefore need to "look up" and find the previous RBS FORMAT value in that sample.
In addition, any non-missing value (including `.:.' or `./.') would effectively break a reference block, and should be treated as a violation of the specification if RBS is specified, or an implicit end of the block if RBS is unknown.
When reading the file from top to bottom, an implementation can simply remember what the RBS is for each sample, however when using the index to ``seek" to a particular point of the reference, one may need to seek to an unknown location in the file.
To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.
To assist in seeking, the \verb!##REFERENCE_BLOCK_CHECKPOINT! header line defines a multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.


For example (with CHROM, ID, REF, ALT, QUAL, FILTER, INFO fields/columns removed for brevity \& clarity):

\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\
\#\#REFERENCE\_BLOCK\_CHECKPOINT=1000\\

See above comment by jmarshall


\begin{tabular}[c]{llll|l}
POS&FORMAT&Alice&Bob&comment\\
400 &GT:DP:RBS& 0/0:30:250& 0/1:20:.\\
500 & GT:DP:RBS& .& 0/1:30:150\\
649 &GT:DP:RBS& .& . &still in the reference block\\
650 &GT:DP:RBS& .& . &no information about this location\\
900 &GT:DP:RBS& 0/1:30& 0/0:20:100&block goes to 999 \\
1000 &GT:DP:RBS& 0/0:20:200& 0/1:20&there's a checkpoint here. \\
1001 &GT:DP:RBS& .& 0/0:20:200 & \\
\end{tabular}
\end{itemize}


Expand Down
Loading