-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add reference blocksize and checkpointing to VCF #435
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -253,6 +253,17 @@ \subsubsection{Pedigree field format} | |||||
##pedigreeDB=URL | ||||||
\end{verbatim} | ||||||
|
||||||
|
||||||
\subsubsection{Reference block checkpoint} | ||||||
Given the ability to interpret missing genotypes as either truly missing or as part of a reference block (see RBS in Genotype tags below), it can be useful to limit the genomic distance required to scan in order to find the "top" of the reference block. | ||||||
To enable this, one may specify a Reference Block Checkpoint scheme: | ||||||
\begin{verbatim} | ||||||
##REFERENCE_BLOCK=<CHECKPOINT=1000> | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is invalid as a structured meta-information line, because it does not have an Are there plans to expand this in future with additional subfields within the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
just reifying this as a one-click GitHub suggestion. |
||||||
\end{verbatim} | ||||||
To indicate that a reference block will not span a POS that is divisible by 1000 so that a user can check less than 1000 genomics positions to find out whether the `.` in the genotype field means "missing" or is part of a reference block. | ||||||
Values other than 1000 can be used. | ||||||
Non-positive values or a missing meta line indicate that there is no checkpointing. | ||||||
|
||||||
\noindent See \ref{PedigreeInDetail} for details. | ||||||
|
||||||
|
||||||
|
@@ -410,8 +421,8 @@ \subsubsection{Genotype fields} | |||||
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ | ||||||
PQ & 1 & Integer & Phasing quality \\ | ||||||
PS & 1 & Integer & Phase set \\ | ||||||
RBS & 1 & Integer & Reference Block Size\\ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
\end{longtable} | ||||||
|
||||||
\begin{itemize} | ||||||
\renewcommand{\labelitemii}{$\circ$} | ||||||
\item AD, ADF, ADR (Integer): Per-sample read depths for each allele; total (AD), on the forward (ADF) and the reverse (ADR) strand. | ||||||
|
@@ -503,6 +514,32 @@ \subsubsection{Genotype fields} | |||||
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. | ||||||
If the genotype in the GT field is unphased, the corresponding PS field is ignored. | ||||||
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). | ||||||
\item RBS(Integer): An integer describing the size of this genotype's reference block, or missing ``.'' if unknown. | ||||||
A ``reference block" is a set of adjacent loci that are determined to be reference with a particular confidence. | ||||||
The RBS notation enables an implementation to avoid writing any information in subsequent genotypes and place the missing value (`.') with the implication that | ||||||
the confidence other attributes of the missing genotypes are the same as that in the anchor genotype (the one with the RBS value). | ||||||
Clearly, this can only be used when the genotype in the anchor variant is reference. | ||||||
The numerical value of RBS is the difference between the last position (inclusive) of the reference block and POS. | ||||||
Missing genotypes (`.') that are not covered by a reference block are to be interpreted as missing, i.e. no information is known about the site. | ||||||
To disambiguate a `.' between being truly missing and part of a reference block, one would therefore need to "look up" and find the previous RBS FORMAT value in that sample. | ||||||
In addition, any non-missing value (including `.:.' or `./.') would effectively break a reference block, and should be treated as a violation of the specification if RBS is specified, or an implicit end of the block if RBS is unknown. | ||||||
When reading the file from top to bottom, an implementation can simply remember what the RBS is for each sample, however when using the index to ``seek" to a particular point of the reference, one may need to seek to an unknown location in the file. | ||||||
To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
For example (with CHROM, ID, REF, ALT, QUAL, FILTER, INFO fields/columns removed for brevity \& clarity): | ||||||
|
||||||
\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
See above comment by jmarshall |
||||||
|
||||||
\begin{tabular}[c]{llll|l} | ||||||
POS&FORMAT&Alice&Bob&comment\\ | ||||||
400 >:DP:RBS& 0/0:30:250& 0/1:20:.\\ | ||||||
500 & GT:DP:RBS& .& 0/1:30:150\\ | ||||||
649 >:DP:RBS& .& . &still in the reference block\\ | ||||||
650 >:DP:RBS& .& . &no information about this location\\ | ||||||
900 >:DP:RBS& 0/1:30& 0/0:20:100&block goes to 999 \\ | ||||||
1000 >:DP:RBS& 0/0:20:200& 0/1:20&there's a checkpoint here. \\ | ||||||
1001 >:DP:RBS& .& 0/0:20:200 & \\ | ||||||
\end{tabular} | ||||||
\end{itemize} | ||||||
|
||||||
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be placed after the final sentence (“See [PedigreeInDetail] for details”) of the preceding section.