Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutect2 Matched samples purity estimates fails #120

Closed
npatel-ah opened this issue May 12, 2020 · 23 comments
Closed

Mutect2 Matched samples purity estimates fails #120

npatel-ah opened this issue May 12, 2020 · 23 comments
Assignees
Labels
bug gatk4 All things related to GATK4/Mutect2

Comments

@npatel-ah
Copy link

Hello Team,

I am trying to use PureCN with a matched tumor and normal samples, however, it's throwing an error. I am attaching the log file here. I have used CNVkit for segmentation and Mutect2 (GATK 4.1.7.0) for variant detection.
Tumor01.PureCN.log

I tried a few things to troubleshoot, none seems to be working.

  1. First, I ran it without the "dbinfoflag" flag but it produced below error. Note that I did provide the "popafinfofield" flag.
WARN [2020-05-12 15:28:37] vcf.file has no DB info field for membership in germline databases. Found and used somatic status instead.
INFO [2020-05-12 15:28:38] 0 (0.0%) variants annotated as likely germline (DB INFO flag).
  1. So then I annotated VCF file with common SNPs (1kg AF > 1%) and used the COMMON as db flag and then got another error as shown in the log file.
  2. Also, I tried different option for fun segmentation (PBCBS isn't working)
  3. And Changed thresholds for --minpurity 0.05 and --maxpurity 0.99.
Rscript $PURECN_DIR/PureCN.R 
        --out PureCN_Analysis/$TUMOR_ID \
	--sampleid $TUMOR_ID \
	--tumor TumorCNN/$TUMOR_ID.cnr \
	--segfile TumorCNN/$TUMOR_ID.seg \
	--vcf ${TUMOR_ID}.filtered.anno.vcf \
	--snpblacklist hg19_simpleRepeats.bed \
	--genome hg19 \
	--popafinfofield gnomAD_exomes_controls_AF \
	--dbinfoflag COMMON \
	--force --postoptimize --seed 123 \
	--outvcf \
	--cores 12 \
	--funsegmentation Hclust \
	--parallel >& $TUMOR_ID.PureCN.log

Any suggestion to resolve the error is appreciated. If there is an issue with samples itself as stated in log file, what would be an alternative way to confirm the behavior?

A note, I don't have normal samples to run NormalDB.

Best,
nihir

@lima1
Copy link
Owner

lima1 commented May 14, 2020

Looks like you filter out known SNPs. We need the SNPs for allele-specific copy number calling (and PSCBS requires them for annotation). M2 requires the -genotype-germline-sites flag to call germline variants.

There is also something wrong with your off-target reads. For WES, they don't add much. If you cannot figure out why the log-ratio standard deviation is so high, you probably get better results without those.

@npatel-ah
Copy link
Author

Thank for your quick reply. I indeed use the "-genotype-germline-sites" to detect germline variants with Mutect2 and assuming the downstream filtering of M2 only assigns a flag to the FILTER field and not remove actual variants.
In terms of common variants, I do see them VCF, in fact below line from log confirms it, as the COMMON flag assigned to AF > 1% in 1000g.

48496 (69.6%) variants annotated as likely germline (COMMON INFO flag).

How would I run analysis without using "log-ratio standard deviation ". Is "--undosd" is an appropriate choice here?

just a note I am using xGen IDT BAIT file for all the analyses.

Best,
Nihir

@lima1
Copy link
Owner

lima1 commented May 14, 2020

I would re-run CNVkit without the off-target feature. The lines log-ratio standard deviation display the noise of the tumor vs normal coverage log2-ratio. In your case, the off-target ratios are super noisy, likely the reason you get the crash. You can also try running our normalization instead of CNVkit.

Not sure I understand your comment you don't have normal samples to build the NormalDB. You have the normals to run Mutect?

@npatel-ah
Copy link
Author

npatel-ah commented May 14, 2020

Thanks again, understood now I will try PureCN normalization.
I have a tumor and a matched normal sample. M2 was ran with the Pair. But if understand correctly, for NormalDB it requires some minimum number of samples.

@npatel-ah
Copy link
Author

Hello Markus, Thanks for all of your suggestions.
Before giving up on CNVkit I have tested it with one of the pair of Tumor and Normal bam files from a panel sequencing as well updated PureCN to 1.18.

This time the SD for log-ratios seems OK however, after running almost till end I got below error, for which I can't find any related post.

One thing I came across was the SOMATIC flag from issue #108 as well noticed in this tutorial section 2.1

For M2 do I need to run VariantAnnotation to add SOMATIC flag ? and is that what is causing all the issues? Attached here the log file.
PureCN_V1-18_Panel.log

Error: BiocParallel errors
  element index: 1, 2, 3, 4, 5, 6, ...
  first error: missing value where TRUE/FALSE needed
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
Execution halted

Thanks again for your time,
Nihir

@lima1 lima1 self-assigned this May 17, 2020
@lima1 lima1 added the bug label May 17, 2020
@lima1
Copy link
Owner

lima1 commented May 17, 2020

Hi Nihir,

looks like there is an issue where the mapping bias is NaN. This is likely causing the crash. I'll investigate, thanks for reporting. You should not need the SOMATIC flag for M2, it should automatically generate it.

The log file looks better now, indeed.

Markus

@lima1
Copy link
Owner

lima1 commented May 17, 2020

Can you try again with the M2 VCF as it is generated, i.e. without adding the COMMON flag etc.?

@npatel-ah
Copy link
Author

Hello Markus,

Here's the error. When I don't annotate with DB it seems to predict all variants as Somatic. How is it identifying if the variant is somatic or nor ? I don't see any mention of the somatic flag on recent M2 documentation.

INFO [2020-05-18 13:46:04] Using BiocParallel for parallel optimization.
INFO [2020-05-18 13:46:04] Loading coverage files...
INFO [2020-05-18 13:46:05] Found log2-ratio in tumor coverage data.
WARN [2020-05-18 13:46:05] Allosome coverage missing, cannot determine sex.
WARN [2020-05-18 13:46:05] Allosome coverage missing, cannot determine sex.
INFO [2020-05-18 13:46:06] Using 27582 intervals (10611 on-target, 16971 off-target).
INFO [2020-05-18 13:46:06] Ratio of mean on-target vs. off-target read counts: NaN
INFO [2020-05-18 13:46:06] Mean off-target bin size: 146326
INFO [2020-05-18 13:46:06] Loading VCF...
INFO [2020-05-18 13:46:06] Found 4343 variants in VCF file.
INFO [2020-05-18 13:46:06] Removing 464 triallelic sites.
WARN [2020-05-18 13:46:06] vcf.file has no DB info field for membership in germline databases. Found and used somatic status instead.
INFO [2020-05-18 13:46:06] 0 (0.0%) variants annotated as likely germline (DB INFO flag).
FATAL [2020-05-18 13:46:06] VCF either contains no germline variants or variants are not properly

FATAL [2020-05-18 13:46:06] annotated.

FATAL [2020-05-18 13:46:06]

FATAL [2020-05-18 13:46:06] This is most likely a user error due to invalid input data or

FATAL [2020-05-18 13:46:06] parameters (PureCN 1.18.0).

Error: VCF either contains no germline variants or variants are not properly
annotated.

This is most likely a user error due to invalid input data or
parameters (PureCN 1.18.0).
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
Execution halted

Thanks for your continuous support on the issues.

Best.
Nihir

@lima1
Copy link
Owner

lima1 commented May 18, 2020

Did you run Mutect with -genotype-germline-sites? Otherwise it will only call somatic.

@npatel-ah
Copy link
Author

Yes, I enabled the flag. Here is my command line for M2

	$GATK Mutect2 -R $GENOME \
	-L $TartgetInterval \
	-ip 50 \
	-I $TUMOR \
	-I $NORMAL \
	-tumor $TUMOR_ID \
	-normal $NORMAL_ID \
	--genotype-germline-sites true \
	--genotype-pon-sites true \
	-germline-resource $gnomAD \
	--native-pair-hmm-threads 24 \
	--native-pair-hmm-use-double-precision true \
	--disable-sequence-dictionary-validation \
	--f1r2-tar-gz $TUMOR_ID.f1r2.tar.gz \
	-O $TUMOR_ID.unfiltered.vcf.gz

@lima1
Copy link
Owner

lima1 commented May 18, 2020

Hmmm. 4000 variants is a weird number if this is whole exome. It's too high for only somatic, but far too low if it includes germline. Do you see something weird when you look at the P_GERMLINE and POP_AF fields? I.e. does it cover the whole expected range for 0 to 0.5-1?

@npatel-ah
Copy link
Author

Sorry if I didn't clarify, the last data is from Panel sequencing of about 400 genes. For my WES run, I was getting variants in a range of 70k to 80K.
But I think you have identified the culprit and it seems to be the "POPAF", which is always > 1 and because it's -log10 of POPAF. Also, I believe "P_GERMLINE" no longer exits or is replaced with "GERMQ".
Here is the header

##INFO=<ID=CONTQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to contamination">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ECNT,Number=1,Type=Integer,Description="Number of events in this haplotype">
##INFO=<ID=GERMQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not germline variants">
##INFO=<ID=MBQ,Number=R,Type=Integer,Description="median base quality">
##INFO=<ID=MFRL,Number=R,Type=Integer,Description="median fragment length">
##INFO=<ID=MMQ,Number=R,Type=Integer,Description="median mapping quality">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="median distance from end of read">
##INFO=<ID=NALOD,Number=A,Type=Float,Description="Negative log 10 odds of artifact in normal with same allele fraction as tumor">
##INFO=<ID=NCount,Number=1,Type=Integer,Description="Count of N bases in the pileup">
##INFO=<ID=NLOD,Number=A,Type=Float,Description="Normal log 10 likelihood ratio of diploid het or hom alt genotypes">
##INFO=<ID=PON,Number=0,Type=Flag,Description="site found in panel of normals">

##INFO=<ID=POPAF,Number=A,Type=Float,Description="negative log 10 population allele frequencies of alt alleles">

##INFO=<ID=ROQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to read orientation artifact">
##INFO=<ID=RPA,Number=R,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=SEQQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not sequencing errors">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##INFO=<ID=STRANDQ,Number=1,Type=Integer,Description="Phred-scaled quality of strand bias artifact">
##INFO=<ID=STRQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles in STRs are not polymerase slippage errors">
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log 10 likelihood ratio score of variant existing versus not existing">

@lima1
Copy link
Owner

lima1 commented May 18, 2020

Oh, ok. I'll add code to check that POPAF is on log-scale and support for GERMQ. Thanks for reporting!

@lima1
Copy link
Owner

lima1 commented May 18, 2020

Hi Nihir, have a look when you have a chance. I don't have matched tumor/normal test data readily available right now, but the latest commit should work.

@npatel-ah
Copy link
Author

Hello Markus,

Wow! That was fast. Thank you very much for the quick response. I just tested the code from the latest commit and worked perfectly fine without any additional flags (i.e dbinfoflag or popinfoflag). It can successfully detect the somatic mutation and runs without any error.

Seems like no additional annotations are required for Matched samples ran with latest M2.

Here are the lines from log file just for information

INFO [2020-05-18 18:04:34] Loading VCF...
INFO [2020-05-18 18:04:35] Found 4343 variants in VCF file.
INFO [2020-05-18 18:04:35] Removing 464 triallelic sites.
WARN [2020-05-18 18:04:35] Found GERMQ info field with Phred scaled germline probabilities.
WARN [2020-05-18 18:04:35] vcf.file has no DB info field for membership in germline databases. Found and used somatic status instead.
INFO [2020-05-18 18:04:35] 2623 (67.6%) variants annotated as likely germline (DB INFO flag).
INFO [2020-05-18 18:04:35] Tumor_O-170000133 is tumor in VCF file.
INFO [2020-05-18 18:04:35] 28 homozygous and 3 heterozygous variants on chrX.
INFO [2020-05-18 18:04:35] Sex from VCF: M (Fisher's p-value: < 0.0001, odds-ratio: 15.03).
INFO [2020-05-18 18:04:36] Removing 1358 non heterozygous (in matched normal) germline SNPs.
INFO [2020-05-18 18:04:36] Removing 879 variants with AF < 0.030 or AF >= 1.000 or less than 6 supporting reads or depth < 15.
INFO [2020-05-18 18:04:39] Removing 48 blacklisted variants.
INFO [2020-05-18 18:04:39] Removing 0 low quality variants with BQ < 25.
INFO [2020-05-18 18:04:39] Total size of targeted genomic region: 2.22Mb (2.79Mb with 50bp padding).
INFO [2020-05-18 18:04:39] 13.8% of targets contain variants.
INFO [2020-05-18 18:04:39] Removing 135 variants outside intervals.
INFO [2020-05-18 18:04:39] Found SOMATIC annotation in VCF.
INFO [2020-05-18 18:04:39] Setting somatic prior probabilities for somatic variants to 0.999000 or to 0.000100 otherwise.
INFO [2020-05-18 18:04:39] Found SOMATIC annotation in VCF. Setting mapping bias to 0.994.
INFO [2020-05-18 18:04:39] Excluding 285 novel or poor quality variants from segmentation.
INFO [2020-05-18 18:04:39] Sample sex: ?
INFO [2020-05-18 18:04:39] Segmenting data...
INFO [2020-05-18 18:04:39] Loaded provided segmentation file Tumor_1 seg (format DNAcopy).
INFO [2020-05-18 18:04:39] Re-centering provided segment means (offset -0.0523).
INFO [2020-05-18 18:04:39] Setting prune.hclust.h parameter to 0.150000.
INFO [2020-05-18 18:04:39] Found 56 segments with median size of 35.05Mb.
INFO [2020-05-18 18:04:39] Removing 7 variants outside segments.
INFO [2020-05-18 18:04:39] Using 1452 variants.
INFO [2020-05-18 18:04:39] Mean standard deviation of log-ratios: 0.30
INFO [2020-05-18 18:04:39] Mean standard deviation of on-target log-ratios only: 0.24
INFO [2020-05-18 18:04:39] Mean standard deviation of off-target log-ratios only: 0.32
INFO [2020-05-18 18:04:39] 2D-grid search of purity and ploidy...

Best,
Nihir

@lima1
Copy link
Owner

lima1 commented May 18, 2020

Great. Closing now. Have a look at the Sampleid.pdf output, the SNPs should cleanly follow the provided segmentations. Also check the Sampleid_variants.csv file (or vcf if you provided --vcf). There might be improvements possible, for example variants that should be filtered based on their flags. Please open new issues with those suggestions.

@lima1 lima1 closed this as completed May 18, 2020
lima1 added a commit that referenced this issue May 18, 2020
@lima1
Copy link
Owner

lima1 commented May 18, 2020

I just found a bug that essentially turned off all flag filtering for M2 VCFs.

@npatel-ah
Copy link
Author

No worries, thanks for the heads up. Let me know whenever you have an update. Meanwhile, I will take a closer look at the results as you suggested.

@lima1
Copy link
Owner

lima1 commented May 18, 2020

I pushed a commit already. It should remove most variants not labeled as PASS, except germline and PoN.

@lima1 lima1 added the gatk4 All things related to GATK4/Mutect2 label May 26, 2020
@dakyung99
Copy link

dakyung99 commented Oct 29, 2023

Hi,
Regarding this issue, should I provide variants with PASS labeled after Mutect2 filter run? or should unfiltered raw Mutect2 output work for purecn input?

Thanks a lot!
DK

@lima1
Copy link
Owner

lima1 commented Oct 29, 2023

Just follow the somatic best practices as closely as possible: https://gatk.broadinstitute.org/hc/en-us/articles/360035531132 . So yes, include all filtering steps. To turn the GenomicsDB directory into a PureCN mapping bias RDS file, you can use the Docker image which comes with the genomicsdb R package.

If you have matched normals, be sure to provide --genotype-germline-sites true to not remove high quality germline SNPs. Also see #320.

@dakyung99
Copy link

@lima1 Thank you so much for prompt reply,
I've ve got one more question regarding germline variants VCFs- should I input germline called from matched normal?(e.g.) blood sample) or input germline mutations calls from tumour sample?

Again, Thanks for providing this wonderful tool!

@lima1
Copy link
Owner

lima1 commented Oct 30, 2023

You just provide the VCF as generated by Mutect2 with the --genotype-germline-sites. It needs the tumor SNP allelic fractions, but will use the normal allelic fractions for bias estimation when available.

@lima1 lima1 reopened this Oct 30, 2023
@lima1 lima1 closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug gatk4 All things related to GATK4/Mutect2
Projects
None yet
Development

No branches or pull requests

3 participants