Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set up sex of PON #324

Closed
lmanchon opened this issue Sep 15, 2023 · 17 comments
Closed

How to set up sex of PON #324

lmanchon opened this issue Sep 15, 2023 · 17 comments

Comments

@lmanchon
Copy link

Describe the issue
build PON from multiples samples, all samples are males

To Reproduce
Rscript $PURECN/Coverage.R --out-dir $OUTPUTS/normals --bam normals.list --intervals $OUT_REF/hg38_intervals.txt --cores 12 --force
ls -a $OUTPUTS/normals/*_loess.txt.gz | cat > PON_normal_coverages.list
Rscript $PURECN/NormalDB.R --out-dir $OUT_REF --coverage-files PON_normal_coverages.list --genome hg38 --assay PON --force

How to set up sex male if all my normal samples are males ?
I know that in cnvkit tool i can set --male-reference for male when i build reference (PON).

thank you--

@lima1
Copy link
Owner

lima1 commented Sep 15, 2023

It's not working automatically? Internally it will create sex-specific PoNs. Not sure I ever tested this thoroughly, but I expect it to work.

@lmanchon
Copy link
Author

it's strange i have no result on X and Y, PureCN detect anomalies on autosomal chromosomes but not on X and Y,
do i set some parameter to recover anomalies on chrX and chr Y ?

@lima1
Copy link
Owner

lima1 commented Sep 19, 2023

Ah, got it. Yes, the likelihood model only supports diploid genomes, so it ignores X/Y for males.

Keeping the X/Y segmentation is a fair feature request though. Not sure I find time for it soon.

@lima1
Copy link
Owner

lima1 commented Sep 19, 2023

You might be able to force the use of X/Y by providing sex="diploid". Not sure there are any side effects though. But might be enough to get the segmentation and merge it then back to the regular run.

@lmanchon
Copy link
Author

is NormalDB.R script accept the parameter sex="diploid" ?

@lima1
Copy link
Owner

lima1 commented Sep 19, 2023

PureCN.R.

@lmanchon
Copy link
Author

oh yes of course, sorry,
I confused normalDb with PureCN

@lmanchon
Copy link
Author

and i have another question because i obtain too much noisy results:
is there a parameter to adjust (such as --drop-low-coverage (per bin) in cnvkit) to reduce the noisy background ?

@lmanchon
Copy link
Author

my command was:
Rscript $PURECN/PureCN.R --out $OUTPUTS/$lib --tumor $OUTPUTS/${lib}"_recalibrated_coverage_loess.txt.gz" --sampleid $lib --vcf $OUTPUT_GATK_CNV/${lib}"_Mutect2.vcf.gz" --stats-file $OUTPUT_GATK_CNV/${lib}"_Mutect2.vcf.gz.stats" --fun-segmentation PSCBS --sex=diploid --cores=16 --normaldb $OUT_REF/normalDB_hg38.rds --mapping-bias-file $OUT_REF/mapping_bias_TWIST_hg38.rds --intervals $OUT_REF/BED_hg38_intervals.txt --genome hg38 --model betabin --post-optimize --seed 123

@lima1
Copy link
Owner

lima1 commented Sep 19, 2023

Can you post the log file? It will tell you exactly what it used for filtering low quality bins.

The --stats-file is only for Mutect1. Also have a look at #320 if you have many MBQ 20 variants.

@lmanchon
Copy link
Author

PSX-AD-001.log
here attachment log file

@lima1
Copy link
Owner

lima1 commented Sep 19, 2023

Looks fine, but you indeed suffer the same issue with MBQ 20. Install the version from the issue_320 branch:
BiocManager::install("lima1/PureCN", ref = "issue_320")

And then re-run PureCN.R with --min-base-quality 20

@lmanchon
Copy link
Author

ok i have installed the version from issue_320

Rscript $PURECN/PureCN.R -v
2.7.11

then:
Rscript $PURECN/PureCN.R --out $OUTPUTS/$lib --tumor $OUTPUTS/${lib}"_recalibrated_coverage_loess.txt.gz" --sampleid $lib --vcf $OUTPUT_GATK_CNV/${lib}"_Mutect2.vcf.gz" --fun-segmentation PSCBS --sex=diploid --cores=16 --min-base-quality 20 --normaldb $OUT_REF/normalDB_TWIST_hg38.rds --mapping-bias-file $OUT_REF/mapping_bias_TWIST_hg38.rds --intervals $OUT_REF/DIGI_BED_hg38_intervals.txt --genome hg38 --model betabin --post-optimize --seed 123

and same output with the same number of CNVs.

@lima1
Copy link
Owner

lima1 commented Sep 19, 2023

You should have now a much higher number of SNPs in the log file.

INFO [2023-09-19 16:42:29] 2.0% of targets contain variants.

That should be now > 10%.

That number should have dropped significantly:

INFO [2023-09-19 16:42:28] Removing 14110 low quality variants with non-offset BQ < 25.

@lmanchon
Copy link
Author

ok, yes i understand.
I have used PureCN and CNVkit, the 2 tools give me approximatively the same number of variants ~ 140
but when i use GATK, i obtain only 40 CNVs, GATK seems to be more stringent in calling variants, and it's not easy to deduce the best results.

@lima1
Copy link
Owner

lima1 commented Sep 19, 2023

You can use the GATK segmentation with PureCN via --fun-segmentation GATK. That should get you close to GATK, but with off-target read support.

PSCBS has a few unique features to combine off-target and on-target reads a bit smarter in the segmentation though. For low purity, I recommend PSCBS, for higher purity (like usually > 30%), GATK is probably cleaner.

It's a balance of sensitivity and specificity. CNVkit is tweaked for sensitivity and expected to give more segments.

@lmanchon
Copy link
Author

yes.
maybe by adding another caller I could make a difference: XHMM, Codex2 ?

@lima1 lima1 closed this as completed Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants