-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Welcome to the Picky wiki!
Picky is a structural variant pipeline for long reads developed by Genome Technologies, The Jackson Laboratory. Picky was initially designed for Oxford Nanopore long reads, but will also work for PacBio reads. Picky uses LAST to generate all possible alignments for a read, and 'pick' the representative alignments with a greedy algorithm instead of using last-split. Picky's picking does not assume colinearity nor mandate that each read base must only be aligned to at most a single genomic location.
-
How do I process Oxford nanopore sequencing data for structural variants?
-
How do I process PacBio sequencing data for structural variants?
Let's assume that LAST is installed, and you have your human sample sequencer output in "LongRead.fastq".
You could try out the public ONT dataset Scrappie chr20 FASTA as Picky's demonstration data.
# download faToFastq
curl -O http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToFastq
# make user executable
chmod u+x faToFastq
# download Scrappie based-called chr20 reads
curl -O http://s3.amazonaws.com/nanopore-human-wgs/na12878.chr20ScrappieFiltered.fasta
# convert fasta to fastq with default base quality 'H'
./faToFastq -qual=H na12878.chr20ScrappieFiltered.fasta LongRead.fastq
You may download the demo data results for comparison or for better understanding the output without pipeline execution. (.sam excluded by file size limit.)
last-755/src/lastdb -v -P 1 hg19.lastdb hg19.fa
NOTE: The LAST reference database needs to be build ONCE ONLY. The hg19 primary assembly took 50 min to build on a single core with 16GB RAM.
samtools dict -H hg19.fa > hg19.seq.dict
NOTE: The sequence dictionary needs to be build ONCE ONLY. It is used by picky callSV to generate .sam file header.
We use at least 4 threads for the compute intensive read alignment process. For faster turn-around, use more threads if your machine has more compute power.
# generate the bash script for processing
./picky.pl script --fastq LongRead.fastq --thread 4 > LongRead.sh
# change script to be executable
chmod u+x LongRead.sh
# start the script
./LongRead.sh
At the end of the script execution, you will have the SV records in the vcf file "LongRead.allsv.vcf", and the alignments in "LongRead.profile.sam".
See Oxford nanopore and PacBio sequence data processing for more details. See cluster support on scaling up on a PBS cluster.
NOTE: The public ONT dataset Scrappie chr20 FASTA with 277,054 reads took 2.5hr to complete using 16 cores and 20GB memory.
The generated content LongRead.sh from the previous step is as shown below and you may customize it before execution or re-execution. Depending on your installation, you have to adapt the first 3 blocks (of exports).
# general installation
export LASTAL=last-755/src/lastal
export PICKY=./picky.pl
# general configuration
export LASTALDB=hg19.lastdb
export LASTALDBFASTA=hg19.fa
export NTHREADS=4
# FASTQ file
export RUN=LongRead
# reads alignments
time (${LASTAL} -C2 -K2 -r1 -q3 -a2 -b1 -v -v -P${NTHREADS} -Q1 ${LASTALDB} ${RUN}.fastq 2>${RUN}.lastal.log | ${PICKY} selectRep --thread ${NTHREADS} --preload 6 1>${RUN}.align 2>${RUN}.selectRep.log)
# call SVs
time (cat ${RUN}.align | ${PICKY} callSV --oprefix ${RUN} --fastq ${RUN}.fastq --genome ${LASTALDBFASTA} --exclude=chrM --sam)
# generate VCF format
${PICKY} xls2vcf --xls ${RUN}.profile.DEL.xls --xls ${RUN}.profile.INS.xls --xls ${RUN}.profile.INDEL.xls --xls ${RUN}.profile.INV.xls --xls ${RUN}.profile.TTLC.xls --xls ${RUN}.profile.TDSR.xls --xls ${RUN}.profile.TDC.xls > ${RUN}.allsv.vcf