Skip to content

renatopuga/somatico

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gitpod ready-to-code

Open In Colab

Análise Somática

GATK 4 Mutect2 Somático

Aula Prática

image

Pipeline

Amostras Extras

  • WP190 (tumor) e WP191 (normal)
  • WP017 (tumor) e WP018 (normal)

Nota 1: Utilizar o af-gnomad chr13 e chr19.

Nota 2: Será preciso baixar o chr13 e chr19 da UCSC.

Download chr19

wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr19.fa.gz

Download chr13

wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr13.fa.gz

Concatenar os arquivos .fa.gz

Dica 1: zcat lê arquivos compactados .gz e zip

zcat chr13.fa.gz chr19.fa.gz | sed -e "s/chr//g" > hg19.fa

Gerar o index do arquivo hg19.fa

samtools faidx hg19.fa

< Primeiro é o chr9 >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

Workflow

Os arquivos BAM (tumor e normal) já foram gerados e estão prontos para a chamada de variates (ver Anexo 1). Então, agora vamos fazer download da referência chr9 e gerar o index com o programa samtools.

Download da Referência - chr9

  • Download
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr9.fa.gz
  • Alterar nome do header: DE: >chr9 para >9

Essa alteração é necessária pois no BAM a referência não tinha >chr era apenas >9.

zcat chr9.fa.gz | sed -e "s/chr//g" > chr9.fa
  • Verificar se alteração foi feita com o comando head
head chr9.fa

O comando head lê as 10 primeiras linha de um arquivo texto

samtools

Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:

samtools install

  • samtools install (Mac)
brew install samtools 
  • samtools install (Ubuntu)
sudo apt-get install samtools 
  • samtools install (Docker)
docker pull biocontainers/samtools

samtools faidx e index

  • samtools faidx
samtools faidx chr9.fa

GATK4

Version: 4.2.2.0

Genome Analysis Toolkit - Variant Discovery in High-Throughput Sequencing Data. https://gatk.broadinstitute.org/

GATK4 install

GATK4 install Docker

docker pull broadinstitute/gatk:4.2.2.0

GATK4 install Mac e Linux

  • Download
wget -c https://github.com/broadinstitute/gatk/releases/download/4.2.2.0/gatk-4.2.2.0.zip
  • Descompactar
unzip gatk-4.2.2.0.zip 
  • Testando gatk
./gatk-4.2.2.0/gatk

GATK4 .dict

./gatk-4.2.2.0/gatk CreateSequenceDictionary -R chr9.fa -O chr9.dict

GATK4 intervals

./gatk-4.2.2.0/gatk ScatterIntervalsByNs -R chr9.fa -O chr9.interval_list -OT ACGT

Mutect2

Call somatic SNVs and indels via local assembly of haplotypes

Mutect2 Tumor e Normal

O comando: samtools view -H normal_JAK2.bam você consegue pegar o campo SM: que contém o ID da amostra normal.

samtools view -H normal_JAK2.bam | grep RG | cut -f6 | sed -e "s/SM://g"
./gatk-4.2.2.0/gatk Mutect2 \
	-R chr9.fa \
	-I tumor_JAK2.bam \
	-I normal_JAK2.bam \
	-normal WP044 \
	--germline-resource af-only-gnomad-chr9.vcf.gz \
	-O somatic.vcf.gz \
	-L chr9.interval_list

Calcular Contaminação

GetPileupSummaries

Tabulates pileup metrics for inferring contamination

  • GetPileupSummaries Tumor
./gatk-4.2.2.0/gatk GetPileupSummaries \
	-I tumor_JAK2.bam \
	-V af-only-gnomad-chr9.vcf.gz \
	-L chr9.interval_list \
	-O tumor_JAK2.table
  • GetPileupSummaries Normal
./gatk-4.2.2.0/gatk GetPileupSummaries \
	-I normal_JAK2.bam \
	-V af-only-gnomad-chr9.vcf.gz \
	-L chr9.interval_list \
	-O normal_JAK2.table

CalculateContamination

Calculate the fraction of reads coming from cross-sample contamination

./gatk-4.2.2.0/gatk CalculateContamination \
	-I tumor_JAK2.table \
	-matched normal_JAK2.table \
	-O contamination.table

FilterMutectCalls

Filter somatic SNVs and indels called by Mutect2

./gatk-4.2.2.0/gatk FilterMutectCalls \
	-R chr9.fa \
	-V somatic.vcf.gz \
	--contamination-table contamination.table \
	-O filtered.vcf.gz

VEP ensembl - Anotação

VEP install

docker pull ensemblorg/ensembl-vep
  • Criar diretório de output do VEP com permissão total (aplicado apenas no gitpod)

Voltar para a casa

cd

Verificar se o diretório é o /home/gitpod

pwd

ex.: /home/gitpod

Copiar o arquivo filtered.vcf.gz

cp /workspace/somatico/filtered.vcf.gz .

Copiar o arquivo chr9.fa

cp /workspace/somatico/chr9.fa .

Copiar o arquivo chr9.fa.fai

cp /workspace/somatico/chr9.fa.fai .

Criar o diretorio vep_output

mkdir -p vep_output

Modificar a permissao do diretorio vep_output

chmod 777 vep_output
  • Aplicar apenas no Google Colab
mkdir -p vep_output
chmod 777 vep_output

rodar o vep

docker run -it --rm  -v $(pwd):/data ensemblorg/ensembl-vep ./vep  \
	-i /data/filtered.vcf.gz  \
	-o /data/vep_output/filtered.vep.tsv \
	--database --assembly GRCh37 --refseq  \
	--pick --pick_allele --force_overwrite --tab --symbol --check_existing \
	--fields "Location,SYMBOL,Consequence,Feature,Amino_acids,CLIN_SIG" \
	--fasta /data/chr9.fa \
	--individual all 

Panel of Normal (PoN)

Panel of Normal (PoN)

GATK Best Practices - Exome PoN

  • vcf
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf
  • vcf.idx
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf.idx
  • Mutect2
./gatk-4.2.2.0/gatk Mutect2 \
  -R hg19.fa \
  -I tumor_wp190.bam \
  --germline-resource af-only-gnomad-chr13-chr19.vcf.gz \
  --panel-of-normals Mutect2-exome-panel.vcf \
  -L hg19.interval_list \
  -O WP190.somatic.pon.vcf.gz
  
  • CalculateContamination somente com o table do tumor (ex.: wp190)
./gatk-4.2.2.0/gatk CalculateContamination \
	-I tumor_wp190.table \
	-O WP190.contamination.pon.table
  • FilterMutectCalls
./gatk-4.2.2.0/gatk FilterMutectCalls \
	-R hg19.fa \
	-V WP190.somatic.pon.vcf.gz \
	--contamination-table WP190.contamination.pon.table \
	-O WP190.filtered.pon.vcf.gz

Anexo 1

Converter BAM para FASTQ

samtools view -h -b /Volumes/Seagate\ Expansion\ Drive/data-lpfap10/projects/proadi/exome/bam/WP043/WP043.bam 9:5021937-5126899 | samtools bam2fq -1 tumor_R1.fq -2 tumor_R2.fq - 
samtools view -h -b /Volumes/Seagate\ Expansion\ Drive/data-lpfap10/projects/proadi/exome/bam/WP044/WP044.bam 9:5021937-5126899 | samtools bam2fq -1 normal_R1.fq -2 normal_R2.fq - 

Converter BAM para JAK2 BAM

Aqui estão os comandos que foram utilizados para gerar os BAMs intermediários e como gerar arquivos FASTQs de regiões específicas do seu arquivo BAM completo.

Essas etapas não precisam ser executadas nesse pipeline

  • BAM para BAM
samtools view -h -b /Volumes/Seagate\ Expansion\ Drive/data-lpfap10/projects/proadi/exome/bam/WP043/WP043.bam 9:5021937-5126899 > tumor_JAK2.bam
samtools view -h -b /Volumes/Seagate\ Expansion\ Drive/data-lpfap10/projects/proadi/exome/bam/WP044/WP044.bam 9:5021937-5126899 > normal_JAK2.bam
  • Gerar index do BAM (.BAI)
samtools index tumor_JAK2.bam 
samtools index normal_JAK2.bam 

af-only-gnomad.vcf.gz (apenas região JAK2)

Google Clou af-only-gnomad.vcf.gz

  • Header do VCF
zgrep -w "\#" af-only-gnomad.raw.sites.chr.vcf.gz > header
  • Apenas Região do Gene JAK2
zgrep -w "^chr9" af-only-gnomad.raw.sites.chr.vcf.gz  | awk '$2>=5021937 && $2<=5126899' > JAK2.region
  • Concatenar header + vcf
cat header JAK2.region > af-only-gnomad-chr9.vcf
  • Compactar
bgzip af-only-gnomad-chr9.vcf
  • Index do VCF
tabix -p vcf af-only-gnomad-chr9.vcf.gz 

Anexo 2

Rodar amostras extras pareadas:

  • Com par:
sh run.chr13-chr19.sh tumor_JAK2.bam normal_JAK2.bam WP043 WP044
sh run.chr13-chr19.sh tumor_wp017.bam tumor_wp018.bam WP017 WP018
sh run.chr13-chr19.sh tumor_wp190.bam tumor_wp191.bam WP190 WP191
  • Com PoN:
sh run.chr13-chr19.pon.sh tumor_JAK2.bam normal_JAK2.bam WP043 WP044 
sh run.chr13-chr19.pon.sh tumor_wp017.bam tumor_wp018.bam WP017 WP018 
sh run.chr13-chr19.pon.sh tumor_wp190.bam tumor_wp191.bam WP190 WP191 

Código Fonte (run.chr13-chr19.pon.sh e run.chr13-chr19.sh)

# variaveis fixas
gnomad="af-only-gnomad-chr13-chr19.vcf.gz"
interval="hg19.interval_list"
genome="hg19.fa"

# bam e ids 
tumor=$1
normal=$2
id_tumor=$3
id_normal=$4

# rodando mutect2
./gatk-4.2.2.0/gatk Mutect2 \
	-R $genome \
	-I $tumor \
	-I $normal \
	-normal $id_normal \
	--germline-resource $gnomad \
	-O $id_tumor.somatic.vcf.gz \
	-L $interval

# getPileup tumor
./gatk-4.2.2.0/gatk GetPileupSummaries \
	-I $tumor \
	-V $gnomad \
	-L $interval \
	-O $id_tumor.table

# getPileup normal
./gatk-4.2.2.0/gatk GetPileupSummaries \
	-I $normal \
	-V $gnomad \
	-L $interval \
	-O $id_normal.table

# CalculateContamination
./gatk-4.2.2.0/gatk CalculateContamination \
	-I $id_tumor.table \
	-matched $id_normal.table \
	-O $id_tumor.contamination.table

# FilterMutectCalls
./gatk-4.2.2.0/gatk FilterMutectCalls \
	-R $genome \
	-V $id_tumor.somatic.vcf.gz \
	--contamination-table $id_tumor.contamination.table \
	-O $id_tumor.filtered.vcf.gz

Rodar amostras extras com Panel of Normal (broad institute).

# variaveis fixas
gnomad="af-only-gnomad-chr13-chr19.vcf.gz"
interval="hg19.interval_list"
genome="hg19.fa"
pon="Mutect2-exome-panel.vcf"

# bam e ids 
tumor=$1
id_tumor=$3

# rodando mutect2
./gatk-4.2.2.0/gatk Mutect2 \
	-R $genome \
	-I $tumor \
	--germline-resource $gnomad \
	--panel-of-normals $pon \
	-O $id_tumor.somatic.pon.vcf.gz \
	-L $interval

# getPileup tumor
./gatk-4.2.2.0/gatk GetPileupSummaries \
	-I $tumor \
	-V $gnomad \
	-L $interval \
	-O $id_tumor.pon.table

# CalculateContamination
./gatk-4.2.2.0/gatk CalculateContamination \
	-I $id_tumor.pon.table \
	-O $id_tumor.contamination.pon.table

# FilterMutectCalls
./gatk-4.2.2.0/gatk FilterMutectCalls \
	-R $genome \
	-V $id_tumor.somatic.pon.vcf.gz \
	--contamination-table $id_tumor.contamination.pon.table \
	-O $id_tumor.filtered.pon.vcf.gz