Skip to content

Commit

Permalink
Update manipulating_vcf.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Netzach authored Aug 22, 2024
1 parent e5686f9 commit 275149b
Showing 1 changed file with 35 additions and 56 deletions.
91 changes: 35 additions & 56 deletions docs/Case_studies/manipulating_vcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Finally, full gene names and functions found in this [file](https://ftp.flybase.

#### Chromosome 4

??? "_Hint_, what do we need?"
??? "_Hint, what do we need?_"
Extract chr4 from the vcf and the gff and make new files

??? "_Hint_"
Expand All @@ -46,34 +46,14 @@ Finally, full gene names and functions found in this [file](https://ftp.flybase.

`awk '/^4/{print $0}' dgrp2_trimmed.vcf > dgrp2_chr4.vcf`

##### *bonus result*
A table with counted and sorted different genomic features in chromosome 4.

??? "__example__"
```
1 chromosome
1 snoRNA
2 pre_miRNA
7 pseudogene
11 pseudogenic_transcript
26 ncRNA_gene
31 ncRNA
79 gene
295 mRNA
338 three_prime_UTR
571 five_prime_UTR
2740 CDS
3155 exon
```


#### Making awking the data easier
Start by splitting the task into sub-tasks. This makes it easier to see what happens and you might get interesting intermediary results.

??? "_Hint_, Example"
??? "_Hint, Example_"
Let's say we want to find out all genes that contains a variant and all variants that are located within a gene. What do we want to do first? Take a look at the vcf file. That is the one that contains all the variants. Then look at the gff file, which contains the genes and other annotations. Finally, take a look at the DNA sequence. You will need to combine all three to answer the question.

??? "_Hint_, Example, what do we need to get?"
??? "_Hint, Example, what do we need to get?_"
* Positions for SNPs and INDELs
* Positions for genes and CDSs
* Separation of variants (SNPs and INDELs) into two groups, inside and outside genes (and CDSs)
Expand All @@ -82,14 +62,35 @@ Start by splitting the task into sub-tasks. This makes it easier to see what hap
### The exercise
Identify the steps you need to do and what each step does. Open the hints if you get stuck.

#### *bonus result*
A table with counted and sorted different genomic features in chromosome 4.

#### *bonus result*
SNPs sorted by number. Just like the coins on day one.





## Hints
??? "_bonus result example_ *A table with counted and sorted different genomic features in chromosome 4.*"
```
1 chromosome
1 snoRNA
2 pre_miRNA
7 pseudogene
11 pseudogenic_transcript
26 ncRNA_gene
31 ncRNA
79 gene
295 mRNA
338 three_prime_UTR
571 five_prime_UTR
2740 CDS
3155 exon
```

### SNPs and INDELs
### Variants

??? "_Hint_, which output do we want?"
Get distribution of variants and list them in two separate files. For a bonus plot of the lengths of the INDELS, get the length of all INDELS into a third file
Expand Down Expand Up @@ -136,10 +137,7 @@ Identify the steps you need to do and what each step does. Open the hints if you
??? "_Solution_ proposed by Loïs Rancilhac - 2022.08.30"
`awk '/_SNP/ {SNP++; print $0 > "chr4_SNPs.vcf"} /_DEL/ {DEL++; print $0 > "chr4_DEL.vcf"; LENGTH=length($4)-length($5); print LENGTH > "Deletions_lengths.txt"} /_INS/ {INS++; print $0 > "chr4_INS.vcf"; LENGTH=length($5)-length($4); print LENGTH > "Insertions_lengths.txt"} END{print "SNPs: "SNP"\nInsertions: "INS"\nDeletions: "DEL}' chr4.vcf`

#### *bonus result:*
SNPs introduce sorted by number. Remember the coins...

??? "_bonus result example_"
??? "_bonus result example_ *SNPs sorted by number*"
```
1182 C->T
1133 G->A
Expand Down Expand Up @@ -197,28 +195,6 @@ SNPs introduce sorted by number. Remember the coins...
}
```



#### *Follow-up task:*
Count and sort the SNPs and INDELs in your output and compare to the output from the first step. (If you want to separate the results for the SNPs and the INDELs, how would you modify your code?)

??? "_Task result example_"
```
1 chromosome
1 pre_miRNA
1 snoRNA
6 pseudogene
9 pseudogenic_transcript
22 ncRNA_gene
28 ncRNA
79 gene
264 three_prime_UTR
290 five_prime_UTR
295 mRNA
1798 CDS
2181 exon
```

### Genes/CDSs only

??? "_Hint_, what features do we look for?"
Expand All @@ -231,7 +207,7 @@ Count and sort the SNPs and INDELs in your output and compare to the output from
`awk '$3 ~ /gene|CDS/' Drosophila_melanogaster.chr4.gff3 > Drosophila_melanogaster.chr4_genesCDSs.gff3`


### Final list of variants
### List of variants

??? "_Hint_, how do we classify the variants?"
Repeat step 3 for the SNPs/INDELs themselves, to see which are actually located inside genes
Expand Down Expand Up @@ -269,12 +245,15 @@ Count and sort the SNPs and INDELs in your output and compare to the output from
}
```

### Final result
Here we can see all SNPs and INDELs which are inside a relevant region. We have successfully made two gff:s containing all gene positions for genes with variants and genes without. Along the way we also got a list of all genes that contain variants too.

### Extra task
Re-do the files but this time include the gene ID (i.e. FBtr0089178 from column nine) and translate that into the full gene name found in this [file](https://ftp.flybase.org/releases/FB2020_06/precomputed_files/genes/fbgn_fbtr_fbpp_expanded_fb_2020_06.tsv.gz).
??? "_Hint, finding the names"
Find a common field

??? "_Hint, which field"
gene ID

??? "_Hint, where is that?"
column nine, not awk!

??? "__Solution__"
`awk 'FNR==1{++fileidx} fileidx==1{split($9,a,";|:");ingene[$1,$4,$5]=a[2]} fileidx==2{FS="\t";name[$3]=$5} fileidx==3{state="Not in gene";for (trip in ingene) {split(trip, t, SUBSEP); if ($1==t[1] && $2>=t[2] && $2<=t[3]) {state=(t[1] "\t" t[2] "\t" t[3] "\t" name[ingene[t[1],t[2],t[3]]])}} print $0, "\t", state }' Drosophila_melanogaster.chr4_genesCDSs.gff3 fbgn_fbtr_fbpp_expanded_fb_2020_06.tsv indels_Drosophila_chr4 > SNPsInNamedGenes_Drosophila_ch4`
<br>
Expand Down

0 comments on commit 275149b

Please sign in to comment.