Skip to content

Latest commit

 

History

History
484 lines (350 loc) · 22.7 KB

09_viralrecon.md

File metadata and controls

484 lines (350 loc) · 22.7 KB

Galaxy for virologist training Exercise 8: Viralrecon

Title Galaxy
Training dataset: SARS-CoV-2 downsampled sequencing data used to report variants and lineages to national Spanish epidemiologist.
Questions:
  • How many variants does the samples have
  • Whoch lineage do the samples belong to?
Objectives:
  • Learn how to run viralrecon in Galaxy's interface
  • Understand the results generated
Estimated time: 1h 15 min

In this report you will find all the information necessary to follow the steps to analyze SARS-CoV-2 data with Galaxy.

Training overview

During this training we will following these steps:

  1. Create a new history and name it Viralrecon
  2. Upload data: Upload data for the analysis.
  3. Quality: Analysis of the quality of the raw reads.
  4. Trimming: Quality trimming using fastp
  5. Mapping: Mapping reads to reference genome with Bowtie2
  6. Stats: Mapping statistics with samtools and picard.
  7. Amplicons: Preprocessing steps mandatory for amplicon sequencing data.
  8. Variants: Variant calling and filtering.
  9. Consensus: Consensus genome generation

From now on, each job we run in Galaxy will have a unique number for identifying each process. This numbers can differ depending on the number of samples and the times you run or delete any process. This training's snapshots were taken using other samples and some process were deleted for any reason, so numbers and names MAY DIFFER. However, the steps you have to run are THE SAME

History

  1. Create a new history in the ➕ and name it Viralrecon

Data

We are going to upload files using these URLS as seen in the Galaxy tutorial first day

https://zenodo.org/record/5724464/files/SARSCOV2-1_R1.fastq.gz?download=1
https://zenodo.org/record/5724464/files/SARSCOV2-1_R2.fastq.gz?download=1
https://zenodo.org/record/5724464/files/SARSCOV2-2_R1.fastq.gz?download=1
https://zenodo.org/record/5724464/files/SARSCOV2-2_R2.fastq.gz?download=1

Prior to any analysis, we have to download the fasta reference genome using the following URL:

https://zenodo.org/record/5724970/files/GCF_009858895.2_ASM985889v3_genomic.200409.fna.gz?download=1

Also, you will download the bed file of the amplicon primers, which contains the positions in the reference genome of each of the amplicon primers. Use this URL in the window:

https://zenodo.org/record/5724970/files/nCoV-2019.artic.V3.scheme.bed.txt?download=1

Finally, rename and tag the data as follows:

  • SARSCOV2-1_R1.fastq.gzto SARSCOV2-1_R1with tagS #sample1 and #forward
  • SARSCOV2-1_R2.fastq.gzto SARSCOV2-1_R2with tagS #sample1 and #reverse
  • SARSCOV2-2_R1.fastq.gzto SARSCOV2-2_R1with tagS #sample2 and #forward
  • SARSCOV2-2_R2.fastq.gzto SARSCOV2-2_R2with tagS #sample2 and #reverse
  • GCF_009858895.2_ASM985889v3_genomic.200409.fna.gz?download=1 to NC_045512.2 with tag #reference
  • nCoV-2019.artic.V3.scheme.bed.txt?download=1 to Amplicon bed with tag #amplicons

viralrecon_tag

Quality

Quality Analysis (FastQC)

Once we have the raw data, an important step is to analyze the quality of the reads, to know if the reads are worth it. To do this, we have to look for the program "FastQC" in the search bar, then select FastQC Read Quality reports and set the following parameters, same as here:

  • Select multiple file data set and select the fastq files R1 and R2 for both samples
  • With Ctrl select the two datasets
  • Then go down and select Run tool

viralrecon_fastqc

FastQC results visualization and interprepation questions

To visualize the information coming from FastQC we just have to select the job of interest. In this case we are interested in the "Web page results" so for the sample we want to see the results we have to click in the eye to visualize galaxy results:

For more information about FastQC output visit FasxstQC website

Question

Which is the read length? What type of sequencing are we doing?
150 maximum. 2x150 sequencing (paired data of 150 read length)
How many reads has samlpe1 before trimming?
50000
How many reads has samlpe2 before trimming?
50000 (this is because we downsampled the data manually, usually samples from the same run do not have same number of reads)

Trimming

Quality trimming (Fastp)

Once we have check the quality of our reads, it's important to trim low quality nucleotides from those reads, for which we will use Fastp. So, in the search bar you look for fastp and then select "fastp - fast all-in-one preprocessing for FASTQ files". There, we will have to change some parameters ensure the trimming accuracy for this amplicon data. First of all we are going to do the analysis for the sample we gave to you (201569). These are the field we will have to change:

  1. Search for fastp in the tools and select fastp - fast all-in-one preprocessing for FASTQ files
  2. Select custom parameters:
    • Single-end or paired reads > Paired
      • Input 1 > Multiple datasets > Select R1 reads for both samples
      • Input 2 > Multiple datasets > Select R2 reads for both samples
    • Display Filter Options
      • Quality Filtering options
        • Qualified Quality Phred = 30
        • Unqualified percent limit = 10
      • Length Filtering Options
        • Length required = 50
    • Read modification options
      • PoliX tail trimming > Enable polyX tail trimming
      • Per read cutting by quality options
        • Cut by quality in front (5') > Yes
        • Cut by quality in tail (3') > Yes
        • Cutting mean quality = 30
  3. Finally, click on Run tool

fastp1 fastp2 fastp3

A message will appear, which means that 6 results will be generated:

  1. Two, one with the R1 trimmed reads, for each sample
  2. Another two, one with the R2 trimmed reads, for each sample
  3. Two, one with the HTML results, for each sample

Fastp results

Once fastp analysis is done, you can see the results by clicking in the eye ("View Data") in the fatp HTML results.

Among the most relevant results, you have the:

  • Summary: Stats summary
    • After filtering: Statistics of the reads after quality filtering
      • reads passed filters: Reads remaining after quality filter trimming
      • reads with low quality: Reads that were remove due to low quality
      • reads too short: Reads that didn't pass the minimum length filter.
    • After filtering: Plots after filtering
      • After filtering: read1: quality: Plot with the evolution of R1 quality over read position. Usually it decays in the last nucleotides.
      • After filtering: read2: quality: Same plot for R2.

For more information about FastQC output visit Fastp github

Question

How many reads are we keeping from sample1?
67.598 reads (67.6%)
How many reads did we lost for sample1 and why?
Low quality: 26.814 (26.81%)
Too many Ns: 6 (0%)
Too short: 5.582 (5.58%)
How many reads are we keeping from sample2?
65.172 (65.17%)
How many reads did we lost for sample2 and why?
Low quality: 29.148 (29.15%)
Too many Ns: 4 (0%)
Too short: 5.676 (5.68%)

Mapping

In order to call for variants between the samples and the reference, it's mandatory to map the sample reads to the reference genome. To do this we need the fasta file of the reference and the Bowtie2 index of that fasta file.

Mapping reads with reference genome (Bowtie2)

Now we can start with the main mapping process. The first thing we have to do is look for the program "Bowtie2" in the search bar and then select "Bowtie2 - map reads against reference genome". Here we will have to set the following parameters, for the first sample, same as here

  1. Is this single or paired library > Paired-end
  2. Fasta/Q file #1: Multiple datasets > fastp Read 1 output for both samples
  3. Fasta/Q file #2: Multiple datasets > fastp Read 2 output for both samples
  4. Will you select a reference genome from your history or use a built-in index? > Use a built-in genome index
  5. Select reference genome > SARS-CoV-2 isolate Wuhan-Hu-1, complete genome (NC_045512.2).
  6. Do you want to use presets? > Very sensitive local
  7. Save the bowtie2 mapping statistics to the history > Yes
  8. Run tool

bowtie1 bowtie2

Mapping results

Now we can see the mapping results for the samples. The bowtie2 resulting file is a .bam file, which is not easy to read by humans. This .bam file can be downloaded by clicking in the alignment file and then into download. Then, the resulting .gz file will contain the alignment .bam file that can be introduced in a software such as IGV with the reference genome fasta file.

In our case, the file that can be visualize is the mapping stats file, which contains information such as the percentage of reads that aligned.

Question

Which is the overall alignment rate for sample1?
99.69%
And the overall alignment rate for sample2?
97.73%

Stats

The previously shown files give few human readable information, because mapping files are supposed to be used by other programs. In this sense, we can use some programs to extract relevant statistical information about the mapping process.

Samtools flagstat

The first program is Samtools, from which we will use the module samtools flagstat. To do this, we have to look in the search bar for "samtools flagstat" and then select "Samtools flagstat tabulate descriptive stats for BAM datset". There, we just have to select the samples we want to perform the mapping stats (in the example there are two samples, you just have to use one): Bowtie2 on data X, data X and data X: alingment. You can select the samples from the list in Multiple datasets or select the folder icon (Browse datasets) to select the file from the history. Finally, select Execute

samtools_flagstat_1

Samtools results

The results of the samtools program gives information about the number and percentage of reads that mapped with the reference genome.

Question

How many reads mapped against the refernce genome for sample1?
67390
And how many for sample2?
63695

Picard CollectWgsMetrics

Another program that gives statistical information about the mapping process is Picard. To run this program you just have to search "Collect Wgs Metrics" and then select "CollectWgsMetrics compute metrics for evaluating of whole genome sequencing experiments".

You have to change the following parameters:

  1. Select SAM/BAM dataset or dataset collection > Dataset collection > Select both bam files at once
  2. Load reference genome from > Local cache
  3. Using reference genome > SARS-CoV-2 isolate Wuhan-Hu-1, complete genome (NC_045512.2).
  4. Treat bases with coverage exceeding this value as if they had coverage at this value = 1000
  5. Select validation stringency > Lenient
  6. Run tool.

picard_wgsmetrics1

This process will generate one output file per .bam alignment file selected as input.

Picard results

Picard results consist in quite long files, so the best is to download those results and visualize them in your computer. Yo you have to click in the CollectWgsMetrics job you want to download, and then click in the save button:

Then you just have to open the file with Excell in your computer, and you will see a file with different columns with information about the percentage of the reference genome that is covered by the reads at a specific depth or the mean depth of coverage of the reference genome.

Question

Which is the mean coverage for sample1?
162.190917
Which percetage of the reference genome is covered to more than 10X by sample1 reads?
78.58%
Which is the mean coverage for sample2?
163.623282
Which percetage of the reference genome is covered to more than 10X by sample2 reads?
88.69%

Amplicons

After mapping the reads to the reference genome, we are interested in removing the sequences of the amplicon primers. To do that you will use a program called iVar, and you will need a bed file with the positions of those amplicon primers.

Trim amplicon sequences

Once you have the bed file, you just have to search for "ivar trim" in the search bar and select "ivar trim Trim reads in aligned BAM". Then follow these steps:

  1. Bam file > Select the aligment bam file generated with Bowtie2 for both samples.
  2. Source of primer information > Built-in.
  3. Primer scheme name > SARS-CoV-2-ARTICv3
  4. Include reads with no primers > Yes.
  5. Require a minimum length for reads to retain them after any trimming? > Yes and provide a custom threshold
  6. Minimum trimmed length threshold = 20

ivar_trim1

iVar trim results

The resulting file from iVar will be a new BAM file where amplicon primer positions will be removed, so there's no result to visualize.

Variants

Once we have the alingment statistics and files with amplicon primers trimmed, we can start with the variant calling process.

iVar variants

iVar uses primer positions supplied in a BED file to soft clip primer sequences from an aligned and sorted BAM file. Following this, the reads are trimmed based on a quality threshold(Default: 20). To do the quality trimming, iVar uses a sliding window approach(Default: 4). The windows slides from the 5' end to the 3' end and if at any point the average base quality in the window falls below the threshold, the remaining read is soft clipped. If after trimming, the length of the read is greater than the minimum length specified(Default: 30), the read is written to the new trimmed BAM file.

  1. Search for ivar variants and select ivar variants Call variants from aligned BAM file
  2. Bam file > Select ivar trimmed bam files for both samples
  3. Reference > GCF_009858895.2_ASM985889v3 (as fasta)
  4. Minimum frequency threshold > 0,75
  5. Output format > Both tabular and VCF
  6. In VCF only output variants that PASS all filters > Yes

ivar_variants

iVar results

iVar results consist in a VCF file containing all the variants found between the reference and the samples. Each line represents a variant the columns give information about that variant, such as the position in the reference genome, the reference allele, the alternate allele, if that variant passed the filters, and so on.

This variants have passed the minimum quality filter, which we set as 20, and the minimum allele frequency of 75%.

Question

How many possitions are diferent (variant) between reference and sample1 that pass all filters?
40
And how many between reference and sample2 that pass all filters?
40

Annotation with SnpEff

Once we have the variants called, it's interesting to annotate those variants, for which you will use SnpEff. Search for "snpeff" in the searh bar and select "SnpEff eff: annotate variants for SARS-CoV-2", then change the following parameters:

  1. Sequence changes (SNPs, MNPs, InDels) > Select ivar output VCF for both samples
  2. Create CSV report, useful for downstream analysis (-csvStats) > Yes

snpeff

SnpEff results

The SnpEff gives three different results, from which the most interesting ones are:

  1. Snpeff eff: Which is a VCF file with the annotation results. It is a very similar file to the ones we saw before for VarScan and Bcftools but with the last column different, containing relevant information about that variant.

  2. Snpeff eff CSV stats: This file is a CSV file that contains statistics about the variant annotation process, such as the percentage of variants annotated, the percentage of variants that are MISSENSE or SILENT, the percentage that have HIGH, MODERATE or LOW effect, and so on.

Question

How many missense variants has sample1?
25
How many INDELs has sample1?
3
How many missense variants has sample2?
31
How many INDELs has sample2?
1

Consensus

Once we have the most relevant variants that can be considered to include in the consensus genome, you can start with the consensus genome generation.

Bcftools consensus

The first step consist in including the called variants into the reference genome, for which you will search for "bcftools consensus" in the search bar and then select "bcftools consensus Create consensus sequence by applying VCF variants to a reference fasta file". In this module you have to select:

  1. VCF/BCF Data > VCF resulting from iVar variants for both samples
  2. Choose the source for the reference genome > Use a genome from the history
  3. Reference genome > Fasta file uploaded at the begining.
  4. Execute

bcftools_consensus

This will just generate a fasta file identical to the reference one, except for those nucleotides that are variants from the VCF file.

Genome coverage calculation

At this point, we have the consensus viral genome, but we know that we have filtered the variants based on the coverage, selecting only those that had a coverage depth higher than 10X. So we cannot ensure that the consensus genome doesn't have any variant that we have filter in those regions with a coverage lower than 10X. So the next step is to determine which regions of the reference genome have a coverage lower than 10X.

To do that you will search for "bedtools genomecov" in the search bar and select "bedtools Genome Coverage compute the coverage over an entire genome", the you will have to select the following files:

  1. Input type > BAM
  2. BAM file > iVar trim output bam files for both samples
  3. Output type > BedGraph coverage file
  4. Report regions with zero coverage > Yes
  5. Execute

bedtools_genomecov

This process will generate a BED file where each genomic position range of the reference genome has the coverage calculated. In this example you can see that for the positions of the reference genome from the nucleotide 2 to 54 they have a coverage of 2X and then will be masked.

bedtools_genomecov_result

Regions filtering

From this resulting file from betdools genomecoverage you are going to select those regions with a coverage lower than 10X. Writing in the search bar "awk" and selecting "Text reformatting with awk", you are going to change:

  1. File to process > Bedtools genome coverage file with the coverage regions for both samples
  2. AWK Program = $4 < 10
  • This will filter all the lines (genomic regions) that have a value lower than 10 in the 4th column (coverage)
  1. Execute

awk

The resulting file is exactly the same as the one in Bedtools genomecoverage but only containing those lines with the genomic region coverage lower than 10X.

Masking the consensus genome

Now that you have the consensus genome and the regions with a sequencing depth lower than 10X, you are going to "mask" those regions in the consensus genome replacing the nucleotides in those regions with "N"s. You have to search for "bedtools maskfasta", select "bedtools MaskFastaBed use intervals to mask sequences from a FASTA file" and then select the following parameters:

  1. BED/bedGraph/GFF/VCF/EncodePeak file > Select the BED files resulting from AWK text filter for both samples
  2. FASTA file > Select the consensus genome fasta file generated with Bcftools consensus, for both samples
  3. Execute

bedtools_maskfasta

The resulting file is the consensus genome generated previously but now only contains Ns instead of A, T, G or C in the regions with less than 10X depth of coverage

bedtools_maskfasta_result

You can download this fasta file and use it to upload it to any public repository such as ENA or GiSaid. Also you can use it to perform phylogenetic trees or whatever else you want to do with the SARS-CoV-2 consensus fasta file.

Lineage

Now we are going to determine the lineage of the samples. We will use a software called pangolin. We are going to use the masked consensus genomes generated in the previous steps as follows:

  1. Search for the pangolin tool
  2. Select Pangolin Phylogenetic Assignment of Outbreak Lineages and set the following parameters:
  3. Select the bedtools MaskFastaBed generated in the previous step as input fasta file for both samples
  4. Execute

pangolin

Now we are going to have a look to the results from pangolin. As you can see, results are in table format, where you have in first place the reference genome and then de lineage assingned.

Question

Which is the lineage of sample1?
B.1.1.7 (Alpha)
And the lineage of sample2?
AY.33 (Delta)

All results

If you have any problem following this training and you want to visualize the resulting file you can access them through this URL:

https://usegalaxy.eu/u/s.varona/h/viralrecon

And viralrecon workflfow in:

https://usegalaxy.eu/u/s.varona/w/viralrecon2022