forked from ngs-course/ngs-course.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexample.html
249 lines (247 loc) · 24.8 KB
/
example.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
<meta name="author" content="DNA and RNA-seq NGS alignment" />
<title>NGS data analysis course</title>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet" href="../../../Commons/css_template_for_examples.css" type="text/css" />
</head>
<body>
<div id="header">
<h1 class="title"><a href="http://ngs-course.github.io/">NGS data analysis course</a></h1>
<h2 class="author"><strong>DNA and RNA-seq NGS alignment</strong></h2>
<h3 class="date"><em>(updated 28-05-2014)</em></h3>
</div>
<!-- COMMON LINKS HERE -->
<h1 id="preliminaries">Preliminaries</h1>
<p>In this hands-on will learn how to align DNA and RNA-seq data with most widely used software today. Building a whole genome index requires a lot of RAM memory and almost one hour in a typical workstation, for this reason <strong>in this tutorial we will work with chromosome 21</strong> to speed up the exercises. The same steps would be done for a whole genome alignment. Two different datasets, high and low quality have been simulated for DNA, high contains 0.1% of mutations and low contains 1%. For RNA-seq a 100bp and 150bp datasets have been simulated.</p>
<h3 id="ngs-aligners-used">NGS aligners used:</h3>
<ul>
<li><a href="http://bio-bwa.sourceforge.net/" title="BWA">BWA</a> : BWA is a software package for mapping <strong>DNA</strong> low-divergent sequences against a large reference genome, such as the human genome.</li>
<li><a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" title="Bowtie2">Bowtie2</a> : <em>Bowtie 2</em> is an ultrafast and memory-efficient tool for aligning <strong>DNA</strong> sequencing reads to long reference sequences.</li>
<li><a href="http://tophat.cbcb.umd.edu/" title="TopHat">TopHat</a> : <em>TopHat</em> is a fast splice junction mapper for RNA-Seq reads. It aligns <strong>RNA-Seq</strong> reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.</li>
<li><a href="https://code.google.com/p/rna-star/" title="STAR">STAR</a> : <em>STAR</em> aligns <strong>RNA-seq</strong> reads to a reference genome using uncompressed suffix arrays.</li>
</ul>
<h3 id="other-software-used-in-this-hands-on">Other software used in this hands-on:</h3>
<ul>
<li><a href="http://samtools.sourceforge.net/" title="SAMtools">SAMTools</a> : SAM Tools <strong>provide various utilities</strong> for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.</li>
<li><a href="http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation" title="dwgsim">dwgsim</a> : dwgsim can perform whole <strong>genome simulation</strong>.</li>
<li><a href="http://www.cbil.upenn.edu/BEERS/" title="BEERS">BEERS</a> : BEERS is a <strong>simulation engine</strong> for generating <strong>RNA-Seq</strong> data.</li>
</ul>
<h3 id="file-formats-explored">File formats explored:</h3>
<ul>
<li><a href="http://samtools.sourceforge.net/SAMv1.pdf">SAM</a>: Sequence alignment format, plain text.</li>
<li><a href="http://www.broadinstitute.org/igv/bam">BAM</a>: Binary and compressed version of SAM</li>
</ul>
<h3 id="data-used-in-this-practical">Data used in this practical</h3>
<p>Create a <code>data</code> folder in your <strong>working directory</strong> and download the <strong>reference genome sequence</strong> to be used (human chromosome 21) and <em>simulated datasets</em> from <strong>Dropbox</strong> <a href="https://www.dropbox.com/sh/4qkqch7gyt888h7/AABD_i9ShwryfAqGeJ0yqqF3a">data</a>. For the rest of this tutorial the <strong>working directory</strong> will be <strong>cambridge_mda14</strong> and all the <strong>paths</strong> will be relative to that working directory:</p>
<pre><code>cd cambridge_mda14
mkdir data</code></pre>
<h5 id="download-reference-genome-from-ensembl">Download reference genome from <a href="http://www.ensembl.org/index.html" title="Ensembl">Ensembl</a></h5>
<p>Working with NGS data requires a high-end workstations and time for building the reference genome indexes and alignment. During this tutorial we will work only with chromosome 21 to speed up the runtimes. You can download it from <strong>Dropbox</strong> <a href="https://www.dropbox.com/sh/4qkqch7gyt888h7/AABD_i9ShwryfAqGeJ0yqqF3a">data</a> or from the <em>Download</em> link at the top of <a href="http://www.ensembl.org/index.html" title="Ensembl">Ensembl</a> website and then to <em>Download data via FTP</em>, you get it in only one step by going to:</p>
<p><a href="http://www.ensembl.org/info/data/ftp/index.html">http://www.ensembl.org/info/data/ftp/index.html</a></p>
<p>You should see a species table with a Human (<em>Homo sapiens</em>) row and a <em>DNA (FASTA)</em> column or click at <a href="ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/">ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/</a>, download the chromosome 21 file (<em>Homo_sapiens.GRCh37.75.dna.chromosome.21.fa.gz</em>) and move it from your browser download folder to your <code>data</code> folder:</p>
<pre><code>mv Homo_sapiens.GRCh37.75.dna.chromosome.21.fa.gz path_to_local_data</code></pre>
<p><strong>NOTE:</strong> For working with the whole reference genome the file to be downloaded is <em>Homo_sapiens.GRCh37.75.dna.toplevel.fa.gz</em></p>
<h5 id="copy-simulated-datasets">Copy simulated datasets</h5>
<p>For this hands-on we are going to use small DNA and RNA-seq datasets simulated from chromosome 21. Data has been already simulated using <em>dwgsim</em> software from SAMtools for DNA and <em>BEERS</em> for RNA-seq. You can copy from the shared resources from <strong>Dropbox</strong> <a href="https://www.dropbox.com/sh/4qkqch7gyt888h7/AABD_i9ShwryfAqGeJ0yqqF3a">data</a> into your <code>data</code> directory for this practical session. Preparing the data directory:</p>
<pre><code>cp path_to_course_materials/alignment/* your_local_data/</code></pre>
<p>The name of the folders and files describe the dataset, ie. <code>dna_chr21_100_hq</code> stands for: <em>DNA</em> type of data from <em>chromosome 21</em> with <em>100</em>nt read lengths of <em>high</em> quality. Where <em>hq</em> quality means 0.1% mutations and <em>lq</em> quality 1% mutations. Take a few minutes to understand the different files.</p>
<p><strong>NOTE:</strong> If you want to learn how to simulate DNA and RNA-seq for other conditions go down to the end of this tutorial.</p>
<h5 id="real-datasets">Real datasets</h5>
<p>For those with access to high-end nodes clusters you can index and simulated whole genome datasets or download real datasets from this sources: - <a href="http://www.1000genomes.org/">1000genomes project</a> - <a href="https://www.ebi.ac.uk/ena/">European Nucleotide Archive (ENA)</a> - <a href="http://www.ncbi.nlm.nih.gov/sra">Sequence Read Archive (SRA)</a></p>
<h3 id="installing-samtools">Installing SAMtools</h3>
<p>Check that it is not installed by executing</p>
<pre><code>samtools</code></pre>
<p>A list of commands should be printed. If not then proceed with the installation.</p>
<p>Download <a href="http://samtools.sourceforge.net/" title="SAMtools">SAMtools</a> from <em>SF Download Page</em> link and move to the working directory, then uncompress it.</p>
<pre><code>mv samtools-0.1.19.tar.bz2 working_directory
cd working_directory
tar -jxvf samtools-0.1.19.tar.bz2
cd samtools-0.1.19
make</code></pre>
<p>Check that is correct by executing it with no arguments, the different commands available should be printed. You can also copy it to your <code>bin</code> folder in your home directory, if bin folder exist, to make it available to the PATH:</p>
<pre><code>samtools
cp samtools ~/bin</code></pre>
<h1 id="exercise-1-ngs-genomic-dna-aligment">Exercise 1: NGS Genomic DNA aligment</h1>
<p>In this exercise we’ll learn how to download, install, build the reference genome index and align in single-end and paired-end mode with the two most widely DNA aligners: <em>BWA</em> and <em>Bowtie2</em>. But first, create an <code>aligners</code> folder to store the software, and an <code>alignments</code> folder to store the results, create those folders in your <em>working directory</em> next to <code>data</code>, you can create both folders by executing:</p>
<pre><code>mkdir aligners alignments</code></pre>
<p>Now go to <code>aligners</code> and <code>alignments</code> folders and create subfolders for <em>bwa</em> and <em>bowtie</em> to store the indexes and alignments results:</p>
<pre><code>cd aligners
mkdir bwa bowtie</code></pre>
<p>and</p>
<pre><code>cd alignments
mkdir bwa bowtie</code></pre>
<p><strong>NOTE:</strong> Now your working directory must contain 3 folders: data (with the reference genome of chrom. 21 and simulated datasets), aligners and alignments. Your working directory should be similar to this (notice that aligners have not been downloaded):</p>
<pre><code>.
├── aligners
│ ├── bowtie
│ ├── bwa
├── alignments
│ ├── bowtie
│ ├── bwa
├── data
│ ├── dna_chr21_100_hq_read1.fastq
│ ├── dna_chr21_100_hq_read2.fastq
│ ├── dna_chr21_100_lq_read1.fastq
│ ├── dna_chr21_100_lq_read2.fastq
│ ├── Homo_sapiens_cDNAs_chr21.fa
│ ├── Homo_sapiens.GRCh37.75.dna.chromosome.21.fa
│ ├── rna_chr21_100_hq_read1.fastq
│ └── rna_chr21_100_hq_read2.fastq
│ ├── rna_chr21_150_lq_read1.fastq
│ └── rna_chr21_150_lq_read2.fastq
</code></pre>
<h3 id="bwa">BWA</h3>
<p><a href="http://bio-bwa.sourceforge.net/" title="BWA">BWA</a> is probably the most used aligner for DNA. AS the documentation states it consists of three different algorithms: <em>BWA</em>, <em>BWA-SW</em> and <em>BWA-MEM</em>. The first algorithm, which is the oldest, is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA for 70-100bp Illumina reads.</p>
<p>All these three algorithms come in the same binary so only one download and installation is needed.</p>
<h5 id="download-and-install">Download and install</h5>
<p>First check that bwa is not currently installed by executing:</p>
<pre><code>bwa</code></pre>
<p>A list of commands will be printed if already installed. If not you can continue with the installation.</p>
<p>You can click on <code>SF download page</code> link in the <a href="http://bio-bwa.sourceforge.net/" title="BWA">BWA</a> page or click directly to:</p>
<p><a href="http://sourceforge.net/projects/bio-bwa/files">http://sourceforge.net/projects/bio-bwa/files</a></p>
<p>Click in the last version of BWA and wait for a few seconds, as the time of this tutorial last version is <strong>bwa-0.7.7.tar.bz2</strong>, the download will start. When downloaded go to your browser download folder and move it to aligners folder, uncompress it and compile it:</p>
<pre><code>mv bwa-0.7.7.tar.bz2 working_directory/aligners/bwa
tar -jxvf bwa-0.7.7.tar.bz2
cd bwa-0.7.7
make
cp bwa ~/bin</code></pre>
<p>You can check that everything is allright by executing:</p>
<pre><code>bwa</code></pre>
<p>Some information about the software and commands should be listed.</p>
<h5 id="build-the-index">Build the index</h5>
<p>Create a folder inside <code>aligners/bwa</code> folder called <code>index</code> to store the BWA index and copy the reference genome into it:</p>
<pre><code>mkdir index
cp data/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa aligners/bwa/index/ (this path can be different!)</code></pre>
<p>Now you can create the index by executing:</p>
<pre><code>bwa index aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa</code></pre>
<p>Some files will be created in the <code>index</code> folder, those files constitute the index that BWA uses.</p>
<p><strong>NOTE:</strong> The index must created only once, it will be used for all the different alignments with BWA.</p>
<h5 id="aligning-with-bwa-mem-in-se-and-pe-modes">Aligning with BWA-MEM in SE and PE modes</h5>
<p>BWA-MEM is the recommended algorithm to use now. You can check the options by executing:</p>
<pre><code>bwa mem</code></pre>
<p>To align SE with BWA-MEM execute:</p>
<pre><code>bwa mem -t 4 -R "@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1" aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa data/dna_chr21_100_hq_read1.fastq > alignments/bwa/dna_chr21_100_hq_se.sam</code></pre>
<p>To align PE with BWA-MEM just execute the same command line with the two FASTQ files:</p>
<pre><code>bwa mem -t 4 -R "@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1" aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa data/dna_chr21_100_hq_read1.fastq data/dna_chr21_100_hq_read2.fastq > alignments/bwa/dna_chr21_100_hq_pe.sam</code></pre>
<p>Now you can use SAMtools to create the BAM file from the <em>alignment/bwa</em> folder:</p>
<pre><code>cd alignments/bwa
samtools view -S -b dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam
samtools view -S -b dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam</code></pre>
<p>Now you can do the same for the <strong>low</strong> quality datasets.</p>
<h5 id="aligning-with-aln-and-samsesampe-old-algorithms-in-se-and-pe-modes">Aligning with ALN and SAMSE/SAMPE (old algorithms) in SE and PE modes</h5>
<p>Now we are going to align SE and PE the <strong>high</strong> quality dataset. Single-end alignment with BWA requires 2 executions. The first uses <code>aln</code> command and takes the <code>fastq</code> file and creates a <code>sai</code> file; the second execution uses <code>samse</code> and the <code>sai</code> file and create the <code>sam</code> file. Results are stored in <code>alignments</code> folder:</p>
<pre><code>bwa aln aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -t 4 data/dna_chr21_100_hq_read1.fastq -f alignments/bwa/dna_chr21_100_hq_se.sai
bwa samse aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa alignments/bwa/dna_chr21_100_hq_se.sai data/dna_chr21_100_hq_read1.fastq -f alignments/bwa/dna_chr21_100_hq_se.sam</code></pre>
<p>For paired-end alignments with BWA 3 executions are needed: 2 for <code>aln</code> command and 1 for <code>sampe</code> command:</p>
<pre><code>bwa aln aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -t 4 data/dna_chr21_100_hq_read1.fastq -f alignments/bwa/dna_chr21_100_hq_pe1.sai
bwa aln aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -t 4 data/dna_chr21_100_hq_read2.fastq -f alignments/bwa/dna_chr21_100_hq_pe2.sai
bwa sampe aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa alignments/bwa/dna_chr21_100_hq_pe1.sai alignments/bwa/dna_chr21_100_hq_pe2.sai data/dna_chr21_100_hq_read1.fastq data/dna_chr21_100_hq_read2.fastq -f alignments/bwa/dna_chr21_100_hq_pe.sam</code></pre>
<p>Now you can use SAMtools to create the BAM file from the <em>alignment/bwa</em> folder:</p>
<pre><code>cd alignments/bwa
samtools view -S -b dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam
samtools view -S -b dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam</code></pre>
<p>Now you can do the same for the <strong>low</strong> quality datasets.</p>
<h3 id="bowtie2">Bowtie2</h3>
<p><a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" title="Bowtie2">Bowtie2</a> as documentation states is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to few 100s. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.</p>
<h5 id="download-and-install-1">Download and install</h5>
<p>First check that bwa is not currently installed by executing:</p>
<pre><code>bowtie2</code></pre>
<p>A list of commands will be printed if already installed. If not you can continue with the installation.</p>
<p>From <a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" title="Bowtie2">Bowtie2</a> go to <code>Latest Release</code> and download the program or go directly to:</p>
<p><a href="http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.1/">http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.1/</a></p>
<p>Click in the Linux version of Bowtie2 and wait for a few seconds, as the time of this tutorial last version is <strong>bowtie2-2.2.1-linux-x86_64.zip</strong>, the download will start. When downloaded go to your browser download folder and move it to aligners folder and uncompress it. No need to compile if you downloaded the Linux version:</p>
<pre><code>mv bowtie2-2.2.1-linux-x86_64.zip working_directory/aligners/bowtie
unzip bowtie2-2.2.1-linux-x86_64.zip
cd bowtie2-2.2.1</code></pre>
<p>You can check that everything is allright by executing:</p>
<pre><code>bowtie2</code></pre>
<p>Big information about the software and commands should be listed.</p>
<h5 id="build-the-index-1">Build the index</h5>
<p>Create a folder inside Bowtie2 program called <code>index</code> to store the Bowtie2 index and copy the reference genome into it:</p>
<pre><code>cd bowtie2-2.2.1 (if not in it)
mkdir index
cp data/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa index/</code></pre>
<p>Now you can create the index by executing:</p>
<pre><code>bowtie2-build aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa</code></pre>
<p>Some files will be created in the <code>index</code> folder, those files constitute the index that Bowtie2 uses.</p>
<p><strong>NOTE:</strong> The index must created only once, it will be used for all the different alignments with Bowtie2.</p>
<h5 id="aligning-in-se-and-pe-modes">Aligning in SE and PE modes</h5>
<p>Mapping SE with Bowtie2 requires only 1 execution, for aligning the <strong>high</strong> in SE mode execute:</p>
<pre><code>bowtie2 -q -p 4 -x aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -U data/dna_chr21_100_hq_read1.fastq -S alignments/bowtie/dna_chr21_100_hq_se.sam</code></pre>
<p>And create the BAM file using SAMtools;</p>
<pre><code>cd alignments/bowtie
samtools view -S -b dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam</code></pre>
<p>Mapping in PE also requires only one execution:</p>
<pre><code>bowtie2 -q -p 4 -x aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -1 data/dna_chr21_100_hq_read1.fastq -2 data/dna_chr21_100_hq_read2.fastq -S alignments/bowtie/dna_chr21_100_hq_pe.sam</code></pre>
<p>And create the BAM file using SAMtools;</p>
<pre><code>cd alignments/bowtie
samtools view -S -b dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam</code></pre>
<p>Repeat the same steps for the <strong>low</strong> quality dataset.</p>
<h3 id="more-exercises">More exercises</h3>
<ul>
<li>Try to simulate datasets with longer reads and more mutations to study which aligner behaves better</li>
<li>Test the aligner sensitivity to INDELS</li>
<li>Try BWA-MEM algorithm and compare sensitivity. The same index is valid, only one execution for the SAM file <code>./bwa mem index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa ../../data/dna_chr21_100_low/dna_chr21_100_low.bwa.read1.fastq</code></li>
</ul>
<h1 id="exercise-2-ngs-rna-seq-aligment">Exercise 2: NGS RNA-seq aligment</h1>
<p>In this exercise we’ll learn how to download, install, build the reference genome index and align in single-end and paired-end mode with the two most widely RNA-seq aligner: <em>TopHat2</em>. TopHat2 uses Bowtie2 as an aligner.</p>
<p><strong>NOTE:</strong> Two others commonly used RNA-seq aligners are <a href="https://code.google.com/p/rna-star/" title="STAR">STAR</a> and <a href="http://www.netlab.uky.edu/p/bioinfo/MapSplice2" title="MapSplice2">MapSplice2</a>, no guided exercises have been documented in this tutorials, but users are encouraged to follow the instructions of their web sites.</p>
<p>Go to <code>alignments</code> folder and create to folders for <em>bwa</em> and <em>bowtie</em> to store alignments results:</p>
<pre><code>cd alignments
mkdir tophat</code></pre>
<p><strong>NOTE:</strong> No index is needed for TopHat as it uses Bowtie2 for alignment.</p>
<h3 id="tophat2">TopHat2</h3>
<p><a href="#tophat2">TopHat2</a> states to be a <em>fast</em> splice junction mapper for RNA-Seq reads, which is not always completrly true. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.</p>
<h5 id="download-and-install-2">Download and install</h5>
<p>First check that bwa is not currently installed by executing:</p>
<pre><code>tophat2</code></pre>
<p>A list of commands will be printed if already installed. If not you can continue with the installation.</p>
<p>From <a href="#tophat2">TopHat2</a> go to <code>Releases</code> and download the Linux program by clicking in <em>Linux x86_64 binary</em> link.</p>
<p>As the time of this tutorial last version is <strong>tophat-2.0.10.Linux_x86_64.tar.gz</strong>, the download will start. When downloaded go to your browser download folder and move it to aligners folder and uncompress it. No need to compile if you downloaded the Linux version:</p>
<pre><code>mv tophat-2.0.10.Linux_x86_64.tar.gz working_directory/aligners/tophat
tar -zxvf tophat-2.0.10.Linux_x86_64.tar.gz
cd tophat-2.0.10.Linux_x86_64</code></pre>
<p>You can check that everything is allright by executing:</p>
<pre><code>tophat2</code></pre>
<p>Big information about the software and commands should be listed.</p>
<p><strong>NOTE:</strong> TopHat uses Bowtie as the read aligner. You can use either Bowtie 2 (the default) or Bowtie (–bowtie1) and you will need the following Bowtie 2 (or Bowtie) programs in your PATH. Index must be created with Bowtie not TopHat. So, copy Bowtie2 into ~/bin:</p>
<pre><code>cd bowtie2-2.2.1 (bowtie 2.2 does not work)
cp bowtie* ~/bin</code></pre>
<h5 id="aligning-in-se-and-pe-modes-1">Aligning in SE and PE modes</h5>
<p>To align in SE mode:</p>
<pre><code>tophat2 -o alignments/tophat/rna_chr21_100_hq_se aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa data/rna_chr21_100_hq_read1.fastq</code></pre>
<p>And for PE:</p>
<pre><code>tophat2 -o alignments/tophat/rna_chr21_100_hq_pe/ aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa data/rna_chr21_100_hq_read1.fastq data/rna_chr21_100_hq_read2.fastq</code></pre>
<p>Now align the rna dataset of 150bp with low quality and compare stats.</p>
<h3 id="star-and-mapsplice2">STAR and MapSplice2</h3>
<p><a href="https://code.google.com/p/rna-star/" title="STAR">STAR</a> and <a href="http://www.netlab.uky.edu/p/bioinfo/MapSplice2" title="MapSplice2">MapSplice2</a> are two others interesting RNA-seq aligners. <a href="https://code.google.com/p/rna-star/" title="STAR">STAR</a> offer a great performance while still have good sensitivity. <a href="http://www.netlab.uky.edu/p/bioinfo/MapSplice2" title="MapSplice2">MapSplice2</a> shows usually a better sensitivity but is several times slower.</p>
<h5 id="star-installation">STAR installation</h5>
<p>STAR comes compiled for Linux, you only have to download the <em>tarball</em>:</p>
<pre><code>tar -zxvf STAR_2.3.0e.Linux_x86_64_static.tgz</code></pre>
<p>Read the documentation and try to align the simulated dataset.</p>
<h5 id="mapsplice2-installation">MapSplice2 installation</h5>
<p>MapSplice must be unizpped and compiled:</p>
<pre><code>unzip MapSplice-v2.1.6.zip
cd MapSplice-v2.1.6
make</code></pre>
<p>Read the documentation and try to align the simulated dataset.</p>
<h1 id="simulating-ngs-datasets">Simulating NGS datasets</h1>
<h3 id="dna">DNA</h3>
<p>Download <a href="http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation" title="dwgsim">dwgsim</a> from http://sourceforge.net/projects/dnaa/files/ to the <em>working_directory</em> and uncompress it and compile it:</p>
<pre><code>tar -zxvf dwgsim-0.1.10.tar.gz
cd dwgsim-0.1.10
make</code></pre>
<p>Check options by executing:</p>
<pre><code>./dwgsim</code></pre>
<p>Then you can simulate 2 million reads of 150bp with a 2% if mutation executing:</p>
<pre><code>./dwgsim-0.1.11/dwgsim -1 150 -2 150 -y 0 -N 2000000 -r 0.02 ../data/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa ../data/dna_chr21_100_low/dna_chr21_100_verylow</code></pre>
<h3 id="rna-seq">RNA-seq</h3>
<p><a href="http://www.cbil.upenn.edu/BEERS/" title="BEERS">BEERS</a> is a perl-based program, no compilation is needed, just download it from here http://www.cbil.upenn.edu/BEERS and uncompress it:</p>
<pre><code>tar xvf beers.tar</code></pre>
</body>
</html>