-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpseudo-it-nextflow.pl
executable file
·327 lines (263 loc) · 11.3 KB
/
pseudo-it-nextflow.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
#!/usr/bin/perl
##---------------------------------------------------------------------------##
## File:
## @(#) pseudo-it-nextflow.pl
## Author:
## Robert M. Hubley [email protected]
## Description:
## A script to iteratively call the NextFlow pseudo-it-nextflow.nf
## workflow. The NextFlow framework is a powerful framework for
## describing dependencies in a serial and parallel fashion.
## Unfortunately it does not handle iteration very well. This
## script simply sets up the data files for a NextFlow run calls
## the workflow in an iterative fashion.
##
#******************************************************************************
#* This software is provided ``AS IS'' and any express or implied *
#* warranties, including, but not limited to, the implied warranties of *
#* merchantability and fitness for a particular purpose, are disclaimed. *
#* In no event shall the authors or the Institute for Systems Biology *
#* liable for any direct, indirect, incidental, special, exemplary, or *
#* consequential damages (including, but not limited to, procurement of *
#* substitute goods or services; loss of use, data, or profits; or *
#* business interruption) however caused and on any theory of liability, *
#* whether in contract, strict liability, or tort (including negligence *
#* or otherwise) arising in any way out of the use of this software, even *
#* if advised of the possibility of such damage. *
#* *
#******************************************************************************
=head1 NAME
pseudo-it-nextflow.pl - Iteratively run the pseudo-it-nextflow workflow
=head1 SYNOPSIS
pseudo-it-nextflow.pl [-reference <fasta>] -PE1 <paired_end_R1.fastq>
-PE2 <paired_end_R2.fastq> [-SE <unpaired.fastq>]
[-iteration <#>] -outputDir <directory_for_resulsts>
Run 7 iterations of the pseudo-it-nextflow.nf workflow and save the results
of each round to "-outputDir". Optionally allow recovery of a failed run
by specifying the "-iteration" parameter and omiting the "-reference"
parameter ( in this case reference is assumed to be the output of the
previous (iteration-1) consensus file output.
For example:
Hrothgar:
ssh hrothgar
qlogin -pe sm 10 -q Yoda -P communitycluster
cd /lustre/scratch/username
nohup /lustre/work/daray/software/pseudo-it-nextflow-v0.4/pseudo-it-nextflow.pl \
-reference /lustre/work/foobar/foo.fa \
-PE1 /lustre/work/foobar/R1.fastq \
-PE2 /lustre/work/foobar/R2.fastq \
-SE /lustre/work/foobar/SE.fastq \
-cluster hrothgar \
-outputDir /lustre/scratch/username/results-genome1 >& run-genome1.log &
Quanah:
ssh quanah
qlogin -pe sm 12 -q omni -P quanah
cd /lustre/scratch/username
nohup /lustre/work/daray/software/pseudo-it-nextflow-v0.4/pseudo-it-nextflow.pl \
-reference /lustre/work/foobar/foo.fa \
-PE1 /lustre/work/foobar/R1.fastq \
-PE2 /lustre/work/foobar/R2.fastq \
-SE /lustre/work/foobar/SE.fastq \
-cluster quanah \
-outputDir /lustre/scratch/username/results-genome1 >& run-genome1.log &
If a run fails you will see an indication of that in the *.log file. Also
NextFlow should "qdel" all the running or queued jobs automatically. If it
looks like it was "Killed" rather than error'd out due to an data problem,
then you should just restart the run using the previous iterations output.
To do this simply alter the command line by removing the "-reference"
parameters and including a new "-iteration #" parameter. The iteration
number is the one you want to re-run. I.e if it died while working iteration
5 simply start:
nohup /lustre/work/daray/software/pseudo-it-nextflow-v0.4/pseudo-it-nextflow.pl \
-PE1 /lustre/work/foobar/R1.fastq \
-PE2 /lustre/work/foobar/R2.fastq \
-SE /lustre/work/foobar/SE.fastq \
-cluster quanah \
-iteration 5 \
-outputDir /lustre/scratch/username/results-genome1 >& run-genome1.log &
=head1 DESCRIPTION
A script to iteratively run the NextFlow port of the Pseudo-it project.
The Pseudo-it project, originally written by Brice Sarver
( https://github.com/bricesarver/pseudo-it ), is an approach that
"iteratively generates pseudoreferences, incorporating sample-specific
variation and reducing mapping biases.".
By porting it to NextFlow aspects of the workflow can be massively
parallelized ( particularlly index building, mapping, and variant
calling ) while other aspects of the flow can remain serial. In
addition the parallel aspects may be run on a single node with many
processors or on many nodes through a job-scheduler.
=head1 Workflow Overview
[serial] Split FASTQ files into batches
[parallel] Index starting reference FASTA file
- make dictionary
- make faIdx
- make BWT
[parallel] Map Read Batches ( bwa - both paired and optional single ended reads )
- Added sort to each job to speed up merging
[serial] Merge BAM files
[serial] Generate read groups ( picard AddOrReplaceReadGroups )
[serial] Dedup ( picard MarkDuplicates )
[serial] Index ( samtools index )
- No need to run gatk RealignerTargetCreator ( only needed by
UnifiedGenotyper )
- No need to run gatk IndelRealigner ( only needed by
UnifiedGenotyper )
[serial] Create BED file batches for HaplotypeCaller
[parallel] Variant calling ( gatk HaplotypeCaller )
- Latest version of gatk doesn't support the "-stand_emit_conf 10"
option anymore. From my reading of the docs it may be that this
was overriden by the "-stand_call_conf 30" parameter anyway.
[serial] Merge VCF files
[serial] Select SNPs ( gatk SelectVariants, gatk VariantFiltration )
[serial] Call consensus ( gatk FastaAlternateReferenceMaker )
=head1 NextFlow Environment
NextFlow maintains an execution state database that needs to be accessible
to the node running the workflow. In addition this database needs to
stored on a filesystem that support file locking. Since /lustre does not
support file locking I choose to create a wrapper script for NextFlow to
set the NextFlow database and run directory to the /tmp/<username> directory.
If the directory doesn't already exist it's created before NextFlow is started
up. In this way a locking file system is guaranteed. Unfortunately this means
that it's important to remember which system NextFlow was initiated on in order
to use features like "-resume". The NextFlow working directories on the other
hand need to be located on a shared filesytem ( file locking not necessary ).
The wrapper currently sets this to the NXF_WORK subdirectory in the current
working directory. The protocol for running on Hrothgar would then be:
# Login to the cluster head node
ssh hrothgar
# Pick an adequately large queue to run the NextFlow environment from
qlogin -pe sm 10 -q Yoda -P communitycluster
# Change to a shared filesystem to run from
cd /lustre/scratch/myworkdir
# Run NextFlow wrapper on a workflow
/lustre/work/daray/software/Nextflow/workflow run my-workflow.nf
# Now...you can finde a new NXF_WORK directory containinig all the
# NextFlow intermediate files:
ls NXF_WORK
# And you can look at the NextFlow database and report files:
ls /tmp/myusername
The options are:
=over 4
=item -reference
The starting reference to align reads to
=item -bwaBatchSize <#>
Optional parameter to control the size of the mapping batches. Currently
set to 5000000. By increasing this, jobs will run slower on the cluster ( typically
not a good way to batch ) but the number of jobs submitted to the queue will
be smaller This is important if the queue admins have a limit on the number
of jobs that can be submitted at once.
=item -haploBatchSize <#>
Optional parameter to control the size of the HaplotypeCaller batches. Currently
defaults to 20000000.
=back
=head1 DEPENDENCIES
Java 1.8u155 JRE, GATK, Picard, Samtools, BWA
=head1 SEE ALSO
=head1 COPYRIGHT
Copyright 2017 Robert Hubley, Institute for Systems Biology
=head1 AUTHOR
Robert Hubley <[email protected]>
=cut
#
# Modules
#
use strict;
use Getopt::Long;
use Cwd;
use Data::Dumper;
use warnings;
use strict;
use File::Spec;
use FindBin;
#
# Where NextFlow and the pseudo-it-nextflow.nf file exist:
#
my $NextFlow = "/lustre/work/daray/software/Nextflow/nextflow";
my $pseudoItWorkflow = "$FindBin::RealBin/pseudo-it-nextflow.nf";
my $Version = "0.1";
#
# Option processing
# e.g.
# -t: Single letter binary option
# -t=s: String parameters
# -t=i: Number paramters
#
my @getopt_args = (
'-version', # print out the version and exit
"-reference=s",
"-PE1=s",
"-PE2=s",
"-SE=s",
"-name=s",
"-cluster=s",
"-iteration=i",
"-outputDir=s",
"-bwaBatchSize=i",
"-haploBatchSize=i"
);
my %options = ();
Getopt::Long::config("noignorecase", "bundling_override");
unless (GetOptions(\%options, @getopt_args)) {
usage();
}
sub usage {
print "$0 - $Version\n\n";
exec "pod2text $0|less";
exit;
}
if ( ! keys( %options ) ||
( ! ( $options{'PE1'} && $options{'PE2'} &&
$options{'cluster'} && $options{'outputDir'} ) ) )
{
usage;
}
my $iteration = 1;
$iteration = $options{'iteration'} if ( exists $options{'iteration'} );
my $username = getpwuid($<);
my $outputDir = $options{'outputDir'};
mkdir($outputDir) if ( ! -d $outputDir);
if ( ! -d "$outputDir/iter-0" && exists $options{'reference'} )
{
mkdir("$outputDir/iter-0");
system("ln -s $options{'reference'} $outputDir/iter-0/consensus.fa" );
}
mkdir( "$outputDir/iter-$iteration" ) if ( ! -d "$outputDir/iter-$iteration" );
if ( ! exists $options{'bwaBatchSize'} )
{
$options{'bwaBatchSize'} = 5000000;
}
if ( ! exists $options{'haploBatchSize'} )
{
$options{'haploBatchSize'} = 20000000;
}
while ( $iteration < 8 && $iteration > 0 )
{
$options{'outputDir'} = $outputDir . "/iter-$iteration";
$options{'reference'} = "$outputDir/iter-" . ( $iteration-1 ) . "/consensus.fa";
my $optionStr = "";
foreach my $option ( keys %options )
{
$optionStr .= " --$option " . $options{$option};
}
my $reportFile = "nf-$$-iter-$iteration-report";
my $timelineFile = "nf-$$-iter-$iteration-timeline";
my $dagFile = "nf-$$-iter-$iteration-dag";
my $iterCmd = "$NextFlow run $pseudoItWorkflow $optionStr -with-report $reportFile -with-timeline $timelineFile -with-dag $dagFile ";
# Run
print "Running iteration $iteration\n";
print "cmd: $iterCmd\n";
system($iterCmd);
if ( -s "$outputDir/iter-$iteration/consensus.fa" &&
-s "$outputDir/iter-$iteration/filteredSNPs.vcf" &&
-s "/tmp/$username/$reportFile" )
{
# Copy report and trace.txt files to shared folders from the /tmp/user file system
system("mv /tmp/$username/$reportFile $outputDir/iter-$iteration");
system("mv /tmp/$username/$timelineFile $outputDir/iter-$iteration");
system("mv /tmp/$username/$dagFile $outputDir/iter-$iteration");
$iteration++;
}else {
$iteration = 0;
}
}
1;