-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathexamplePipeline.sh
executable file
·135 lines (122 loc) · 9.49 KB
/
examplePipeline.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
#!/bin/bash
# This script describes each step of the pipeline for training FILET and running it on an actual
# real dataset. Example commands are given, and the input files required to get the pipeline
# started. Thus, if you clone the repository and cannot run through the whole pipeline then
# something is wrong and you may be missing one of the dependencies (see README file).
# Some steps in this pipeline may take a fair amount of time (e.g. several hours), so you may wish
# to run it overnight. On much larger datasets you may wish to use a compute cluster. and split
# up the computation. Anyway, here we go:
# Step 0: Simulate your training data (not shown here). This can be done with msmove, written by
# Daniel Garrigan and Anthony Geneva (https://github.com/geneva/msmove).
n1=20
n2=14
windowSize=10000
# Step 1: Make directories for intermediate files and final classification output
mkdir -p trainingSimsStats trainingSets classifier results
# Step 2: Calculate summary statistics for each input file in our training simulations.
# This step may take several hours for very large training sets. The input for twoPopnStats_forML
# (which is piped in via stdin) is ms-style simulation output. n1 and n2 give the sample sizes of
# the two population samples in these simulated data. The output file contains one tab-separated
# line for each simulation, this line gives the values of various population genetic summary
# statistics calculated within the simulated region.
# Note: the masking step described in our paper and used for simulans/sechellia analysis is
# omitted here for simplicity.
#
# normalizeTwoPopnStats.py simply takes the output from twoPopnStats_forML and for statistics that
# scale with window size (e.g. pi, the number of segregating sites, the number of private alleles)
# the value of the statistic is divided by the number of sites in the window. The script takes
# two arguments: a file that has one line for each simulation replicate in the input. Each line
# contains only a single number: the fraction of unmasked data (i.e. one minus the fraction of
# missing data); just use the word None if no masking was performed), and the window size (10 kb
# in this example). The output is again a tab-separated list of values for each simulation.
gunzip trainingSims/*.msOut.gz # first we have to unzip the simulation output
for inFile in `ls trainingSims/ | grep .msOut` ; do cat trainingSims/$inFile | ./twoPopnStats_forML $n1 $n2 | python normalizeTwoPopnStats.py None $windowSize > trainingSimsStats/$inFile; done
# Step 3: Aggregate our training data into one labeled data matrix to be used for training.
# The first argument is the directory which contains all input files, and the second argument
# is the name of the output file. The input file names must be noMig.msOut (no migration, assigned
# label "0"), mig12.msOut (training examples for migration from population 2 into population 1,
# assigned label "1"), and mig21.msOut (training examples for migration from population 1 into
# population 2, assigned label "2"). The output file contains all of these feature vectors but
# with an additional column: the appropriate class lable.
python buildThreeClassTrainingSet.py trainingSimsStats/ trainingSets/threeClass.fvec
# Step 4: Train the classifier. The first two arguments are the paths to the training data and
# classifier pickle (see below) to be written, respectively. The remaining arguments are the names
# of all statistics to be included in the classifier, or simply the word "all" (without quotes).
# if using all features in the training data.
#
# Note: this step may take several hours for ~10000 training examples.
#
# Additional note: for this example I have omitted several statistics that are sensitive to the
# length of the sequence but for whcih this dependency is not linear (haplotype counts and IBS
# tract length statistics). It would be better to include these only if the training data have
# been masked in the same manner as the data the user wishes to classify.
python trainFiletClassifier.py trainingSets/threeClass.fvec classifier/threeClass.p pi1 hetVar1 ss1 private1 tajd1 ZnS1 pi2 hetVar2 ss2 private2 tajd2 ZnS2 Fst snn dxy_mean dxy_min gmin zx dd1 dd2 ddRank1 ddRank2
# Step 5: Meanwhile, we need some data to classify. The program pgStatsBedSubpop_forML does this
# for us. The arguments, in order, are the phased sequence data for population/species 1 in our
# region or chromosome of interest (in fasta format), the same for population 2, a fasta file
# with one entry representing the inferred ancestral state for each position in our region
# of interest (can just use data from the reference genome if not using output information -- in
# this case be sure not to incorporate into your classifier either Fay and Wu's thetaH or H
# statistics for either population), a BED file specifying the coordinates of the windows within
# our region of interest (which can be an entire chromosome/scaffold or a piece of one; all
# coordinates here are zero-based and relative to the beginning of the region included in the
# fasta files--i.e. if our fasta files contain sequence for chr1:1001-2000 then the BED entry
# "chr1\t0\t100" corresponds to chr1:1001-1100), and the maximum fraction of missing data per
# site (e.g. in the example above, if >=50% of phased haplotypes are 'N' at a given site, then
# then entire site is ignored during the calculation of summary statistics.
# Note that for the fasta-formatted input files, each sequence must be the same length (i.e. the
# same chromosome/scaffold/region) though different files may have different numbers of sequences
# (i.e. pop1.fa and pop2.fa will have one entry for each of their haploid sample sizes n1 and n2,
# and anc.fa should have only one entry). The BED format is described here:
# https://genome.ucsc.edu/FAQ/FAQformat.html#format1
#
# If the user wishes to perform a sliding-window analysis, the user need only ensure that the
# input BED file contain sets of coordinates with the desired degree of overlap.
#
# The normalizePgStats.py script divides all measures of variation that scale with the number of
# sites (e.g. pi, the number of segregating sites, the number of private alleles) by the number
# of sites not filtered from the window (the numSites field from pgStatsBedSubpop_forML's output).
# Windows missing at least 75% of all sites are filtered out by this script.
#
# The output of this step is a tab-delimited file giving the coordinates for each window, the
# number of unmasked sites in the window and the values of various summary statistics.
#
# Note: the example files included in this script are from chr4 in the simulans-sechellia dataset.
# These data may not be of adequate quality for analysis, but are useful for testing the pipeline.
./pgStatsBedSubpop_forML dataToClassify/pop1.fa dataToClassify/pop2.fa dataToClassify/anc.fa dataToClassify/coords.bed 0.5 | python normalizePgStats.py $windowSize > featureVectorsToClassify/test.ss
# Step 6: Use classifier to produce classifications on the real data set.
# The first argument is the path to the classifier pickle, the second is the data we wish to
# classify (described above), the third is a floating point number between zero and one
# specifying the critical posterior probability threshold for rejecting the no-migration class,
# and the fourth is the path where we wish to write our results.
# Note on the second argument: if FILET's predicted posterior probability threshold for the
# no-migration class is less than probThreshold, then FILET will reject this class and emit as its
# prediction the more probable of the two migration classes. We often use 0.05 as this value, but
# selection of this value should be informed by results on an independent simulated test set in
# order to obtain the desired tradeoff between sensitivity and specificity, as the posterior
# probabilties estimated by FILET may not be well calibrated. In our experience this is especially
# so when the training set is very small.
# Note: If we reject the no introgression class and our two introgression classes happen to have
# equal posterior probability, then we will emit class "-1" which you can think of as introgression
# of perfectly ambiguous direction. This should be rare.
# python classifyChromosome.py classifier/threeClass.p featureVectorsToClassify/ probThreshold results/
# Output format:
# 1. Chromosome
# 2. Starting coordinate of the classified window
# 3. Ending coordinate
# 4. The number of sites in the window that are not 'N' across the entire sample (treating this as our number of informative sites)
# 5. The class label of the chosen class
# 6. The posterior probability of class zero
# 7. The posterior probability of class one
# 8. The posterior probability of class two
# Class 0 means no introgression. Class 1 is introgression from population 2 into population 1, and class 2
# is introgression from population 2 into population 1. (Class -1 is described above.)
# Note about pickles: this pipeline saves our classifier as a "pickle" using scikit-learn's joblib
# library. Beware that pickles can contain executable python code, so if your pickles are tampered
# with there is the potential for execution of malicious code. Make sure your pickles cannot be
# edited by anyone else! For more information on pickles and joblib, see
# http://scikit-learn.org/stable/modules/model_persistence.html
#
# The joblib pickles can consist of many files, so it is best to save each pick;e in their own
# directory. Note that the formatting of pickles is system-dependent, so you may have to re-train
# the classifier on each machine where you wish to use it.