-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathold_norm_pipeline.txt
110 lines (87 loc) · 4.4 KB
/
old_norm_pipeline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---------- old normalization -------------------------------
(A) Create reference genome and theoretical fragments
See set_up_reference.txt
(B) Partition the genome into a set number of bins, such that each bin contains the same number of aligned simulated fragments
(1) [get_bin_bounds.sh]
(2) [chrom_sizes.py]
(3) [get_goodzones.py]
(4) [sep_bin_boundaries.py]
(C) Calculate the GC content and median fragment length for each bin
(5) [compute_gc_content.sh]
(6) [compute_gc_content.py]
(7) [compute_len_distrib.py]
(D) Process sequencing data: see "processing_pipeline.txt"
(E) Count the data samples into genomic bins
(8) [get_bin_counts_qsubs.py]
(9) [count_and_dedup.py]
(F) Combine counts for multiple samples
(10) [combine_varbin_data.py]
(G) Normalize bin counts and plot profiles
(11) [normalize_GC_segment.R]
--------- Method specifications ------------------------------
(1) [get_bin_bounds.sh] creates 5000 bins for the hybrid reference genome. Removes theoretical alignments for fragments that dont fall within a certain range (this range should be an input, but is currently hardcoded)
Run: <qsub get_bin_bounds.sh>
Inputs: None
Outputs: file containing chromosome lengths, file of sorted unique mappers, file of number of alignments per chromosome, file of genomic bin boundaries
Dependencies: 1 - the SAM alignment file for theortical fragments from (A)
2 - (2), (3), (4)
3 - SamTools software
4 - directory containing the chromosome fasta files
(2) [chrom_sizes.py] creates a file containing each chromosome’s length
Run: <python chrom_sized.py genome_path output_name.txt>
Inputs: 1 - the directory containing the chromosome fasta files for the genome being examined; each chromosome should be separated into its own file.
2 - the name of the file to be outputted
Outputs: 1 - a text file with 3 columns; 1st col is chromosome name, 2nd col is the length of the chromosome, 3rd is the relative start position of that chromosome in the entire genome. Chromsomes are sorted numerically.
Dependencies: directory containing the chromosome fasta files
(3) [get_goodzones.py]
Run: <python get_goodzones.py alignmentdata.sam>
Inputs: alignmentdata.sam, a paired-end alignment file from (A) with fragments outside a length range removed
Outputs: 1 - A .bed file containing all alignments, sorted by chromosome. First column is chrom name, 2nd is starting pos of that alignment on the chrom, 3rd is the ending pos of that alignment on the chrom. Sorted numerically by chromosome.
2 - Text file with the number of alignments per chromosome
Dependencies: None
(4) [sep_bin_boundaries.py]
Run: <python sep_bin_boundaries.py num_mappers.txt sorted_mappers.bed chrom_sizes.txt output_loc>
Inputs: 1 - the txt file output of (3) that contains the number of alignments per chromosome
2 - the .bed file output of (3) that contains all alignments. must be sorted in lexicographic order by chromosome name, then by position.
3 - the text file output of (2) containing chrom sizes
4 - the directory where the output should be written
Outputs: 1 - a text file containing the bin boundaries for the hybrid genome. Columns are: chromosome, chrom bin pos, absolute bin start pos, chrom bin end, bin length, number of reads mapping to the bin
Dependencies:
Note: the number of bins to be created is currently hardcoded to 5000, but this should be an option
(5) [compute_gc_content.sh]
Run: <qsub compute_gc_content.sh>
Inputs: None
Outputs: 1 - A text file containing GC content values for each chromosome
2 - A text file containing the median fragment length for each chrosome
Dependencies: directory containing all the chromosome fasta files
(6) [compute_gc_content.py]
Run: <python compute_gc_content.py genome_dir >
Inputs:
Outputs:
Dependencies:
(7) [compute_len_distrib.py]
Run:
Inputs:
Outputs:
Dependencies:
(8) [get_bin_counts_qsubs.py]
Run:
Inputs:
Outputs:
Dependencies:
(9) [count_and_dedup.py]
Run:
Inputs:
Outputs:
Dependencies:
(10) [combine_varbin_data.py]
Run:
Inputs: None
Outputs:
Dependencies:
(11) [normalize_GC_segment.R]
Run: </usr/bin/R CMD BATCH normalize_GC_segment.R>
Inputs: None
Outputs: png files with the copy number profile for each sample in the "uber" file of all samples
Dependencies: 1 - "uber" file from (10)
2 - A file containing ...