-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathGN_Assignment3.Rmd
321 lines (263 loc) · 13.4 KB
/
GN_Assignment3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
title: "GN Assignment3"
author: "Grant Nickles"
date: "10/12/2019"
output: html_document
---
BEFORE RUNNING THE CODE SET THE WORKING DIRECTORY TO THE REPOSITORY FOLDER. Everything should work after this point.
First I will call the packages needed to graph and manipulate the data.
```{r}
library(ggplot2)
library(tidyverse)
library(dplyr)
library(forcats)
```
#**Data Inspection**
##**Inspection of fang_et_al_genotypes.txt**
Reading in the data
```{r}
fang = read.delim("fang_et_al_genotypes.txt", header = TRUE, sep = "\t")
```
Checking the rows and columns of the file: The file appears to have 2782 rows and 986 columns. Thus there are 2782 unique observations. This is a very large file and thus it will be unreasonable to by hand look at all the columns or data points. The byte size is 12,163,888.
```{r}
dim(fang)
object.size(fang)
```
Checking on the first six rows of the data I'm able to pull a couple pieces of information. The Samples are labled in the "Sample_ID" colume. Missing data appears to be indicated with "?" and the diploid copies of each SNP location is separted with "/". The 2nd and 3rd column both have other info on the samples but are not SNP data.
```{r}
View(fang)
```
##**Inspection of snp_position.txt**
Reading in the data
```{r}
snp = read.delim("snp_position.txt", header = TRUE, sep = "\t")
```
Checking the rows and columns on the SNP file there appears to be 15 columns and 983 rows. Supbracting the rows in the fang file that are not relavent to the SNP locations the amount of SNP observations matches on both files. The file overall is 324,472 bytes.
```{r}
dim(snp)
object.size(snp)
```
This file however has the formating of data in a transposed version of the fang file. In addition it has the chromosome and position that the SNP are located on. The rest of the column headers are extra info that may not be needed later on.
```{r}
View(snp)
names(snp)
```
#**Data Processing**
First I'm going to sepearte out the groups that way I don't have to later worry about this column altogether.
```{r}
teosinteTarget = c('ZMMIL','ZMMLR','ZMMMR')
teosinte = filter(fang, Group %in% teosinteTarget)
maizeTarget = c('ZMPBA', 'ZMPIL', 'ZMPJA')
maize = filter(fang, Group %in% maizeTarget)
```
Now I need to transpose the files after removing the columns I don't care about.
```{r}
teosinte = teosinte %>%
select(1, 4:986)
maize = maize %>%
select(1, 4:986)
snp = snp %>%
select(1,3,4)
```
Next I transposed the fang data for the maize and teosinte, wrote it to a seperate file without the headers(they will be added back later when combined with the SNP file), and read that file back into R with the headers. Now the SNP locations are going across the top of the file. I then took the SNP first column and added it onto the beginging of the fang file -> Labeledfang. Now we have two files that are ready for merging!
```{r}
Tteosinte = as.data.frame(t(teosinte))
Tmaize = as.data.frame(t(maize))
```
```{r}
write_csv(Tteosinte, "TeoNoHeaders", col_names = FALSE)
write_csv(Tmaize, "MaizeNoHeaders", col_names = FALSE)
```
```{r}
Tteosinte2 = read_csv("TeoNoHeaders", col_names = TRUE)
Tmaize2 = read_csv("MaizeNoHeaders", col_names = TRUE)
Sample_ID = snp %>%
select(1)
LabeledTeo = cbind(Sample_ID, Tteosinte2)
LabeledMaize = cbind(Sample_ID, Tmaize2)
```
Now I'm going to merge both files by the first column that they share. But first I need to add back in the group column. The files are now formated as specified in the rubric, with the SNP_ID in the first column, Chromosome in the second column, and Position in the third column.
```{r}
MergedTeo = merge(snp, LabeledTeo, by="SNP_ID")
MergedMaize = merge(snp, LabeledMaize, by="SNP_ID")
```
Great! Now we have a file to work with for the rest of the data analysis. However I want to write this file into a .csv doc because I will be working with it a lot. That way I have it if ever needed.
```{r}
write_csv(MergedTeo, "MergedTeo", col_names = TRUE)
write_csv(MergedMaize, "MergedMaize", col_names = TRUE)
```
##**Teosinte**
ZMMIL, ZMMLR, and ZMMMR
*The file that I'm working with at the start is MergedTeo*
First I'm going to filter out by the chromosome, arrange that such that the position is ordered in ascending order, and then write those as seperate files. I was having issuse at first due to the posion column's data type being listed as Factor and not a numeric value. By adding in the line of code that converted the data to numeric I was able to get around this.
```{r}
setwd("./teosinte/Ascending")
#I want to store the files in this specific folder hence why I had to change to working directory
i = 1
while(i <= 10){
a = i
ChromTeosinteAscending = filter(MergedTeo, MergedTeo$Chromosome == a)
ChromTeosinteAscending$Position = as.numeric(as.character(ChromTeosinteAscending$Position))
ChromTeosinteAscending = ChromTeosinteAscending[order(ChromTeosinteAscending$Position), ]
write_csv(ChromTeosinteAscending, sprintf("Ascending_TeoChromosome_%s", a), col_names = TRUE)
i = i + 1
}
```
First step is to replace the ? characters with the - character.
```{r}
i = 4
DecMergedTeo = MergedTeo
for(i in names(DecMergedTeo)){
DecMergedTeo[[i]] = gsub("\\?","-", DecMergedTeo[[i]])
}
```
Now I need to as last time seperate out the chromosomes and order them in decreasing order.
```{r}
setwd("./teosinte/Decreasing")
#again I changed the working directory to put the files in the correct folder
i = 1
while(i <= 10){
a = i
ChromTeosinteDecreasing = filter(DecMergedTeo, DecMergedTeo$Chromosome == a)
ChromTeosinteDecreasing$Position = as.numeric(as.character(ChromTeosinteDecreasing$Position))
ChromTeosinteDecreasing = ChromTeosinteDecreasing[order(ChromTeosinteDecreasing$Position, decreasing = TRUE), ]
write_csv(ChromTeosinteDecreasing, sprintf("Decreasing_TeoChromosome_%s", a), col_names = TRUE)
i = i + 1
}
```
Great! Now I have all 20 of the files I needed organized into files within the repository. Now the easy part, doing the same with the maize files. Since most of the code will be recycled I'm not going to put as many comments over every chunk I run.
##**Maize**
ZMPBA, ZMPIL, and ZMPJA
*The file that I'm working with at the start is MergedMaize*
First as last time I'm doing the ascending order.
```{r}
setwd("./maize/Ascending")
#I want to store the files in this specific folder hence why I had to change to working directory
i = 1
while (i <= 10){
a = i
ChromMaizeAscending = filter(MergedMaize, MergedMaize$Chromosome == a)
ChromMaizeAscending$Position = as.numeric(as.character(ChromMaizeAscending$Position))
ChromMaizeAscending = ChromMaizeAscending[order(ChromMaizeAscending$Position), ]
write_csv(ChromMaizeAscending, sprintf("Ascending_MaizeChromosome_%s", a), col_names = TRUE)
i = i + 1
}
```
Now I will replace the ? characters with the - character.
```{r}
i = 4
DecMergedMaize = MergedMaize
for(i in names(DecMergedMaize)){
DecMergedMaize[[i]] = gsub("\\?","-", DecMergedMaize[[i]])
}
```
Next I'm seperating out the chromosomes and ordering them in decreasing order.
```{r}
setwd("./maize/Decreasing")
#again I changed the working directory to put the files in the correct folder
i = 1
while(i <= 10){
a = i
ChromMaizeDecreasing = filter(DecMergedMaize, DecMergedMaize$Chromosome == a)
ChromMaizeDecreasing$Position = as.numeric(as.character(ChromMaizeDecreasing$Position))
ChromMaizeDecreasing = ChromMaizeDecreasing[order(ChromMaizeDecreasing$Position, decreasing = TRUE), ]
write_csv(ChromMaizeDecreasing, sprintf("Decreasing_MaizeChromosome_%s", a), col_names = TRUE)
i = i + 1
}
```
#**Part 2: Data Visualizaion**
First since we now want to graph all of the data for the maize and teosinte fang file, I'm going to repeat the steps I did earlier but combining both groups.
```{r}
library(reshape2)
library(plyr)
MergedMaize2 = MergedMaize %>%
select(1, 4:978)
MergedCombined = merge(MergedMaize2, MergedTeo, by="SNP_ID")
MergedCombined = merge(snp, MergedCombined, by="SNP_ID")
MergedCombined = rename(MergedCombined, c("Position.x" = "Position", "Chromosome.x" = "Chromosome"))
```
##**SNPs per chromosome**
###**Total number of SNPs in our dataset per chromosomes
```{r}
ggplot(data = MergedCombined) +
geom_bar(mapping = aes(x = Chromosome, fill = Chromosome)) +
ggtitle("SNPs per chromosome")
```
###**Distribution of SNPs on chromosomes and my personal graph**
Because it is impossible to graph the SNP positions when they are listed as unknown or multiple I am first going to drop this data. The dropped data is now in MergedCombinedFiltered. There are several ways to show the disrubution of the of the SNPs on the chromosomes so I chosed to graph the distruptions on the length of the chromosome, and on the chromosome altogether.
```{r}
MergedCombined$Position = as.numeric(as.character(MergedCombined$Position))
MergedCombinedFiltered = filter(MergedCombined, as.numeric(as.character(Position)) >= 1)
MergedCombinedFiltered = subset(MergedCombined, !(Chromosome %in% "multiple" | Chromosome %in% "unknown" | Position %in% "multiple" | Position %in% "unknown" | Position %in% ""))
Chromosomes = c("multiple", "unknown", "")
levels(MergedCombinedFiltered$Chromosome)[match(Chromosomes, levels(MergedCombinedFiltered$Chromosome))] = NA
MergedCombinedFiltered = MergedCombinedFiltered %>%
mutate(PositionRange = cut(Position, 5));
```
The **first graph** is the density split by all the chromosomes.
The **second graph** has the PositionRange on the X-axis. This mutated category is just the Position column binned into five groups. The choice of 5 is arbitary, but it shows that several of the chromosomes have SNPs in roughly the 1st bin only. This could be from those chromosomes being smaller, or the SNPs measured just all being in that region of those chromosomes. **this is my personal graph that we were asked to create.**
```{r}
ggplot(data = MergedCombinedFiltered) +
geom_density(mapping=aes(x = Chromosome, fill = Chromosome), alpha = 0.15) +
ggtitle("SNP density on each chromosome")
Chromosomes = c("multiple", "unknown", "")
levels(MergedCombinedFiltered$Chromosome)[match(Chromosomes, levels(MergedCombinedFiltered$Chromosome))] = NA
ggplot(data = MergedCombinedFiltered) +
geom_density(mapping=aes(x = PositionRange, fill = Chromosome), alpha = 0.15) +
ggtitle("SNP density on its position on the chromosome") +
theme(axis.text.x=element_blank())
##NOTE the graph for this one says fewer than two data points have been dropped as it dropped the position ranges that were null
```
###**Missing data and amount of heterozygosity**
For this graph we don't even need the data in the SNP file. I'm going to format the fang file to be how we want it.
```{r}
teosinteMaizeTarget = c('ZMMIL','ZMMLR','ZMMMR', 'ZMPBA', 'ZMPIL', 'ZMPJA')
plot_final = filter(fang, Group %in% teosinteMaizeTarget)
plot_final = plot_final %>%
select(1, 4:986)
plot_final = plot_final %>%
pivot_longer(-Sample_ID, names_to = "SNP_Name", values_to = "SNP_values")
#first I need to convert all the homozygotes the phrase "homozygous"
homozygotes = c("T/T", "C/C", "A/A", "G/G")
levels(plot_final$SNP_values)[match(homozygotes, levels(plot_final$SNP_values))] = "homozygous"
#now I'm going to change all the missing data to say "missing"
missing_data = c("?/?", "?/(A|G|C|T)", "(A|G|C|T)/?")
levels(plot_final$SNP_values)[match(missing_data, levels(plot_final$SNP_values))] = "missing_data"
#lastly I'm going to convert any other data to heterozygous
heterozygotes = c("A/G", "A/C","A/T", "G/C", "G/T","G/A", "T/C","T/A", "T/G","C/A", "C/T","C/G")
levels(plot_final$SNP_values)[match(heterozygotes, levels(plot_final$SNP_values))] = "heterozygotes"
```
**Plotting the graphs**
The first graph is showing the density of the heterozygous/homozygous/missing data across all the samples
```{r}
ggplot(date = plot_final) +
geom_bar(mapping = aes(x = plot_final$Sample_ID, fill = plot_final$SNP_values), position = "fill") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
ggtitle("Density of homo/heterozygous data \nacross SNP reads for every Sample") +
labs( y = "Density", x = "Samples")
```
The second graph is showing the density of the heterozygous/homozygous/missing data across all the groups
```{r}
plot_final2 = filter(fang, Group %in% teosinteMaizeTarget)
plot_final2 = plot_final2 %>%
select(3:986)
plot_final2 = plot_final2 %>%
pivot_longer(-Group, names_to = "SNP_Name", values_to = "SNP_values")
#first I need to convert all the homozygotes the phrase "homozygous"
levels(plot_final2$SNP_values)[match(homozygotes, levels(plot_final2$SNP_values))] = "homozygous"
#now I'm going to change all the missing data to say "missing"
levels(plot_final2$SNP_values)[match(missing_data, levels(plot_final2$SNP_values))] = "missing_data"
#lastly I'm going to convert any other data to heterozygous.
levels(plot_final2$SNP_values)[match(heterozygotes, levels(plot_final2$SNP_values))] = "heterozygotes"
#for the purposes of getting useful information, I want to change the group names to teosinte or maize depending on which category they fall in
levels(plot_final2$Group)[match(maizeTarget, levels(plot_final2$Group))] = "Maize"
levels(plot_final2$Group)[match(teosinteTarget, levels(plot_final2$Group))] = "Teosinte"
```
Now I'm going to graph it out seperating out the maize and teosinte groups.
```{r}
ggplot(date = plot_final2) +
geom_bar(mapping = aes(x = plot_final2$Group, fill = plot_final2$SNP_values), position = "fill") +
ggtitle("Density of homo/heterozygous data \nacross Maize and Teosinte groups") +
labs( y = "Density", x = "Samples")
```