GN_Assignment3.Rmd

---
title: "GN Assignment3"
author: "Grant Nickles"
date: "10/12/2019"
output: html_document
---
BEFORE RUNNING THE CODE SET THE WORKING DIRECTORY TO THE REPOSITORY FOLDER. Everything should work after this point.

First I will call the packages needed to graph and manipulate the data.
```{r}
library(ggplot2)
library(tidyverse)
library(dplyr)
library(forcats)
```

#**Data Inspection**

##**Inspection of fang_et_al_genotypes.txt**

Reading in the data
```{r}
fang = read.delim("fang_et_al_genotypes.txt", header = TRUE, sep = "\t")
```
Checking the rows and columns of the file: The file appears to have 2782 rows and 986 columns. Thus there are 2782 unique observations. This is a very large file and thus it will be unreasonable to by hand look at all the columns or data points. The byte size is 12,163,888.
```{r}
dim(fang)
object.size(fang)
```

Checking on the first six rows of the data I'm able to pull a couple pieces of information. The Samples are labled in the "Sample_ID" colume. Missing data appears to be indicated with "?" and the diploid copies of each SNP location is separted with "/". The 2nd and 3rd column both have other info on the samples but are not SNP data.
```{r}
View(fang)
```


##**Inspection of snp_position.txt**

Reading in the data
```{r}
snp = read.delim("snp_position.txt", header = TRUE, sep = "\t")
```

Checking the rows and columns on the SNP file there appears to be 15 columns and 983 rows. Supbracting the rows in the fang file that are not relavent to the SNP locations the amount of SNP observations matches on both files. The file overall is 324,472 bytes.
```{r}
dim(snp)
object.size(snp)
```
This file however has the formating of data in a transposed version of the fang file. In addition it has the chromosome and position that the SNP are located on. The rest of the column headers are extra info that may not be needed later on.
```{r}
View(snp)
names(snp)
```


#**Data Processing**

First I'm going to sepearte out the groups that way I don't have to later worry about this column altogether.
```{r}
teosinteTarget = c('ZMMIL','ZMMLR','ZMMMR')
teosinte = filter(fang, Group %in% teosinteTarget)

maizeTarget = c('ZMPBA', 'ZMPIL', 'ZMPJA')
maize = filter(fang, Group %in% maizeTarget)
```

Now I need to transpose the files after removing the columns I don't care about.
```{r}
teosinte = teosinte %>%
  select(1, 4:986)
maize = maize %>%
  select(1, 4:986)
snp = snp %>%
  select(1,3,4)
```

Next I transposed the fang data for the maize and teosinte, wrote it to a seperate file without the headers(they will be added back later when combined with the SNP file), and read that file back into R with the headers. Now the SNP locations are going across the top of the file. I then took the SNP first column and added it onto the beginging of the fang file -> Labeledfang. Now we have two files that are ready for merging!

```{r}
Tteosinte = as.data.frame(t(teosinte))
Tmaize = as.data.frame(t(maize))
```

```{r}
write_csv(Tteosinte, "TeoNoHeaders", col_names = FALSE)
write_csv(Tmaize, "MaizeNoHeaders", col_names = FALSE)
```

```{r}
Tteosinte2 = read_csv("TeoNoHeaders", col_names = TRUE)
Tmaize2 = read_csv("MaizeNoHeaders", col_names = TRUE)
Sample_ID = snp %>%
  select(1)
LabeledTeo = cbind(Sample_ID, Tteosinte2)
LabeledMaize = cbind(Sample_ID, Tmaize2)
```

Now I'm going to merge both files by the first column that they share. But first I need to add back in the group column. The files are now formated as specified in the rubric, with the SNP_ID in the first column, Chromosome in the second column, and Position in the third column.
```{r}
MergedTeo = merge(snp, LabeledTeo, by="SNP_ID")
MergedMaize = merge(snp, LabeledMaize, by="SNP_ID")
```

Great! Now we have a file to work with for the rest of the data analysis. However I want to write this file into a .csv doc because I will be working with it a lot. That way I have it if ever needed. 
```{r}
write_csv(MergedTeo, "MergedTeo", col_names = TRUE)
write_csv(MergedMaize, "MergedMaize", col_names = TRUE)
```

##**Teosinte**
ZMMIL, ZMMLR, and ZMMMR
*The file that I'm working with at the start is MergedTeo*

First I'm going to filter out by the chromosome, arrange that such that the position is ordered in ascending order, and then write those as seperate files. I was having issuse at first due to the posion column's data type being listed as Factor and not a numeric value. By adding in the line of code that converted the data to numeric I was able to get around this.
```{r}
setwd("./teosinte/Ascending")
#I want to store the files in this specific folder hence why I had to change to working directory
i = 1
while(i <= 10){
  a = i
  ChromTeosinteAscending = filter(MergedTeo, MergedTeo$Chromosome == a)
  ChromTeosinteAscending$Position = as.numeric(as.character(ChromTeosinteAscending$Position))
  ChromTeosinteAscending = ChromTeosinteAscending[order(ChromTeosinteAscending$Position), ]
  write_csv(ChromTeosinteAscending, sprintf("Ascending_TeoChromosome_%s", a), col_names = TRUE)
  i = i + 1
}
```

First step is to replace the ? characters with the - character.
```{r}
i = 4 
DecMergedTeo = MergedTeo
for(i in names(DecMergedTeo)){
  DecMergedTeo[[i]] = gsub("\\?","-", DecMergedTeo[[i]])
}

```
Now I need to as last time seperate out the chromosomes and order them in decreasing order.
```{r}
setwd("./teosinte/Decreasing")
#again I changed the working directory to put the files in the correct folder

i = 1
while(i <= 10){
  a = i
  ChromTeosinteDecreasing = filter(DecMergedTeo, DecMergedTeo$Chromosome == a)
  ChromTeosinteDecreasing$Position = as.numeric(as.character(ChromTeosinteDecreasing$Position))
  ChromTeosinteDecreasing = ChromTeosinteDecreasing[order(ChromTeosinteDecreasing$Position, decreasing = TRUE), ]
  write_csv(ChromTeosinteDecreasing, sprintf("Decreasing_TeoChromosome_%s", a), col_names = TRUE)
  i = i + 1
}
```
Great! Now I have all 20 of the files I needed organized into files within the repository. Now the easy part, doing the same with the maize files. Since most of the code will be recycled I'm not going to put as many comments over every chunk I run.

##**Maize**
ZMPBA, ZMPIL, and ZMPJA
*The file that I'm working with at the start is MergedMaize*

First as last time I'm doing the ascending order.
```{r}
setwd("./maize/Ascending")
#I want to store the files in this specific folder hence why I had to change to working directory
i = 1
while (i <= 10){
  a = i
  ChromMaizeAscending = filter(MergedMaize, MergedMaize$Chromosome == a)
  ChromMaizeAscending$Position = as.numeric(as.character(ChromMaizeAscending$Position))
  ChromMaizeAscending = ChromMaizeAscending[order(ChromMaizeAscending$Position), ]
  write_csv(ChromMaizeAscending, sprintf("Ascending_MaizeChromosome_%s", a), col_names = TRUE)
  i = i + 1
}

```

Now I will replace the ? characters with the - character.
```{r}
i = 4 
DecMergedMaize = MergedMaize
for(i in names(DecMergedMaize)){
  DecMergedMaize[[i]] = gsub("\\?","-", DecMergedMaize[[i]])
}

```
Next I'm seperating out the chromosomes and ordering them in decreasing order.
```{r}
setwd("./maize/Decreasing")
#again I changed the working directory to put the files in the correct folder

i = 1
while(i <= 10){
  a = i
  ChromMaizeDecreasing = filter(DecMergedMaize, DecMergedMaize$Chromosome == a)
  ChromMaizeDecreasing$Position = as.numeric(as.character(ChromMaizeDecreasing$Position))
  ChromMaizeDecreasing = ChromMaizeDecreasing[order(ChromMaizeDecreasing$Position, decreasing = TRUE), ]
  write_csv(ChromMaizeDecreasing, sprintf("Decreasing_MaizeChromosome_%s", a), col_names = TRUE)
  i = i + 1
}
```

#**Part 2: Data Visualizaion**
First since we now want to graph all of the data for the maize and teosinte fang file, I'm going to repeat the steps I did earlier but combining both groups.
```{r}
library(reshape2)
library(plyr)
MergedMaize2 = MergedMaize %>%
  select(1, 4:978)
MergedCombined = merge(MergedMaize2, MergedTeo, by="SNP_ID")
MergedCombined = merge(snp, MergedCombined, by="SNP_ID")
MergedCombined = rename(MergedCombined, c("Position.x" = "Position", "Chromosome.x" = "Chromosome"))
```


##**SNPs per chromosome**
###**Total number of SNPs in our dataset per chromosomes
```{r}
ggplot(data = MergedCombined) +
  geom_bar(mapping = aes(x = Chromosome, fill = Chromosome)) +
  ggtitle("SNPs per chromosome")

```
###**Distribution of SNPs on chromosomes and my personal graph**
Because it is impossible to graph the SNP positions when they are listed as unknown or multiple I am first going to drop this data. The dropped data is now in MergedCombinedFiltered. There are several ways to show the disrubution of the of the SNPs on the chromosomes so I chosed to graph the distruptions on the length of the chromosome, and on the chromosome altogether. 
```{r}
MergedCombined$Position = as.numeric(as.character(MergedCombined$Position))
MergedCombinedFiltered = filter(MergedCombined, as.numeric(as.character(Position)) >= 1)
MergedCombinedFiltered = subset(MergedCombined, !(Chromosome %in% "multiple" | Chromosome %in% "unknown" | Position %in% "multiple" | Position %in% "unknown" | Position %in% ""))
Chromosomes = c("multiple", "unknown", "")
levels(MergedCombinedFiltered$Chromosome)[match(Chromosomes, levels(MergedCombinedFiltered$Chromosome))] = NA
MergedCombinedFiltered = MergedCombinedFiltered %>%
  mutate(PositionRange = cut(Position, 5));
```
The **first graph** is the density split by all the chromosomes. 

The **second graph** has the PositionRange on the X-axis. This mutated category is just the Position column binned into five groups. The choice of 5 is arbitary, but it shows that several of the chromosomes have SNPs in roughly the 1st bin only. This could be from those chromosomes being smaller, or the SNPs measured just all being in that region of those chromosomes. **this is my personal graph that we were asked to create.**


```{r}
ggplot(data = MergedCombinedFiltered) + 
  geom_density(mapping=aes(x = Chromosome, fill = Chromosome), alpha = 0.15) +
  ggtitle("SNP density on each chromosome")

Chromosomes = c("multiple", "unknown", "")
levels(MergedCombinedFiltered$Chromosome)[match(Chromosomes, levels(MergedCombinedFiltered$Chromosome))] = NA

ggplot(data = MergedCombinedFiltered) + 
  geom_density(mapping=aes(x = PositionRange, fill = Chromosome), alpha = 0.15) +
  ggtitle("SNP density on its position on the chromosome") +
  theme(axis.text.x=element_blank())
##NOTE the graph for this one says fewer than two data points have been dropped as it dropped the position ranges that were null
```

###**Missing data and amount of heterozygosity**

For this graph we don't even need the data in the SNP file. I'm going to format the fang file to be how we want it. 


```{r}
teosinteMaizeTarget = c('ZMMIL','ZMMLR','ZMMMR', 'ZMPBA', 'ZMPIL', 'ZMPJA')
plot_final = filter(fang, Group %in% teosinteMaizeTarget)
plot_final = plot_final %>%
  select(1, 4:986)

plot_final = plot_final %>% 
  pivot_longer(-Sample_ID, names_to = "SNP_Name", values_to = "SNP_values")
#first I need to convert all the homozygotes the phrase "homozygous"
homozygotes = c("T/T", "C/C", "A/A", "G/G")
levels(plot_final$SNP_values)[match(homozygotes, levels(plot_final$SNP_values))] = "homozygous"

#now I'm going to change all the missing data to say "missing"
missing_data = c("?/?", "?/(A|G|C|T)", "(A|G|C|T)/?")
levels(plot_final$SNP_values)[match(missing_data, levels(plot_final$SNP_values))] = "missing_data"

#lastly I'm going to convert any other data to heterozygous
heterozygotes = c("A/G", "A/C","A/T", "G/C", "G/T","G/A", "T/C","T/A", "T/G","C/A", "C/T","C/G")
levels(plot_final$SNP_values)[match(heterozygotes, levels(plot_final$SNP_values))] = "heterozygotes"

```
**Plotting the graphs**
The first graph is showing the density of the heterozygous/homozygous/missing data across all the samples
```{r}
ggplot(date = plot_final) +
  geom_bar(mapping = aes(x = plot_final$Sample_ID, fill = plot_final$SNP_values), position = "fill") +
  theme(axis.text.x=element_blank(),
  axis.ticks.x=element_blank()) +
  ggtitle("Density of homo/heterozygous data \nacross SNP reads for every Sample") +
  labs( y = "Density", x = "Samples")
```

The second graph is showing the density of the heterozygous/homozygous/missing data across all the groups
```{r}
plot_final2 = filter(fang, Group %in% teosinteMaizeTarget)
plot_final2 = plot_final2 %>%
  select(3:986)

plot_final2 = plot_final2 %>% 
  pivot_longer(-Group, names_to = "SNP_Name", values_to = "SNP_values")
#first I need to convert all the homozygotes the phrase "homozygous"
levels(plot_final2$SNP_values)[match(homozygotes, levels(plot_final2$SNP_values))] = "homozygous"

#now I'm going to change all the missing data to say "missing"
levels(plot_final2$SNP_values)[match(missing_data, levels(plot_final2$SNP_values))] = "missing_data"

#lastly I'm going to convert any other data to heterozygous.
levels(plot_final2$SNP_values)[match(heterozygotes, levels(plot_final2$SNP_values))] = "heterozygotes"

#for the purposes of getting useful information, I want to change the group names to teosinte or maize depending on which category they fall in
levels(plot_final2$Group)[match(maizeTarget, levels(plot_final2$Group))] = "Maize"
levels(plot_final2$Group)[match(teosinteTarget, levels(plot_final2$Group))] = "Teosinte"
```

Now I'm going to graph it out seperating out the maize and teosinte groups.
```{r}
ggplot(date = plot_final2) +
  geom_bar(mapping = aes(x = plot_final2$Group, fill = plot_final2$SNP_values), position = "fill") +
  ggtitle("Density of homo/heterozygous data \nacross Maize and Teosinte groups") +
  labs( y = "Density", x = "Samples")
```