-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathA2_NoahZeidenberg_1048535.Rmd
534 lines (429 loc) · 24.1 KB
/
A2_NoahZeidenberg_1048535.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
---
title: "A2_NoahZeidenberg_1048535"
author: "Noah Zeidenberg"
date: "2024-10-14"
output:
pdf_document:
latex_engine: xelatex
html_document:
df_print: paged
---
### Introduction
The concept and application of classification is inherent to genomics. Whether the classification is taxonomic, genetic, or chemical- it is vital that information is categorized correctly in order to inform current and future research models. As such, this project aims to compare the efficiency of a random forest (RF) classifier in distinguishing between two genes with similar functions within the taxon Aves, using k-mers of various sizes. Bone Morphogenetic Protein 4 (BMP4) and Alpha-2-HS-Glycoprotein (AHSG) are two genes with similar but significant roles in bone growth, and are predicted to strongly influence organism size [1]. Extracting distinguishing features from both sequences is crucial to studies aiming to taxonomically classify birds, as well as many other mammals.
K-mer analysis is one of many alignment-free techniques used for sequence classification. The choice of k-mer size is critical as it influences the feature representation of sequences and the performance of machine learning algorithms [2]. However, determining the optimal k size for a given set of sequences is an NP-hard problem, and common computational methods perform poorly at large k sizes. Using the Jellyfish program for efficiently computing occurrences of 7-, 14- and 21-mers, I will compare the performance of an RF classifier across small and large k-mers in categorizing BMP4 and AHSG sequences.
#### Load libraries
```{r error = FALSE, message=FALSE, warning=FALSE}
# some global settings
knitr::opts_chunk$set(echo = TRUE) # always include code in output doc
options(install.packages.ask = FALSE) # no need to ask to install packages/dependancies
options(crayon.enabled = FALSE) # could not knit with invalid characters from caret
library("pacman") # pacman automatically checks if a package is installed and (if not, installs it then) loads it into the current environment
pacman::p_load("dplyr", # for performing major dataframe selections/transformations/etc.
"tidyverse", # has numerous helpful functions I may use
"readr", # ""
"rentrez", # for using EUtils, connecting to NCBIs dbs
"DECIPHER", # for performing and multiple sequence alignments
"taxize", # works well with the bold package -> e.g. "downstream" fn reduces load on BOLD query
"xml2", # for parsing certain entrez responses
"randomForest", # using an RF model for comparison
"caret", # for confusion matrices
"keras", "tensorflow", # for training the DNN
"purrr", # I prefer map functions to the base::apply family
"ggplot2", "patchwork", # better plotting visualization and control than base plots
update = FALSE,
dependencies = TRUE)
```
##### Load other necessary variables
```{r}
# set rentrez API key from environment variable
tryCatch(rentrez::set_entrez_key(Sys.getenv("ENTREZ_KEY")))
# source functions
source("./functions.R")
# set the seed
set.seed(36) # for reproducability. 36 is 18 (lucky number) * 2 (for assignment #2 :)
```
#### Fetch Data from NCBI
```{r}
# load from text file rather than re-run rentrez functions
if (!exists("gene_data") & file.exists("./doc/gene_data.txt")) {
gene_data <- read_file(file = "./doc/gene_data.txt")
}
# again, only run if necessary
if (!exists("gene_data")) {
# search from NCBI's gene database for either gene in Aves returns ~ 300 entries
genes_query <- rentrez::entrez_search(db = "gene",
term = '(BMP4 OR AHSG OR "Alpha 2-HS Glycoprotein" OR "bone morphogenetic protein 4") AND "Aves"[porgn] AND alive[prop]',
use_history = TRUE) # using web_history object to mitigate size-issues in the subsequent API request
# pull info for each entry using web_history object
gene_data <- rentrez::entrez_fetch(db = "gene", web_history = genes_query$web_history, rettype = NULL, retmode = "default")
}
# save as text file rather than re-run rentrez functions
if (!file.exists("./doc/gene_data.txt")) {
write(gene_data, file = "./doc/gene_data.txt")
}
```
```{r echo = FALSE, error = FALSE, eval = FALSE, include = FALSE}
# not needed in output ^^. just to show that yes, I did look at my data before working with it
gene_data %>% cat # entries are formatted consistently, can apply a vectorized function...
```
```{r}
# split up each entry for parsing
entries <- str_split(gene_data, "\\n\\d+\\.\\s+")[[1]] %>% # split at each integer followed by a '.' and at least one whitespace character
trimws # trim white space around each line
entries %>%
strsplit(split = "\n") %>% # split into substrings by newline
.[[20]] %>% # take the 20th entry, for example
cat(sep = "\n") # concatenate the entry by newline
# Apply vectorized function to extract the ID, gene name, species name, accession ID and the start/stop bp numbers identifying where the gene resides within the fasta file for said accession ID.
df_spec <- purrr::map_dfr(entries, extract_info) # equivalent (though faster) to base::apply but output is a dataframe
# Checking for genes names in df_spec
unique(df_spec$gene)
# keep only the rows where the gene name is BMP4 or AHSG (some entries are from rentrez returning similar genes such as BMP7 or PAX6, which are also involved in bone development)
df_spec <- df_spec %>%
mutate(gene = str_trim(gene)) %>% # Remove any leading/trailing whitespace or special characters
filter(gene %in% c("BMP4", "AHSG")) %>% # still ~ 270 observations
drop_na # remove rows with NAs using tidyr's drop_na fn
# Visualize
df_spec %>% head(3)
```
```{r warning = FALSE}
# load from file if already run previously
if (file.exists("./doc/df_seqspec.tsv")) {
df_seqspec <- read_tsv("./doc/df_seqspec.tsv", show_col_types = FALSE)
}
# "" otherwise:
if (!exists("df_seqspec")) {
# apply vectorized function that leverages purrr's pmap ("parallel" map) to iterate over each row in df_spec and perform the entrez_fetch() using multiple column values in df_spec.
ls_fastas <- pmap(df_spec[,c("accession_id", "bp_start", "bp_stop")],
function(accession_id, bp_start, bp_stop) {
return(entrez_fetch(db = "nuccore", rettype = "fasta",
id = accession_id, seq_start = bp_start, seq_stop = bp_stop))
Sys.sleep(0.1) # to avoid rate-limiting
})
# export fastas to file, for use with Jellyfish later
ls_fastas %>%
unlist %>% # flatten to
writeLines(., con = "./doc/BMP4_AHSG_Aves.fasta")
# clean fasta files (remove header)
clean_fastas <- map(ls_fastas, clean_fasta)
# convert to dataframe and bind to df_spec
df_seqspec <- do.call(rbind, clean_fastas) %>% # bind rows
as.data.frame %>% # convert to df
setNames("sequence") %>% # set the column name to "sequence"
cbind(df_spec, .) # bind to df_spec
# save as file
write_tsv(df_seqspec, file = "./doc/df_seqspec.tsv")
}
# write fasta files for each gene to two separate folders, to be used in counting k-mers and training classifiers
if (!file.exists("./data/AHSG/NC_021681.1.fasta")) { # test case, it will exist only if this code is run
write_fasta_files(df_seqspec[,c("gene", "accession_id", "sequence")])
}
# drop unnecessary rows
df_seqspec <- df_seqspec[, c("gene", "species_name", "accession_id", "sequence"), drop = TRUE]
```
#### Function: Sequence Quality Check, Clean and Filter Based on Gaps and "N" Content ----
```{r}
quality_check_and_filter_sequences <- function(df) {
# Initialize a list to store quality check results
Results <- list()
# Count initial number of sequences
initial_count <- nrow(df)
# Quality Checks
# Check for sequences containing gaps ('-')
Results$gaps_count <- sum(str_detect(df$sequence, "-"))
# Check for sequences starting with 'N'
Results$start_N_count <- sum(str_detect(df$sequence, "^N"))
# Check for sequences ending with 'N'
Results$end_N_count <- sum(str_detect(df$sequence, "N$"))
# Clean sequences
df <- df %>%
mutate(
sequence = str_remove(sequence, "^[-N]+"), # Remove leading gaps and 'N's
sequence = str_remove(sequence, "[-N]+$"), # Remove trailing gaps and 'N's
sequence = str_remove_all(sequence, "-+") # Remove all remaining gaps
)
# Check for sequences with high 'N' content (>=5% of sequence length)
Results$high_N_content_count <- sum(str_count(df$sequence, "N") >= (0.05 * str_length(df$sequence)))
# Print quality check results
cat("Quality Check Results:\n")
cat("Count of sequences with gaps:", Results$gaps_count, "\n")
cat("Count of sequences starting with 'N':", Results$start_N_count, "\n")
cat("Count of sequences ending with 'N':", Results$end_N_count, "\n")
cat("Count of sequences with high 'N' content:", Results$high_N_content_count, "\n")
# Filter sequences with high 'N' content
df <- df %>%
filter(str_count(sequence, "N") < (0.05 * str_length(sequence))) # Remove sequences
# Calculate number of final and filtered out sequences
final_count <- nrow(df)
filtered_count <- initial_count - final_count
# Print the number of initial, filltered out and final sequences
cat("Initial sequence count:", initial_count, "\n")
cat("Total sequences filtered out:", filtered_count, "\n")
cat("Final sequence count:", final_count, "\n")
# Return the filtered dataframe
return(df)
}
# Apply the sequence quality check and filtering function to df_seqspec
df_seqspec <- quality_check_and_filter_sequences(df_seqspec)
```
```{r}
# show quartiles for sequence length in AHSG and BMP4, respectively
cat("Summary of quartiles in sequence lengths for AHSG \n")
df_seqspec %>%
filter(gene == "AHSG") %>%
pull(sequence) %>%
width %>%
summary
cat("\n Summary of quartiles in sequence lengths for BMP4 \n")
df_seqspec %>%
filter(gene == "BMP4") %>%
pull(sequence) %>%
width %>%
summary
```
### Filter outliers with very long sequences ----
```{r echo = TRUE, fig.show = "hide"}
# Extract sequence lengths for each gene
ahsg_lengths <- nchar(df_seqspec %>% filter(gene == "AHSG") %>% pull(sequence))
bmp4_lengths <- nchar(df_seqspec %>% filter(gene == "BMP4") %>% pull(sequence))
# Initial boxplot of sequence lengths by gene
boxplot(ahsg_lengths, bmp4_lengths,
names = c("AHSG", "BMP4"),
main = "Initial Distribution of Sequence Lengths by Gene",
ylab = "Sequence Length",
col = c("lightblue", "lightgreen"))
# Filter out BMP4 sequences with lengths greater than 5000 (extreme outliers)
df_seqspec <- df_seqspec %>%
filter(!(gene == "BMP4" & nchar(sequence) > 5000))
# Check the count of sequences for each gene
table(df_seqspec$gene)
# Extract sequence lengths for each gene after filtering outliers
ahsg_lengths <- nchar(df_seqspec %>% filter(gene == "AHSG") %>% pull(sequence))
bmp4_lengths <- nchar(df_seqspec %>% filter(gene == "BMP4") %>% pull(sequence))
# Boxplot of sequence lengths by gene after filtering
boxplot(ahsg_lengths, bmp4_lengths,
names = c("AHSG", "BMP4"),
main = "Distribution of Sequence Lengths by Gene (Filtered)",
ylab = "Sequence Length",
col = c("lightblue", "lightgreen"))
# Filtered out extreme outliers with sequence length > 5000.
# Filtering by quartiles was not feasible, as it would have reduced the dataset to only 124 sequences across both genes.
```
# Function to calculate k-mer frequencies for DNA sequences in a dataframe
```{r}
calculate_kmer_frequencies <- function(df_seqspec, k) {
df_kmer <- df_seqspec$sequence %>%
DNAStringSet() %>% # Convert sequences to DNAStringSet for DNA operations
oligonucleotideFrequency(width = k, as.prob = TRUE) %>% # Calculate k-mer frequencies as proportions
as.data.frame() %>% # Convert the result to a dataframe for further manipulation
mutate_all(~ round(., 2)) %>% # Round all frequency values to 2 decimal places
cbind(df_seqspec[, c("gene", "accession_id"), drop = TRUE], .) # Combine with gene and accession_id columns
return(df_kmer) # Return the resulting dataframe with k-mer frequencies
}
# Calculate 1-mer, 2-mer, and 3-mer frequencies using the kmer_function
df_1mer <- calculate_kmer_frequencies(df_seqspec, 1)
df_2mer <- calculate_kmer_frequencies(df_seqspec, 2)
df_3mer <- calculate_kmer_frequencies(df_seqspec, 3)
```
##### Generate a density plot of the probabilities of 1-, 2-, and 3-mer frequencies
```{r}
# add a column to each data frame to indicate k-mer size
df_1mer_hist <- df_1mer %>% mutate(kmer_size = "1-mer")
df_2mer_hist <- df_2mer %>% mutate(kmer_size = "2-mer")
df_3mer_hist <- df_3mer %>% mutate(kmer_size = "3-mer")
# combine data frames
df_combined <- bind_rows(df_1mer_hist, df_2mer_hist, df_3mer_hist)
# reshape to long format (each k-mer probability in one row)
df_long <- df_combined %>%
pivot_longer(cols = -c(accession_id, gene, kmer_size), names_to = "kmer", values_to = "probability")
# Plot density of k-mer probabilities by gene and k-mer size
densityplot <- ggplot(df_long, aes(x = probability, fill = gene, color = gene)) +
# Adjust transparency and outline size
geom_density(alpha = 0.3, linewidth = 0.8) +
# Set different colors for each gene
scale_fill_manual(values = c("AHSG" = "#E69F00", "BMP4" = "#56B4E9")) +
scale_color_manual(values = c("AHSG" = "#D55E00", "BMP4" = "#0072B2")) +
# Facet by k-mer size with independent y-axes
facet_wrap(~ kmer_size, scales = "free") +
# Set x-axis to start from zero
scale_x_continuous(limits = c(0, NA)) +
# Add labels for axes, title, and subtitle
labs(
x = "Probability of k-mer Occurrence",
y = "Density",
title = "Density Plot of k-mer Probabilities for Different k-mer Sizes",
subtitle = "Comparison of AHSG and BMP4 genes"
) +
# Minimal theme for a clean look
theme_minimal() +
# Customize plot theme for improved readability and styling
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 20),
plot.subtitle = element_text(hjust = 0.5, size = 18),
axis.title = element_text(size = 18, face = "bold"),
axis.text = element_text(size = 16),
strip.text = element_text(size = 18, face = "bold"),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 16, face = "bold")
)
# Save the plot if it doesn't already exist
if (!file.exists("./figs/small_mer_densityplot.png")) {
ggsave(filename = "./figs/small_mer_densityplot.png", plot = densityplot, width = 12, height = 8, dpi = 600)
}
# Show plot in R markdown (below)
```
![Figure Caption: Density plot showing the distribution of k-mer probabilities for AHSG and BMP4 genes across different k-mer sizes. An inverse relationship is observed between k-mer size and the probability of occurrence, where larger k-mer sizes correspond to lower probabilities of dominant patterns.](./figs/small_mer_densityplot.png)
```{r eval = FALSE}
# I attempted to compute larger k-mers using Jellyfish (installed in the bash shell according to the instructions on gmarcais/jellyfish github)
# I commented out certain lines since Jellyfish is likely only installed on my computer
#system("jellyfish --version") # check that jellyfish is working in the current R proj. directory
# computing 7-, 14- and 21-mers
#file_paths <- c(list.files("./doc/BMP4", full.names = T),
# list.files("./doc/AHSG", full.names = T))
# jellyfish(file_paths) # takes a very long time to run
# load from tsv files
if (file.exists("./doc/df_7mer.tsv")) {
df_7mer <- read_tsv("./doc/df_7mer.tsv", show_col_types = FALSE)
df_7mer[is.na(df_7mer)] <- 0 # replace NAs with 0
} else {
df_7mer <- bind_rows( # load data for each k-mer size
kmer_counts_from_files("./doc/BMP4", 7),
kmer_counts_from_files("./doc/AHSG", 7)
)
write_tsv(df_7mer, "./doc/df_7mer.tsv")
rm(df_7mer) # must be removed from local memory to proceed, otherwise my computer crashes
}
# again for 14-mers. My computer was not able to run this, I had to run it on another computer that has 16GB RAM
#if (file.exists("./doc/df_14mer.tsv")) {
# df_14mer <- read_tsv("./doc/df_14mer.tsv", show_col_types = FALSE)
# df_14mer[is.na(df_14mer)] <- 0 # ""
#} else {# Load data for each k-mer size
# df_14mer <- bind_rows(
# kmer_counts_from_files("./doc/BMP4", 14),
# kmer_counts_from_files("./doc/AHSG", 14)
# )
# write_tsv(df_14mer, "./doc/df_14mer.tsv")
# rm(df_14mer) # must be removed from local memory to proceed, otherwise my computer crashes
#}
# My computer was not able to compute 21-mers, neither was a 16GB RAM computer.
#if (file.exists("./doc/df_21mer.tsv")) {
# df_21mer <- read_tsv("./doc/df_21mer.tsv", show_col_types = FALSE)
# df_21mer[is.na(df_21mer)] <- 0 # ""
#} else {# Load data for each k-mer size * not run
# df_21mer <- bind_rows(
# kmer_counts_from_files("./doc/BMP4", 21),
# kmer_counts_from_files("./doc/AHSG", 21)
# )
# write_tsv(df_21mer, "./doc/df_21mer.tsv")
# rm(df_21mer) # must be removed from local memory to proceed, otherwise my computer crashes
#}
```
#### Train random forest classifier
```{r}
# apply defined function to each set of kmer data
if(!file.exists("./doc/confusionMatrix_7mer.txt")) {
map2(list(df_1mer, df_2mer, df_3mer, df_7mer),
c("1", "2", "3", "7"),
RF_confMatrix)
}
# an example of the computed confusion matrices
cat("An example of the computed confusion matrices, for 1-mer data\n")
readRDS("./doc/confusionMatrix_1mer.txt")
```
Table 1: Summary of RF classification metrics, where *AHSG* is the positive class and *BMP4* is the negative class.
| Model | 95% CI (Accuracy) | Sensitivity | Specificity | Kappa |
|--------|-------------------|-------------|-------------|-------|
| 1-mers | (93.23%, 99.97%) | 97.56% | 100.00% | 97.5% |
| 2-mers | (95.49%, 100%) | 100% | 100% | 100% |
| 3-mers | "" | "" | "" | "" |
| 7-mers | "" | "" | "" | "" |
#### Train Deep Neural Network Classifier
```{r, eval = FALSE, message=FALSE, warning=FALSE, sanitize=TRUE}
# I included only the example for 1mers, as I struggled to vectorize the code and I would have been over the page limit. However, the only change I made was to "k".
# one-hot encode sequence data
one_hot_encoded_data <- lapply(df_seqspec$sequence, one_hot_encode)
# first bind the feature matrix and labels
feature_matrix <- t(sapply(df_seqspec$sequence, generate_kmer_counts, k = 1))
training_data <- data.frame(feature_matrix, label = as.factor(df_seqspec$gene))
# split data into training and test (70/30 split)
trainIndex <- createDataPartition(training_data$label, p = 0.7, list = FALSE)
testIndex <- setdiff(seq_len(nrow(training_data)), trainIndex)
train_data <- training_data[trainIndex, ]
test_data <- training_data[-trainIndex, ]
# need to define model input shape for tensorflow
sequence_length <- ncol(one_hot_encoded_data[[1]])
input_shape <- c(sequence_length, 4) # 4 for A, T, G and C
# define the layers and parameters in the model, according to reference [3]
model <- keras_model_sequential()
# add layers individually
model$add(layer_conv_1d(filters = 64, kernel_size = 16, activation = "relu", input_shape = input_shape))
model$add(layer_max_pooling_1d(pool_size = 13, strides = 13))
model$add(bidirectional(layer = layer_lstm(units = 64, return_sequences = TRUE)))
model$add(layer_dropout(rate = 0.5))
model$add(layer_flatten())
model$add(layer_dense(units = 512, activation = "relu"))
model$add(layer_dropout(rate = 0.5))
model$add(layer_dense(units = 1, activation = "sigmoid"))
# compile using TensorFlow
tf$keras$models$Sequential$compile(
model,
loss = "binary_crossentropy",
optimizer = "adam",
metrics = list("accuracy")
)
# convert list of matrices to array
train_x <- array(unlist(one_hot_encoded_data[trainIndex]), dim = c(length(trainIndex), sequence_length, 4))
test_x <- array(unlist(one_hot_encoded_data[-trainIndex]), dim = c(length(testIndex), sequence_length, 4))
# add labels
train_y <- as.numeric(train_data$label == "AHSG") # Adjust labels as needed
test_y <- as.numeric(test_data$label == "BMP4")
# convert train and test data to TensorFlow-compatible format
train_x <- array_reshape(train_x, dim = c(dim(train_x)[1], dim(train_x)[2], dim(train_x)[3]))
test_x <- array_reshape(test_x, dim = c(dim(test_x)[1], dim(test_x)[2], dim(test_x)[3]))
train_y <- as.array(train_y)
test_y <- as.array(test_y)
# I could only fit using TensorFlow's backend format
history_1mer <- tf$keras$models$Sequential$fit(
model,
x = train_x,
y = train_y,
epochs = as.integer(30),
batch_size = as.integer(32),
validation_data = list(test_x, test_y)
)
# save history object
history_metrics <- list(
accuracy = history_1mer$history$accuracy,
val_accuracy = history_1mer$history$val_accuracy,
loss = history_1mer$history$loss,
val_loss = history_1mer$history$val_loss
)
saveRDS(history_metrics, file = "./doc/DNN_history_1mer.rds")# Create figure comparing accuracy per epoch for all the trained models
```
#### Create plots for DNN results
```{r}
# Create plot comparing accuracy, loss, val_accuracy and val_loss for the four (1, 2, 3, 7) models
p1 <- DNN_plot("./doc/DNN_history_1mer.rds", k = 1)
p2 <- DNN_plot("./doc/DNN_history_2mer.rds", k = 2)
p3 <- DNN_plot("./doc/DNN_history_3mer.rds", k = 3)
p4 <- DNN_plot("./doc/DNN_history_7mer.rds", k = 7)
# Output as one figure
combined_plot <- (p1 + p2 + p3 + p4) +
plot_layout(ncol = 2, nrow = 2) +
plot_annotation(tag_levels = 'A') # Labels each plot as A, B, C, D
# Save plot
if (!file.exists("./figs/DNN_performance.png")) {
ggsave(filename = "./figs/DNN_performance.png", plot = combined_plot, width = 12, height = 8, dpi = 600)
}
```
![Figure 2: Line plots of DNN training and validation accuracy over time, with training and validation loss as error margins scaled to 10% of their reported value.](./figs/DNN_performance.png)
### Results and Discussion
The significant range in size of the BMP4 sequences presented a challenge for alignment-based methods- some of the issues that arose during my preliminary analysis were 1. long stretches of gaps in shorter sequences 2. increased computational load due to the largest sequences being \> 150 kbps and 3. loss of resolution in phylogenetic comparisons between species. Instead, k-mers were used for classification, and lengths 1, 2, 3, and 7 were compared as features for an RF classifier and a hybrid-DNN inspired by DanQ [3]. The RF classifier performed well even for kmers of length 1, implying that the genes were not similar enough to present a challenge. In the future I would attempt the same pipeline on more similar genes, such as two BMP class genes[1]. The performance of the RF classifier did not improve after k \> 2, likely for the above reason (Table 1). Interestingly, the DNN performed okay on 1-mers (\~ 60% accuracy) , but began overfitting after epoch 25 (Fig 2A). The DNN performed worse on 2-, 3- and 7-mers and had extreme overfitting on each (Fig 2B-D).
This pipeline has potential for improvement in both the model parameters (of both RF and DNN) and the quality of the input data. What I find particularly interesting is that the DNN may outperform the RF as the number of classification bins increases (i.e. from a binary classificiation problem of BMP4 or AHSG, to factors of species as well) [3].
##### Acknowledgments
I would like to thank Brittany for giving me an extension, and though I was still not able to complete the components of the pipeline I wished to, I learned a lot.
##### References
1. Wu, L., Jiao, X., Zhang, D., Cheng, Y., Song, G., Qu, Y., and Lei, F. (2021). Comparative genomics and evolution of avian specialized traits. Current Genomics 22, 496–511. <https://doi.org/10.2174/1389202923666211227143952>.
2. Marçais, G., and Kingsford, C. (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770. <https://doi.org/10.1093/bioinformatics/btr011>.
3. Quang, D., and Xie, X. (2016). DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Research *44*, e107. <https://doi.org/10.1093/nar/gkw226.>