swap_speaqeasy.Rmd

---
title: "Identifying sample swaps"
author: 
  - name: Joshua M. Stolz
    affiliation:
    - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus
  - name: Louise Huuki
    affiliation:
    - *libd
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: '9/18/2020'
---


```{r 'setup', echo = FALSE, warning = FALSE, message = FALSE}
timestart <- Sys.time()

## Bib setup
library("knitcitations")

## Load knitcitations with a clean bibliography
cleanbib()
cite_options(hyperlink = "to.doc", citation_format = "text", style = "html")

## Write bibliography information
bibs <- c(
    SummarizedExperiment = citation("SummarizedExperiment"),
    devtools = citation("devtools"),
    jaffelab = citation("jaffelab"),
    knitcitations = citation("knitcitations"),
    pheatmap = citation("pheatmap"),
    R = citation(),
    tidyr = citation("tidyr"),
    here = citation("here"),
    rmarkdown = citation("rmarkdown"),
    VariantAnnotation = citation("VariantAnnotation")
)
write.bibtex(bibs, file = "swap_analysis.bib")
bib <- read.bibtex("swap_analysis.bib")

## Assign short names
names(bib) <- names(bibs)
```

```{r libraries, message = FALSE, warning = FALSE}
library("pheatmap")
library("tidyr")
library("jaffelab")
library("here")
library("VariantAnnotation")
library("SummarizedExperiment")
library("devtools")
library("BiocStyle")
```


In order to resolve the swaps to our best ability we need four data sets. Here we have load snpGeno_example which is from our TOPMed imputed genotype data, a phenotype data sheet (pd_example), a VCF file of the relevant SPEAQeasy output (SPEAQeasy), and our current genotype sample sheet (brain_sentrix). This file is wrote in the directory listed below.

```{r load_data, echo=TRUE}
load(here("sample_selection", "snpsGeno_example.RData"), verbose = TRUE)
load(here("sample_selection", "pd_example.Rdata"), verbose = TRUE)
Speaqeasy <-
    readVcf(here(
        "pipeline_outputs",
        "merged_variants",
        "mergedVariants.vcf.gz"
    ),
    genome = "hg38"
    )
brain_sentrix <- read.csv(here("brain_sentrix_speaqeasy.csv"))
```

We can see that the genotype is represented in the form of 0s,1s, and 2s. The rare 2s are a result of multiallelic snps and we will drop those. 0 represent the reference allele with ones representing the alternate. We can see the distribution below.

```{r explore_speaqeasy_speaqeasy, echo=TRUE}
Geno_speaqeasy <- geno(Speaqeasy)$GT
table(Geno_speaqeasy)
```

Given this we convert we convert the Genotype data from SPEAQeasy to numeric data. The "./." were values that could not accurately be determined and are replaced with NA.
```{r data_prep_speaqeasy, echo=TRUE}
colnames_speaqeasy <- as.data.frame(colnames(Geno_speaqeasy))
colnames(colnames_speaqeasy) <- c("a")
samples <-
    separate(colnames_speaqeasy,
        a,
        into = c("a", "b", "c"),
        sep = "_"
    )
samples <- paste0(samples$a, "_", samples$b)
samples <- as.data.frame(samples)
colnames(Geno_speaqeasy) <- samples$samples
Geno_speaqeasy[Geno_speaqeasy == "./."] <- NA
Geno_speaqeasy[Geno_speaqeasy == "0/0"] <- 0
Geno_speaqeasy[Geno_speaqeasy == "0/1"] <- 1
Geno_speaqeasy[Geno_speaqeasy == "1/1"] <- 2
class(Geno_speaqeasy) <- "numeric"
corner(Geno_speaqeasy)
```

We then make a correlation matrix to find the possible mismatches between samples.

```{r cor_speaqeasy, echo=TRUE}
speaqeasy_Cor <- cor(Geno_speaqeasy, use = "pairwise.comp")
corner(speaqeasy_Cor)
```

Here in the heatmap below we can see that several points do not correlate with themselves in a symmetrical matrix. This could be mismatches, but it also could be a result of a brain being sequenced twice. We will dig more into this later on.

```{r check_correlations_speaqeasy, echo=FALSE}
pheatmap(
    speaqeasy_Cor,
    cluster_rows = FALSE,
    show_rownames = FALSE,
    cluster_cols = FALSE,
    show_colnames = FALSE
)
```

```{r pdf_cor_speaqeasy, eval=FALSE, echo = FALSE}
## For some reason, I need to run this manually to save the PDF
dir.create(here("pdf_swaps"), showWarnings = FALSE)
pdf(here("pdf_swaps", "heatmap_cor_speaqeasy.pdf"))
pheatmap(
    speaqeasy_Cor,
    cluster_rows = FALSE,
    show_rownames = FALSE,
    cluster_cols = FALSE,
    show_colnames = FALSE
)
dev.off()
```


We repeat the process for the genotype data from TOPMed. First creating our numeric data for the genotypes.

```{r data_prep_genotypes, echo=FALSE}
names <- rownames(snpsGeno_example)
Geno_example <- geno(snpsGeno_example)$GT
table(Geno_example)
Geno_example[Geno_example == ".|."] <- NA
Geno_example[Geno_example == "0|0"] <- 0
Geno_example[Geno_example == "1|0"] <- 1
Geno_example[Geno_example == "0|1"] <- 1
Geno_example[Geno_example == "1|1"] <- 2
class(Geno_example) <- "numeric"
rownames(Geno_example) <- names
correlation_genotype <- cor(Geno_example, use = "pairwise.comp")
corner(correlation_genotype)
```

In this case the data only appears to have samples that match themselves. However there is the potential for a second kind of error where a brain has two samples, however the do not match each other. 

```{r check_correlations_genotypes, echo=FALSE}
pheatmap(
    correlation_genotype,
    cluster_rows = FALSE,
    show_rownames = FALSE,
    cluster_cols = FALSE,
    show_colnames = FALSE
)
```

```{r pdf_cor_topmed, eval=FALSE, echo = FALSE}
## For some reason, I need to run this manually to save the PDF
pdf(here("pdf_swaps", "heatmap_cor_topmed.pdf"))
pheatmap(
    correlation_genotype,
    cluster_rows = FALSE,
    show_rownames = FALSE,
    cluster_cols = FALSE,
    show_colnames = FALSE
)
dev.off()
```

In order to dig into this further we will collapse the correlation matrices into a data table shown below.

```{r make_table_genotypes, echo=TRUE}
corLong <-
    data.frame(cor = signif(as.numeric(correlation_genotype), 3))
corLong$rowSample <-
    rep(colnames(snpsGeno_example), times = ncol(snpsGeno_example))
corLong$colSample <-
    rep(colnames(snpsGeno_example), each = ncol(snpsGeno_example))
corLong <- corLong[!is.na(corLong$cor), ]
head(corLong)
```

```{r check_for_swaps_correlations_genotypes, echo=TRUE}
corLong2 <- data.frame(cor = signif(as.numeric(speaqeasy_Cor), 3))
corLong2$rowSample <-
    rep(colnames(Geno_speaqeasy), times = ncol(Geno_speaqeasy))
corLong2$colSample <-
    rep(colnames(Geno_speaqeasy), each = ncol(Geno_speaqeasy))
corLong2 <- corLong2[!is.na(corLong2$cor), ]
head(corLong2)
```

```{r format_genotype_data_sets, echo=FALSE}
brain_sentrix_present <-
    subset(brain_sentrix, ID %in% colnames(snpsGeno_example))
brain_sentrix_match_row <-
    match(corLong$rowSample, brain_sentrix_present$ID)
brain_sentrix_match_col <-
    match(corLong$colSample, brain_sentrix_present$ID)
corLong$rowBrain <-
    brain_sentrix_present$BrNum[brain_sentrix_match_row]
corLong$colBrain <-
    brain_sentrix_present$BrNum[brain_sentrix_match_col]
corLong$rowBatch <-
    brain_sentrix_present$Batch[brain_sentrix_match_row]
corLong$colBatch <-
    brain_sentrix_present$Batch[brain_sentrix_match_col]
speaqeasy_match_row <-
    match(corLong2$rowSample, pd_example$SAMPLE_ID)
speaqeasy_match_col <-
    match(corLong2$colSample, pd_example$SAMPLE_ID)
corLong2$rowBrain <- pd_example$BrNum[speaqeasy_match_row]
corLong2$colBrain <- pd_example$BrNum[speaqeasy_match_col]
```

We can check these tables for columns where different brains are strongly correlated and where the same brain fails to match itself. Below is the output of those analysis for the TOPMed genotypes.

```{r check_swaps_genotype_data_sets, echo=FALSE}
corLong[corLong$rowBrain == corLong$colBrain & corLong$cor < .8, ]

corLong[!corLong$rowBrain == corLong$colBrain & corLong$cor > .8, ]
```

And we do this again for the SPEAQeasy data.

```{r check_swaps_speaqeasy_data_sets, echo=FALSE}
corLong2[corLong2$rowBrain == corLong2$colBrain &
    corLong2$cor < .8, ]
# Mismatches
corLong2[!corLong2$rowBrain == corLong2$colBrain &
    corLong2$cor > .8, ]
```

We will next compare the correlation between the SPEAQeasy samples and the TOPMed samples. In order to do this we need to subset the genotypes for only SNPs that are common between the two. We can see that we have 656 snps common between the 42 samples.

```{r subset_genotype_data_sets, echo=FALSE}
Geno_speaqeasy_subset <-
    Geno_speaqeasy[rownames(Geno_speaqeasy) %in% rownames(Geno_example), ]
snpsGeno_subset <-
    Geno_example[rownames(Geno_example) %in% rownames(Geno_speaqeasy_subset), ]
dim(Geno_speaqeasy_subset)
dim(snpsGeno_subset)
```

As we did before we create a correlation matrix this time between the two data sets. 

```{r make_correlation_genotype_speaqeasy, echo=FALSE}
correlation_genotype_speaq <-
    cor(snpsGeno_subset, Geno_speaqeasy_subset, use = "pairwise.comp")
```


```{r make_corlong3_table, echo=FALSE}
corLong3 <-
    data.frame(cor = signif(as.numeric(correlation_genotype_speaq), 3))
corLong3$rowSample <-
    rep(colnames(snpsGeno_example), times = ncol(snpsGeno_example))
corLong3$colSample <-
    rep(colnames(Geno_speaqeasy), each = ncol(Geno_speaqeasy))
```

Check to correlation between SPEAQeasy and Genotype for mismatches and swaps.

```{r check_mismatches, echo=FALSE}
speaqeasy_match_col <-
    match(corLong3$colSample, pd_example$SAMPLE_ID)
corLong3$colBrain <- pd_example$BrNum[speaqeasy_match_col]
brain_sentrix_match_row <-
    match(corLong3$rowSample, brain_sentrix_present$ID)
corLong3$rowBrain <-
    brain_sentrix_present$BrNum[brain_sentrix_match_row]
# Fails to match
corLong3[corLong3$rowBrain == corLong3$colBrain &
    corLong3$cor < .8, ]
# Mismatches
corLong3[!corLong3$rowBrain == corLong3$colBrain &
    corLong3$cor > .8, ]
```

We can see from this from this analysis there are a few swaps present between RNA and DNA samples here. We can categorize them as simple and complex sample swaps. Because the two Br2275 do not match each other and also match nothing else we will be forced to consider this a complex swap and drop the sample. In the case of Br2473 it is a simple swap with Br2260 in both cases. This can be amended by swapping with in the phenotype data sheet manually. Now we have our accurate data outputs and will need to fix our ranged summarized experiment object for our SPEAQeasy data.

```{r swap_pd_sheet, eval=FALSE}
load(here(
    "pipeline_outputs",
    "count_objects",
    "rse_gene_Jlab_experiment_n42.Rdata"
))

## drop sample from rse with SPEAQeasy data
ids <- pd_example$SAMPLE_ID[pd_example$BrNum == "Br2275"]
rse_gene <- rse_gene[, !rse_gene$SAMPLE_ID == ids[1]]
rse_gene <- rse_gene[, !rse_gene$SAMPLE_ID == ids[2]]

# resolve swaps and drops in pd_example
pd_example <- pd_example[!pd_example$SAMPLE_ID == ids[1], ]
pd_example <- pd_example[!pd_example$SAMPLE_ID == ids[2], ]
ids2 <- pd_example$SAMPLE_ID[pd_example$BrNum == "Br2260"]
ids3 <- pd_example$SAMPLE_ID[pd_example$BrNum == "Br2473"]
pd_example$SAMPLE_ID[pd_example$Sample_ID == ids2] <- "Br2473"
pd_example$SAMPLE_ID[pd_example$Sample_ID == ids3] <- "Br2260"

# reorder phenotype data by the sample order present in the 'rse_gene' object
pd_example <-
    pd_example[match(rse_gene$SAMPLE_ID, pd_example$SAMPLE_ID), ]

# add important colData to 'rse_gene'
rse_gene$BrainRegion <- pd_example$BrainRegion
rse_gene$Race <- pd_example$Race
rse_gene$PrimaryDx <- pd_example$PrimaryDx
rse_gene$Sex <- pd_example$Sex
rse_gene$AgeDeath <- pd_example$AgeDeath

# add correct BrNum to colData for rse_gene
colData(rse_gene)$BrNum <- pd_example$BrNum

save(rse_gene, file = "rse_speaqeasy.RData")
```

# Reproducibility

This analysis report was made possible thanks to:
    
* R `r citep(bib[['R']])`
* `r CRANpkg('devtools')` `r citep(bib[['devtools']])`
* `r CRANpkg('here')` `r citep(bib[['here']])`
* `r Githubpkg('LieberInstitute/jaffelab')` `r citep(bib[['jaffelab']])`
* `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`
* `r CRANpkg('pheatmap')` `r citep(bib[['pheatmap']])`
* `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])`
* `r Biocpkg('SummarizedExperiment')` `r citep(bib[['SummarizedExperiment']])`
* `r CRANpkg('tidyr')` `r citep(bib[['tidyr']])`
* `r Biocpkg('voom')` `r citep(bib[['voom']])`
* `r Biocpkg('VariantAnnotation')` `r citep(bib[['VariantAnnotation']])`

[Bibliography file](swap_analysis.bib)

```{r bibliography, results='asis', echo=FALSE, warning = FALSE}
# Print bibliography
bibliography()
```

```{r reproducibility}
# Time spent creating this report:
diff(c(timestart, Sys.time()))

# Date this report was generated
message(Sys.time())

# Reproducibility info
options(width = 120)
devtools::session_info()
```