-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into jen-omalley/add-bluesky-footer
- Loading branch information
Showing
3 changed files
with
836 additions
and
0 deletions.
There are no files selected for viewing
175 changes: 175 additions & 0 deletions
175
analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
--- | ||
title: "Identify max rank for AUCell" | ||
author: Ally Hawkins | ||
date: "`r Sys.Date()`" | ||
output: | ||
html_document: | ||
toc: true | ||
toc_depth: 3 | ||
--- | ||
|
||
This notebook is used to find the appropriate `aucMaxRank` to use with `AUCell` for all samples in `SCPCP00015`. | ||
|
||
From the [`AUCell` vignette](https://bioconductor.org/packages/release/bioc/vignettes/AUCell/inst/doc/AUCell.html): | ||
|
||
> It is important to check that most cells have at least the number of expressed/detected genes that are going to be used to calculate the AUC (aucMaxRank in calcAUC()). The histogram provided by AUCell_buildRankings() allows to quickly check this distribution. | ||
Here we will look at a histogram showing the number of genes detected per cell for each sample and use that information to pick a threshold (percentage of genes) to use across all samples. | ||
|
||
## Setup | ||
|
||
```{r packages} | ||
suppressPackageStartupMessages({ | ||
# load required packages | ||
library(SingleCellExperiment) | ||
library(ggplot2) | ||
}) | ||
# Set default ggplot theme | ||
theme_set( | ||
theme_classic() | ||
) | ||
``` | ||
|
||
|
||
```{r base paths} | ||
# The base path for the OpenScPCA repository, found by its (hidden) .git directory | ||
repository_base <- rprojroot::find_root(rprojroot::is_git_root) | ||
# The current data directory, found within the repository base directory | ||
data_dir <- file.path(repository_base, "data", "current", "SCPCP000015") | ||
# paths to all processed sce files | ||
sce_files <- list.files(data_dir, pattern = "*processed.rds", full.names = TRUE, recursive = TRUE) | ||
``` | ||
|
||
```{r} | ||
# read in SCE files | ||
sce_list <- sce_files |> | ||
purrr::map(readr::read_rds) | ||
``` | ||
|
||
|
||
## Filter `SingleCellExperiment` objects | ||
|
||
The default `aucMaxRank` is calculated as `ceiling(0.05*nrow(exprMat))` where the `exprMat` is the counts matrix. | ||
For the processed `SingleCellExperiment` objects in ScPCA we include all genes in the index, regardless of if they are detected in our sample. | ||
This means the rankings might include a bunch of genes that aren't actually present and the `aucMaxRank` would be calculated based on a set of genes that aren't even there, setting the threshold higher than it should be. | ||
Because of this, we probably want to filter the `SingleCellExperiment` objects to remove any genes that are not detected in any cells prior to calculating ranks or running `AUCell`. | ||
|
||
```{r} | ||
# get a list of filtered sce | ||
# remove any genes that are not detected in any cells | ||
sce_list <- sce_list |> | ||
purrr::map(\(sce){ | ||
genes_to_remove <- rowData(sce)$detected > 0 | ||
filtered_sce <- sce[genes_to_remove > 0 , ] | ||
}) | ||
``` | ||
|
||
## Number of genes detected | ||
|
||
Below we will look at the distribution of the number of genes detected per cell for each sample. | ||
|
||
|
||
```{r, fig.height = 10, fig.width=10, warning=FALSE} | ||
# make a list of the `countByCell` as AUCell does | ||
names(sce_list) <- stringr::str_extract(sce_files, "SCPCS\\d+") | ||
count_df_list <- sce_list |> | ||
purrr::map(counts) |> | ||
purrr::map( | ||
\(counts) { | ||
countByCell <- Matrix::colSums(counts > 0, na.rm = TRUE) | ||
data.frame(count_by_cell = countByCell) | ||
}) | ||
# first, make the plots | ||
count_df_list |> | ||
purrr::imap( | ||
\(count_df, sample_id) { | ||
count_hist <- ggplot(count_df) + | ||
aes(x = count_by_cell) + | ||
geom_histogram(color = "grey20", fill = "skyblue", bins = 30) + | ||
xlim(0, max(count_df$count_by_cell)) + | ||
xlab("Number of genes detected per cell") | ||
count_box <- ggplot(count_df) + | ||
aes(x = count_by_cell, y = "") + | ||
geom_boxplot(color = "grey20", fill = "skyblue", outlier.size = 0.5) + | ||
theme_void() + | ||
xlim(0, max(count_df$count_by_cell)) + | ||
ggtitle(sample_id) | ||
count_box / count_hist + patchwork::plot_layout(heights = c(0.2, 1)) | ||
} | ||
) |> | ||
patchwork::wrap_plots(nrow = 5) | ||
# second, print a table of the stats | ||
count_df_list |> | ||
purrr::map( | ||
\(count_df) { | ||
c( | ||
min=min(count_df$count_by_cell), | ||
quantile(count_df$count_by_cell, c(.01,.05, .10, .50, 1)) | ||
) | ||
}) |> | ||
dplyr::bind_rows(.id = "sample_id") | ||
``` | ||
|
||
|
||
Now we will look at the potential thresholds for `aucMaxRank` using 5% and 1% of genes. | ||
The numbers printed here represent the total number of genes that would be used as the `aucMaxRank` for a given dataset using the threshold. | ||
We can then use the histograms to see how many cells have at least that number of genes detected. | ||
Remember, we want to pick a number of genes that are detected in most cells. | ||
If we picked a max rank higher than the number of genes detected in most cells, the non-detected genes that are randomly ordered would play an outsized role in our AUC values. | ||
|
||
### 5% | ||
|
||
```{r} | ||
# get max rank using 1% threshold | ||
threshold_list <- sce_list |> | ||
purrr::map_dbl(\(sce){ceiling(nrow(counts(sce))*0.05)}) | ||
# make it legible | ||
data.frame( | ||
max_rank = threshold_list | ||
) | ||
``` | ||
|
||
This looks too high based on our histograms. | ||
We would have a lot of cells where the number of genes detected is lower than the max rank, which is not ideal. | ||
|
||
### 1% | ||
|
||
```{r} | ||
# get max rank using 1% threshold | ||
threshold_list <- sce_list |> | ||
purrr::map_dbl(\(sce){ceiling(nrow(counts(sce))*0.01)}) | ||
# make it legible | ||
data.frame( | ||
max_rank = threshold_list | ||
) | ||
``` | ||
|
||
This looks better to me. | ||
I also think there's enough variation in some of the samples that we don't want to just pick one max rank for all samples, but instead use a percentage of the total genes detected. | ||
|
||
## Conclusions | ||
|
||
- Prior to running `AUCell` we should remove any genes that are not detected in any cells. | ||
- Using a cutoff of 1% of the total genes detected seems like a reasonable choice for the `aucMaxRank`. | ||
This allows us to account for the variation in genes detected per sample, but we should be sure to be careful when interpreting results as using different cutoffs for each sample could impact the AUC values. | ||
|
||
## Session info | ||
|
||
```{r session info} | ||
# record the versions of the packages used in this analysis and other environment information | ||
sessionInfo() | ||
``` | ||
|
||
|
659 changes: 659 additions & 0 deletions
659
analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.html
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters