Merge branch 'main' into jen-omalley/add-bluesky-footer

AlexsLemonade · Jan 16, 2025 · 8364bc7 · 8364bc7
2 parents 9737755 + f7f0046
commit 8364bc7
Show file tree

Hide file tree

Showing 3 changed files with 836 additions and 0 deletions.
diff --git a/analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.Rmd b/analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.Rmd
@@ -0,0 +1,175 @@
+---
+title: "Identify max rank for AUCell"
+author: Ally Hawkins
+date: "`r Sys.Date()`"
+output:
+  html_document:
+    toc: true
+    toc_depth: 3
+---
+
+This notebook is used to find the appropriate `aucMaxRank` to use with `AUCell` for all samples in `SCPCP00015`. 
+
+From the  [`AUCell` vignette](https://bioconductor.org/packages/release/bioc/vignettes/AUCell/inst/doc/AUCell.html):
+
+> It is important to check that most cells have at least the number of expressed/detected genes that are going to be used to calculate the AUC (aucMaxRank in calcAUC()). The histogram provided by AUCell_buildRankings() allows to quickly check this distribution. 
+
+Here we will look at a histogram showing the number of genes detected per cell for each sample and use that information to pick a threshold (percentage of genes) to use across all samples. 
+
+## Setup
+
+```{r packages}
+suppressPackageStartupMessages({
+  # load required packages
+  library(SingleCellExperiment)
+  library(ggplot2)
+})
+
+# Set default ggplot theme
+theme_set(
+  theme_classic()
+)
+```
+
+
+```{r base paths}
+# The base path for the OpenScPCA repository, found by its (hidden) .git directory
+repository_base <- rprojroot::find_root(rprojroot::is_git_root)
+
+# The current data directory, found within the repository base directory
+data_dir <- file.path(repository_base, "data", "current", "SCPCP000015")
+
+# paths to all processed sce files
+sce_files <- list.files(data_dir, pattern = "*processed.rds", full.names = TRUE, recursive = TRUE)
+```
+
+```{r}
+# read in SCE files 
+sce_list <- sce_files |> 
+  purrr::map(readr::read_rds)
+```
+
+
+## Filter `SingleCellExperiment` objects
+
+The default `aucMaxRank` is calculated as `ceiling(0.05*nrow(exprMat))` where the `exprMat` is the counts matrix. 
+For the processed `SingleCellExperiment` objects in ScPCA we include all genes in the index, regardless of if they are detected in our sample. 
+This means the rankings might include a bunch of genes that aren't actually present and the `aucMaxRank` would be calculated based on a set of genes that aren't even there, setting the threshold higher than it should be. 
+Because of this, we probably want to filter the `SingleCellExperiment` objects to remove any genes that are not detected in any cells prior to calculating ranks or running `AUCell`. 
+
+```{r}
+# get a list of filtered sce
+# remove any genes that are not detected in any cells 
+sce_list <- sce_list |> 
+  purrr::map(\(sce){
+    
+    genes_to_remove <- rowData(sce)$detected > 0 
+    filtered_sce <- sce[genes_to_remove > 0 , ]
+    
+  })
+```
+
+## Number of genes detected
+
+Below we will look at the distribution of the number of genes detected per cell for each sample.  
+
+
+```{r, fig.height = 10, fig.width=10, warning=FALSE}
+# make a list of the `countByCell` as AUCell does
+names(sce_list) <- stringr::str_extract(sce_files, "SCPCS\\d+")
+count_df_list <- sce_list |>
+  purrr::map(counts) |>
+  purrr::map(
+    \(counts) {
+      countByCell <- Matrix::colSums(counts > 0, na.rm = TRUE)
+      data.frame(count_by_cell = countByCell)
+    })
+
+# first, make the plots
+count_df_list |>
+  purrr::imap(
+    \(count_df, sample_id) {
+        count_hist <- ggplot(count_df) + 
+          aes(x = count_by_cell) + 
+          geom_histogram(color = "grey20", fill = "skyblue", bins = 30) + 
+          xlim(0, max(count_df$count_by_cell)) +
+          xlab("Number of genes detected per cell")
+  
+        count_box <- ggplot(count_df) + 
+          aes(x = count_by_cell, y = "") + 
+          geom_boxplot(color = "grey20", fill = "skyblue", outlier.size = 0.5) + 
+          theme_void() + 
+          xlim(0, max(count_df$count_by_cell)) + 
+          ggtitle(sample_id)
+      
+      count_box / count_hist + patchwork::plot_layout(heights = c(0.2, 1))
+    }
+  ) |>
+  patchwork::wrap_plots(nrow = 5)
+
+
+# second, print a table of the stats
+count_df_list |>
+  purrr::map(
+     \(count_df) {
+         c(
+          min=min(count_df$count_by_cell), 
+          quantile(count_df$count_by_cell, c(.01,.05, .10, .50, 1))
+        )
+     }) |>
+  dplyr::bind_rows(.id = "sample_id")
+```
+
+
+Now we will look at the potential thresholds for `aucMaxRank` using 5% and 1% of genes. 
+The numbers printed here represent the total number of genes that would be used as the `aucMaxRank` for a given dataset using the threshold. 
+We can then use the histograms to see how many cells have at least that number of genes detected. 
+Remember, we want to pick a number of genes that are detected in most cells.
+If we picked a max rank higher than the number of genes detected in most cells, the non-detected genes that are randomly ordered would play an outsized role in our AUC values.
+
+### 5%  
+
+```{r}
+# get max rank using 1% threshold
+threshold_list <- sce_list |> 
+  purrr::map_dbl(\(sce){ceiling(nrow(counts(sce))*0.05)}) 
+
+# make it legible
+data.frame(
+  max_rank = threshold_list
+)
+```
+
+This looks too high based on our histograms. 
+We would have a lot of cells where the number of genes detected is lower than the max rank, which is not ideal. 
+
+### 1% 
+
+```{r}
+# get max rank using 1% threshold
+threshold_list <- sce_list |> 
+  purrr::map_dbl(\(sce){ceiling(nrow(counts(sce))*0.01)}) 
+
+# make it legible
+data.frame(
+  max_rank = threshold_list
+)
+```
+
+This looks better to me. 
+I also think there's enough variation in some of the samples that we don't want to just pick one max rank for all samples, but instead use a percentage of the total genes detected. 
+
+## Conclusions
+
+- Prior to running `AUCell` we should remove any genes that are not detected in any cells.
+- Using a cutoff of 1% of the total genes detected seems like a reasonable choice for the `aucMaxRank`. 
+This allows us to account for the variation in genes detected per sample, but we should be sure to be careful when interpreting results as using different cutoffs for each sample could impact the AUC values. 
+
+## Session info 
+
+```{r session info}
+# record the versions of the packages used in this analysis and other environment information
+sessionInfo()
+```
+
+
diff --git a/analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.html b/analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.html
diff --git a/analyses/cell-type-ewings/exploratory_analysis/README.md b/analyses/cell-type-ewings/exploratory_analysis/README.md
@@ -23,6 +23,8 @@ Additionally, we look at expression of marker genes from `references/visser-all-
 6. `06-tumor-cell-state-assignments.Rmd`: This notebook looks at assigning tumor cell states in `SCPCL000822` and `SCPCL000824`. 
 Tumor cells are grouped into `EWS-FLI` high, `EWS-FLI` low, and `EWS-FLI` middle based on expression of marker genes. 
 
+7. `07-identify-max-rank-aucell.Rmd`: This notebook identifies an appropriate threshold for `aucMaxRank` to use with `AUCell`. 
+
 ## Annotation notebooks
 
 The `annotation_notebooks` folder contains exploratory notebooks used to explore and validate annotations in individual samples.