Skip to content

Commit

Permalink
Merge branch 'main' into jen-omalley/add-bluesky-footer
Browse files Browse the repository at this point in the history
  • Loading branch information
Jen-OMalley authored Jan 16, 2025
2 parents 9737755 + f7f0046 commit 8364bc7
Show file tree
Hide file tree
Showing 3 changed files with 836 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
title: "Identify max rank for AUCell"
author: Ally Hawkins
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_depth: 3
---

This notebook is used to find the appropriate `aucMaxRank` to use with `AUCell` for all samples in `SCPCP00015`.

From the [`AUCell` vignette](https://bioconductor.org/packages/release/bioc/vignettes/AUCell/inst/doc/AUCell.html):

> It is important to check that most cells have at least the number of expressed/detected genes that are going to be used to calculate the AUC (aucMaxRank in calcAUC()). The histogram provided by AUCell_buildRankings() allows to quickly check this distribution.
Here we will look at a histogram showing the number of genes detected per cell for each sample and use that information to pick a threshold (percentage of genes) to use across all samples.

## Setup

```{r packages}
suppressPackageStartupMessages({
# load required packages
library(SingleCellExperiment)
library(ggplot2)
})
# Set default ggplot theme
theme_set(
theme_classic()
)
```


```{r base paths}
# The base path for the OpenScPCA repository, found by its (hidden) .git directory
repository_base <- rprojroot::find_root(rprojroot::is_git_root)
# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current", "SCPCP000015")
# paths to all processed sce files
sce_files <- list.files(data_dir, pattern = "*processed.rds", full.names = TRUE, recursive = TRUE)
```

```{r}
# read in SCE files
sce_list <- sce_files |>
purrr::map(readr::read_rds)
```


## Filter `SingleCellExperiment` objects

The default `aucMaxRank` is calculated as `ceiling(0.05*nrow(exprMat))` where the `exprMat` is the counts matrix.
For the processed `SingleCellExperiment` objects in ScPCA we include all genes in the index, regardless of if they are detected in our sample.
This means the rankings might include a bunch of genes that aren't actually present and the `aucMaxRank` would be calculated based on a set of genes that aren't even there, setting the threshold higher than it should be.
Because of this, we probably want to filter the `SingleCellExperiment` objects to remove any genes that are not detected in any cells prior to calculating ranks or running `AUCell`.

```{r}
# get a list of filtered sce
# remove any genes that are not detected in any cells
sce_list <- sce_list |>
purrr::map(\(sce){
genes_to_remove <- rowData(sce)$detected > 0
filtered_sce <- sce[genes_to_remove > 0 , ]
})
```

## Number of genes detected

Below we will look at the distribution of the number of genes detected per cell for each sample.


```{r, fig.height = 10, fig.width=10, warning=FALSE}
# make a list of the `countByCell` as AUCell does
names(sce_list) <- stringr::str_extract(sce_files, "SCPCS\\d+")
count_df_list <- sce_list |>
purrr::map(counts) |>
purrr::map(
\(counts) {
countByCell <- Matrix::colSums(counts > 0, na.rm = TRUE)
data.frame(count_by_cell = countByCell)
})
# first, make the plots
count_df_list |>
purrr::imap(
\(count_df, sample_id) {
count_hist <- ggplot(count_df) +
aes(x = count_by_cell) +
geom_histogram(color = "grey20", fill = "skyblue", bins = 30) +
xlim(0, max(count_df$count_by_cell)) +
xlab("Number of genes detected per cell")
count_box <- ggplot(count_df) +
aes(x = count_by_cell, y = "") +
geom_boxplot(color = "grey20", fill = "skyblue", outlier.size = 0.5) +
theme_void() +
xlim(0, max(count_df$count_by_cell)) +
ggtitle(sample_id)
count_box / count_hist + patchwork::plot_layout(heights = c(0.2, 1))
}
) |>
patchwork::wrap_plots(nrow = 5)
# second, print a table of the stats
count_df_list |>
purrr::map(
\(count_df) {
c(
min=min(count_df$count_by_cell),
quantile(count_df$count_by_cell, c(.01,.05, .10, .50, 1))
)
}) |>
dplyr::bind_rows(.id = "sample_id")
```


Now we will look at the potential thresholds for `aucMaxRank` using 5% and 1% of genes.
The numbers printed here represent the total number of genes that would be used as the `aucMaxRank` for a given dataset using the threshold.
We can then use the histograms to see how many cells have at least that number of genes detected.
Remember, we want to pick a number of genes that are detected in most cells.
If we picked a max rank higher than the number of genes detected in most cells, the non-detected genes that are randomly ordered would play an outsized role in our AUC values.

### 5%

```{r}
# get max rank using 1% threshold
threshold_list <- sce_list |>
purrr::map_dbl(\(sce){ceiling(nrow(counts(sce))*0.05)})
# make it legible
data.frame(
max_rank = threshold_list
)
```

This looks too high based on our histograms.
We would have a lot of cells where the number of genes detected is lower than the max rank, which is not ideal.

### 1%

```{r}
# get max rank using 1% threshold
threshold_list <- sce_list |>
purrr::map_dbl(\(sce){ceiling(nrow(counts(sce))*0.01)})
# make it legible
data.frame(
max_rank = threshold_list
)
```

This looks better to me.
I also think there's enough variation in some of the samples that we don't want to just pick one max rank for all samples, but instead use a percentage of the total genes detected.

## Conclusions

- Prior to running `AUCell` we should remove any genes that are not detected in any cells.
- Using a cutoff of 1% of the total genes detected seems like a reasonable choice for the `aucMaxRank`.
This allows us to account for the variation in genes detected per sample, but we should be sure to be careful when interpreting results as using different cutoffs for each sample could impact the AUC values.

## Session info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```


Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions analyses/cell-type-ewings/exploratory_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ Additionally, we look at expression of marker genes from `references/visser-all-
6. `06-tumor-cell-state-assignments.Rmd`: This notebook looks at assigning tumor cell states in `SCPCL000822` and `SCPCL000824`.
Tumor cells are grouped into `EWS-FLI` high, `EWS-FLI` low, and `EWS-FLI` middle based on expression of marker genes.

7. `07-identify-max-rank-aucell.Rmd`: This notebook identifies an appropriate threshold for `aucMaxRank` to use with `AUCell`.

## Annotation notebooks

The `annotation_notebooks` folder contains exploratory notebooks used to explore and validate annotations in individual samples.

0 comments on commit 8364bc7

Please sign in to comment.