Identify appropriate max rank to use with AUCell #990

allyhawkins · 2025-01-15T22:52:49Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #984

What is the goal of this pull request?

The goal of this notebook is to look at the numbers of genes detected per cell for each sample in SCPCP000015 and use that to define an appropriate aucMaxRank for use with AUCell.

From the AUCell vignette:

It is important to check that most cells have at least the number of expressed/detected genes that are going to be used to calculate the AUC (aucMaxRank in calcAUC()). The histogram provided by AUCell_buildRankings() allows to quickly check this distribution.

Briefly describe the general approach you took to achieve this goal.

This is a pretty small notebook that reads in all the SCE objects for all samples and then looks at the histogram of number of genes detected and different percentage thresholds that could be used to define aucMaxRank.

The default value for aucMaxRank is calculated as ceiling(0.05*nrow(exprMat)) where the exprMat is the counts matrix. Because of this, all genes in the matrix are used to calculate the aucMaxRank, including genes that aren't expressed in any cells. With ScPCA data we provide all genes regardless of if they are present in the sample or not. This ultimately could be inflating the aucMaxRank higher than we actually want it to be. So the first step I'm doing here is actually removing any genes that aren't present in the sample at all. I think we generally want to use this with AUCell.
I looked at both 1% and 5% for setting aucMaxRank for each sample, and to me 1% looks like a good value to use. Note that the exact aucMaxRank would vary slightly by sample, but I think that's okay since the distributions are different across samples. Some have much lower genes detected so by using a percentage we would account for that. Are we okay with using a percentage rather than a set value for number of genes that would be the same across all samples?

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes, next up is a script to actually run AUCell with our defined gene sets and this set threshold.

Results

What types of results does your code produce (e.g., table, figure)?

Here's the rendered copy of the notebook:
07-identify-max-rank-aucell.html.zip

What is your summary of the results?

As I stated above, I think we want to use a threshold percentage of 1% so that aucMaxRank is ceiling(0.01*nrow(exprMat)) for each sample.

Author checklists

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

sjspielman

Your conclusion here seems reasonable enough to me to proceed with. But, I found this report hard to read in general because of how the plots show up. I ended up going down a bit of a rabbit hole through AUCell to see if we could find a way to make those come out a bit nicer, and I found that that the plotting code in the AUCell_buildRankings function is totally separate from the code that builds the rankings, so unless you need the rankings specifically, you don't need this function for plotting. The plots only use SCE counts. Once I realized that, I got excited and made it ggplot :)

Here's what I ended up cobbling together (requires patchwork) as alternative for the plots. It does split up the plots and quantiles, but I think it comes out cleaner. Replace the chunk with AUCell::AUCell_buildRankings to see in action, and let me know what you think of this approach or something that builds off of this!

# make a list of the `countByCell` as AUCell does
names(sce_list) <- stringr::str_extract(sce_files, "SCPCS\\d+")
count_df_list <- sce_list |>
  purrr::map(counts) |>
  purrr::map(
    \(counts) {
      countByCell <- Matrix::colSums(counts > 0, na.rm = TRUE)
      data.frame(count_by_cell = countByCell)
    })

# first, make the plots
count_df_list |>
  purrr::imap(
    \(count_df, sample_id) {
        count_hist <- ggplot(count_df) + 
          aes(x = count_by_cell) + 
          geom_histogram(color = "grey20", fill = "skyblue", bins = 30) + 
          xlim(0, max(count_df$count_by_cell)) +
          xlab("Number of genes detected per cell")
  
        count_box <- ggplot(count_df) + 
          aes(x = count_by_cell, y = "") + 
          geom_boxplot(color = "grey20", fill = "skyblue", outlier.size = 0.5) + 
          theme_void() + 
          xlim(0, max(count_df$count_by_cell)) + 
          ggtitle(sample_id)
      
      count_box / count_hist + plot_layout(heights = c(0.2, 1))
    }
  ) |>
  patchwork::wrap_plots(nrow = 5)


# second, print a table of the stats
count_df_list |>
  purrr::map(
     \(count_df) {
         c(
          min=min(count_df$count_by_cell), 
          quantile(count_df$count_by_cell, c(.01,.05, .10, .50, 1))
        )
     }) |>
  dplyr::bind_rows(.id = "sample_id")

analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.Rmd

sjspielman · 2025-01-16T13:12:14Z

analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.Rmd

+- Prior to running `AUCell` we should remove any genes that are not detected in any cells.
+- Using a cutoff of 1% of the total genes detected seems like a reasonable choice for the `aucMaxRank`. 


This seems totally reasonable to me to use a percentage to account for the different number of expressed genes across samples. These calculations are pretty quick but I guess we could export a TSV of the thresholds and read them in for use; I don't think that's necessary though unless we want a separate record of the thresholds.

I was thinking that we make the percentage an argument to a script that runs AUCell and then calculate it for each sample in the script using ceiling(nrow(counts)*percentage)

Yeah, that's what I was assuming for next steps here. I was kind of thinking more like, "what if we need this for supplementary info" or so, but I think that's really unlikely. It's also easy to get later!

…-rank-aucell.Rmd Co-authored-by: Stephanie Spielman <[email protected]>

allyhawkins · 2025-01-16T17:49:08Z

But, I found this report hard to read in general because of how the plots show up

I agree that I don't love the automatic plots from AUCell, but I didn't think it was worth changing too much and trying to re-create. But thank you for doing that! It does look better!

I also made some other small adjustments to generally make the report more visually appealing, but conclusions have not changed.
New report:
07-identify-max-rank-aucell.html.zip

sjspielman

LGTM! 🐈‍⬛

sjspielman · 2025-01-16T18:07:34Z

analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.Rmd

+# set seed
+set.seed(2025)


You don't need this in this notebook, but great work on the date!!! 🎉

Suggested change

# set seed

set.seed(2025)

I needed it when using the build rankings function and I was so proud of myself for remembering the year!

allyhawkins · 2025-01-16T19:15:22Z

Noting that I am canceling the workflow since this notebook is not run in CI.

allyhawkins added 3 commits January 15, 2025 16:40

notebook for identifying appropriate max rank

26757ba

update readme

f10d18c

add missing link

24ad79b

allyhawkins requested a review from jaclyn-taroni as a code owner January 15, 2025 22:52

allyhawkins requested review from sjspielman and removed request for jaclyn-taroni January 15, 2025 22:53

sjspielman reviewed Jan 16, 2025

View reviewed changes

allyhawkins and others added 2 commits January 16, 2025 11:32

Update analyses/cell-type-ewings/exploratory_analysis/07-identify-max…

8fe46b8

…-rank-aucell.Rmd Co-authored-by: Stephanie Spielman <[email protected]>

make the report pretty

ed9c24d

allyhawkins requested a review from sjspielman January 16, 2025 17:49

sjspielman approved these changes Jan 16, 2025

View reviewed changes

remove seed and add comment about using percentage

7e0219c

allyhawkins merged commit f7f0046 into AlexsLemonade:main Jan 16, 2025
1 of 3 checks passed

allyhawkins deleted the allyhawkins/aucell-max-rank branch January 16, 2025 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify appropriate max rank to use with AUCell #990

Identify appropriate max rank to use with AUCell #990

allyhawkins commented Jan 15, 2025

sjspielman left a comment

sjspielman Jan 16, 2025

allyhawkins Jan 16, 2025

sjspielman Jan 16, 2025

allyhawkins commented Jan 16, 2025

sjspielman left a comment

sjspielman Jan 16, 2025

allyhawkins Jan 16, 2025

allyhawkins commented Jan 16, 2025

		- Prior to running `AUCell` we should remove any genes that are not detected in any cells.
		- Using a cutoff of 1% of the total genes detected seems like a reasonable choice for the `aucMaxRank`.

Identify appropriate max rank to use with AUCell #990

Identify appropriate max rank to use with AUCell #990

Conversation

allyhawkins commented Jan 15, 2025

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Author checklists

Analysis module and review

Reproducibility checklist

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Jan 16, 2025

Choose a reason for hiding this comment

allyhawkins Jan 16, 2025

Choose a reason for hiding this comment

sjspielman Jan 16, 2025

Choose a reason for hiding this comment

allyhawkins commented Jan 16, 2025

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Jan 16, 2025

Choose a reason for hiding this comment

allyhawkins Jan 16, 2025

Choose a reason for hiding this comment

allyhawkins commented Jan 16, 2025