Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify appropriate max rank to use with AUCell #990

Merged

Conversation

allyhawkins
Copy link
Member

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #984

What is the goal of this pull request?

The goal of this notebook is to look at the numbers of genes detected per cell for each sample in SCPCP000015 and use that to define an appropriate aucMaxRank for use with AUCell.

From the AUCell vignette:

It is important to check that most cells have at least the number of expressed/detected genes that are going to be used to calculate the AUC (aucMaxRank in calcAUC()). The histogram provided by AUCell_buildRankings() allows to quickly check this distribution.

Briefly describe the general approach you took to achieve this goal.

This is a pretty small notebook that reads in all the SCE objects for all samples and then looks at the histogram of number of genes detected and different percentage thresholds that could be used to define aucMaxRank.

  • The default value for aucMaxRank is calculated as ceiling(0.05*nrow(exprMat)) where the exprMat is the counts matrix. Because of this, all genes in the matrix are used to calculate the aucMaxRank, including genes that aren't expressed in any cells. With ScPCA data we provide all genes regardless of if they are present in the sample or not. This ultimately could be inflating the aucMaxRank higher than we actually want it to be. So the first step I'm doing here is actually removing any genes that aren't present in the sample at all. I think we generally want to use this with AUCell.
  • I looked at both 1% and 5% for setting aucMaxRank for each sample, and to me 1% looks like a good value to use. Note that the exact aucMaxRank would vary slightly by sample, but I think that's okay since the distributions are different across samples. Some have much lower genes detected so by using a percentage we would account for that. Are we okay with using a percentage rather than a set value for number of genes that would be the same across all samples?

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes, next up is a script to actually run AUCell with our defined gene sets and this set threshold.

Results

What types of results does your code produce (e.g., table, figure)?

Here's the rendered copy of the notebook:
07-identify-max-rank-aucell.html.zip

What is your summary of the results?

As I stated above, I think we want to use a threshold percentage of 1% so that aucMaxRank is ceiling(0.01*nrow(exprMat)) for each sample.

Author checklists

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@allyhawkins allyhawkins requested review from sjspielman and removed request for jaclyn-taroni January 15, 2025 22:53
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your conclusion here seems reasonable enough to me to proceed with. But, I found this report hard to read in general because of how the plots show up. I ended up going down a bit of a rabbit hole through AUCell to see if we could find a way to make those come out a bit nicer, and I found that that the plotting code in the AUCell_buildRankings function is totally separate from the code that builds the rankings, so unless you need the rankings specifically, you don't need this function for plotting. The plots only use SCE counts. Once I realized that, I got excited and made it ggplot :)

Here's what I ended up cobbling together (requires patchwork) as alternative for the plots. It does split up the plots and quantiles, but I think it comes out cleaner. Replace the chunk with AUCell::AUCell_buildRankings to see in action, and let me know what you think of this approach or something that builds off of this!

# make a list of the `countByCell` as AUCell does
names(sce_list) <- stringr::str_extract(sce_files, "SCPCS\\d+")
count_df_list <- sce_list |>
  purrr::map(counts) |>
  purrr::map(
    \(counts) {
      countByCell <- Matrix::colSums(counts > 0, na.rm = TRUE)
      data.frame(count_by_cell = countByCell)
    })

# first, make the plots
count_df_list |>
  purrr::imap(
    \(count_df, sample_id) {
        count_hist <- ggplot(count_df) + 
          aes(x = count_by_cell) + 
          geom_histogram(color = "grey20", fill = "skyblue", bins = 30) + 
          xlim(0, max(count_df$count_by_cell)) +
          xlab("Number of genes detected per cell")
  
        count_box <- ggplot(count_df) + 
          aes(x = count_by_cell, y = "") + 
          geom_boxplot(color = "grey20", fill = "skyblue", outlier.size = 0.5) + 
          theme_void() + 
          xlim(0, max(count_df$count_by_cell)) + 
          ggtitle(sample_id)
      
      count_box / count_hist + plot_layout(heights = c(0.2, 1))
    }
  ) |>
  patchwork::wrap_plots(nrow = 5)


# second, print a table of the stats
count_df_list |>
  purrr::map(
     \(count_df) {
         c(
          min=min(count_df$count_by_cell), 
          quantile(count_df$count_by_cell, c(.01,.05, .10, .50, 1))
        )
     }) |>
  dplyr::bind_rows(.id = "sample_id")

Comment on lines +113 to +114
- Prior to running `AUCell` we should remove any genes that are not detected in any cells.
- Using a cutoff of 1% of the total genes detected seems like a reasonable choice for the `aucMaxRank`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems totally reasonable to me to use a percentage to account for the different number of expressed genes across samples. These calculations are pretty quick but I guess we could export a TSV of the thresholds and read them in for use; I don't think that's necessary though unless we want a separate record of the thresholds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that we make the percentage an argument to a script that runs AUCell and then calculate it for each sample in the script using ceiling(nrow(counts)*percentage)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I was assuming for next steps here. I was kind of thinking more like, "what if we need this for supplementary info" or so, but I think that's really unlikely. It's also easy to get later!

@allyhawkins
Copy link
Member Author

But, I found this report hard to read in general because of how the plots show up

I agree that I don't love the automatic plots from AUCell, but I didn't think it was worth changing too much and trying to re-create. But thank you for doing that! It does look better!

I also made some other small adjustments to generally make the report more visually appealing, but conclusions have not changed.
New report:
07-identify-max-rank-aucell.html.zip

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🐈‍⬛

Comment on lines 33 to 34
# set seed
set.seed(2025)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this in this notebook, but great work on the date!!! 🎉

Suggested change
# set seed
set.seed(2025)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed it when using the build rankings function and I was so proud of myself for remembering the year!

@allyhawkins
Copy link
Member Author

Noting that I am canceling the workflow since this notebook is not run in CI.

@allyhawkins allyhawkins merged commit f7f0046 into AlexsLemonade:main Jan 16, 2025
1 of 3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/aucell-max-rank branch January 16, 2025 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Identify appropriate max rank to use for AUCell with Ewing samples
2 participants