-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify appropriate max rank to use with AUCell #990
Identify appropriate max rank to use with AUCell #990
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your conclusion here seems reasonable enough to me to proceed with. But, I found this report hard to read in general because of how the plots show up. I ended up going down a bit of a rabbit hole through AUCell to see if we could find a way to make those come out a bit nicer, and I found that that the plotting code in the AUCell_buildRankings
function is totally separate from the code that builds the rankings, so unless you need the rankings specifically, you don't need this function for plotting. The plots only use SCE counts. Once I realized that, I got excited and made it ggplot :)
Here's what I ended up cobbling together (requires patchwork
) as alternative for the plots. It does split up the plots and quantiles, but I think it comes out cleaner. Replace the chunk with AUCell::AUCell_buildRankings
to see in action, and let me know what you think of this approach or something that builds off of this!
# make a list of the `countByCell` as AUCell does
names(sce_list) <- stringr::str_extract(sce_files, "SCPCS\\d+")
count_df_list <- sce_list |>
purrr::map(counts) |>
purrr::map(
\(counts) {
countByCell <- Matrix::colSums(counts > 0, na.rm = TRUE)
data.frame(count_by_cell = countByCell)
})
# first, make the plots
count_df_list |>
purrr::imap(
\(count_df, sample_id) {
count_hist <- ggplot(count_df) +
aes(x = count_by_cell) +
geom_histogram(color = "grey20", fill = "skyblue", bins = 30) +
xlim(0, max(count_df$count_by_cell)) +
xlab("Number of genes detected per cell")
count_box <- ggplot(count_df) +
aes(x = count_by_cell, y = "") +
geom_boxplot(color = "grey20", fill = "skyblue", outlier.size = 0.5) +
theme_void() +
xlim(0, max(count_df$count_by_cell)) +
ggtitle(sample_id)
count_box / count_hist + plot_layout(heights = c(0.2, 1))
}
) |>
patchwork::wrap_plots(nrow = 5)
# second, print a table of the stats
count_df_list |>
purrr::map(
\(count_df) {
c(
min=min(count_df$count_by_cell),
quantile(count_df$count_by_cell, c(.01,.05, .10, .50, 1))
)
}) |>
dplyr::bind_rows(.id = "sample_id")
analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.Rmd
Outdated
Show resolved
Hide resolved
- Prior to running `AUCell` we should remove any genes that are not detected in any cells. | ||
- Using a cutoff of 1% of the total genes detected seems like a reasonable choice for the `aucMaxRank`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems totally reasonable to me to use a percentage to account for the different number of expressed genes across samples. These calculations are pretty quick but I guess we could export a TSV of the thresholds and read them in for use; I don't think that's necessary though unless we want a separate record of the thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that we make the percentage an argument to a script that runs AUCell and then calculate it for each sample in the script using ceiling(nrow(counts)*percentage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I was assuming for next steps here. I was kind of thinking more like, "what if we need this for supplementary info" or so, but I think that's really unlikely. It's also easy to get later!
…-rank-aucell.Rmd Co-authored-by: Stephanie Spielman <[email protected]>
I agree that I don't love the automatic plots from AUCell, but I didn't think it was worth changing too much and trying to re-create. But thank you for doing that! It does look better! I also made some other small adjustments to generally make the report more visually appealing, but conclusions have not changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🐈⬛
# set seed | ||
set.seed(2025) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this in this notebook, but great work on the date!!! 🎉
# set seed | |
set.seed(2025) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed it when using the build rankings function and I was so proud of myself for remembering the year!
Noting that I am canceling the workflow since this notebook is not run in CI. |
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
Closes #984
What is the goal of this pull request?
The goal of this notebook is to look at the numbers of genes detected per cell for each sample in
SCPCP000015
and use that to define an appropriateaucMaxRank
for use withAUCell
.From the
AUCell
vignette:Briefly describe the general approach you took to achieve this goal.
This is a pretty small notebook that reads in all the SCE objects for all samples and then looks at the histogram of number of genes detected and different percentage thresholds that could be used to define
aucMaxRank
.aucMaxRank
is calculated asceiling(0.05*nrow(exprMat))
where theexprMat
is the counts matrix. Because of this, all genes in the matrix are used to calculate theaucMaxRank
, including genes that aren't expressed in any cells. With ScPCA data we provide all genes regardless of if they are present in the sample or not. This ultimately could be inflating theaucMaxRank
higher than we actually want it to be. So the first step I'm doing here is actually removing any genes that aren't present in the sample at all. I think we generally want to use this withAUCell
.aucMaxRank
for each sample, and to me 1% looks like a good value to use. Note that the exactaucMaxRank
would vary slightly by sample, but I think that's okay since the distributions are different across samples. Some have much lower genes detected so by using a percentage we would account for that. Are we okay with using a percentage rather than a set value for number of genes that would be the same across all samples?If known, do you anticipate filing additional pull requests to complete this analysis module?
Yes, next up is a script to actually run
AUCell
with our defined gene sets and this set threshold.Results
What types of results does your code produce (e.g., table, figure)?
Here's the rendered copy of the notebook:
07-identify-max-rank-aucell.html.zip
What is your summary of the results?
As I stated above, I think we want to use a threshold percentage of 1% so that
aucMaxRank
isceiling(0.01*nrow(exprMat))
for each sample.Author checklists
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.