Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I subset the model to a select few cell types/clusters before training the model? #129

Open
yojetsharma opened this issue Aug 17, 2024 · 6 comments

Comments

@yojetsharma
Copy link

I am using the Human_Developing_Brain.pkl as a model to annotate the query dataset. However, I am only interested in select few cell types/clusters. Is there a function to subset those clusters?
Thank you!

@ChuanXu1
Copy link
Collaborator

@yojetsharma, please refer to this question #128

@yojetsharma
Copy link
Author

I did try that:

>>> ref
CellTypist model with 129 cell types and 1000 features
    date: 2022-10-29 21:02:53.713593
    details: cell types from the first-trimester developing human brain
    source: https://doi.org/10.1126/science.adf1226
    version: v1
    cell types: Brain erythrocytes, Brain fibroblasts, ..., Ventral midbrain radial glia
    features: VWA1, HES5, ..., BGN
>>> celltypist.samples.downsample_adata(ref, n_cells=1000, by=(cell_types['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC']), mode='each', return_index=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'cell_types' is not defined
>>> celltypist.samples.downsample_adata(ref, n_cells=1000, by=['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'], mode='each', return_index=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniconda3/envs/scarches/lib/python3.9/site-packages/celltypist/samples.py", line 89, in downsample_adata
    celltypes = np.unique(adata.obs[by])
AttributeError: 'Model' object has no attribute 'obs'

@ChuanXu1
Copy link
Collaborator

@yojetsharma, if you are just selecting a subset of cell types, just use adata = adata[adata.obs.cell_types.isin(['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'])].copy()

@yojetsharma
Copy link
Author

Right, thanks for this but does the following mean there is an installation error of the package on my end:

>>> ref_adata=ref[ref.cell_types.isin(['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'])].copy()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'isin'

@yojetsharma
Copy link
Author

I tried Boolean indexing on the reference model (downloded from celltypist models) since the model.cell_types is a direct NumPy array:

selected_cell_types = [
    'Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 
    'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 
    'Telencephalon glioblast', 'Telencephalon neuron', 
    'Telencephalon radial glia', 'Telencephalon neuroblast', 
    'Telencephalon neuronal IPC'
]

# Create a Boolean mask
mask = np.isin(adata.cell_types, selected_cell_types)

# Subset the AnnData object using the mask
ref= adata[mask].copy()

But the above still didn't work, most likely because the model is a not an anndata object. Does this mean i will need to downlaod this model from the source and make it as a model myself and then use it in the celltypist program?

@ChuanXu1
Copy link
Collaborator

@yojetsharma, if you try to subset the model, it is not possible. You need to subset your anndata and re-train the model using celltypist.train

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants