-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Normalized Mutual Information metric (#332)
* add SIMLR dimensionality reduction method * add description and reference * add SIMLR reference * change default n_dim and write output to file * Add SIMLR entry * Update documentation URL Co-authored-by: Kai Waldrant <[email protected]> * Reformat code * Use explicit namespaces * Add new metric normalized_mutual_information * Add reference for adjusted rand index * change metric from normalized_mutual_information to clustering_performance * add .obs["cell type"] to slots * perform leiden clustering on embedding and compute NMI and ARI scores * fix typo * Compute neighbors if not already stored in input object Co-authored-by: Robrecht Cannoodt <[email protected]> * Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py Use key_max to store best clustering Co-authored-by: Robrecht Cannoodt <[email protected]> * Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py Co-authored-by: Robrecht Cannoodt <[email protected]> * Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py Co-authored-by: Robrecht Cannoodt <[email protected]> * Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py Co-authored-by: Robrecht Cannoodt <[email protected]> * Make sure that the key is unique * add slot to common dataset * add key for cluster labels --------- Co-authored-by: Kai Waldrant <[email protected]> Co-authored-by: Robrecht Cannoodt <[email protected]>
- Loading branch information
1 parent
a87afdc
commit d9e4454
Showing
6 changed files
with
168 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
61 changes: 61 additions & 0 deletions
61
src/tasks/dimensionality_reduction/metrics/clustering_performance/config.vsh.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
__merge__: ../../api/comp_metric.yaml | ||
|
||
functionality: | ||
name: clustering_performance | ||
info: | ||
metrics: | ||
- name: normalized_mutual_information | ||
label: NMI | ||
summary: Normalized Mutual Information (NMI) is a measure of the concordance between clustering obtained from the reduced-dimensional embeddings and the cell labels. | ||
description: | | ||
The Normalized Mutual Information (NMI) is a measure of the similarity between cluster labels obtained from the clustering of dimensionality reduction embeddings and the true cell labels. It is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). | ||
Mutual Information quantifies the "amount of information" obtained about one random variable by observing the other random variable. Assuming two label assignments X and Y, it is given by: | ||
$MI(X,Y) = \sum_{x=1}^{X}\sum_{y=1}^{Y}p(x,y)log(\frac{P(x,y)}{P(x)P'(y)})$, | ||
where P(x,y) is the joint probability mass function of X and Y, and P(x), P'(y) are the marginal probability mass functions of X and Y respectively. The mutual information is normalized by some generalized mean of H(X) and H(Y). Therefore, Normalized Mutual Information can be defined as: | ||
$NMI(X,Y) = \frac{MI(X,Y)}{mean(H(X),H(Y))}$, | ||
where H(X) and H(Y) are the entropies of X and Y respectively. Higher NMI score suggests that the method is effective in preserving relevant information. | ||
reference: emmons2016analysis | ||
documentation_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html | ||
repository_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html | ||
min: 0 | ||
max: 1 | ||
maximize: true | ||
- name: adjusted_rand_index | ||
label: ARI | ||
summary: Adjusted Rand Index (ARI) is a measure of the similarities between two cluster assignments of the reduced-dimensional embeddings and the true cell types. | ||
description: | | ||
Adjusted Rand Index (ARI) is a measure of similarity between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted (from the reduced dimensional embeddings) and true clusterings (cell type labels). It is the Rand Index (RI) adjusted for chance. | ||
Assuming the C as the cell type labels and K as the clustering of the reduced dimensional embedding, Rand Index can be defined as: | ||
$RI = \frac{a + b}{{C}_{2}^{n_{samples}}}$, | ||
where 'a' is the number of pairs of elements that are in the same set in C and in the same set in K, 'b' is the number of pairs of elements that are in different sets in C and in different sets in K, and ${C}_{2}^{n_{samples}}$ is the total number of possible pairs in the dataset. Random label assignments can be discounted as follows: | ||
$ARI = \frac{RI - E[RI]}{max(RI) - E[RI]}$, | ||
where E[RI] is the expected RI of random labellings. | ||
reference: santos2009on | ||
documentation_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score | ||
repository_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score | ||
min: 0 | ||
max: 1 | ||
maximize: true | ||
|
||
# Component-specific parameters | ||
arguments: | ||
- name: "--nmi_avg_method" | ||
type: string | ||
default: arithmetic | ||
description: Method to compute normalizer in the denominator for normalized mutual information score calculation. | ||
choices: [ min, geometric, arithmetic, max ] | ||
|
||
resources: | ||
- type: python_script | ||
path: script.py | ||
|
||
platforms: | ||
- type: docker | ||
image: ghcr.io/openproblems-bio/base_python:1.0.2 | ||
setup: | ||
- type: python | ||
packages: [ scikit-learn, scanpy, leidenalg ] | ||
- type: native | ||
- type: nextflow | ||
directives: | ||
label: [ "midtime", midmem, midcpu ] |
63 changes: 63 additions & 0 deletions
63
src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
import anndata as ad | ||
import scanpy as sc | ||
from sklearn.cluster import KMeans | ||
from sklearn.metrics import normalized_mutual_info_score | ||
from sklearn.metrics import adjusted_rand_score | ||
|
||
## VIASH START | ||
par = { | ||
'input_embedding': 'resources_test/dimensionality_reduction/pancreas/embedding.h5ad', | ||
'input_solution': 'resources_test/dimensionality_reduction/pancreas/solution.h5ad', | ||
'output': 'output.h5ad', | ||
'nmi_avg_method': 'arithmetic' | ||
} | ||
meta = { | ||
'functionality_name': 'clustering_performance' | ||
} | ||
## VIASH END | ||
|
||
print('Reading input files', flush=True) | ||
input_embedding = ad.read_h5ad(par['input_embedding']) | ||
input_solution = ad.read_h5ad(par['input_solution']) | ||
|
||
print('Compute metrics', flush=True) | ||
|
||
# Perform Leiden clustering on dimensionlity reduction embedding | ||
n = 20 | ||
resolutions = [2 * x / n for x in range(1, n + 1)] | ||
score_max = 0 | ||
res_max = resolutions[0] | ||
key_max = None | ||
score_all = [] | ||
|
||
if "neighbors" not in input_embedding.uns: | ||
sc.pp.neighbors(input_embedding, use_rep="X_emb") | ||
|
||
for res in resolutions: | ||
key_added = f"X_emb_leiden_{res}" | ||
sc.tl.leiden(input_embedding, resolution=res, key_added=key_added) | ||
score = normalized_mutual_info_score(input_solution.obs["cell_type"], input_embedding.obs[key_added], average_method = par['nmi_avg_method']) | ||
score_all.append(score) | ||
|
||
if score_max < score: | ||
score_max = score | ||
res_max = res | ||
key_max = key_added | ||
|
||
# Compute NMI scores | ||
nmi = normalized_mutual_info_score(input_solution.obs["cell_type"], input_embedding.obs[key_max], average_method = par['nmi_avg_method']) | ||
|
||
# Compute ARI scores | ||
ari = adjusted_rand_score(input_solution.obs["cell_type"], input_embedding.obs[key_max]) | ||
|
||
print("Write output AnnData to file", flush=True) | ||
output = ad.AnnData( | ||
uns={ | ||
'dataset_id': input_embedding.uns['dataset_id'], | ||
'normalization_id': input_embedding.uns['normalization_id'], | ||
'method_id': input_embedding.uns['method_id'], | ||
'metric_ids': [ 'normalized_mutual_information', 'adjusted_rand_index' ], | ||
'metric_values': [ nmi, ari ] | ||
} | ||
) | ||
output.write_h5ad(par['output'], compression='gzip') |