Skip to content

Commit

Permalink
Add Normalized Mutual Information metric (#332)
Browse files Browse the repository at this point in the history
* add SIMLR dimensionality reduction method

* add description and reference

* add SIMLR reference

* change default n_dim and write output to file

* Add SIMLR entry

* Update documentation URL

Co-authored-by: Kai Waldrant <[email protected]>

* Reformat code

* Use explicit namespaces

* Add new metric normalized_mutual_information

* Add reference for adjusted rand index

* change metric from normalized_mutual_information to clustering_performance

* add .obs["cell type"] to slots

* perform leiden clustering on embedding and compute NMI and ARI scores

* fix typo

* Compute neighbors if not already stored in input object

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py

Use key_max to store best clustering

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Make sure that the key is unique

* add slot to common dataset

* add key for cluster labels

---------

Co-authored-by: Kai Waldrant <[email protected]>
Co-authored-by: Robrecht Cannoodt <[email protected]>
  • Loading branch information
3 people authored Feb 2, 2024
1 parent a87afdc commit d9e4454
Show file tree
Hide file tree
Showing 6 changed files with 168 additions and 0 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,8 @@

* `methods/simlr`: Added new SIMLR method.

* `metrics/clustering_performance`: Added new metric to assess clustering on the reduced dimensional embeddings using NMI and ARI.


## match_modalities (PR #201)

Expand Down
32 changes: 32 additions & 0 deletions src/common/library.bib
Original file line number Diff line number Diff line change
Expand Up @@ -400,6 +400,23 @@ @article{efremova2020cellphonedb
}


@article{emmons2016analysis,
title = {Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale},
volume = {11},
ISSN = {1932-6203},
url = {http://dx.doi.org/10.1371/journal.pone.0159161},
doi = {10.1371/journal.pone.0159161},
number = {7},
journal = {PLOS ONE},
publisher = {Public Library of Science (PLoS)},
author = {Emmons, Scott and Kobourov, Stephen and Gallant, Mike and B\"{o}rner, Katy},
editor = {Dovrolis, Constantine},
year = {2016},
month = jul,
pages = {e0159161}
}


@article{eraslan2019single,
title = {Single-cell {RNA}-seq denoising using a deep count autoencoder},
author = {G\"{o}kcen Eraslan and Lukas M. Simon and Maria Mircea and Nikola S. Mueller and Fabian J. Theis},
Expand Down Expand Up @@ -1091,6 +1108,21 @@ @article{rodriques2019slide
}


@InProceedings{santos2009on,
author = {Santos, Jorge M. and Embrechts, Mark"},
editor = {Alippi, Cesare and Polycarpou, Marios and Panayiotou, Christos and Ellinas, Georgios},
title = {On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification},
booktitle = {Artificial Neural Networks -- ICANN 2009},
year = {2009},
publisher = {Springer Berlin Heidelberg},
address = {Berlin, Heidelberg},
pages = {175--184},
isbn = {978-3-642-04277-5},
doi = {10.1007/978-3-642-04277-5_18},
url = {https://doi.org/10.1007/978-3-642-04277-5_18}
}


@article{sarkar2021separating,
title = {Separating measurement and expression models clarifies confusion in single-cell {RNA} sequencing analysis},
author = {Abhishek Sarkar and Matthew Stephens},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ info:
name: normalized
description: Normalized expression values
required: true
obs:
- type: string
name: cell_type
description: Classification of the cell type based on its characteristics and function within the tissue or organism.
required: true
var:
- type: double
name: hvg_score
Expand Down
5 changes: 5 additions & 0 deletions src/tasks/dimensionality_reduction/api/file_solution.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ info:
name: normalized
description: Normalized expression values
required: true
obs:
- type: string
name: cell_type
description: Classification of the cell type based on its characteristics and function within the tissue or organism.
required: true
var:
- type: double
name: hvg_score
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
__merge__: ../../api/comp_metric.yaml

functionality:
name: clustering_performance
info:
metrics:
- name: normalized_mutual_information
label: NMI
summary: Normalized Mutual Information (NMI) is a measure of the concordance between clustering obtained from the reduced-dimensional embeddings and the cell labels.
description: |
The Normalized Mutual Information (NMI) is a measure of the similarity between cluster labels obtained from the clustering of dimensionality reduction embeddings and the true cell labels. It is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation).
Mutual Information quantifies the "amount of information" obtained about one random variable by observing the other random variable. Assuming two label assignments X and Y, it is given by:
$MI(X,Y) = \sum_{x=1}^{X}\sum_{y=1}^{Y}p(x,y)log(\frac{P(x,y)}{P(x)P'(y)})$,
where P(x,y) is the joint probability mass function of X and Y, and P(x), P'(y) are the marginal probability mass functions of X and Y respectively. The mutual information is normalized by some generalized mean of H(X) and H(Y). Therefore, Normalized Mutual Information can be defined as:
$NMI(X,Y) = \frac{MI(X,Y)}{mean(H(X),H(Y))}$,
where H(X) and H(Y) are the entropies of X and Y respectively. Higher NMI score suggests that the method is effective in preserving relevant information.
reference: emmons2016analysis
documentation_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html
repository_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html
min: 0
max: 1
maximize: true
- name: adjusted_rand_index
label: ARI
summary: Adjusted Rand Index (ARI) is a measure of the similarities between two cluster assignments of the reduced-dimensional embeddings and the true cell types.
description: |
Adjusted Rand Index (ARI) is a measure of similarity between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted (from the reduced dimensional embeddings) and true clusterings (cell type labels). It is the Rand Index (RI) adjusted for chance.
Assuming the C as the cell type labels and K as the clustering of the reduced dimensional embedding, Rand Index can be defined as:
$RI = \frac{a + b}{{C}_{2}^{n_{samples}}}$,
where 'a' is the number of pairs of elements that are in the same set in C and in the same set in K, 'b' is the number of pairs of elements that are in different sets in C and in different sets in K, and ${C}_{2}^{n_{samples}}$ is the total number of possible pairs in the dataset. Random label assignments can be discounted as follows:
$ARI = \frac{RI - E[RI]}{max(RI) - E[RI]}$,
where E[RI] is the expected RI of random labellings.
reference: santos2009on
documentation_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score
repository_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score
min: 0
max: 1
maximize: true

# Component-specific parameters
arguments:
- name: "--nmi_avg_method"
type: string
default: arithmetic
description: Method to compute normalizer in the denominator for normalized mutual information score calculation.
choices: [ min, geometric, arithmetic, max ]

resources:
- type: python_script
path: script.py

platforms:
- type: docker
image: ghcr.io/openproblems-bio/base_python:1.0.2
setup:
- type: python
packages: [ scikit-learn, scanpy, leidenalg ]
- type: native
- type: nextflow
directives:
label: [ "midtime", midmem, midcpu ]
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import anndata as ad
import scanpy as sc
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score
from sklearn.metrics import adjusted_rand_score

## VIASH START
par = {
'input_embedding': 'resources_test/dimensionality_reduction/pancreas/embedding.h5ad',
'input_solution': 'resources_test/dimensionality_reduction/pancreas/solution.h5ad',
'output': 'output.h5ad',
'nmi_avg_method': 'arithmetic'
}
meta = {
'functionality_name': 'clustering_performance'
}
## VIASH END

print('Reading input files', flush=True)
input_embedding = ad.read_h5ad(par['input_embedding'])
input_solution = ad.read_h5ad(par['input_solution'])

print('Compute metrics', flush=True)

# Perform Leiden clustering on dimensionlity reduction embedding
n = 20
resolutions = [2 * x / n for x in range(1, n + 1)]
score_max = 0
res_max = resolutions[0]
key_max = None
score_all = []

if "neighbors" not in input_embedding.uns:
sc.pp.neighbors(input_embedding, use_rep="X_emb")

for res in resolutions:
key_added = f"X_emb_leiden_{res}"
sc.tl.leiden(input_embedding, resolution=res, key_added=key_added)
score = normalized_mutual_info_score(input_solution.obs["cell_type"], input_embedding.obs[key_added], average_method = par['nmi_avg_method'])
score_all.append(score)

if score_max < score:
score_max = score
res_max = res
key_max = key_added

# Compute NMI scores
nmi = normalized_mutual_info_score(input_solution.obs["cell_type"], input_embedding.obs[key_max], average_method = par['nmi_avg_method'])

# Compute ARI scores
ari = adjusted_rand_score(input_solution.obs["cell_type"], input_embedding.obs[key_max])

print("Write output AnnData to file", flush=True)
output = ad.AnnData(
uns={
'dataset_id': input_embedding.uns['dataset_id'],
'normalization_id': input_embedding.uns['normalization_id'],
'method_id': input_embedding.uns['method_id'],
'metric_ids': [ 'normalized_mutual_information', 'adjusted_rand_index' ],
'metric_values': [ nmi, ari ]
}
)
output.write_h5ad(par['output'], compression='gzip')

0 comments on commit d9e4454

Please sign in to comment.