Add Normalized Mutual Information metric (#332)

* add SIMLR dimensionality reduction method * add description and reference * add SIMLR reference * change default n_dim and write output to file * Add SIMLR entry * Update documentation URL Co-authored-by: Kai Waldrant <[email protected]> * Reformat code * Use explicit namespaces * Add new metric normalized_mutual_information * Add reference for adjusted rand index * change metric from normalized_mutual_information to clustering_performance * add .obs["cell type"] to slots * perform leiden clustering on embedding and compute NMI and ARI scores * fix typo * Compute neighbors if not already stored in input object Co-authored-by: Robrecht Cannoodt <[email protected]> * Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py Use key_max to store best clustering Co-authored-by: Robrecht Cannoodt <[email protected]> * Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py Co-authored-by: Robrecht Cannoodt <[email protected]> * Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py Co-authored-by: Robrecht Cannoodt <[email protected]> * Update src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py Co-authored-by: Robrecht Cannoodt <[email protected]> * Make sure that the key is unique * add slot to common dataset * add key for cluster labels --------- Co-authored-by: Kai Waldrant <[email protected]> Co-authored-by: Robrecht Cannoodt <[email protected]>
openproblems-bio · Feb 2, 2024 · d9e4454 · d9e4454
1 parent a87afdc
commit d9e4454
Show file tree

Hide file tree

Showing 6 changed files with 168 additions and 0 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -309,6 +309,8 @@
 
 * `methods/simlr`: Added new SIMLR method.
 
+* `metrics/clustering_performance`: Added new metric to assess clustering on the reduced dimensional embeddings using NMI and ARI.
+
 
 ## match_modalities (PR #201)
 

diff --git a/src/common/library.bib b/src/common/library.bib
@@ -400,6 +400,23 @@ @article{efremova2020cellphonedb
 }
 
 
+@article{emmons2016analysis,
+  title = {Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale},
+  volume = {11},
+  ISSN = {1932-6203},
+  url = {http://dx.doi.org/10.1371/journal.pone.0159161},
+  doi = {10.1371/journal.pone.0159161},
+  number = {7},
+  journal = {PLOS ONE},
+  publisher = {Public Library of Science (PLoS)},
+  author = {Emmons,  Scott and Kobourov,  Stephen and Gallant,  Mike and B\"{o}rner,  Katy},
+  editor = {Dovrolis,  Constantine},
+  year = {2016},
+  month = jul,
+  pages = {e0159161}
+}
+
+
 @article{eraslan2019single,
 	title = {Single-cell {RNA}-seq denoising using a deep count autoencoder},
 	author = {G\"{o}kcen Eraslan and Lukas M. Simon and Maria Mircea and Nikola S. Mueller and Fabian J. Theis},
@@ -1091,6 +1108,21 @@ @article{rodriques2019slide
 }
 
 
+@InProceedings{santos2009on,
+	author = {Santos, Jorge M. and Embrechts, Mark"},
+	editor = {Alippi, Cesare and Polycarpou, Marios and Panayiotou, Christos and Ellinas, Georgios},
+	title = {On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification},
+	booktitle = {Artificial Neural Networks -- ICANN 2009},
+	year = {2009},
+	publisher = {Springer Berlin Heidelberg},
+	address = {Berlin, Heidelberg},
+	pages = {175--184},
+	isbn = {978-3-642-04277-5}, 
+	doi = {10.1007/978-3-642-04277-5_18},
+	url = {https://doi.org/10.1007/978-3-642-04277-5_18}
+}
+
+
 @article{sarkar2021separating,
 	title = {Separating measurement and expression models clarifies confusion in single-cell {RNA} sequencing analysis},
 	author = {Abhishek Sarkar and Matthew Stephens},

diff --git a/src/tasks/dimensionality_reduction/api/file_common_dataset.yaml b/src/tasks/dimensionality_reduction/api/file_common_dataset.yaml
@@ -13,6 +13,11 @@ info:
         name: normalized
         description: Normalized expression values
         required: true
+    obs: 
+      - type: string
+        name: cell_type
+        description: Classification of the cell type based on its characteristics and function within the tissue or organism.
+        required: true
     var:
       - type: double
         name: hvg_score

diff --git a/src/tasks/dimensionality_reduction/api/file_solution.yaml b/src/tasks/dimensionality_reduction/api/file_solution.yaml
@@ -13,6 +13,11 @@ info:
         name: normalized
         description: Normalized expression values
         required: true
+    obs: 
+      - type: string
+        name: cell_type
+        description: Classification of the cell type based on its characteristics and function within the tissue or organism.
+        required: true
     var:
       - type: double
         name: hvg_score

diff --git a/src/tasks/dimensionality_reduction/metrics/clustering_performance/config.vsh.yaml b/src/tasks/dimensionality_reduction/metrics/clustering_performance/config.vsh.yaml
@@ -0,0 +1,61 @@
+__merge__: ../../api/comp_metric.yaml
+
+functionality:
+  name: clustering_performance
+  info:
+    metrics:
+      - name: normalized_mutual_information
+        label: NMI
+        summary: Normalized Mutual Information (NMI) is a measure of the concordance between clustering obtained from the reduced-dimensional embeddings and the cell labels.
+        description: |
+          The Normalized Mutual Information (NMI) is a measure of the similarity between cluster labels obtained from the clustering of dimensionality reduction embeddings and the true cell labels. It is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). 
+          Mutual Information quantifies the "amount of information" obtained about one random variable by observing the other random variable. Assuming two label assignments X and Y, it is given by: 
+            $MI(X,Y) = \sum_{x=1}^{X}\sum_{y=1}^{Y}p(x,y)log(\frac{P(x,y)}{P(x)P'(y)})$, 
+          where P(x,y) is the joint probability mass function of X and Y, and P(x), P'(y) are the marginal probability mass functions of X and Y respectively. The mutual information is normalized by some generalized mean of H(X) and H(Y). Therefore, Normalized Mutual Information can be defined as: 
+            $NMI(X,Y) = \frac{MI(X,Y)}{mean(H(X),H(Y))}$, 
+          where H(X) and H(Y) are the entropies of X and Y respectively. Higher NMI score suggests that the method is effective in preserving relevant information.
+        reference: emmons2016analysis
+        documentation_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html
+        repository_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html
+        min: 0
+        max: 1
+        maximize: true
+      - name: adjusted_rand_index
+        label: ARI
+        summary: Adjusted Rand Index (ARI) is a measure of the similarities between two cluster assignments of the reduced-dimensional embeddings and the true cell types.
+        description: |
+          Adjusted Rand Index (ARI) is a measure of similarity between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted (from the reduced dimensional embeddings) and true clusterings (cell type labels). It is the Rand Index (RI) adjusted for chance.
+          Assuming the C as the cell type labels and K as the clustering of the reduced dimensional embedding, Rand Index can be defined as:
+            $RI = \frac{a + b}{{C}_{2}^{n_{samples}}}$,
+          where 'a' is the number of pairs of elements that are in the same set in C and in the same set in K, 'b' is the number of pairs of elements that are in different sets in C and in different sets in K, and ${C}_{2}^{n_{samples}}$ is the total number of possible pairs in the dataset. Random label assignments can be discounted as follows: 
+            $ARI = \frac{RI - E[RI]}{max(RI) - E[RI]}$, 
+          where E[RI] is the expected RI of random labellings.
+        reference: santos2009on
+        documentation_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score
+        repository_url: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score
+        min: 0
+        max: 1
+        maximize: true
+
+  # Component-specific parameters
+  arguments:
+    - name: "--nmi_avg_method"
+      type: string
+      default: arithmetic
+      description: Method to compute normalizer in the denominator for normalized mutual information score calculation. 
+      choices: [ min, geometric, arithmetic, max ]
+
+  resources:
+    - type: python_script
+      path: script.py
+
+platforms:
+  - type: docker
+    image: ghcr.io/openproblems-bio/base_python:1.0.2
+    setup:
+      - type: python
+        packages: [ scikit-learn, scanpy, leidenalg ]
+  - type: native
+  - type: nextflow
+    directives:
+      label: [ "midtime", midmem, midcpu ]
diff --git a/src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py b/src/tasks/dimensionality_reduction/metrics/clustering_performance/script.py
@@ -0,0 +1,63 @@
+import anndata as ad
+import scanpy as sc
+from sklearn.cluster import KMeans
+from sklearn.metrics import normalized_mutual_info_score
+from sklearn.metrics import adjusted_rand_score
+
+## VIASH START
+par = {
+  'input_embedding': 'resources_test/dimensionality_reduction/pancreas/embedding.h5ad',
+  'input_solution': 'resources_test/dimensionality_reduction/pancreas/solution.h5ad',
+  'output': 'output.h5ad',
+  'nmi_avg_method': 'arithmetic'
+}
+meta = {
+  'functionality_name': 'clustering_performance'
+}
+## VIASH END
+
+print('Reading input files', flush=True)
+input_embedding = ad.read_h5ad(par['input_embedding'])
+input_solution = ad.read_h5ad(par['input_solution'])
+
+print('Compute metrics', flush=True)
+
+# Perform Leiden clustering on dimensionlity reduction embedding
+n = 20
+resolutions = [2 * x / n for x in range(1, n + 1)]
+score_max = 0
+res_max = resolutions[0]
+key_max = None
+score_all = []
+
+if "neighbors" not in input_embedding.uns:
+  sc.pp.neighbors(input_embedding, use_rep="X_emb")
+
+for res in resolutions:
+  key_added = f"X_emb_leiden_{res}"
+  sc.tl.leiden(input_embedding, resolution=res, key_added=key_added)
+  score = normalized_mutual_info_score(input_solution.obs["cell_type"], input_embedding.obs[key_added], average_method = par['nmi_avg_method'])
+  score_all.append(score)
+
+  if score_max < score:
+    score_max = score
+    res_max = res
+    key_max = key_added
+
+# Compute NMI scores
+nmi = normalized_mutual_info_score(input_solution.obs["cell_type"], input_embedding.obs[key_max], average_method = par['nmi_avg_method'])
+
+# Compute ARI scores
+ari = adjusted_rand_score(input_solution.obs["cell_type"], input_embedding.obs[key_max])
+
+print("Write output AnnData to file", flush=True)
+output = ad.AnnData(
+  uns={
+    'dataset_id': input_embedding.uns['dataset_id'],
+    'normalization_id': input_embedding.uns['normalization_id'],
+    'method_id': input_embedding.uns['method_id'],
+    'metric_ids': [ 'normalized_mutual_information', 'adjusted_rand_index' ],
+    'metric_values': [ nmi, ari ]
+  }
+)
+output.write_h5ad(par['output'], compression='gzip')
Original file line number	Diff line number	Diff line change
Expand Up		@@ -309,6 +309,8 @@

		* `methods/simlr`: Added new SIMLR method.

		* `metrics/clustering_performance`: Added new metric to assess clustering on the reduced dimensional embeddings using NMI and ARI.


		## match_modalities (PR #201)

Expand Down