scNMT-seq as a case-study for epigenetic regulation

Overview and biological question

scRNA-seq technologies have enabled the identification of transcriptional profiles associated with lineage diversification and cell fate commitment [@doi:10.15252/msb.20178046], but the role of epigenetic layers still remains poorly understood [@doi:10.1016/j.stem.2014.05.008]. In contrast to the first two hackathons, which leveraged datasets from complementary technologies to enable high molecular and spatial resolution of biological systems, the third hackathon used datasets spanning disparate molecular scales (e.g. DNA and RNA measurements) to improve our understanding of cell fate decisions using scNMT-seq.

scNMT-seq is one of the first experimental protocols that enable simultaneous quantification of RNA expression and epigenetic information from individual cells [@doi:10.1038/s41467-018-03149-4]. Briefly, cells are incubated with a GpC methyltransferase enzyme that labels accessible GpC sites via DNA methylation. Thus, GpC methylation marks can be interpreted as direct read-outs for chromatin accessibility, whereas CpG methylation marks can be interpreted as endogenous DNA methylation. By physically separating the genomic DNA from the mRNA, scNMT-seq can profile RNA expression, DNA methylation and chromatin accessibility read-outs from the same cell. This third hackathon focused on data integration strategies to detect global covariation between RNA expression and DNA methylation variation from scNMT-seq data in a mouse gastrulation study [@doi:10.1038/s41586-019-1825-8].

Gastrulation is a major lineage specification event in mammalian embryos that is accompanied by profound transcriptional rewiring and epigenetic remodeling [@doi:10.1093/humupd/dmy021]. In this study, four developmental stages were profiled, spanning exit from pluripotency to germ layer commitment (E4.5 to E7.5). For simplicity in this hackathon, we focused on the integration of RNA expression and DNA methylation, quantified over the following genomic contexts: gene bodies, promoters, CpG islands, and DHS open sites. A total of 799 cells passed quality control (Figure {@fig:scnmtseq}A). Preliminary analyses using dimensionality reduction methods confirmed that all four embryonic stages could be separated on the basis of RNA expression (Figure {@fig:scnmtseq}B). The main challenge was to leverage the multi-faceted nature of measurements to better resolve the single-cell subpopulations from distinct embyonic stages.

Computational challenges

Our participants considered 3 computational strategies (see Vignettes): MOSAIC (Multi-Omics Supervised Integrative Clustering algorithm inspired by survClust[@doi:10.1101/2020.05.11.084798]) classifies samples by creating weighted distance matrices across data modalities, where the weights are defined as the maximum of the ratio of cluster specifc vs. population log likelihoods (Figure {@fig:scnmtseq}C). LIGER is an unsupervised non-negative matrix factorization model for manifold alignment that assumes a common feature space by aggregating DNA methylation over gene-centric elements (promoters or gene bodies) but allows cells to vary between data modalities [@doi:10.1016/j.cell.2019.05.006] (Figure {@fig:scnmtseq}D). Multi-block sparse Projection to Latent Structures (multiblock sPLS), is a sparse generalization of canonical correlation analysis that maximizes paired covariances between the RNA data set and each of the other genomic context data sets [@doi:10.1093/biostatistics/kxu001 [@doi:10.1371/journal.pcbi.1005752] (Figure {@fig:scnmtseq}E).

{#fig:scnmtseq width="60%"}

Caption Figure: Overview of hackathon analyses for the scNMT-seq challenge. A Summary of the data modalities analyzed, including different putative regulatory regions. B UMAP of RNA measurements using 671 highly variable genes shows separation of the four embryonic stages.
C Supervised analysis using view-specific and integrative distance measures with MOSAIC: The integration identifies five clusters of cell populations based on Adjusted Mutual Information and Standardized Pooled Within Sum of Squares that outperforms individual (single omics) analyses.
D LIGER joint alignment using gene body methylation and RNA expression: cells are colored by stage (left) or original data modality (right). E Unsupervised integration using multiblock sPLS: cells are projected into the space spanned by each data view components that are maximally correlated. For performance assessment, two types of analyses were considered, either by omitting the missing DNA methylation values or incorporating imputed values. K-means clustering analysis based on the multiblock sPLS components was used to calculate balanced accuracy measures.

Challenge 1: defining genomic features

The first challenge presented in this hackathon concerns the definition of the input data. The output of single-cell bisulfite sequencing are binary DNA methylation measurements for individual CpG sites. Integrative analysis at the CpG level is extremely challenging due to the sparsity levels, the binary nature of the read-outs, and the intricacy in interpretation of individual dinucleotides. To address these problems, DNA methylation measurements are typically aggregated over pre-defined sets of genomic elements (i.e. promoters, enhancers, etc.). This preprocessing step reduces sparsity, permits the calculation of binomial rates that are approximately continuous and can also improve interpretability of the model output.

We observed remarkable differences between genomic contexts on the integration performance. In MOSAIC, stages are better separated when using DNA methylation measurements on promoter regions and at least four clusters (AMI=0.45). Interestingly, this setting performed better than using RNA expression alone (AMI=0.40). Notably, when using an integrated solution across data modalities, stages were better classified (AMI = 0.68) (Figure {@fig:scnmtseq}C). LIGER, that was also applied in the first hackathon requires a common feature space to perform alignment of cells when profiled for different data modalities. This hackathon provides unambiguous cell matching between the data modalities and thus represents a gold standard for testing this approach. LIGER was applied to gene expression and gene body methylation: the poor alignment suggested a complex coupling of gene expression and gene body methylation during gastrulation (Figure {@fig:scnmtseq}D). Finally, multiblock sPLS identified covarying components between RNA expression and DNA methylation that separated cell stages in all putative regulatory contexts considered (Figure {@fig:scnmtseq}E). Taken altogether, these results confirmed that the appropriate selection of the feature space is critical for a successful integration with RNA expression.

Challenge 2: Missing values in DNA methylation

Single-cell bisulfite sequencing protocols are limited by incomplete CpG coverage because of the low amounts of starting material. Nonetheless, in contrast to scRNA-seq, missing data can be distinguished from dropouts. Integrative methods can be divided into approaches that can handle missing values (e.g. MOSAIC, multiblock sPLS which omit the missing values during inference), or approaches that require a priori imputation (e.g. LIGER). In this hackathon, missing values were imputed using nearest neighbor averaging (as implemented in the impute package [@doi:10.18129/B9.bioc.impute]) in the methylation data.

We compared the integration performance of multiblock sPLS either with original or with imputed data. The missing values were inferred using nearest neighbor averaging (as implemented in the impute package [@doi:10.18129/B9.bioc.impute]) in the methylation data. The components associated to each data set showed varying degree of separation of the embryonic stages, depending on the genomic contexts (Figure {@fig:scnmtseq}E). Accuracy measures based on k-means clustering analysis on the multiblock sPLS components showed that gene body methylation components were better at characterizing embryonic stage after imputation (from 70% with original data to 86% after imputation).

Missing values in regulatory context data represent a topical challenge in data analysis, and further methodological developments are needed to either handle and accurately estimate missing values.

Challenge 3: Linking epigenetic features to gene expression

One of the main advantages of scNMT-seq is the ability to unbiasedly link epigenetic variation with gene expression. Transcriptional activation is associated with specific chromatin states near the gene of interest. This includes deposition of activatory histone marks such as H3K27ac, H3K4me3 and H3K36me3, binding of transcription factors, promoter and/or enhancer demethylation and chromatin remodeling. All these events are closely interconnected and leave a footprint across multiple molecular layers that can only be (partially) recovered by performing an association analysis between a specific chromatin read-out and mRNA expression. However, given the large amount of genes and regulatory regions, this task can become prohibitively large, with the associated multiple testing burden. In addition, some of our analyses have shown that the correlations between epigenetic layers and RNA expression calculated from individual genomic features can be generally weak or spurious.

A practical and straightforward approach from a computational perspective involves considering only putative regulatory elements within each gene's genomic neighborhood. Nonetheless, this might miss important links with regulatory elements located far away from the neighborhood.

In recent years, chromosome conformation capture experiments, have uncovered a complex network of chromatin interactions inside the nucleus connecting regions separated by multiple megabases along the genome and potentially involved in gene regulation. Early genome-wide contact maps generated by HiC uncovered domains spanning on the order of 1 Mb (in humans) within which genes would be coordinately regulated. Thus, a second strategy to associate putative regulatory elements and genes is to build on existing promoter-centered chromatin contact networks to restrict the association analysis to putative regulatory elements that are in 3D contact with genes. Although this is a promising strategy to reduce the complexity of the association analysis, most of our 3D interaction datasets are produced in bulk samples and it is so far unclear how much of these structures are preserved across individual cells. While single-cell conformation capture experiments remain limited by data sparsity and high levels of technical noise, we envision that technological advances in this area will deepen our understanding of the regulatory roles of chromatin states.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

30.scNMT-seq.md

30.scNMT-seq.md

scNMT-seq as a case-study for epigenetic regulation

Overview and biological question

Computational challenges

Challenge 1: defining genomic features

Challenge 2: Missing values in DNA methylation

Challenge 3: Linking epigenetic features to gene expression

Files

30.scNMT-seq.md

Latest commit

History

30.scNMT-seq.md

File metadata and controls

scNMT-seq as a case-study for epigenetic regulation

Overview and biological question

Computational challenges

Challenge 1: defining genomic features

Challenge 2: Missing values in DNA methylation

Challenge 3: Linking epigenetic features to gene expression