Skip to content

Latest commit

 

History

History
68 lines (39 loc) · 10.8 KB

60.software.md

File metadata and controls

68 lines (39 loc) · 10.8 KB

Software strategies to enable analyses of multimodal single-cell experiments

Open-source software is essential in bioinformatics and computational biology. Benchmark datasets, analysis pipelines, and the development of multimodal genome-scale experiments are all enabled through community-developed, open-source software, and data sharing platforms. A wide array of genomics frameworks for multi-platform single-cell data have been developed in R and Python. Along with other software, these frameworks use standardized licensing in Creative Commons, Artistic, or GNU so that all components are accessible for full vetting by the community (see List of sofware)). Our hackathons hinged on the central challenges such as widescale adoption, extension, and collaboration to enable inference and visualization of the multimodal single-cell experiments in our analytic frameworks. We designed each case study to leverage and build on these open frameworks to further develop and evaluate robust benchmarking strategies. Easy to use data packages to distribute the multi-omics data and reproducible vignettes were key outputs from our workshop.

Collaboration enabled through continuous integration

Open-source software efforts facilitate a community-level coordinated approach to support collaboration rather than duplication of effort between groups working on similar problems. Real-time improvements to the tool-set should be feasible, respecting the needs for stability, reliability, and continuity of access to evolving components. To that end, exploration and engagement with all these tools is richly enabled through code sharing resources. Our hackathons directly leveraged through GitHub with our reproducible analyses reports to enable continuous integration of changes to source codes (using Github Action), and containerized snapshots of the analyses environments. The hackathons analyses conducted in R were assembled into R packages to facilitate libraries loading, while those conducted in Python enabled automatic installation and deployment

Usability and adoption by the community

Robust software ecosystems are required to build broad user bases [@doi:10.1126/science.aaf6162; @https://chaoss.github.io/grimoirelab/; @http://ceur-ws.org/Vol-987/3.pdf]. Bioconductor is one example of such ecosystem, that provides multiplatform and continuous delivery of contributed software while assisting a wide range of users with standardized documentation, tests, community forums, and workshops [@https://bioconductor.org; @https://bioconductor.org/checkResults/; @https://bioconductor.org/support/]. In the case of the hackathons, the R/Bioconductor ecosystem for multi-omics enabled data structures and vignettes to support reproducible, open-source, open development analysis. During this workshop, we identified key software goals needed to advance the methods and interpretation of multi-omics.

Challenge 1: data accessibility

Providing data to the scientific community is a long-standing issue. A particular challenge in our hackathons was that each data modality was characterized by a different collection of features from possibly non-overlapping collections of samples (see common challenges section). Thus, common data structures are needed to store and operate on these data collections, and support data dissemination with robust metadata and implementation of analytical frameworks.

The MultiAssayExperiment integrative data class from Bioconductor was our class of choice to enable the collation of standard data formats, easy data access, and processing. It uses the S4 object-oriented structure in R [@doi:10.18129/B9.bioc.MultiAssayExperiment; @doi:10.1158/0008-5472.CAN-17-0344] and includes several features to support multi-platform genomics data analysis, to store features from multiple data modalities (e.g. gene expression units from scRNA-seq and protein units in sc-proteomics) from either the same or distinct cells, biological specimen of origin, or from multiple dimensions (e.g. spatial coordinates, locations of eQTLs). This class also enables to store sample metadata (e.g. study, center, phenotype, perturbation) and provides a map between the datasets from different assays for downstream analysis.

In our hackathons, pre-processing steps applied to the raw data were fully documented. The input data were stored as MultiAssayExperiment objects that were centrally managed and hosted on ExperimentHub[@doi:10.18129/B9.bioc.ExperimentHub] as a starting point for all analyses. The SingleCellMultiModal package was used to query the relevant datasets for each analysis [doi:10.18129/B9.bioc.SingleCellMultiModal] (Figure {@fig:spatialExpt}). Text-based machine-readable data were also made available for non-R users, and also to facilitate alternative data preprocessing for participants.

Besides efficient data storage, several hackathon contributors used the MultiAssayExperiment class to implement further data processing and extraction of spatial information from raster objects in their analyses. This infrastructure was readily suitable for the spatial and scNMT-seq hackathons but the lack of overlap between samples in the spatial proteomics hackathon revealed an important area of future work to link biologically related datasets without direct feature or sample mappings for multi-omics analysis. Further, our hackathons highlighted the need for scalability of storing and efficiently retrieving single-cell data datasets [@doi:10.18129/B9.bioc.DelayedArray; @doi:10.18129/B9.bioc.rhdf5]. New algorithms are emerging, that allow for data to be stored in memory or on disk (e.g. [@doi:10.1101/2020.05.27.119438; @doi:10.18129/B9.bioc.mbkmeans] in R or [@doi:10.1186/s13059-017-1382-0] in Python).

{#fig:spatialExpt width="50%"}

Caption figure: A Software infrastructure using Bioconductor for the first hackathon to combine seqFISH-based SpatialExperiment and SingleCellExperiment instances into a MultiAssayExperiment. B To combine these two different experiments, the seqFISH data were stored into a SpatialExperiment S4 class object, while the scRNA-seq data were stored into a SingleCellExperiment class object [@doi:10.18129/B9.bioc.SingleCellExperiment]. These objects were then stored into a MultiAssayExperiment class object and released with the SingleCellMultiModal Bioconductor package [@doi:10.18129/B9.bioc.SingleCellMultiModal].

Challenge 2: software infrastructure to handle assay-specific features

The hackathons further highlighted emerging challenges to handle different data modalities.

RNA-seq has well-defined units and IDs (e.g., transcript names), but other assays need to be summarized at different genomic scales (e.g., gene promoters, exons, introns, or gene bodies), as was highlighted in the scNMT-seq hackathon. Tools such as the GenomicRanges R package [@doi:10.18129/B9.bioc.GenomicRanges] have been proposed to compute summaries at different scales and overlaps between signal (e.g., ATAC-seq peaks) and genomic annotation.

Further, the observations of different modalities may not be directly comparable: for instance, gene expression may be measured from individual cells in single-cell RNA-seq, but spatial transcriptomics may have a finer (sub-cellular) or coarser (multi-cellular) resolution. Methods such as SPOTlight [@doi:10.1101/2020.06.03.131334] can be used to deconvolute multi-cellular spots signal.

Finally, in the absence of universal standards, the metadata available may vary from modalities, or independent studies (e.g. spatial proteomics), thus urging the need from the computational biology community to define the minimum set of metadata variables necessary for each assay, as well as for pairs of assays to be comparable for common analyses.

Challenge 3: accessible vizualization

Our brainstorm discussions on the Data Interpretation Challenge highlighted the importance of novel data visualization strategies to make sens of multi-modal data analyses. Often, these visualization strategies rely on heatmaps or reduced dimension plots, and utilize color to represent the different dimensions. These colors and low dimensional plots facilitate pattern detection and interpretation of increasingly complex and rich data. However, relying on color for interpretation leads to difficulties in perceiving patterns for a substantial proportion of the population with color vision deficiencies and can result in different data interpretations between individuals.

Presenting accessible scientific information requires the inclusion of colorblind friendly visualizations [@https://doi.org/10.1038/nmeth.1618; @doi:10.1038/nmeth0810-573] standardized as default settings through use of color palettes such as R/viridis [@https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html] and dittoSeq [@doi:10.18129/B9.bioc.dittoSeq] with a limit of 10 colors. Additional visual cues to differentiate regions or cells can also reduce the dependence on colors using hatched areas or point shapes. The inclusion an "accessibility caption" accompanying figures which to guide the reader's perception of the images would also greatly benefit broader data accessibility. Thus, implementing community standards for accessible visualizations is essential for bioinformatics software communities to ensure standardized interpretation of multi-platform single-cell data.