Ocelli: an open-source tool for the analysis and visualization of developmental multimodal single-cell data

Introduction

Multimodal data comprises distinct feature sets (called modalities or views) acquired by different detectors describing the same object [1, 2]. For example, a video includes modalities such as a video recording, an audio track, and often text captions. Genome-wide single-cell technologies [3] can nowadays simultaneously measure multiple modalities from the same cell, including (i) transcriptome and chromatin accessibility [4–7]; (ii) transcriptome and DNA methylation [8]; (iii) transcriptome and chromatin proteins or histone modifications [9, 10]; (iv) transcriptome and protein epitope [11]; (v) transcriptome, chromatin accessibility, and protein epitope [12]; (vi) multiple histone modifications and DNA-bound proteins [13–17]; and (vii) open and closed chromatin [18].

Each single-cell data modality is high dimensional. The number of measured features spans from hundreds in the case of protein epitopes to hundreds of thousands for chromatin-accessible sites. The multiple modalities profiled from the same single cell correspond to consecutive stages of gene expression, from its regulation by modifying chromatin architecture and engaging transcription-initiation proteins to the synthesis of mRNA and protein molecules. For example, chromatin becomes accessible before gene expression commences, resulting in temporal coherence between modalities. Thus, all modalities need to be modeled simultaneously to visualize the developmental progression or activation of cells and lineage branching accurately.

The inherent feature of data generated by single-cell technologies is a high level of technical noise resulting in data sparsity due to under-sampling of features from the same cell [19]. Additionally, different modalities have profoundly distinct statistical properties, which makes it challenging to find a joint representative data manifold. Modeling developmental processes is an additional challenge as cellular differentiation is a continuous and nonlinear process [20]. The origin of non-linearity stems from the regulation of gene expression, where even the most straightforward regulation by a single transcription factor in an open-loop scenario results in nonlinear transcriptional kinetics that generates non-trivial gene expression patterns [21].

Multiple methods have been developed to deal with high-throughput high-dimensional unimodal single-cell data to study developmental processes. For example, Monocle [22] leverages a cellular minimum spanning tree (or DDRTree [23] in Monocle 2 and 3 [24]) to find dimensionally reduced gene expression space; Velocyto [25] and scVelo [26] model cellular transitions from transcriptional kinetics (named RNA velocity), leveraging the relationship between unspliced and spliced mRNA molecules within cells; MultiVelo [27] models epigenome–transcriptome interactions to predict cell fates; Waddington-OT [28] employs optimal transport theory to find temporal couplings of differentiating cells; diffusion maps [20, 29–32] embed cells onto a low-dimensional manifold by eigendecomposing the diffusion operator. The significant advantages of diffusion maps are their robustness to noise, high sensitivity to finding rare and transient cell states [28], and ability to preserve nonlinear local and global structures, together with a continuum of developmental cell transitions. The structure and continuum are preserved because diffusion maps utilize a distance metric (diffusion time) relevant to differentiation processes [20, 31].

Recently, several methods have been developed to generate low-dimensional joint embeddings from multimodal single-cell data. The methods can generally be divided into deep learning and probabilistic categories. Deep learning [33] is the dominant branch of machine learning that gained popularity due to the increasing computational capabilities of modern computers [34]. While the variety of deep learning models is enormous, autoencoder architecture [35] has become the leading tool for visualizing multimodal single-cell data [36–39]. Autoencoder is a neural network that encodes and compresses complex, high-dimensional data. Its objective is to learn and perform two functions: an encoder that transforms the input data and a decoder that reconstructs the input data from the encoded representation. The performance of an autoencoder is measured by comparing input and reconstructed data - with the goal of them being similar. Autoencoders proved to be a potent model architecture for problems such as dimension reduction [33], drug discovery [40], and image processing [41, 42]. The advantage of deep learning methods is their theoretical ability to learn very complex functions [43]. However, in practice, they are prone to underfitting or overfitting. Additionally, they are computationally expensive and require extensive data sizes to converge. Probabilistic methods for finding joint multimodal latent spaces from single-cell data include Weighted Nearest Neighbors [44] (WNN), and Multi-Omics Factor Analysis v2 [45] (MOFA+). WNN integrates modalities by constructing a weighted nearest-neighbor graph based on a weighted average of modality similarities. The graph is then visualized with UMAP [46]. MOFA+ [45] factorizes multimodal single-cell data to latent factors and corresponding feature weights. It uses stochastic variational inference [47, 48] for scalability of the parameter inference. Other approaches based on integrative methods exist, such as MOJITOO [49] that uses canonical correlation analysis to find low-dimensional multimodal data representation.

Here, we introduce Ocelli, a new multi-part diffusion-based analysis and visualization strategy for developmental multimodal data. Ocelli leverages topic modeling and novel multimodal diffusion maps (MDM) to reduce dimensionality and to preserve the ordering, continuity, and branching of developmental trajectories. Explicit incorporation of modality-specific transition probabilities allows us to generate more informative developmental data visualization. Moreover, multimodal diffusion-based modeling provides an efficient solution to deal with such caveats of single-cell genomics data as sparsity of measured features. We show that Ocelli facilitates the extraction and analysis of developmental subtrajectories from single-cell atlases. In addition, Ocelli can be leveraged to investigate a broad spectrum of biological processes involving cell state transitions, such as disease onset and progression or cellular responses to perturbations. Notably, Ocelli was designed to work on an arbitrary number and type of modalities and optimized to handle the scalability of current single-cell profiling technologies. To ensure scalable performance, Ocelli’s core functionalities are implemented using Ray [50]. Ocelli is available as an open-source software package at https://github.com/TabakaLab/ocelli along with tutorials and documentation at https://ocelli.readthedocs.io.

Materials and methods

Basic notation

We use the following notation. Matrices and vectors are written in bold. Superscript refers to modality, and the subscript to coordinates. As a result, |$\mathcal {M} = \mathbf {M}^{1:M}$| represents multimodal data consisting of M modalities, where modality |$\mathbf {M}^m \in \mathbb {R}^{C, F^m}$| comprises C cells and F^m features. |$\mathbf {M}^m_{c, f}$| is an activity value of feature f from cell c in modality m.

Multimodal weights

We model cell differentiation as a process of cellular movement in the latent space of single-cell features. According to such an interpretation, expanding distances in the cell’s neighborhood indicate a developmental process. Inspired by previous work on WNN [44], we quantify this expansion using multimodal weights, i.e. cell-specific distributions over modalities, assessing the likelihood of a developmental process occurring in modalities. The values of the weights are computed by comparing distances from a cell to its cross-modality nearest neighbors.

The computation of multimodal weights for cell c proceeds as follows. Let |$\mathbf {d}_c^{m,l}$| be a vector of distances between cell c and its nearest neighbors from modality l in the feature space of modality m. Intuitively, |$\mathbf {d}_c^{m,l}$| encodes cross-modality behavior of cell c neighborhood from modality l in modality m. We normalize cross-modality distances using modality-specific empirical cumulative distribution functions (ECDFs) of distances between cells. ECDF of modality m, denoted ECDF^m, scales distances to a [0,1] range according to the distribution of distances between cells in modality m. We estimate ECDFs by uniformly sampling P pairs of cell embeddings from each modality and calculating distances between them. The value of P depends on the total number of cells; however, usually, P = 1000 is enough to generate robust results. Scaled distances, denoted |$\text{ECDF}^m(\mathbf {d}_c^{m,l})$|⁠, measure the expansion of the neighborhood of cell c from modality l in modality m, with respect to the global structure of modality m. Next, we score each modality as a sum over the remaining modalities of ECDF-scaled median distances. |$s_c^m$| is a score for modality m in cell c.

$$\begin{eqnarray*} s_c^m = \sum _{l=1:M, l \ne m}\text{median}(\text{ECDF}^m(\mathbf {d}_c^{m,l})) \end{eqnarray*}$$

(1)

Then, we smoothen the |$s_c^m$| scores by averaging them over the nearest neighbors of cell c from the highest-scoring modality. Lastly, we normalize updated scores using a softmax function with the parameter α. The value of α should be greater or equal to 1; we used α = 10. Normalized scores produce cell-specific multimodal weights |$\mathbf {w}_c^m$|⁠, which sum to 1 for each cell.

$$\begin{eqnarray*} \mathbf {w}_c^m = \frac{e^{\alpha s_c^m }}{\sum _{l = 1:M} e^{\alpha s_c^l } } \end{eqnarray*}$$

(2)

Algorithm description

The algorithm is a multimodal generalization of the diffusion maps [20, 29]. The underlying idea is that high-affinity cells, i.e. cells close to each other in the single-cell feature space, are phenotypically closely related, and the direction in this space describes the differentiation trajectory. Diffusion maps model the resulting probabilistic diffusion process as a Markov chain, which is then eigendecomposed into a low-dimensional embedding. Multimodal diffusion maps (MDM) create a multimodal Markov chain (MMC) that extends this approach to multimodal data.

MDM models multimodal data |$\mathcal {M}$| as a diffusion process. Firstly, MDM separately computes the affinities between cells and their nearest neighbors for each modality M. This step is performed using a density-adjusted symmetric Gaussian kernel κ

$$\begin{eqnarray*} \kappa (\mathbf {M}, c_1, c_2) = \exp {(-\frac{d(\mathbf {M}_{c_1, :}, \mathbf {M}_{c_2, :})^2}{\varepsilon _{c_1} \varepsilon _{c_2}})} \end{eqnarray*}$$

(3)

Equation (3) shows a formula for cell affinity κ between cells c₁ and c₂ in modality M. d( ·, ·) is the Euclidean metric and ϵ_c denotes a distance from cell c to its N^th neighbor to account for local data density (usually N = 20 is enough for robust results).

MDM models each modality as an unimodal Markov chain. The diffusion process of modality m is represented by a square matrix |$\mathbf {K}^m \in \mathbb {R}^{C, C}$|⁠. |$\mathbf {K}^m_{c_1, c_2}$| is a weighted affinity between cells c₁ and c₂.

$$\begin{eqnarray*} \mathbf {K}^m_{c_1, c_2} = \mathbf {K}^m_{c_2, c_1} = \frac{\mathbf {w}_{c_1}^m + \mathbf {w}_{c_2}^m}{2} \kappa (\mathbf {M}^m, c_1, c_2) \end{eqnarray*}$$

(4)

Next, MDM constructs the multimodal diffusion process K by element-wise summing weighted and row-normalized matrices K^{1: M}. The normalization is conducted using diagonal matrices D^m such that |$\mathbf {D}^m_{c, c} = (\sum _{i=1:C} \mathbf {K}^m_{c, i})^{-1}$|⁠.

$$\begin{eqnarray*} \mathbf {K}= \sum _{m=1:M} \mathbf {D}^m\mathbf {K}^m \end{eqnarray*}$$

(5)

For efficient eigendecomposition to eigenvectors and eigenvalues, MDM converts K to a matrix |$\widehat{\mathbf {K}}$| that has the following properties: (i) symmetric (⁠|$\widehat{\mathbf {K}}= \widehat{\mathbf {K}}^T$|⁠), (ii) positive semi-definite (⁠|$\forall _{\mathbf {v}\in \mathbb {R}^C} \mathbf {v}^T \widehat{\mathbf {K}}\mathbf {v}\ge 0$|⁠), and (iii) non-negative (⁠|$\widehat{\mathbf {K}}\ge 0$|⁠). These properties guarantee |$\widehat{\mathbf {K}}$| to be a Hermitian matrix with real and non-negative eigenvalues. MDM row-normalizes K (using a diagonal row-normalization matrix D such that D_{c, c} = (∑_{i = 1: C}K_{c, i})⁻¹) and then symmetrizes it along the diagonal to meet the conditions mentioned above.

$$\begin{eqnarray*} \widehat{\mathbf {K}}= \mathbf {D}\mathbf {K}+ (\mathbf {D}\mathbf {K})^T \end{eqnarray*}$$

(6)

MDM eigendecomposes |$\widehat{\mathbf {K}}$| to create the low-dimensional embedding of multimodal data |$\mathcal {M}$|⁠. Eigendecomposition factorizes |$\widehat{\mathbf {K}}$| to a product of the eigenvector matrix Q (where Q_{:, i} is the i^th eigenvector), the eigenvalue vector Λ (where Λ_i is the i^th eigenvalue), and Q⁻¹.

$$\begin{eqnarray*} \widehat{\mathbf {K}} = \mathbf {Q}\mathbf {\Lambda }\mathbf {Q}^{-1} \end{eqnarray*}$$

(7)

Let |$\widehat{\mathbf {\Lambda }}$| be the sorted eigenvalue vector with a decreasing order and |$\widehat{\mathbf {Q}}$| the eigenvector matrix with columns sorted accordingly. The N-dimensional MDM embedding of multimodal data |$\mathcal {M}$|⁠, denoted |$\text{MDM}(\mathcal {M})$|⁠, is defined as follows.

$$\begin{eqnarray*} \text{MDM}(\mathcal {M}) = [\widehat{\mathbf {\Lambda }}_2 \widehat{\mathbf {Q}}_{:, 2}, \widehat{\mathbf {\Lambda }}_3 \widehat{\mathbf {Q}}_{:, 3}, \dots , \widehat{\mathbf {\Lambda }}_{N+1} \widehat{\mathbf {Q}}_{:, N+1}] \in \mathbb {R}^{C,N} \end{eqnarray*}$$

(8)

Ocelli employs scikit-learn [51], an ARPACK [52] wrapper, to compute the top N + E + 1 eigenvalues and corresponding eigenvectors. The quality of the computed eigenvectors may degrade as eigenvalues decrease; hence, E more than necessary eigenvectors are computed and discarded; we use E = 10. MDM ignores |$\widehat{\mathbf {\Lambda }}_1$|⁠, the largest eigenvalue, as it is non-informative in diffusion maps to recover developmental processes [20].

Joint visualization of MDM components

MDM-generated low-dimensional embedding of multimodal data can be visualized in 2 or 3 dimensions using any dimension reduction algorithm. Ocelli has built-in wrappers for UMAP (umap-learn v0.5.3) [46] and ForceAtlas2 (v1.0.3) [53], which are recommended for well-clustered (e.g., postmitotic cells) or well-connected data (e.g., differentiating cells), respectively.

ForceAtlas2 is an algorithm for force-directed graph visualization. Ocelli provides three methods for converting MDM embeddings into graphs: (i) the nearest neighbors-based graph connects the nearest neighbors in the MDM latent space; (ii) the transitions-based graph connects a cell to its nearest neighbors in the MDM latent space with the highest developmental transition probabilities, e.g., from RNA velocity [25, 26], MultiVelo [27], or Waddington-OT [28]; (iii) in the transitions-based graph, if additional information about cell timestamps is available, the nearest neighbors in the MDM latent space are selected only among cells from the subsequent timestamp (except cells from a terminal timestamp, when nearest neighbors are found in the same timestamp). MDM latent space captures information about the global structure of the data. Involving transitions between cells orders them locally in the visualization. A constructed graph has a predefined number of edges coming out of each node that is constant across all cells. If fewer cells have non-zero transitions in a cell’s MDM neighborhood, the graph edges link to the unconnected MDM nearest neighbors.

Single-modality data exploration

Ocelli uses topic modeling to employ the multimodal analysis workflow to explore unimodal single-cell data. Ocelli splits data into latent single-cell modalities using latent Dirichlet allocation (LDA) [54], a probabilistic generative algorithm for learning relationships (called topics) between features. LDA models each cell as a multinomial mixture of topics, represented as distributions over features. Ocelli groups single-cell features according to their most probable topics and leaves only the top features according to LDA’s feature-topic distribution. The constructed groups form topic-based latent modalities.

Ocelli can visualize unimodal single-cell data by (i) constructing latent modalities using topic modeling and (ii) training MDM on latent modalities with LDA’s cell-topic multinomial distribution used as weights. This procedure reduces the inherent noise of single-cell data as we train MDM only on topic-specific features, and the impact of each latent modality is cell-specific using LDA’s cell-topic distribution for MDM weights.

Diffusion-based multimodal imputation

Diffusion-based models can impute values of sparse single-cell features [19] by iteratively applying a diffusion operator to a feature count matrix G. For example, MDM produces a multimodal diffusion operator |$\widehat{\mathbf {K}}$|⁠. The t-step imputation proceeds as follows.

$$\begin{eqnarray*} \mathbf {G}_\text{imputed} = \widehat{\mathbf {K}}^t\mathbf {G} \end{eqnarray*}$$

(9)

Eigendecomposition of |$\widehat{\mathbf {K}}$| from equation (7) greatly simplifies running multiple diffusion steps.

$$\begin{eqnarray*} \mathbf {G}_\text{imputed} & = & (\mathbf {Q}\mathbf {\Lambda }\mathbf {Q}^{-1})^t \mathbf {G} \nonumber\\ && = \overbrace{(\mathbf {Q}\mathbf {\Lambda }\mathbf {Q}^{-1}) \dots (\mathbf {Q}\mathbf {\Lambda }\mathbf {Q}^{-1})}^t \mathbf {G}\nonumber\\ && = \mathbf {Q}\mathbf {\Lambda }^t \mathbf {Q}^{-1} \mathbf {G} \end{eqnarray*}$$

(10)

The only difficulty with equation (10), the computation of the inverse matrix Q⁻¹, can be evaded. |$\widehat{\mathbf {K}}$| is symmetric, and ARPACK [52] produces unit-norm eigenvectors, making the eigenvector matrix Q orthonormal. Since the inverse of an orthonormal matrix equals its transpose, we derived an efficient imputation procedure with a constant complexity as the number of steps t varies.

$$\begin{eqnarray*} \mathbf {G}_\text{imputed} & =& \mathbf {Q}\mathbf {\Lambda }^t \mathbf {Q}^{-1} \mathbf {G}\nonumber\\ && = \mathbf {Q}\mathbf {\Lambda }^t \mathbf {Q}^{\text{T}} \mathbf {G} \end{eqnarray*}$$

(11)

Ocelli imputes values of sparse multimodal single-cell features with an approximated version of the abovementioned procedure. It uses MDM-generated eigenvectors and eigenvalues in place of Q and Λ, respectively.

We investigated the computational time of Ocelli’s imputation on a 16-CPU machine in two experiments with t = 1 imputation steps (complexity is constant as t changes). (i) We fixed the number of features to 1000 and generated sparse matrices with 10 000, 50 000, 100 000, 500 000, and 1 000 000 cells. (ii) We fixed the number of cells to 50 000 and generated sparse matrices with 1, 10, 100, 1000, and 10 000 features. Values of matrices were sampled from a binomial distribution that samples 1 with probability p = 0.2 and 0 otherwise. We generated 50 eigenvalues and corresponding eigenvectors from a uniform distribution defined on a unit segment [0, 1].

Benchmarking simulated data

We created and investigated two simulated developmental datasets. The binary tree dataset has three modalities and 6000 observations grouped into six types (A-F). Modalities 1, 2, and 3 introduce developmental lineages A–B, C–D, and E–F. We sampled observations from three-dimensional (3D) Gaussian distributions and then downsampled them to 1000 observations per type with probabilities accounting for local densities. Equation (12) defines the distribution behind the downsampling procedure. P(x) is a probability of observation x being downsampled and NN_N(x) is the N^th nearest neighbor of observation x. We used N = 25.

$$\begin{eqnarray*} P(x) \propto d(x, \text{NN}_{N}(x))^3 \end{eqnarray*}$$

(12)

The developmental process was introduced to the dataset by assigning order to observations (pseudotime) based on the distance to manually selected developmental starting points. Then, we added noise by permuting every 100 cells along the pseudotime. During the multimodal analysis with Ocelli, we computed an MDM embedding (parameters: 10 dimensions, 20 nearest neighbors) and visualized it with ForceAtlas2 as a 30-edge-per-node nearest neighbor graph.

The rare transitions dataset has 4500 observations grouped into nine types (A-I). Modalities 1 and 2 introduce developmental lineages A–C and D–I, respectively. Observations A-C in the first modality were sampled uniformly from 3D segments. The remaining observations originated from Gaussian distributions as in the binary tree dataset but resulted in 500 observations per type after downsampling. Pseudotime and noise were introduced analogically, as in the binary tree dataset. During the multimodal analysis with Ocelli, we computed an MDM embedding (parameters: 20 dimensions, 60 nearest neighbors) and visualized it with ForceAtlas2 as a 60-edge-per-node nearest neighbor graph.

We evaluated the quality of multimodal weights computed with Ocelli and WNN [44]. Weights from both methods were calculated using 60 nearest neighbors and compared to target weights using the mean squared error (MSE). Target modality weights equal 1 for lineages introduced in the modality and 0 for the remaining lineages.

Benchmarking scalability

Scalability is a significant challenge for emerging computational methods in high-throughput single-cell genomics. We tested the scalability of Cobolt (v0.0.1) [36], Matilda (the only, unnumbered release) [38], MIRA (mira-multiome v2.1.1) [55], MOFA+ (mofapy2 v0.7.0 and mofax v0.3.6) [45], MOJITOO (v1.0) [49], MultiVI (scvi-tools v0.19.0) [39], Ocelli (v0.1.1), scMM (v1.0.0) [37], and WNN (Seurat v5.0.3) [44] methods.

Benchmarking was conducted on matrices generated from the hair follicle SHARE-seq dataset [6] in the following steps. (i) We sampled 10 000, 50 000, and 100 000 cells from the dataset, (ii) we selected 1000 highly variable RNA-seq genes, (iii) we selected 1000 highly variable ATAC-seq genes from the gene activity matrix (generated using Signac [56] by mapping chromatin accessibility fragments to gene regions, including the region 2kb upstream from the promoter), (iv) we selected 1000 highly variable ATAC-seq peaks, (v) we modified counts of each cell by randomly adding single counts to 10 of its features per modality to avoid cells with the same levels of features. We used a peak count matrix for methods that require or recommend peaks instead of gene activities. Data has been preprocessed using Scanpy (v1.9.6) [57]. We ran all methods on a 16-CPU machine, with an additional NVIDIA Tesla T4 GPU if the method requires GPU-based computation. We measured only the training time of each method, excluding data preprocessing and downstream visualization. We employed the following training procedures: (Cobolt) 1000 RNA-seq genes and 1000 ATAC-seq peaks trained on GPU for 100 epochs; (Matilda) 1000 RNA-seq genes and 1000 ATAC-seq gene activities trained on GPU for 100 epochs; (MIRA) 1000 RNA-seq genes and 1000 ATAC-seq peaks trained on GPU; (MOFA+) 1000 RNA-seq genes and 1000 ATAC-seq gene activities trained on GPU in the most accurate convergence mode; (MOJITOO) 1000 RNA-seq genes and 1000 ATAC-seq gene activities trained without GPU; (MultiVI) 1000 RNA-seq genes and 1000 ATAC-seq peaks trained on GPU for 100 epochs; (Ocelli) 1000 RNA-seq genes and 1000 ATAC-seq gene activities trained without GPU on exact (ENNs, computed with scikit-learn v1.0.2 [51]) and approximate (ANNs, computed with nmslib v2.1.1 [58] nearest neighbors; (scMM) 1000 RNA-seq genes and 1000 ATAC-seq peaks trained on GPU for 100 epochs; (WNN) 1000 RNA-seq genes and 1000 ATAC-seq gene activities trained without GPU. We specified the dimensionality of the resulting latent space to 20 if applicable.

The reported times include only the execution time of the algorithm, excluding preprocessing steps. However, we acknowledge that preprocessing also contributes to overall runtime. Here, we report topic modeling training times. We trained 20 LDA topics on the scalability benchmark training data using 16 parallel jobs for 30 iterations (Ocelli’s default setting, while scikit-learn’s default number of iterations is 10). The approximate preprocessing times, rounded to the nearest second, are as follows. For 10 000 cells, the LDA training time was 6 seconds for 1000 RNA-seq genes, 5 s for 1000 ATAC-seq gene activities, and 5 s for 1000 ATAC-seq peaks. For 50 000 cells, the LDA training time was 19 seconds for 1000 RNA-seq genes, 18 s for 1000 ATAC-seq gene activities, and 22 s for 1000 ATAC-seq peaks. For 100 000 cells, the LDA training time was 39 s for 1000 RNA-seq genes, 33 s for 1000 ATAC-seq gene activities, and 34 s for 1000 ATAC-seq peaks.

To provide a more accurate estimate of preprocessing time, we also report LDA training times for the analyzed datasets. Here, we trained 20 LDA topics for 30 iterations using 50 parallel jobs unless otherwise specified. For the hair follicle SHARE-seq dataset, which includes 7160 cells, training took 15 s for 6731 RNA genes and 1 min 36 s for 17 495 ATAC gene activities. In the human bone marrow ASAP-seq dataset of 10 927 cells, the training time was 1 min 35 s for 3000 ATAC gene activities and 14 s for 238 proteins. For the human bone marrow NTT-seq dataset with 5236 cells, the process took 33 s for 43,170 H3K27me3 peaks and 26 s for 71 253 H3K27ac peaks. The peripheral blood mononuclear cells (PBMCs) DOGMA-seq dataset, containing 7516 cells, required 3 min 18 s for 84 846 ATAC peaks, 1 min 23 s for 36 601 RNA genes, and 7 s for 210 proteins. In the pancreatic endocrinogenesis RNA-seq dataset of 3606 cells, the training time for 5465 genes was 36 s. For larger datasets, the BMMCs SHARE-seq dataset, which includes 842 440 cells and was processed using 100 jobs, took 44 min 2 s for 33 721 RNA genes and 49 min 57 s for 189 259 ATAC peaks. Similarly, in the iPSCs reprogramming dataset with 68 703 cells, also run on 100 jobs, training required 12 min 18 s for 16 817 RNA genes.

Benchmarking cohesiveness of signatures

RNA-seq signatures and ATAC-seq motifs (called together ‘signatures’ in this section) are a powerful aid in understanding the biology behind single-cell data. This section describes a benchmark evaluating the coherency of embedded cells with similar signatures in each modality. We define the activity of a signature in a cell as a mean z-score of gene expression for RNA-seq signatures or a motif variability score for ATAC-seq motifs. The cohesiveness benchmark is based on two criteria, as follows. (i) Close neighborhoods of highly signature-active cells should also be signature-active. We select 5 % of cells with the highest activity and compute a score ψ_c for each selected cell c defined as a mean signature activity score among its n = 3 nearest neighbors. (ii) Signature-active cells should be in a cohesive embedding region; disconnected high-activity clusters may indicate a disorganized embedding. We employ a density-based clustering algorithm DBSCAN [59] to establish a connective structure among the 5% of most active cells. DBSCAN creates a graph |$\mathcal {G}$| connecting cells within ϵ distance from each other. We set ϵ equal to the embedding’s estimated median distance between cells based on 1000 random pairs of cells. We consider the median distance a fair threshold of proximity between cells.

The cohesiveness score of a signature for an embedding is calculated as follows. We compute a score Ψ_G for each connected component |$G \in \mathcal {G}$| of the connectivity graph. A lower Ψ_G value indicates a better cohesiveness within G.

$$\begin{eqnarray*} \Psi _G = \exp (-\sum _{\text{c}\in G} \psi _\text{c}) \end{eqnarray*}$$

(13)

Then, we compute a total cohesiveness score Φ of an embedding over all connected graph components |$\mathcal {G}$|⁠.

$$\begin{eqnarray*} \Phi = -\log (\sum _{G\in \mathcal {G}} \Psi _G) \end{eqnarray*}$$

(14)

Φ penalizes graphs with multiple connected components. Note that the scale is inverted; higher values indicate a better-connected and locally more consistent signature portrayal.

We examined the cohesiveness scores on the hair follicle dataset [6]. For ATAC-seq, we used the 150 most-variable motifs computed using chromVAR v0.3 [60] from the peak count matrix. For RNA-seq, we ran a community-finding Infomap [61, 62] algorithm using the igraph v1.5.0 [63] package and kept clusters with more than 10 genes, resulting in nine signatures. Cohesiveness scores were summed over all RNA-seq gene signatures for each evaluated embedding and divided by a maximum score so that the highest-scoring embedding scored 1. We analogically summed and scaled ATAC-seq motif cohesiveness scores. The final multimodal cohesiveness score is a sum of scaled RNA-seq and ATAC-seq scores with a maximum possible score of 2.

Benchmarking RNA velocity vector field alignment

The RNA velocity alignment benchmark evaluates whether low-dimensional embeddings preserve RNA velocity dynamics. In the hair follicle dataset, we scored each cell as median cosine similarity between the cell’s projected developmental direction and vectors directed towards n = 3 cells with the highest RNA transition probabilities from the selected cell. Cosine similarities show high values if an embedding orders cells along the RNA velocity vector field. The score of an embedding is a median score across all cells. We computed projected developmental directions and RNA velocity transitions with scVelo [26] on log-normalized 1000 highly variable genes.

Benchmarking RNA velocity-based terminal state separation

The terminal state separation benchmark quantifies the separation of developmental lineage fates in low-dimensional embeddings. It uses absorption probabilities, which estimate how likely each cell is to transition into each identified terminal state based on RNA velocity. In the hair follicle dataset, we identified four terminal states and their absorption probabilities with CellRank v1.5.1 [64]. Let |$\mathcal {T}$| be a set of terminal states. For each terminal state |$t\in \mathcal {T}$|⁠, we set its location as mean coordinates of 1 % of cells with the highest corresponding absorption probabilities. We score the separation of terminal state t as:

$$\begin{eqnarray*} \text{score}(n, t) = \text{ap}(n, t, t) - \sum _{t_0\in T \setminus \lbrace t\rbrace } \text{ap}(n, t, t_0) \end{eqnarray*}$$

(15)

where ap(n, t₁, t₂) is a mean absorption probability of terminal state t₁ among n nearest neighbors of the terminal state t₂. score(n, t) rewards embeddings where absorption probabilities of terminal state t dominate its neighborhood. High absorption probabilities of remaining terminal states in t’s neighborhood penalize the score. We computed scores for n ∈ {20, 40, …, 980, 1000}.

Benchmarking multimodal imputation

We tested the performance of multimodal imputation using Ocelli and MultiVI [39] on the hair follicle dataset [6]. The training data comprised 5000 highly variable RNA-seq genes and 28 501 ATAC-seq peaks (selected by removing 5% of most active peaks from 30 000 highly variable peaks). We measured imputation performance on raw and downsampled data by randomly removing p = 10%, …, 80% of counts from count matrices. Each test was performed in two variants: (i) by training a separate model for each downsampling rate p and (ii) by training a single model on raw data and using it to impute downsampled data.

To investigate the ability of multimodal imputation to recover cell–cell correlations, we devised a benchmark that measures how well the correlation matrix distinguishes between same-type and different-type cells. We score a cell–cell correlation matrix as a difference between the mean correlation of same-type cells and the mean correlation of different-type cells. The correlation matrix showing a clear cell type structure will score high, while a correlation matrix that does not distinguish between cell types will score low. When computing cell–cell correlations, we used 5000 highly variable RNA-seq genes and 10 000 highly variable ATAC-seq peaks.

To test the robustness of multimodal imputation, we tested the similarity of imputed data against non-zero raw data counts. We computed MSE between raw data non-zero counts and corresponding imputed values for 5000 highly variable RNA-seq genes and 10 000 highly variable ATAC-seq peaks. We present results as boxplots showing the distribution of MSE over features.

$$\begin{eqnarray*} \text{MSE}(\mathbf {x}, \mathbf {y}) = \frac{1}{n}\sum _{i=1}^n(x_i-y_i)^2 \end{eqnarray*}$$

(16)

Multimodal imputation can also reconstruct feature-feature correlations. We computed correlation matrices using imputed data of 5000 highly variable RNA-seq genes and 10 000 highly variable ATAC-seq peaks. We clustered features of imputed raw data into 8 hierarchical Ward clusters for each modality using the scikit-learn package.

Results

Overview of Ocelli

Ocelli is an explainable multimodal framework (Fig. 1A and “Materials and methods”) to learn a low-dimensional representation of developmental trajectories. In the data preprocessing step, we find modality-specific programs with topic modeling using LDA. LDA topics provide a highly compressed yet sensitive representation [65]. The training is fully unsupervised and flexible to any single-cell modality. Topic modeling captures each modality’s underlying thematic structure by identifying which features tend to occur together across cells. Also, topic modeling is well-suited to developmental data, as topic proportions are continuous distributions, allowing for gradual changes in modality-specific programs.

Figure 1.

Ocelli overview. (A) Schematic view of Ocelli workflow. The Ocelli core comprises multimodal dimension reduction with MDMs and data visualization functionalities. Preprocessed modalities, modeled individually as diffusion processes, are merged into a weighted MMC and then decomposed into the MDM latent space. Ocelli can visualize MDM embeddings using a force-directed layout that employs cellular transition probabilities or any other dimension reduction algorithm, e.g., UMAP. (B-C) Application of Ocelli to simulated data: the binary tree dataset with three modalities and six lineages A–F (B) and the rare transitions dataset with two modalities and nine lineages A–I (C). From top: a diagram of lineage trajectories (top), projections of modalities with lineage annotations and distribution of inferred multimodal weights (middle), an Ocelli embedding visualized using ForceAtlas2 with lineage annotations, and a developmental pseudotime progression (bottom).

In the second part of the model, we compute cell-specific modality weights that determine the importance of each modality for each cell from the topic latent representations. This approach resembles the WNN [44] method. However, we compute cell-cell affinities using ECDFs of distances between cells. Their values encode distances with respect to the global structure of the topic latent space. Having multimodal weights, we construct the MMC as a weighted sum of the unimodal affinities between cells. Multimodal weights quantify how informative modalities’ affinities are about the underlying developmental process. Next, we determine the latent space of MDM by factoring the MMC into eigenvectors and eigenvalues. This step further reduces the dimensionality of features, denoises the data, and allows us to find underlying developmental processes and cell ordering from the data. Then, we build the nearest neighbor cell-cell graph in the space of MDM components.

Finally, to obtain a latent space that reflects all modalities and additional information about cell fates, we leverage modality-inferred or -measured transition probabilities between cells to correct the cell-cell graph created from MDM. The transition probabilities between cells can be inferred from, e.g., RNA velocity [25, 26], optimal transport [28] algorithm, or determined from lineage-tracing experiments [66]. We modify the MDM nearest neighbors graph by rewiring edges to connect cells with the highest transition probabilities (see Methods for details). This step is used only for cells with determined transitions. The corrected cell-cell graph is visualized in 2D or 3D as a Force-directed Layout Embedding (FLE) [28] with ForceAtlas2 [53, 67] or Uniform Manifold Approximation and Projection (UMAP) [46].

Modeling differentiation with Ocelli

First, we validated the analysis strategy of Ocelli in inferring developmental progression and lineage branching of cells using simulated data. We modeled the developmental process as a binary tree composed of three modalities with two lineages branching in each modality (Fig. 1B). Ocelli identified the importance of each modality, reconstructed the branching points, and preserved the continuity of developmental trajectories with high accuracy. Next, we considered a developmental process with transient developmental transitions in the first modality (Fig. 1C, lineages A–C) followed by lineage branching in the second modality (lineages D-I). Ocelli displayed remarkable sensitivity in reconstructing transient developmental transitions and lineage branching (Fig. 1C). Finally, we compared the modality weights distributions and visualizations generated from simulated multimodal data between Ocelli and WNN, a direct competitor graph-based method (Supplementary Fig. S1). Comparison of distributions of modality weights between the methods (Supplementary Fig. S1A) shows that Ocelli accurately infers the distributions (Supplementary Fig. S1B), minimizing the MSEs between ground truth and inferred modality weights. The gains in accuracy of Ocelli modality weights are further confirmed by visualizing MDM latent spaces computed using modality weights from both Ocelli and WNN (Supplementary Fig. S1C). Ocelli generally generates more coherent and compact visualizations of trajectories while preserving their continuity and recovering lineage branching (Supplementary Fig. S1D).

Having demonstrated its performance on simulated data, we then used Ocelli to infer lineage dynamics of the regenerative subset of the hair follicle (Fig. 2A) from a dataset generated from simultaneous profiling of chromatin accessibility and gene expression by the SHARE-seq method [6]. Ocelli deconvoluted lineages building the hair shaft (HS) and inner root sheath (IRS) from transit-amplifying cells (TACs) (Fig. 2B), correctly identifying the higher weighting of chromatin accessibility in progenitor cells (TACs) and of RNA in differentiated cells (Fig. 2C and D). The obtained visualization shows two populations of progenitor cells and four populations of terminal differentiated cells. To annotate these cell populations, we computed gene signature activities determined from Smart-seq2 dataset [68]. Ocelli clearly distinguished HS cells (medulla (HS-Me), cortex (HS-Co)) and IRS cells (Huxley’s layer (IRS-Hu), Henle’s layer (IRS-He)) (Fig. 2E). In addition, CellRank [64] analysis confirms four terminally differentiated states that overlap with these gene signature activities (Fig. 2F).

Figure 2.

Reconstruction of developmental pathways in the hair follicle SHARE-seq dataset, comprising chromatin accessibility and transcriptome modalities. (A) Schematic depicting the differentiation pathways in the regenerative part of hair follicle. (B) Ocelli’s visualization of the single-cell SHARE-seq data of the regenerative part of hair follicle using scVelo [26] RNA velocities. (C) Inferred chromatin accessibility weights show higher levels for progenitor TACs. In contrast, the transcriptome weights are elevated for differentiated cells. (D) A distribution of 10% of cells with the highest weights of each modality. (E) Gene signature activity [68], computed as gene expression mean z-scores, of HS: medulla (HS-Me) and cortex (HS-Co); and IRS: Huxley’s layer (IRS-Hu) and Henle’s layer (IRS-He). (F) CellRank [64] analysis detects four terminal differentiation states. Mean absorption probability shows a variety within TAC. IRS cells are more likely to develop from TAC-1 and HS cells from TAC-2.

Diffusion-based multimodal imputation

The single-cell profiling of cells inevitably generates data of high sparsity levels, resulting in 0-inflated feature values that obstruct the inference of relationships between features or cells of the same or different types. The data’s sparsity originates from inefficient feature detection in single cells ranging from 10% to 45% of expressed RNA molecules in scRNA-seq and 1–10% of accessible sites in scATAC-seq [69]. Many unimodal imputation methods have been developed in recent years to address the sparsity in single-cell data [70]. In general, the imputation methods can be categorized into three categories: (i) probabilistic ones that directly model the data sparsity (Saver [71], scImpute [72]), (ii) smooth-based methods that smooth or diffuse feature values among nearby cells (MAGIC [19], kNN-smoothing [73]), and (iii) deep learning methods that first identify latent spaces and then reconstruct the feature matrices (AutoImpute [74], scVI [75]). The extensive benchmarking [70] of scRNA-seq data imputation methods revealed that smooth-based approaches, e.g., MAGIC [19] and kNN-smoothing [73] are the top unimodal imputation methods. MAGIC uses the diffusion operator to first learn data manifold [29] and then restores the missing expression values by exponentiation of the Markov matrix to a power t and multiplying the result by original expression matrix. While the unimodal imputation methods can be bluntly applied to each modality separately, multimodal imputation has the potential to resolve more nuanced differences between cells along the differentiation axis and, thereby, improve imputations.

Here, we demonstrate several key properties of Ocelli that help impute high-dimensional signals in multimodal data in a highly efficient way. First, instead of exponentiation of the multimodal Markov matrix, which is time-consuming, we eigendecompose it (Fig. 3A). Then, the exponentiation of the Markov matrix is equivalent to the exponentiation of eigenvalues, which is of constant complexity as the number of diffusion steps t varies. It results in highly scalable and efficient imputation of unobserved data for a large number of cells (Fig. 3B, upper panel) and features (Fig. 3B, lower panel). The salient feature of imputations is the recovery of gene–gene relationships and regulatory interactions [19]. First, we tested the imputations on two canonical markers of Henle’s lineage, Krt71 and Krt73. Ocelli’s multimodal imputations recovered the correlations between the expression of the marker genes (Fig. 3C) and the chromatin accessibilities at their promoters (Fig. 3D). To further test the usability of Ocelli’s multimodal imputations, we explored the recovery of correlations between features from different modalities or among cells. In the first case, we imputed gene expression, promoter, and distal element accessibility for 50 top gene markers of clusters identified earlier as terminal states. Ocelli restored the coactivity of regulatory elements found using GREAT [76] and genes within differentiated cells (Fig. 3E). The second case involved imputing the dataset’s expression values of all genes and regulatory elements. Cell-to-cell correlation heatmaps of Ocelli-imputed RNA estimates (t = 1) (Fig. 3F) and Ocelli-imputed promoter and distal elements accessibility estimates (t = 1) (Fig. 3G) showed that multimodal imputations help to reconstruct similarities between closely related cells.

Figure 3.

Ocelli enables the reconstruction of highly sparse features of each modality using multimodal imputations. (A) The exponentiation of the multimodal affinity matrix is equivalent to eigendecomposition with exponentiated eigenvalues, significantly improving the computational time of imputations. (B) Ocelli’s runtimes for multimodal imputation for the increasing number of cells (top) and features being imputed (bottom). (C) Gene expression levels (top) and scatterplots of gene–gene relationships (bottom) for two canonical markers of the IRS Henle’s layer (Krt71 and Krt73) with different amounts of diffusion. (D) Chromatin accessibility of promoters for the markers in (C). (E) Correlations of gene expression to chromatin accessibility of promoters and distal elements before and after Ocelli imputations (t = 1) for differentiated cells. (F and G) Cell–cell correlations computed on gene expression (F) and chromatin accessibility (G) levels before and after imputation (t = 1) with Ocelli.

We examined the accuracy and robustness of multimodal imputation with Ocelli and MultiVI [39], a deep learning-based method, on the hair follicle dataset. MultiVI [39] is a leading multimodal imputation probabilistic model that leverages neural network encoders to find modality-specific latent representations that are aligned and merged into a multimodal representation. We employed multiple data sparsity levels by downsampling RNA and ATAC count matrices by up to 80% to test model sensitivity. Furthermore, each test was performed in two variants. In the first one, denoted separate models, we trained a separate model for each downsampling rate. In the second one, denoted single model, we trained a model on raw count matrices and used it to impute downsampled data. The separate model variant measures the method’s resistance to a significant increase in sparsity, while the single model variant checks the model robustness for degrading data quality. We tested the reconstruction of cell–cell relationships by examining post-imputation correlations between cells (Supplementary Fig. S2A and B). Cell–cell relationships are well-distinguished if correlations of same-type cells are noticeably higher than those of different-type cells. In Supplementary Fig. S2A, we show mean cell–cell correlations between same- and different-type cells and their differences. Ocelli showed higher differences, which translates to better recovery of cell–cell relationships. Additionally, we explored correlations between modality features (Supplementary Fig. S2C and D). Ocelli showed high robustness when comparing non-zero counts from raw data and imputed values using MSE (Supplementary Fig. S2C). As a result, Ocelli’s multimodal imputation is a powerful tool for reducing count sparsity, preserving pre-imputation signal, and reconstructing cell–cell and feature–feature relationships.

Benchmarking analysis

We further performed extensive benchmarking of Ocelli against other recently developed methods for single-cell multimodal data analysis and visualization regarding computational runtime and quality of the generated data embeddings. For the latter benchmarking analyses, we wanted to explore the quality of the visualization of developmental single-cell data focusing on the regenerative part of hair follicle (Fig. 2). While some benchmarking strategies have been developed for assessing analysis and visualization of well-clustered data, such as silhouette score or adjusted Rand index score [49], we designed new benchmarks related to the visualization of developmental multimodal data and transitions between cells (Fig. 4, Supplementary Figs S3, S4), “Materials and methods”). Ocelli was benchmarked against (i) graph-based methods—WNN [44], MIRA [55]; (ii) statistical framework for multimodal factor analysis, MOFA+ [45]; (iii) deep learning-based methods - Cobolt [36], scMM [37], Matilda [38], MultiVI [39]; and (iv) canonical correlation analysis method MOJITOO [49] (Supplementary Fig. S3A, “Materials and methods”). Ocelli is flexible and can accommodate transition probabilties generated using any method. Here, we explored two velocity dynamics strategies by running Ocelli with transition probabilities from scVelo [26] and MultiVelo [27].

Figure 4.

Benchmarking Ocelli against competing multimodal dimensionality reduction methods on a SHARE-seq dataset of the regenerative part of hair follicle. (A) Ocelli improves computation time compared to state-of-the-art methods. (B) The cohesiveness score quantifies the tendency of grouping cells with similar transcriptomic and chromatin accessibility signatures. Bar plots compile cohesiveness scores computed over multiple parameters for each method. The left bar plot shows the highest scores achieved by each method. The right bar plot shows the variability of each method’s scores across tested parameter values. (C) The vector field score measures the alignment of an embedding and the RNA velocity vector field. The score is the median cosine similarity between the cell’s projected developmental direction and directions toward cells with the highest RNA velocity transitions. Bar plots compile vector field scores computed over multiple parameters for each method. The left bar plot shows the highest scores achieved by each method. The right bar plot shows the variability of each method’s scores across tested parameter values. (D) Net absorption evaluates the separation of developmental terminal states in embeddings. Net absorption of a terminal state is its mean neighborhood absorption after subtracting mean absorptions of remaining terminal states. High net absorption translates to a well-separated terminal state. Plots show best-performing models over hyperparameter search for each method. (E) Compilation of benchmark scores.

First, Ocelli exhibits improved computational runtimes compared to other competing multimodal methods (Fig. 4A). For example, reconstructing a joint embedding of 100 000 cells took Ocelli less than 3 min, outperforming all competing multimodal methods. We used GPU-accelerated computations for MOFA+ [45], MIRA [55] and deep learning methods and CPUs for others for these speed tests. We excluded preprocessing steps and used the same data dimensions to perform fair comparisons between the methods (“Materials and methods”). For Ocelli, we determined the runtimes for exact (ENN) and approximate (ANN) approaches of the computation of nearest neighbors.

Next, we assessed how the aggregate signal of related features in each modality is displaced on the embedding. We used gene signature scores for the RNA modality. In the case of the ATAC modality, we leveraged transcription factor motif scores (“Materials and methods”). A high-quality, developmental data embedding should group cells of similar scores together. A failed embedding can have high signature scores, for example, related to the same cell identity separated, which would suggest alternative lineages. To quantify this, we introduced the signature cohesiveness score (“Materials and methods”). The evaluated embedding is dissected into a connectivity graph using the DBSCAN [59] algorithm. Then, each connected graph component is scored based on the presence of the most signature-active cells and the signature activity in their neighborhoods. The cohesiveness score favors an embedding in which most signature-active cells are grouped in a single connected graph component and if neighborhoods of highly signature-active cells are also highly signature-active. We considered different scenarios for data visualization (“Materials and methods”): (i) optimal preprocessing steps for each method suggested by developers of each method; (ii) usage of all features or initial reduction of data dimensionality by selecting highly-variable features; (iii) final data embedding on two-dimensional (2D) or in 3D using UMAP or FLE. The cumulative cohesiveness score for RNA and ATAC modality is the highest for Ocelli and 3D FLE (Fig. 4B). Ocelli generally generates the most robust visualizations in this benchmark across all methods (Supplementary Fig. S3C).

Finally, we developed two benchmarks to compare the accuracy of visualizing developmental trajectories across the methods. The first benchmark quantifies the reconstruction of temporal cell ordering and differentiation direction on the embedding in relation to the inferred transitions by the RNA velocity-based method [26]. The output of the method, RNA velocity vector field, is obtained from relative abundances of immature (unspliced) and mature (spliced) RNA molecules. Usually, it is visualized on data embeddings by arrows, depicting the cell’s projected developmental direction. Descendant cells with the highest transition probabilities should be aligned near the projected developmental direction (Fig. 4C) in case of high-quality data embedding. Across all cells, we quantified this by computing cosine similarities between the projected developmental direction and the direction toward cells with the highest transition probabilities. 2D and 3D data embeddings generated by Ocelli using FLE show the highest scores in the ordering of cells according to RNA velocity vector fields across all tested methods (Fig. 4C, Supplementary Fig. S4A). The single-cell data visualizations that scored the highest vector field score for each method (Supplementary Fig. S3) show that Ocelli generated a clear trajectory to Huxley’s layer cells. Other methods identify Huxley’s layer as a transient state to Medulla (MIRA [55], MOJITOO [49], to some degree MultiVI [39]), Cortex (MOFA+ [45], scMM [37], WNN [44]), or mix the Huxley’s layer cells with TAC-1 (Matilda [38], Cobolt [36]). While MultiVI [39] visualized Huxley’s layer cells as the final state, it failed to predict transitions to Medulla from TACs.

As a second quantitative benchmark, we determined the separability of cell lineages on generated data embeddings. Using CellRank’s [64] determined terminal states based on RNA velocity results, we tested how the cell neighborhoods of terminal states are displaced on the embedding relatively to each other (Fig. 4D). To do so, first, we determined cells’ absorption probabilities to terminal states using CellRank [64] algorithm. Then, we systematically increased the number of nearest neighbors of one terminal state and computed, among the nearest neighbors, the difference between mean absorption probabilities to this terminal state and to others (Fig. 4D, “Materials and methods”). This net absorption score is high if cell trajectories of different lineages are well separated on the embedding. We computed the net absorption scores to four terminal states of hair follicle for all tested methods (Fig. 4D, Supplementary Fig. S4B and C). Regardless of the final data embedding strategy (UMAP or FLE), Ocelli outperformed all other methods in the discussed benchmarks (Fig. 4E).

Finally, we compared the quality of the embeddings generated by Ocelli boosted with transition probabilities obtained from scVelo’ stochastic model (Fig. 2B) and Ocelli embeddings generated using MultiVelo stochastic and dynamical models (Supplementary Fig. S3B). While MultiVelo’s [27] stochastic model produced reasonable results (Fig. 4, Supplementary Figs S3, S4), scVelo [26] produced a better velocity stream, capturing all trajectories to terminal states. MultiVelo’s velocity analysis failed to capture the TAC-to-IRS Huxley’s layer transition (Supplementary Fig. S3B) despite its clear delineation by Ocelli’s diffusion maps. These results do not indicate a universal preference for scVelo [26] over MultiVelo [27]. Rather, they underscore the necessity of robust multimodal embeddings and highlight that, although often effective, current velocity algorithms may yield misleading outcomes. Consequently, users must rigorously evaluate transition probability methods to ensure accurate data interpretation. Ocelli remains fully adaptable to transitions from any method, allowing users complete flexibility and customization to analyze their dataset.

Diffusion-based multimodal exploration of subtrajectories

Building developmental single-cell atlases involves refined analysis of developmental subtrajectories from data consisting of multiple distinct lineages and cell types unrelated directly to a subtrajectory of interest [77]. Currently, such analysis entails clustering all cells, manual annotation of cell clusters, and supervised extraction of cell clusters that may belong to the studied subtrajectory. Here we show that data diffusion can facilitate the analysis of subtrajectories by unsupervised extraction of phenotypically related cells.

To illustrate multimodal diffusion-based extraction and refined reconstruction of a subtrajectory, we analyzed ASAP-seq human native hematopoietic differentiation data [12], obtained from simultaneous measurement of chromatin accessibility and protein epitopes (Fig. 5A) for 10,927 human bone marrow cells. First, consistent with the analysis results for the hair follicle dataset, Ocelli reconstructed the hematopoietic cell lineages with high accuracy, correctly assigning the highest chromatin accessibility weights to hematopoietic stem cells (HSCs), gradually decreasing with more differentiated states of cells for which protein epitope modality becomes more prevalent (Fig. 5B,C). The vast majority of cells present are CD14+ monocytes, B, and T cells, which obstructs visualization of the hematopoietic hierarchy of cells. We sought to refine the analysis of the HSCs differentiation trajectories. We selected a CD34+ positive cluster of 351 cells and assigned the uniform distribution equal to 1 for the selected cells and 0 otherwise (Fig. 5D). Using the same data diffusion strategy as in imputations (Fig. 3A), we performed diffusion to expand the initial cell set to similar cells along the multimodal affinity-based graph structure. Finally, for diffusion time t = 20, the expanded set of 1464 cells has been used as an input in Ocelli for multimodal visualization of the cells. The continuous trajectories from HSCs are readily apparent in the refined visualization of cells (Fig. 5E), which accurately reconstructed the early hematopoietic hierarchy [78, 79]. Moreover, the refined visualization shows distinct pathways of basophilic/mast cells, B cell progenitors, and cDCs that were buried in a coarse visualization of human bone marrow cells (Fig. 5A).

Figure 5.

Cell selection strategy for refined analysis of trajectories. (A) The multimodal visualization of ASAP-seq human bone marrow data with coarse annotation of hematopoietic cell lineages. (B) Violin plots show a distribution of multimodal weights in each cluster from panel (A). (C) A distribution of 15 % of cells with the highest weights. (D) Cells from the HSCs cluster were assigned 1 s (t = 0), and the diffusion was performed for different diffusion times t = 1, 10, 20. (E) Cells selected at t = 20 were reanalyzed with Ocelli. The FLE plots show refined hematopoietic trajectories with cells colored by pseudotime or epitope level of lineage-specific protein markers. HSCs, hematopoietic stem cells; E, erythroid progenitors; pDCs plasmacytoid dendritic cells; Ba, basophilic/mast cell progenitors; B, B cell progenitors; cDCs, classical dendritic cells; M, myeloid progenitors.

Ocelli performance on datasets with distinct sets of modalities

To showcase how Ocelli deals with single-cell data generated by methods profiling distinct modalities in visualizing the same developmental trajectory, we applied Ocelli to multimodal human native hematopoietic differentiation data. We have already analyzed ASAP-seq data (chromatin accessibility and protein epitopes, Fig 5A). In this section, we leverage Ocelli to visualize multimodal data generated by (i) SHARE-seq [80] (chromatin accessibility and transcriptome, Supplementary Fig. S5A); and (ii) NTT-seq [15] (histone modifications H3K27ac and H3K27me3, Supplementary Fig. S5B). The SHARE-seq data involves 842,440 cells and it is the largest single-cell multimodal data that have been generated so far. NTT-seq data consists of 5236 cells. Ocelli-generated single-cell visualizations depict the same hematopoietic cell lineages and correctly reconstruct hematopoietic trajectories for each multimodal method, showing its high robustness to the type of profiled genome-wide features and the number of cells studied. When comparing the computed weights for different cells along the hematopoietic developmental trajectories (Supplementary Fig. S5C), the most notable feature is that HSCs have the strongest weights for the chromatin accessibility modality in the ASAP-seq and SHARE-seq datasets; while the H3K27me3 modality in the NTT-seq dataset. H3K27me3 is a repressive mark found in stem cells to be pivotal for proper cell differentiation or to remain unspecified [81, 82]. In particular, H3K27me3 marks are associated with gene repression for cell type-specific genes [83], and Ocelli correctly predicts the importance of this modification in HSCs. Classical and plasmacytoid dendritic cells (cDCs and pDCs) show a strong signal for chromatin accessibility in both ASAP-seq and SHARE-seq datasets and remarkable change in acetylation of histone H3 lysine 27 (H3K27ac), which mark genomic regions that indicate active both proximal and distal regions of transcription start sites. Interestingly, during dendritic cell development, chromatin structure undergoes global reorganization [84], where thousands of enhancers are activated in pDCs and cDCs [85–88].

Finally, we tested Ocelli on a trimodal dataset of PBMCs generated by Dogma-seq [12] (chromatin accessibility, transcriptome, protein epitopes, 7516 cells, Supplementary Fig. S5D). Ocelli reconstructed with high-resolution different states of T cells (CD4 and CD8 Naive, Memory, Effector, along with MAIT and gdT cells). To chart the contributions of each modality, we showed the modality weights (Supplementary Fig. S5E), which depict the strong contribution of chromatin accessibility in naive CD4 Naive T cells, RNA on regulatory T cells, and protein on Effector T, MAIT, gdT, B, and NK cells.

Utilizing Ocelli for single-modality data visualization

Gene expression programs (GEPs) determine cell identity and activity. GEPs are modules of coregulated genes [89–91]. GEPs maintain specific cell types and perform complex cellular activities such as proliferation, apoptosis, metabolism, differentiation, or responses to environmental cues. Each cell is a mixture of GEPs, and their relative contributions change continuously throughout cellular differentiation. Usually, GEPs are computationally determined in an unsupervised manner from scRNA-seq data by LDA [92] or non-negative matrix factorization [91]. We hypothesized that GEPs, inferred from unimodal scRNA-seq, could be leveraged directly as latent modalities to generate visualizations of scRNA-seq data with Ocelli. In this context, inference of GEPs in the first step would denoise single-cell data, and the multimodal analysis strategy implemented in Ocelli would accurately separate cell lineages and their states of different activities.

Here, we tested multimodal analysis framework on two developmental scRNA-seq datasets: (i) pancreatic endocrinogenesis [93], and (ii) cellular reprogramming to induced Pluripotent Stem Cells (iPSCs) [28]. First, we inferred GEPs using LDA analysis, where each LDA’s topic groups features into programs. Ocelli interprets each program as a modality, using the cell-topic distribution as modality weights. Then, we computed a low-dimensional MDM embedding on log-normalized topic-based latent modalities. Finally, we created a visualization of single cells with ForceAtlas2, leveraging inferred transition probabilities between cells. In the case of pancreatic endocrinogenesis data visualization, we used RNA velocity-derived transition probabilities. Ocelli accurately reconstructed complex developmental lineages (Fig. 6A), rare and transient cellular transitions (Fig. 6B), and different proliferative cell states (Fig. 6C–E). In detail, four hormone-producing endocrine cells, Alpha, Beta, Epsilon, and Delta, arise from the endocrine progenitor cells originating from cycling Ductal cells. The first progenitor branching point leads to the development of Epsilon cells and an intermediate cell population that generates independently Alpha cells [94]. Subsequently, progenitors further develop into Alpha, Beta, and Delta lineages. Delta cells are produced primarily after E15.5, with only a few cells detected at E15.5 [94]. Notably, Ocelli visualized (i) both differentiation pathways to Alpha cells: through intermediate cell population with Epsilon cells and then directly from endocrine progenitor cells; (ii) a distinct pathway to Delta cells despite its scarcity at the embryonic day E15.5; and (iii) a cell cycle of Ductal cells. Ocelli’s visualization shows concise trajectories as compared to competitive approaches [26].

$Sensitivity of Ocelli in reconstructing rare cell transitions. (A–E) Exploration of the pancreatic endocrinogenesis unimodal dataset with Ocelli. (A) The visualization of the cell trajectories generated by Ocelli. (B) scVelo’s [26] RNA velocity vector field projected onto developmental trajectories of endocrine cells reconstructed with Ocelli. (C–E) Reconstruction of the cell cycle of Ductal cells. (C) scVelo’s [26] RNA velocity vector field showing inferred cell cycle trajectory of Ductal cells. (D) A schematic view of cell cycle phases. (E) Scores of cell cycle phases calculated using the list of cell cycle signature genes [100]. (F–L) Reconstruction of cell trajectories from the cell reprogramming dataset [28] with Ocelli. (F) Three scatter plots show projections of a 3D embedding constructed with Ocelli. The arrow highlights a rare developmental trajectory of iPSCs. The table specifies the fraction of iPSCs concerning all cells from a given timestamp. (G–L) Ocelli captures multiple developmental trajectories on a single joint embedding. Plots show the activity of gene signatures related to cell identities [28]: pluripotency (G); keratinocytes at differentiation stages from basal to terminally differentiated cells, where each cell is colored by its dominant differentiation phase based on the highest z-score (H); XEN stem cells (I); spongiotrophoblasts (J); spiral artery trophoblast giant cells (K); trophoblast progenitor cells (L).$

Figure 6.

Sensitivity of Ocelli in reconstructing rare cell transitions. (A–E) Exploration of the pancreatic endocrinogenesis unimodal dataset with Ocelli. (A) The visualization of the cell trajectories generated by Ocelli. (B) scVelo’s [26] RNA velocity vector field projected onto developmental trajectories of endocrine cells reconstructed with Ocelli. (C–E) Reconstruction of the cell cycle of Ductal cells. (C) scVelo’s [26] RNA velocity vector field showing inferred cell cycle trajectory of Ductal cells. (D) A schematic view of cell cycle phases. (E) Scores of cell cycle phases calculated using the list of cell cycle signature genes [100]. (F–L) Reconstruction of cell trajectories from the cell reprogramming dataset [28] with Ocelli. (F) Three scatter plots show projections of a 3D embedding constructed with Ocelli. The arrow highlights a rare developmental trajectory of iPSCs. The table specifies the fraction of iPSCs concerning all cells from a given timestamp. (G–L) Ocelli captures multiple developmental trajectories on a single joint embedding. Plots show the activity of gene signatures related to cell identities [28]: pluripotency (G); keratinocytes at differentiation stages from basal to terminally differentiated cells, where each cell is colored by its dominant differentiation phase based on the highest z-score (H); XEN stem cells (I); spongiotrophoblasts (J); spiral artery trophoblast giant cells (K); trophoblast progenitor cells (L).

10.1007/s00521-013-1362-6

In the second example, reprogramming of cells to iPSCs, we leveraged Waddington-OT inferred cell-cell transition probabilities. The transition probabilities were obtained by the optimal transport modeling of the cell fates between two consecutive time points at which the cells were collected for single-cell profiling. Similarly to previous examples, Ocelli showed high sensitivity to capture precisely developmental branching points and to reconstruct rare cell transitions (Fig. 6F). For example, Ocelli visualized a transient developmental trajectory to iPSCs in serum condition (Fig. 6F and G), comprising as little as 0.45% of all cells at day 11. In addition to all previously annotated cell types [28], Ocelli was able to visualize a distinct population of keratinocytes depicting an asynchronous multiple-stage interfollicular differentiation [95] (Fig. 6H). Moreover, rare populations of extraembryonic endoderm (XEN) and trophoblast-like cells with their differentiation trajectories are clearly distinguishable (Fig. 6I–L). Earlier analyses [28] could not capture such precise trajectories and cell subtypes in a single, joint visualization. Taken together, Ocelli produces highly sensitive and accurate embeddings capable of showing rare cell transitions and multiple differentiation trajectories.

Discussion

Ocelli is an open-source visualization framework that uses new MDMs to enhance multimodal single-cell sequencing data analysis and exploration. The framework was specifically designed to resolve developmental processes and diminish caveats of single-cell genomics data, such as data sparsity. Ocelli generates informative visualizations of developmental multimodal data and outperforms state-of-the-art multimodal algorithms regarding the computational runtime and the quality of reconstructed developmental trajectories. To our knowledge, Ocelli is the only method that learns low-dimensional embedding of the data coalescing information from both highly dimensional multimodal features and modality-specific transition probabilities between cells. As a diffusion-based method, Ocelli enables scalable multimodal imputations of missing feature activity levels and has the potential to facilitate the exploration of trajectories from large datasets, e.g. mammalian organogenesis [77], which typically consist of many unrelated differentiation subtrajectories or in which some of the ancestral states are missing. Finally, Ocelli can be exploited on unimodal single-cell (e.g., scRNA-seq) data to generate more informative visualizations with the prospective to better resolve cell identities, activities, and spare transitions along cell’s differentiation trajectories.

The ability of multimodal algorithms to deconvolute and interpret multi-faceted data, termed explainability, is one of the critical factors that govern their applicability to resolve biological processes. In other words, a method producing highly interpretable results can be preferred over a slightly more accurate but less interpretable method. The existing methods for analysis of single-cell multimodal data vary in terms of explainability. Deep learning models tend to be less explainable. The field of explainable deep learning is rapidly developing, with many attempts at constructing explainable models [96]. However, learned parameter values of neural networks are often impossible to interpret. They represent meta properties of data, understandable by the computer, creating black-box models. New developments are needed for reliable large-scale single-cell applications. The explainability of a model can be increased with the use of interpretable parameters. For example, WNN [44] produces cell-specific modality weights that inform which modality is most informative for each cell. At the same time, MOFA+ [45] computes a distribution of modality features for each latent dimension of the reduced embedding. Applying such methods to multimodal single-cell data creates a joint low-dimensional embedding together with additional interpretable background information about cells or features, increasing the explainability of the output. Ocelli does increase explainability, integrating the advantages of the abovementioned methods to explore and visualize developmental multimodal single-cell data. In applying Ocelli, we preprocess data using probabilistic topic modeling that denoises the data while preserving its continuous structure. We adapt diffusion maps to a multimodal setting that allows us to include local and global information from each modality. Moreover, we directly include transition probabilities, e.g. from RNA velocity or Waddington-OT, when generating low-dimensional embeddings, which improves visualization of cell lineages and branching decisions. Lastly, we reconstruct highly sparse features using multimodal imputations, which allows us to recover feature-to-feature and cell-to-cell correlations.

Methods modeling RNA dynamics such as Velocyto [25] or scVelo [26] model the transcriptional content of cells and predict future states. Based on the relative kinetics of unspliced and spliced RNA, the methods compute a network of transition probabilities between cells and build a ‘velocity’ vector field that portrays cellular trajectories. In the extended velocity framework, MultiVelo [27] incorporates epigenomic information into the computation of transition probabilities. The vector field of mentioned methods was routinely visualized on the UMAP [46] or FLE [53] that were built without including cell transition probabilities. Recently, the VeloViz [97] method utilized cellular transitions to visualize unimodal single-cell data. VeloViz obtains each cell’s direction toward its predicted transcriptional state. Then, it calculates the proximity between all cell pairs by comparing their spatial direction to the predicted transcriptional one. Cells with the smallest proximities produce a nearest neighbor graph that is visualized. However, the drawback of VeloViz is that it requires coordinates of current and projected future transcription states for each cell, which limits a spectrum of applicable methods that produce transitions between cells, e.g. it is unable to leverage Optimal Transport inferred transition probabilities [28]. In addition, VeloViz applies to the visualization of scRNA-seq data only.

MDMs implemented in Ocelli allow us to impute the genome-wide features’ activity in single cells and recover feature-to-feature relationships or modality-specific cell-to-cell correlations. Scalable implementation of the imputation algorithm enables real-time imputations of feature sets of interest for large multimodal datasets that contain a million cells. We envision the single-cell multimodal diffusion-based imputation can enhance the analysis of highly sparse features originating from chromatin state studies, e.g., profiling of chromatin accessibility, histone modifications, or DNA-bound proteins. Single-cell multimodal profiling technologies that leverage scCUT&TAG-based approaches [13–16] use Tn5 and/or Tn5-conjugated proteins (e.g. pA-Tn5 or nano-Tn5) to cut genomic DNA at open chromatin regions or close to sites of histone modifications. In the case of multimodal profiling of chromatin states, only one or two modalities can be captured for the same region of genomic DNA from the same cell. In the case of more modalities studied, multimodal imputations are inevitable to infer the chromatin states at single-nucleosome and single-cell resolution. Moreover, the proposed multimodal imputations can enhance our ability to identify distal elements, e.g., enhancer-promoter connections, on a genome-wide basis from single-cell profiling studies [98] and facilitate the building of single-cell enhancer databases [99]. One apparent shortcoming of diffusion-based imputation is that smoothing feature activity levels results in real numbers of the imputed levels instead of counts. The imputed levels include “averaging” of multiple processes, both technical and biological, such as inefficient capturing of feature activity levels in single-cell sequencing technologies, as well as stochastic events during gene expression related to the dynamic switching of chromatin states at promoters and enhancers and RNA/protein production in bursts. The stochastic nature of the dynamics of chromatin states, transcription, and translation makes it impossible to infer genuine feature activity levels in single cells. Imputations have to involve the ‘averaging’ of multiple stochastic processes.

Overall, we have found that Ocelli employs powerful tools to study and visualize extensive developmental multimodal single-cell data to yield critical insights into the behavior of differentiating cells. We have shown, that Ocelli is a flexible diffusion-based method harnessing information from any number and type of modalities, along with prior knowledge of cell-to-cell transitions. It generates rich low-dimensional representation of continuous high-dimensional single- and mmultimodal data, which facilitates detection of relevant features from genome-wide measurements.

Acknowledgements

We thank members of the Computational Genomics Group for their discussions.

Author contributions: M.T. conceived and supervised the project. M.T. and P.R. designed and developed Ocelli. P.R. implemented Ocelli and wrote its documentation. M.T. and P.R. conducted the multimodal data analysis. M.T. and P.R. wrote the manuscript.

Supplementary data

Supplementary data is available at NAR Genomics & Bioinformatics online.

Conflict of interest

M.T. and P.R. are named inventors on European Patent application EP23169340.9 relating to the work of this manuscript.

Funding

P.R. and M.T. are supported by the International Centre for Translational Eye Research (FENG.02.01-IP.05-T005/23) project, which is carried out within the International Research Agendas programme of the Foundation for Polish Science co-financed by the European Union under the European Funds for Smart Economy 2021-2027 (FENG); and grants funded by National Science Center, Poland: the Sonata Bis 12 grant 2022/46/E/NZ2/00370 (M.T.), and Preludium 21 2022/45/N/NZ2/02311 (P.R.).

Code availability

Ocelli is an open-source resource, available under a BSD-Clause 2 license at https://github.com/TabakaLab/ocelli. The documentation and tutorials are available at https://ocelli.readthedocs.io.

Data availability

Single-cell data are available at the Gene Expression Omnibus under accession numbers: hair follicle data (GSE140203), human bone marrow data (ASAP-seq - GSE156478, NTT-seq - GSE212588, SHARE-seq - GSE216464), peripheral blood mononuclear cells data (GSE166188), pancreatic endocrinogenesis data (GSE132188), cell reprogramming data (GSE115943). Simulated data are available at https://doi.org/10.6084/m9.figshare.22700020.v1. Ocelli v1.0.0 source code is available at https://doi.org/10.5281/zenodo.15078053.

References

Sun

A survey of multi-view machine learning

Neural Comput Appl

2013

;

2031

–

https://arxiv.org/abs/1304.5634.

Tao

A survey on multi-view learning

arXiv

20 April 2013, preprint: not peer reviewed

Stuart

Satija

Integrative single-cell analysis

Nat Rev Genet

2019

;

257

–

10.1038/s41576-019-0093-7

Cao

Cusanovich

Ramani

et al. .

Joint profiling of chromatin accessibility and gene expression in thousands of single cells

Science

2018

;

361

1380

–

10.1126/science.aau0730

Zhu

Huang

et al. .

An ultra high-throughput method for single-cell joint analysis of open chromatin and transcriptome

Nat Struct Mol Biol

2019

;

1063

–

10.1038/s41594-019-0323-x

Zhang

LaFave

et al. .

Chromatin potential identified by shared single-cell profiling of RNA and chromatin

Cell

2020

;

183

1103

–

10.1016/j.cell.2020.09.056

Chen

Lake

Zhang

High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell

Nat Biotechnol

2019

;

1452

–

10.1038/s41587-019-0290-0

Hunt

Burnard

Roper

et al. .

scTEM-seq: Single-cell analysis of transposable element methylation to link global epigenetic heterogeneity with transcriptional programs

Sci Rep

2022

;

5776

10.1038/s41598-022-09765-x

Zhu

Zhang

et al. .

Joint profiling of histone modifications and transcriptome in single cells from mouse brain

Nat Methods

2021

;

283

–

10.1038/s41592-021-01060-3

10.

Pan

Tang

et al. .

scPCOR-seq enables co-profiling of chromatin occupancy and RNAs in single cells

Commun Biol

2022

;

678

10.1038/s42003-022-03584-6

11.

Stoeckius

Hafemeister

Stephenson

et al. .

Simultaneous epitope and transcriptome measurement in single cells

Nat Methods

2017

;

865

–

12.

Mimitou

Lareau

Chen

et al. .

Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells

Nat Biotechnol

2021

;

1246

–

10.1038/s41587-021-00927-2

13.

Bartosovic

Kabbe

Castelo-Branco

Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues

Nat Biotechnol

2021

;

825

–

10.1038/s41587-021-00869-9

14.

Gopalan

Wang

Harper

et al. .

Simultaneous profiling of multiple chromatin proteins in the same cells

Mol Cell

2021

;

4736

–

10.1016/j.molcel.2021.09.019

15.

Stuart

Hao

Zhang

et al. .

Nanobody-tethered transposition enables multifactorial chromatin profiling at single-cell resolution

Nat Biotechnol

2023

;

806

–

10.1038/s41587-022-01588-5

16.

Bartosovic

Castelo-Branco

Multimodal chromatin profiling using nanobody-based single-cell CUT&Tag

Nat Biotechnol

2023

;

794

–

805

10.1038/s41587-022-01535-4

17.

Yeung

Florescu

Zeller

et al. .

scChIX-seq infers dynamic relationships between histone modifications in single cells

Nat Biotechnol

2023

;

813

–

10.1038/s41587-022-01560-3

18.

Tedesco

Giannese

Lazarević

et al. .

Chromatin velocity reveals epigenetic dynamics by single-cell profiling of heterochromatin and euchromatin

Nat Biotechnol

2022

;

235

–

10.1038/s41587-021-01031-1

19.

Van Dijk

Sharma

Nainys

et al. .

Recovering gene interactions from single-cell data using data diffusion

Cell

2018

;

174

716

–

10.1016/j.cell.2018.05.061

20.

Haghverdi

Buettner

Theis

Diffusion maps for high-dimensional single-cell analysis of differentiation data

Bioinformatics

2015

;

2989

–

10.1093/bioinformatics/btv325

21.

Ochab-Marcinek

Tabaka

Bimodal gene expression in noncooperative regulatory systems

Proc Natl Acad Sci USA

2010

;

107

22096

–

101

10.1073/pnas.1008965107

22.

Trapnell

Cacchiarelli

Grimsby

et al. .

The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells

Nat Biotechnol

2014

;

381

–

23.

Mao

Wang

Goodison

et al. .

Dimensionality reduction via graph structure learning

Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining

2015

;

Sydney, Australia

765

–

10.1145/2783258.2783309

24.

Qiu

Mao

Tang

et al. .

Reversed graph embedding resolves complex single-cell trajectories

Nat Methods

2017

;

979

–

25.

Manno G

Soldatov

Zeisel

et al. .

RNA velocity of single cells

Nature

2018

;

560

494

–

10.1038/s41586-018-0414-6

26.

Bergen

Lange

Peidli

et al. .

Generalizing RNA velocity to transient cell states through dynamical modeling

Nat Biotechnol

2020

;

1408

–

10.1038/s41587-020-0591-3

27.

Virgilio

Collins

et al. .

Multi-omic single-cell velocity models epigenome–transcriptome interactions and improves cell fate prediction

Nat Biotechnol

2023

;

387

–

10.1038/s41587-022-01476-y

28.

Schiebinger

Shu

Tabaka

et al. .

Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming

Cell

2019

;

176

928

–

10.1016/j.cell.2019.01.006

29.

Coifman

Lafon

Diffusion Maps

Appl Comput Harmon Anal

2006

;

–

10.1016/j.acha.2006.04.006

10.1093/bioinformatics/btv715

30.

Angerer

Haghverdi

Büttner

et al. .

Destiny: diffusion maps for large-scale single-cell data in R

Bioinformatics

2016

;

1241

–

31.

Haghverdi

Büttner

Wolf

et al. .

Diffusion pseudotime robustly reconstructs lineage branching

Nat Methods

2016

;

845

–

32.

Gould

Yang

et al. .

Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq

Nat Methods

2020

;

793

–

10.1038/s41592-020-0905-x

33.

LeCun

Bengio

Hinton

Deep learning

Nature

2015

;

521

436

–

34.

Krizhevsky

Sutskever

Hinton

Imagenet classification with deep convolutional neural networks

Commun ACM

2017

;

–

35.

Kramer

Nonlinear principal component analysis using autoassociative neural networks

AICHE J

1991

;

233

–

10.1002/aic.690370209

10.1186/s13059-021-02556-z

36.

Gong

Zhou

Purdom

Cobolt: integrative analysis of multimodal single-cell sequencing data

Genome Biol

2021

;

–

37.

Minoura

Abe

Nam

et al. .

A mixture-of-experts deep generative model for integrated analysis of single-cell multiomics data

Cell Rep Meth

2021

;

100071

10.1016/j.crmeth.2021.100071

10.1038/s41592-023-01909-9

38.

Liu

Huang

Yang

Multi-task learning from multimodal single-cell omics with Matilda

Nucleic Acids Res

2023

;

e45

39.

Ashuach

Gabitto

Koodli

et al. .

MultiVI: deep generative model for the integration of multimodal data

Nat Methods

2023

;

1222

–

40.

Zhavoronkov

Ivanenkov

Aliper

et al. .

Deep learning enables rapid identification of potent DDR1 kinase inhibitors

Nat Biotechnol

2019

;

1038

–

10.1038/s41587-019-0224-x

41.

Ballé

Laparra

Simoncelli

End-to-end optimized image compression

arXiv

3 March 2017, preprint: not peer reviewed

10.48550/arXiv.1611.01704

42.

Cho

Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images

Int Conf Machine Learn

2013

;

432

–

43.

Hornik

Stinchcombe

White

Multilayer feedforward networks are universal approximators

Neural Netw

1989

;

359

–

44.

Hao

Andersen-Nissen

et al. .

Integrated analysis of multimodal single-cell data

Cell

2021

;

184

3573

–

45.

Argelaguet

Arnol

Bredikhin

et al. .

MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data

Genome Biol

2020

;

111

46.

McInnes

Healy

Saul

et al. .

UMAP: uniform manifold approximation and projection

J Open Source Softw

2018

;

861

47.

Hoffman

Blei

Wang

et al. .

Stochastic variational inference

J Mach Learn Res

2013

;

1303

–

48.

Blei

Kucukelbir

McAuliffe

Variational inference: a review for statisticians

J Am Stat Assoc

2017

;

112

859

–

49.

Cheng

Costa

MOJITOO: a fast and universal method for integration of multimodal single-cell data

Bioinformatics

2022

;

i282

–

50.

Moritz

Nishihara

Wang

et al. .

Ray: a distributed framework for emerging {AI} applications

13th USENIX symposium on operating systems design and implementation (OSDI 18)

2018

;

Carlsbad, CA, USA

561

–

Google Preview

51.

Pedregosa

Varoquaux

Gramfort

et al. .

Scikit-learn: machine learning in Python

J Mach Learn Res

2011

;

2825

–

10.1371/journal.pone.0098679

52.

Lehoucq

Sorensen

Yang

ARPACK Users’ Guide: Solution of Large-scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods

1998

;

Philadelphia, PA, USA

SIAM

53.

Jacomy

Venturini

Heymann

et al. .

ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software

PloS One

2014

;

e98679

54.

Blei

Jordan

Latent dirichlet allocation

J Mach Learn Res

2003

;

993

–

1022

10.1038/s41592-022-01595-z

55.

Lynch

Theodoris

Long

et al. .

MIRA: joint regulatory modeling of multimodal expression and chromatin accessibility in single cells

Nat Methods

2022

;

1097

–

108

56.

Stuart

Srivastava

Madad

et al. .

Single-cell chromatin state analysis with Signac

Nat Methods

2021

;

1333

–

10.1038/s41592-021-01282-5

57.

Wolf

Angerer

Theis

SCANPY: large-scale single-cell gene expression data analysis

Genome Biol

2018

;

58.

Malkov

Yashunin

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs

IEEE T Pattern Anal Mach Intell

2018

;

824

–

59.

Ester

Kriegel

Sander

et al. .

A density-based algorithm for discovering clusters in large spatial databases with noise

Proceedings of the Second International Conference on Knowledge Discovery and Data Mining

1996

;

Portland, OR, USA

226

–

Google Preview

10.1140/epjst/e2010-01179-1

60.

Schep

Buenrostro

et al. .

chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data

Nat Methods

2017

;

975

–

61.

Rosvall

Bergstrom

Maps of random walks on complex networks reveal community structure

Proc Natl Acad Sci USA

2008

;

105

1118

–

10.1073/pnas.0706851105

62.

Rosvall

Axelsson

Bergstrom

The map equation

Eur Phys J: Spec Top

2009

;

178

–

63.

Csardi

Nepusz

The igraph software package for complex network research

Int J Complex Syst

2006

;

1695

–

10.1038/s41592-021-01346-6

64.

Lange

Bergen

Klein

et al. .

CellRank for directed single-cell fate mapping

Nat Methods

2022

;

159

–

65.

Duan

Zhou

Zhu

et al. .

Model-based understanding of single-cell CRISPR screening

Nat Commun

2019

;

2233

10.1038/s41467-019-10216-x

66.

McKenna

Findlay

Gagnon

et al. .

Whole-organism lineage tracing by combinatorial and cumulative genome editing

Science

2016

;

353

aaf7907

10.1126/science.aaf7907

67.

Tabaka

Gould

Regev

scSVA: an interactive tool for big data visualization and exploration in single-cell omics

bioRxiv

6 January 2019, preprint: not peer reviewed

10.1101/512582

10.1016/j.cell.2017.03.038

68.

Yang

Adam

et al. .

Epithelial-mesenchymal micro-niches govern stem cell lineage choices

Cell

2017

;

169

483

–

69.

Chen

Lareau

Andreani

et al. .

Assessment of computational methods for the analysis of single-cell ATAC-seq data

Genome Biol

2019

;

241

70.

Hou

et al. .

A systematic evaluation of single-cell RNA-sequencing imputation methods

Genome Biol

2020

;

218

71.

Huang

Wang

Torre

et al. .

SAVER: gene expression recovery for single-cell RNA sequencing

Nat Methods

2018

;

539

–

10.1038/s41592-018-0033-z

72.

An accurate and robust imputation method scImpute for single-cell RNA-seq data

Nat Commun

2018

;

997

10.1038/s41467-018-03405-7

73.

Wagner

Yan

Yanai

K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data

bioRxiv

9 April 2018, preprint: not peer reviewed

10.1101/217737

74.

Talwar

Mongia

Sengupta

et al. .

AutoImpute: autoencoder based imputation of single-cell RNA-seq data

Sci Rep

2018

;

16329

10.1038/s41598-018-34688-x

75.

Lopez

Regier

Cole

et al. .

Deep generative modeling for single-cell transcriptomics

Nat Methods

2018

;

1053

–

10.1038/s41592-018-0229-2

76.

McLean

Bristor

Hiller

et al. .

GREAT improves functional interpretation of cis-regulatory regions

Nat Biotechnol

2010

;

495

–

501

77.

Cao

Spielmann

Qiu

et al. .

The single-cell transcriptional landscape of mammalian organogenesis

Nature

2019

;

566

496

–

502

10.1038/s41586-019-0969-x

78.

Tusi

Wolock

Weinreb

et al. .

Population snapshots predict early haematopoietic and erythroid hierarchies

Nature

2018

;

555

–

79.

Watcham

Kucinski

Gottgens

New insights into hematopoietic differentiation landscapes from single-cell RNA sequencing

Blood

2019

;

133

1415

–

80.

Kartha

et al. .

Single-cell multi-scale footprinting reveals the modular organization of DNA regulatory elements

bioRxiv

29 March 2023, preprint: not peer reviewed

10.1101/2023.03.28.533945

81.

Bernstein

Mikkelsen

Xie

et al. .

A bivalent chromatin structure marks key developmental genes in embryonic stem cells

Cell

2006

;

125

315

–

10.1016/j.cell.2006.02.041

82.

Abraham

Cui

Tang

et al. .

Dynamic regulation of epigenomic landscapes during hematopoiesis

BMC Genom

2013

;

–

10.1038/s41467-021-20940-y

83.

Cai

Zhang

Loh

et al. .

H3K27me3-rich genomic regions can function as silencers to repress gene expression via chromatin interactions

Nat Commun

2021

;

719

84.

Kurotaki

Kikuchi

Cui

et al. .

Chromatin structure undergoes global and local reorganization during murine dendritic cell development and activation

Proc Natl Acad Sci

2022

;

119

e2207009119

10.1073/pnas.2207009119

85.

Tian

Meng

Zhang

Epigenetic regulation of dendritic cell development and function

Cancer J

2017

;

302

–

10.1097/PPO.0000000000000280

86.

Paul

Amit

Plasticity in the transcriptional and epigenetic circuits regulating dendritic cell lineage specification and function

Curr Opin Immunol

2014

;

–

10.1016/j.coi.2014.04.004

87.

Paul

Arkin

Giladi

et al. .

Transcriptional heterogeneity and lineage commitment in myeloid progenitors

Cell

2015

;

163

1663

–

10.1016/j.cell.2015.11.013

88.

Lin

Chauvistré

Costa

et al. .

Epigenetic program and transcription factor circuitry of dendritic cell development

Nucleic Acids Res

2015

;

9680

–

PubMed