Complex hierarchical structures analysis in single-cell data with Poincaré deep manifold transformation

Author Notes

Abstract

Single-cell RNA sequencing (scRNA-seq) offers remarkable insights into cellular development and differentiation by capturing the gene expression profiles of individual cells. The role of dimensionality reduction and visualization in the interpretation of scRNA-seq data has gained widely acceptance. However, current methods face several challenges, including incomplete structure-preserving strategies and high distortion in embeddings, which fail to effectively model complex cell trajectories with multiple branches. To address these issues, we propose the Poincaré deep manifold transformation (PoincaréDMT) method, which maps high-dimensional scRNA-seq data to a hyperbolic Poincaré disk. This approach preserves global structure from a graph Laplacian matrix while achieving local structure correction through a structure module combined with data augmentation. Additionally, PoincaréDMT alleviates batch effects by integrating a batch graph that accounts for batch labels into the low-dimensional embeddings during network training. Furthermore, PoincaréDMT introduces the Shapley additive explanations method based on trained model to identify the important marker genes in specific clusters and cell differentiation process. Therefore, PoincaréDMT provides a unified framework for multiple key tasks essential for scRNA-seq analysis, including trajectory inference, pseudotime inference, batch correction, and marker gene selection. We validate PoincaréDMT through extensive evaluations on both simulated and real scRNA-seq datasets, demonstrating its superior performance in preserving global and local data structures compared to existing methods.

dimensionality reduction, data visualization, pseudotime inference, batch correction, deep manifold learning

Introduction

Single-cell RNA sequencing (scRNA-seq) significantly enhances the ability to study cell development and differentiation [1]. The data collected over continuous time points allow the cellular development process to be represented as continuous trajectories, often resembling hierarchical trees with multiple branches [2]. The data visualization in reduced dimensions provides a powerful means to uncover high-level structural information in scRNA-seq data, including cellular heterogeneity and hierarchical patterns [3], providing more insightful findings compared to analyzing high-dimensional data directly [4]. It should be noted that scRNA-seq data are characterized by high dimensionality, zero inflation (dropout phenomenon), and high noise levels, which present challenges for dimensionality reduction and visualization [5]. Therefore, an effective method that can preserve the geometric structure is necessary to accurately uncover the complex biological hierarchical patterns within the cellular developmental data.

Over the past few decades, numerous manifold learning approaches have been developed to address the dimensionality reduction of high-dimensional scRNA-seq data. These methods can be broadly classified according to their ability to preserve global or local structures within high-dimensional data: (1) global structures depict the overall distribution and relationships of all cells, revealing the main differentiation pathways, differentiation states, and interrelationships between different cell types. Therefore, some methods focus on preserving the global information of high-dimensional data within a low-dimensional embedding space, including principal component analysis (PCA) [6], isometric mapping [7], diffusion maps (DiffMap) [8], partition-based graph abstraction (PAGA) [9], and PHATE [10]. (2) Local structures describe the detailed distribution and relationships within specific cell groups or subpopulations and are essential for a deep understanding of the differentiation processes of particular cell groups and for discovering rare cell types or cell state transitions. Therefore, some methods focus on preserving the local geometry information of high-dimensional data, such as locally linear embedding [11], stochastic neighbor embedding (SNE) [12], Laplacian eigenmaps [13], and t-distributed stochastic neighbor embedding (t-SNE) [14]. (3) Several methods have been proposed that integrate the preservation of global and local structures, including uniform manifold approximation and projection (UMAP) [15], pairwise controlled manifold approximation (PaCMap) [16], deep local-flatness manifold embedding [17], PoincaréMaps [18], contrastive PoincaréMaps [19], deep visualization (DV) [20], and single-cell deep hierarchical map (scDHMap) [21]. However, these methods face challenges in the application of cell development due to an unreasonable design that fails to preserve the global structure. PoincaréMaps [18] attempts to estimate and preserve the entire global relationship of high-dimensional data directly. This approach is often impractical, necessitating the use of a downsampling strategy for the analysis of large scRNA-seq datasets. scDHMap [21] utilizes a (variational) autoencoder framework, but is highly dependent on the stability of the autoencoder. PHATE preserves global structures using diffusion-based manifold learning [10], but it struggles with scalability when handling new incoming data.

The selection of the low-dimensional subspace is another crucial factor influencing the preservation of the geometric structure in high-dimensional data. Most methods operate in locally flat Euclidean space, such as t-SNE, UMAP, and PHATE, which distorts pairwise distances in high-dimensional space and often fail to capture the inherent hierarchical and tree-like structure of differentiation processes. In contrast, hyperbolic space, defined by its constant negative curvature, can be viewed as a continuous model for representing hierarchical trees. It offers distinct advantages in minimizing the distortion of high-dimensional pairwise distances and demonstrates low distortion in low-dimensional embeddings, even in two dimensions [22]. This property has enabled to effectively represent complex hierarchical structures across diverse types of data [23–26]. In the context of cell differentiation, the cell population can increase exponentially. Fortunately, in hyperbolic space, the volume of balls expands exponentially with the radius, aiding in the visualization of single-cell lineages. In contrast, in Euclidean space, the volume expands polynomially, which may not provide sufficient space for the rapidly increasing number of cells, leading to distortions in the high-dimensional geometric structure. Recently, some methods have been designed to visualize single-cell trajectories in hyperbolic space. PoincaréMaps [18] leverages hyperbolic geometry to embed scRNA-seq data into a Poincaré disk. scPhere [27] uses the variational autoencoder [28] to embed scRNA-seq data into a low-dimensional hyperbolic space. DV [20] corrects the geometric structure of high-dimensional scRNA-seq data through combining contrastive learning with data augmentation. scDHMap [21] integrates the benefits of t-SNE’s ability to preserve local structures with the global structure preservation capabilities of a variational autoencoder. However, although existing hyperbolic embedding methods have demonstrated effectiveness, they have shortcomings when processing scRNA-seq data: (1) these methods consider only the output subspace as hyperbolic, neglecting the fact that each layer subspace of neural networks is still Euclidean, which may influence distortions of the high-dimensional structure [29–32]. (2) These methods do not incorporate marker gene selection strategies and cannot utilize trained models to interpret the marker genes involved in the cell differentiation process. (3) PoincaréMaps lacks batch correction capability, which are essential for integrating data from diverse sources into a unified embedding while removing batch effects.

In particular, the key objective of this study is to propose a general method capable of preserving the geometric structure of high-dimensional scRNA-seq data within a low-dimensionality space to analyze hierarchical structures in cell differentiation process, while also being applicable to multiple downstream tasks as shown in Fig. 1(C): (1) trajectory inference is used to reconstruct the developmental pathways or trajectories of cells as they transition from immature states to mature states of specific types [33, 34]. This method elucidates the dynamic changes that cells undergo during different developmental stages or under varying conditions. (2) Pseudotime inference is used to assign a relative temporal order to individual cells along a developmental trajectory [35]. This method facilitates the investigation of the progression of cellular development. (3) Batch correction is used to mitigate technical variability that arises from processing scRNA-seq data across different experimental batches [36–43]. This process ensures that biological variations are captured accurately and are not confounded by technical artifacts. (4) Marker gene selection is used to identify genes that are characteristic of specific cell types or states [44–47]. This process is crucial for annotating cell types and understanding the functional roles of different cell populations.

$Overview of the PoincaréDMT framework for scRNA-seq data analysis. (A) PoincaréDMT takes scRNA-seq measurements as input and maps them into a Euclidean structure space (Euc) and a hyperbolic semantic space (Hyp), leveraging data augmentation (Aug.) while preserving geometric structures. (B) PoincaréDMT alleviates batch effects by integrating a batch graph that accounts for batch labels during model training. (C) The low-dimensional embeddings and trained model of PoincaréDMT can be applied to multiple downstream tasks, including data visualization, batch correction, pseudotime inference, and marker gene selection, which assesses gene contributions to intra-cluster samples or inter-cluster transitions (e.g. $c_{1} \rightarrow c_{2}$).$

Figure 1

Overview of the PoincaréDMT framework for scRNA-seq data analysis. (A) PoincaréDMT takes scRNA-seq measurements as input and maps them into a Euclidean structure space (Euc) and a hyperbolic semantic space (Hyp), leveraging data augmentation (Aug.) while preserving geometric structures. (B) PoincaréDMT alleviates batch effects by integrating a batch graph that accounts for batch labels during model training. (C) The low-dimensional embeddings and trained model of PoincaréDMT can be applied to multiple downstream tasks, including data visualization, batch correction, pseudotime inference, and marker gene selection, which assesses gene contributions to intra-cluster samples or inter-cluster transitions (e.g. |$c_{1} \rightarrow c_{2}$|⁠).

Open in new tab Download slide

To address these challenges, we propose the Poincaré deep manifold transformation (PoincaréDMT) method. This approach leverages hyperbolic neural networks (HNNs) to map high-dimensional data to a Poincaré disk, effectively representing continuous hierarchical structures. PoincaréDMT inherits the strength of global structure preservation from a graph Laplacian of the pairwise distance matrix and achieves local structure correction through a dedicated structure module combined with data augmentation, making it adept at representing complex hierarchical data. To integrate datasets from diverse sources into a unified embedding, we alleviate batch effects via integrating a batch graph that accounts for batch labels to low-dimensional embedding during network training. To explain the important marker genes in the cell differentiation process, we introduce Shapley additive explanations (SHAP) method based on low-dimensional embeddings. The proposed dimensionality reduction and visualization method can be applied to multiple downstream tasks, including developmental trajectory inference, pseudotime inference, batch correction, and marker gene selection. The comprehensive analysis on simulated and real datasets demonstrates the excellent performance of PoincaréDMT compared with existing methods in various tasks.

Materials and methods

Datasets and data preprocessing

As shown in Supplementary Table S1, we utilize synthetic datasets available in Scanpy [48], along with real scRNA-seq datasets derived from various tissues and organs in humans and mice, to evaluate the performance of the proposed PoincaréDMT. The datasets used in this study are all sourced from publicly available data.

For synthetic datasets: (1) the simple toggle switch model (ToggleSwitch) [49, 50] describes a bifurcating process controlled by the expression levels of two key markers, representing two distinct branches. (2) The MyeloidProgenitors dataset [51] for myeloid differentiation depicts the progress of cell differentiation from a common myeloid progenitor into four distinct branches: erythrocyte, neutrophil, monocyte, and megakaryocyte. (3) The Krumsiek11_blobs dataset builds upon the MyeloidProgenitors dataset by incorporating two Gaussian blobs to introduce additional complexity [9].

For real scRNA-seq datasets: (1) the mouse myelopoiesis dataset (Olsson) tracks the differentiation of myeloid cells [52]. (2) The Paul dataset [53] includes multiple intermediate populations of myeloid progenitors. (3) The Moignard2015 dataset [54] represents early blood development in mice. This dataset contains cells at different developmental stages (Fig. 5A): primitive streak (PS), neural plate, head fold (HF), four somite GFP negative (4SG-), and four somite GFP positive (4SG+). (4) The Caenorhabditis elegans dataset [55] provides a lineage-resolved molecular atlas of C. elegans embryogenesis at various developmental stages. (5) The colon epithelial cells and colon immune cells [56] are collected from a single-cell atlas of the colon mucosa from ulcerative colitis patients and healthy individuals. Both datasets include three batch factors: patient source, disease status, and location.

For data pre-processing, PoincaréDMT recommends that only quality control is required for raw sequencing data. ToggleSwitch, Olsson, Paul, Moignard2015, MyeloidProgenitors, and Krumsiek11_blobs datasets are preprocessed following the guidelines described in PoincaréMaps [18]. Caenorhabditis elegans cell atlas, colon epithelial cells and colon immune cells are preprocessed following the procedures outlined in scPhere [27]. Similar to nearest-neighbors manifold learning methods like t-SNE, UMAP, PHATE, and PoincaréMaps, they can face challenges due to the curse of dimensionality. To mitigate this for data with more than 100 dimensions, reducing it to 50 or 100 principal components is a common practice. In marker gene selection task, PoincaréDMT directly utilizes high-dimensional unique molecular identifier (UMI) counts as input.

Methods

Data definition and augmentation

Let |$\mathbf{X}=\left \{\mathbf{x_{i}}\right \}_{i=1}^{N} \in \mathbb{R}^{d}$| represent a high-dimensional scRNA-seq expression matrix of |$N$| cells, where each cell is characterized by |$d$| genes |$\left \{g_{i}\right \}_{i=1}^{d}$|⁠. Similarly, let |$\mathbf{Y}=\left \{\mathbf{y_{i}}\right \}_{i=1}^{N} \in \mathbb{R}^{b}$| represent a categorical batch label matrix (e.g. multi-hot encoding) associated with |$\mathbf{X}$|⁠, with each cell assigned |$b$| categorical batch labels. The neighborhood function |$\mathcal{N}(i, k)$| denotes the neighborhood of |$\mathbf{{x}_{i}}$|⁠, determined by Euclidean distance.

In scRNA-seq data, cell UMI counts tend to be highly sparse. To enhance model stability and generalization, we introduce a linear data augmentation technique, where augmented cells are produced by combining cells in a convex manner and dynamically sampling cells from their |$k^{s}$| nearest neighbors |$\mathcal{N}(i, k^{s})$| during model training:

$$ \begin{align}& \begin{aligned} & \hat{\mathbf{x}_{i}}=\left(1-r_{u}\right) \cdot \mathbf{x_{i}}+r_{u} \cdot \mathbf{x_{k}}, \end{aligned}\end{align} $$

(1)

where |$\mathbf{x_{k}} \in \mathcal{N}(i, k^{s})$| represents a cell sampled from the neighborhood, |$\boldsymbol{\hat{\mathbf{x}_{i}}}$| is the augmented gene expression for cell |$\mathbf{x_{i}}$|⁠, and |$r_{u}$| is a linear combination parameter drawn from a standard uniform distribution. |$\hat{\mathbf{X}}=\left \{\boldsymbol{\hat{\mathbf{x}_{i}}}\right \}_{i=1}^{a \times N}$| is the augmented scRNA-seq dataset created through the data augmentation process, where |$a$| is the number of data augmentation per cell. Therefore, an updated scRNA-seq dataset |$\widetilde{\mathbf{X}}=\left \{\mathbf{X}, \hat{\mathbf{X}} \right \}$| is constructed and used as the input data for the model.

This data augmentation method exploits the local neighborhood structure of the data and is applied in real time during the model’s training phase, increasing data randomness while ensuring that the gene expression semantics remain unchanged.

Poincaré ball model

The |$d$|-dimensional Poincaré ball with a constant negative curvature |$K~(K<0)$| can be described by the Riemannian manifold |$(\mathcal{B}, g^{\mathcal{B}})$|⁠. The Poincaré ball |$\mathcal{B}$| is defined:

$$ \begin{align}& \begin{aligned} \mathcal{B}=\left\{\mathbf{x} \in \mathbb{R}^{d}:\|\mathbf{x}\|^{2}<-\frac{1}{K}\right\}, \end{aligned}\end{align} $$

(2)

where |$\left \| \cdot \right \|$| is the Euclidean norm. The metric tensor is expressed as |$g^{\mathcal{B}}=(\lambda _{\mathbf{x}}^{K})^{2} g^{\mathcal{E}}$|⁠, where |$\lambda _{\mathbf{x}}^{K}=\frac{2}{1+K\|\mathbf{x}\|^{2}}$| is the conformal factor and |$g^{\mathcal{E}}=diag([1,1,...,1])$| (i.e, identity matrix) is the Euclidean metric tensor.

For each point |$\mathbf{x} \in \mathcal{B}$| and a tangent vector |$\mathbf{v} \in \mathcal{T}_{\mathbf{x}} \mathcal{B}$|⁠, the exponential map |$\exp _{\mathbf{x}}^{K}(\mathbf{v}): \mathcal{T}_{\mathbf{x}} \mathcal{B} \rightarrow \mathcal{B}$| is a function that projects a tangent vector |$\mathbf{v}$| to the Poincaré ball manifold. Conversely, for a target point |$\mathbf{t}\neq 0$| on the manifold, the logarithmic map |$\log _{\mathbf{x}}^{K}(\mathbf{t}): \mathcal{B} \rightarrow \mathcal{T}_{\mathbf{x}} \mathcal{B}$| projects it back to the tangent space, which is Euclidean and isomorphic to |$\mathbb{R}^{d}$|⁠. The exponential map |$\exp _{\mathbf{x}}^{K}(\mathbf{v})$| and the logarithmic map |$\log _{\mathbf{x}}^{K}(\mathbf{t})$| are given by

$$ \begin{align} & \begin{aligned} \exp_{\mathbf{x}}^{K}(\mathbf{v}) = \mathbf{x} \oplus_{K}\left(\tanh \left(\frac{\sqrt{-K}\lambda_{\mathbf{x}}^{K}\|\mathbf{v}\|}{2}\right) \frac{\mathbf{v}}{\sqrt{-K}\|\mathbf{v}\|}\right), \end{aligned} \end{align} $$

(3)

$$ \begin{align} & \begin{aligned} \log_{\mathbf{x}}^{K}(\mathbf{t}) = \frac{2}{\sqrt{-K}\lambda_{\mathbf{x}}^{K}} \operatorname{arctanh}(\sqrt{-K}\|\mathbf{u}\|)\frac{\mathbf{u}}{\|\mathbf{u}\|},\qquad\ \ \end{aligned}\end{align} $$

(4)

where |$\mathbf{u}=-\mathbf{x} \oplus _{K} \mathbf{t}$| and |$\oplus _{K}$| is the Möbius addition for any |$\mathbf{x}, \mathbf{t} \in \mathcal{B}$|⁠:

$$ \begin{align}& \mathbf{x} \oplus_{K} \mathbf{t}=\frac{\left(1-2K\langle\mathbf{x}, \mathbf{t}\rangle-K\|\mathbf{t}\|^{2}\right) \mathbf{x}+\left(1+K\|\mathbf{x}\|^{2}\right) \mathbf{t}}{1-2K\langle \mathbf{x}, \mathbf{t}\rangle+K^{2}\|\mathbf{x}\|^{2}\|\mathbf{t}\|^{2}}.\end{align} $$

(5)

From local connectivity to global proximities

We begin by estimating local connectivity structures, similar to manifold learning techniques [18]. Using the neighborhood function |$\mathcal{N}(i, k)$|⁠, we construct a symmetric graph |$\mathcal{G}=(\mathcal{V}, \mathcal{E}, w)$| to represent the |$k$|-nearest neighbor relationships. Here, the vertex set |$\mathcal{V}=\{\mathbf{x_{i}}\}_{i=1}^{N}$| represents the sample set (i.e. |$\mathbf{X}$|⁠) and the edge set |$\mathcal{E}=\left \{\mathbf{x_{i}} \sim \mathbf{x_{j}}: i \in \mathcal{N}(j, k) \wedge j \in \mathcal{N}(i, k) \right \}$| represents the connectivity between samples. We employ a greedy procedure to ensure a connected graph |$\mathcal{G}$|⁠: (1) construct a standard graph |$\mathcal{G}$| with a specified hyperparameter nearest neighbor number |$k^{r}$|⁠. (2) Identify the minimum length edge that could connect each pair of disconnected components. (3) Connect the two components using the shortest edge. (4) Repeat this process until graph |$\mathcal{G}$| consists of a single connected component. The nearest-neighbor connections are weighted with a Gaussian kernel based on their Euclidean distances:

$$ \begin{align}& w_{ij}=\left\{ \begin{array}{@{}ll}\exp \left(-\frac{\left\|\mathbf{x_{i}}-\mathbf{x_{j}}\right\|^{2}}{2 \sigma^{2}}\right) & \text{ if } \mathbf{x_{i}} \sim \mathbf{x_{j}} \in \mathcal{E}, \\ 0 & \text{ otherwise,} \end{array}\right.\end{align} $$

(6)

where |$\sigma $| determines the width of the kernel and is set to |$1$|⁠.

We then estimate the underlying manifold structure according to graph |$\mathcal{G}$| using relative forest accessibility (RFA) index [57]. The RFA index is derived from the graph Laplacian |$\mathbf{L}=\mathbf{D}-\mathbf{A}$|⁠, with |$\mathbf{A}_{i j}=w_{ij}$| representing the adjacency matrix and |$\mathbf{D}_{i i}=\sum _{j} w_{ij}$| representing the degree matrix. The RFA matrix |$\mathbf{U^{r}}$| is calculated as follows:

$$ \begin{align}& \mathbf{U^{r}}=(\mathbf{I}+\mathbf{L})^{-1} \text{, }\end{align} $$

(7)

where |$\mathbf{U^{r}}$| is a doubly stochastic matrix, each entry |$u^{r}_{ij}$| reflects the probability that a spanning forest of graph |$\mathcal{G}$| contains both root node |$\mathbf{p}$| and the other node |$\mathbf{x}$| (i.e. |$\mathbf{x}$| is reachable from |$\mathbf{p}$|⁠). Unlike the shortest paths method, the RFA matrix increases the similarity between nodes that frequently appear in multiple shortest paths, making it useful for uncovering hierarchical structures. The reason is that nodes appearing on multiple shortest paths are more likely to be positioned near the root of the hierarchy. The RFA matrix estimates global proximities of local connectivity and provides the initial manifold structure for optimizing PoincaréDMT across all experiments.

Framework of PoincaréDMT

As illustrated in Fig. 1(A), PoincaréDMT comprises a structure module utilizing a multilayer perceptron (MLP) and a semantic module employing HNNs to achieve dimensionality reduction and hierarchical structures preservation. The MLP is used to learn embeddings that reveal the underlying manifold structure of the data topology, while the HNNs, based on the Poincaré ball model, are employed to minimize information distortion during the preservation of hierarchical structures.

$$ \begin{align}& \begin{aligned} \mathbf{z^{s}} & = \text{MLP}_{\phi}(\mathbf{x}) \\[1mm] \mathbf{z^{l}} & = \text{HNN}_{\theta}(\mathbf{z^{s}}) \\ \end{aligned}\end{align} $$

(8)

where |$\phi $| and |$\theta $| represent the network parameters of the structure module and semantic module, respectively. Taking the input data |$\mathbf{x}$| as an example, |$\mathbf{z^{s}}$| denotes the high-dimensional structure embeddings produced by the structure module, while |$\mathbf{z^{l}}$| represents the low-dimensional hyperbolic embeddings generated by the semantic module.

The structure embeddings |$\mathbf{z^{s}}$| are produced by the MLP, which can be regarded as Euclidean neural networks. Let |$\mathbf{o}=(0,..., 0)$| denote the origin in |$\mathcal{B}$|⁠, which we use as a reference point for performing tangent space operations. We consider structure embeddings as Euclidean embeddings |$\mathbf{z^{\mathcal{E}}}$| and first map them from the Euclidean space to the Poincaré ball manifold via exponential map |$\mathbf{z^{\mathcal{B}}}=\exp _{\mathbf{o}}^{K}(\mathbf{z^{s}})$|⁠.

The HNNs [23] learn hyperbolic embeddings by stacking multiple hyperbolic neural network layers, each of which includes a linear transformation and a nonlinear activation. Hyperbolic points are first projected to the tangent space |$\mathcal{T}_{\mathbf{o}} \mathcal{B}$| using the logarithmic map, followed by multiplying the embedding vector with the weight matrix |$\mathbf{W_{\ell }}$| at the |$\ell $|-th layer, and subsequently the vector in the tangent space is projected back to the Poincaré ball manifold using the exponential map. The hyperbolic linear transformation is defined as follows:

$$ \begin{align}& \mathbf{W_{\ell}} \otimes^{K_{\ell}} \mathbf{z^{\mathcal{B}}}:=\exp_{\mathbf{o}}^{K_{\ell}}\left(\mathbf{W_{\ell}} \log_{\mathbf{o}}^{K_{\ell}}\left(\mathbf{z^{\mathcal{B}}}\right)\right).\end{align} $$

(9)

The nonlinear activation |$\sigma $| employs a mechanism similar to the linear transformation, but with adjustable curvature values for each layer. Hyperbolic points are first mapped to the tangent space |$\mathcal{T}_{\mathbf{o}} \mathcal{B}$| of the current layer using the logarithmic map. These points are then transferred from the tangent space to the hyperbolic space of the next layer using the exponential map of the subsequent layer. The hyperbolic nonlinear activation is defined as follows:

$$ \begin{align}& \sigma^{\otimes^{K_{\ell}, K_{\ell+1}}}\left(\mathbf{z^{\mathcal{B}}}\right)=\exp_{\mathbf{o}}^{K_{\ell+1}}\left(\sigma\left(\log_{\mathbf{o}}^{K_{\ell}}\left(\mathbf{z^{\mathcal{B}}}\right)\right)\right).\end{align} $$

(10)

Loss function

The objective is to learn a low-dimensional embedding |$\mathbf{z_{i}^{l}}$| for each sample |$\mathbf{x_{i}}$|⁠, which emphasizes the hierarchical structure among the data. To achieve this, the input data are embedded into a two-dimensional Poincaré disk. The hyperbolic distance between embeddings |$\mathbf{z^{\mathcal{B}}_{i}}, \mathbf{z^{\mathcal{B}}_{j}} \in \mathcal{B}$| is given by

$$ \begin{align}& \mathcal{D_{B}}(\mathbf{z^{\mathcal{B}}_{i}}, \mathbf{z^{\mathcal{B}}_{j}})=\frac{1}{\sqrt{-K}}\operatorname{arcosh}\left(1- \frac{2K\|\mathbf{z^{\mathcal{B}}_{i}}-\mathbf{z^{\mathcal{B}}_{j}}\|^{2}}{\left(1+K\|\mathbf{z^{\mathcal{B}}_{i}}\|^{2}\right)\left(1+K\|\mathbf{z^{\mathcal{B}}_{j}}\|^{2}\right)}\right).\end{align} $$

(11)

The Euclidean distance in |$\mathcal{B}$| is smoothly amplified based on the norms of |$\mathbf{z^{\mathcal{B}}_{i}}$| and |$\mathbf{z^{\mathcal{B}}_{j}}$|⁠, making it particularly suitable for learning hierarchical embeddings. By positioning the root node of a tree at the origin in |$\mathcal{B}$|⁠, it remains relatively close to other nodes due to its zero norm. In contrast, leaf nodes can be positioned near the disk’s boundary, as distances grow rapidly when norms approach one.

Manifold learning aims to preserve the geometric structure of the data in embeddings. To maintain the hierarchical relationships between samples, we optimize PoincaréDMT using a geometric structure preservation (GSP) loss, as suggested in [58]. This loss minimizes the distribution diskrepancy of geometric structure between high-dimensional data/embeddings and low-dimensional embeddings using a fuzzy sets cross-entropy loss:

$$ \begin{align}& \mathcal{L}_{GSP}=\sum_{i,j=1}^{N} u_{i j}^{h} \log \frac{u_{i j}^{h}}{u_{i j}^{l}}+\left(1-u_{i j}^{h}\right) \log \frac{\left(1-u_{i j}^{h}\right)}{\left(1-u_{i j}^{l}\right)},\end{align} $$

(12)

where |$u_{i j}^{h}$| represents the undirectional similarity between high-dimensional input data (i.e. |$u^{r}_{ij}$|⁠) or structure embeddings (i.e. |$u^{s}_{ij}$|⁠), and |$u_{i j}^{l}$| represents the undirectional similarity between low-dimensional hyperbolic embeddings. The |$u^{s}_{ij}$| and |$u^{l}_{ij}$| are defined as follows:

$$ \begin{align}& u_{i j}=u_{i \mid j}+u_{j \mid i}-2 u_{i \mid j} u_{j \mid i},\end{align} $$

(13)

where |$u^{s}_{j \mid i}$| is a directional similarity derived from the Euclidean distance |$\mathcal{D_{E}}(\mathbf{z^{s}_{i}}, \mathbf{z^{s}_{j}})$|⁠, and |$u^{l}_{j \mid i}$| is a directional similarity derived from the hyperbolic distance |$\mathcal{D_{B}}(\mathbf{z^{l}_{i}}, \mathbf{z^{l}_{j}})$|⁠. The |$u^{s}_{j \mid i}$| and |$u^{l}_{j \mid i}$| are weighted using a normalized squared |$t$|-distribution:

$$ \begin{align}& u_{j \mid i}=C_{\nu}\left(1+\frac{\mathcal{D}\left(\mathbf{z_{i}}, \mathbf{z_{j}}\right)}{\nu}\right)^{-(\nu+1)},\end{align} $$

(14)

where |$\nu $| denotes the degrees of freedom associated with the t-distribution. We set |$\nu ^{s}$| for |$u^{s}_{j \mid i}$| to |$100$|⁠, while |$\nu ^{l}$| for |$u^{l}_{j \mid i}$| is treated as a hyperparameter. The normalizing function |$C_{\nu }$| is defined as follows:

$$ \begin{align}& C_{\nu}=2 \pi\left(\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)}\right)^{2}.\end{align} $$

(15)

We design global and local structure preservation losses based on |$\mathcal{L}_{GSP}$| to optimize PoincaréDMT. After applying local connectivity and global proximities operators, we assume that the hierarchical information of input data is stored in the RFA probability |$u_{i j}^{r} \in \mathbf{U^{r}}$|⁠. Therefore, the global loss |$\mathcal{L}_{global}$| is defined as follows:

$$ \begin{align}& \mathcal{L}_{global}=\sum_{i,j=1}^{N} u_{i j}^{r} \log \frac{u_{i j}^{r}}{u_{i j}^{l}}+\left(1-u_{i j}^{r}\right) \log \frac{\left(1-u_{i j}^{r}\right)}{\left(1-u_{i j}^{l}\right)},\end{align} $$

(16)

where |$u_{i j}^{l} \in \mathbf{U^{l}}$| is calculated using the low-dimensional embeddings |$\mathbf{Z^{l}}$| of input data and their nearest neighbors with a hyperparameter |$\nu ^{l}_{global}$|⁠.

Due to the high sparsity of observed UMI counts in cells, using vector similarity measures such as Euclidean distance to define neighbor relationships between cells is challenging. To overcome this, we adopt a structure module to estimate the local topological properties of high-dimensional manifolds and correct the local geometric structures in the RFA matrix. As a result, the local loss |$\mathcal{L}_{local}$| is defined as follows:

$$ \begin{align}& \mathcal{L}_{local}=\sum_{i,j=1}^{N} u_{i j}^{s} \log \frac{u_{i j}^{s}}{u_{i j}^{l}}+\left(1-u_{i j}^{s}\right) \log \frac{\left(1-u_{i j}^{s}\right)}{\left(1-u_{i j}^{l}\right)},\end{align} $$

(17)

where |$u_{i j}^{s} \in \mathbf{U^{s}}$| is calculated using the structure embeddings |$\mathbf{Z^{s}}$|⁠, and |$u_{i j}^{l} \in \mathbf{U^{l}}$| is calculated between the low-dimensional embeddings |$\mathbf{Z^{l}}$| and |$\hat{\mathbf{Z}^{l}}$| with a hyperparameter |$\nu ^{l}_{local}$|⁠.

Finally, the total loss can be described as follows:

$$ \begin{align}& \mathcal{L}_{total} = \mathcal{L}_{global} + \alpha \cdot \mathcal{L}_{local},\end{align} $$

(18)

where |$\alpha $| is a hyperparameter that controls the weight of the local loss, and the total loss is optimized per min-batch.

Batch correction

In this section, we construct a batch graph based on the multi-hot encoding matrix |$\mathbf{Y}$| of categorical batch labels, using the Euclidean distance |$\mathcal{D_{E}}\left (\mathbf{y_{i}}, \mathbf{y_{j}}\right )$|⁠. This graph is integrated with the low-dimensional embeddings during model training, as illustrated in Fig. 1(B). Specifically, the |$u_{j \mid i}^{l}$| is redefined as follows:

$$ \begin{align}& u_{j \mid i}^{l}=C_{\nu^{l}}\left(1+\frac{\mathcal{D_{B}}\left(\mathbf{z^{l}_{i}}, \mathbf{z^{l}_{j}}\right)+\beta \cdot \mathcal{D_{E}}\left(\mathbf{y_{i}}, \mathbf{y_{j}}\right)}{\nu^{l}}\right)^{-(\nu^{l}+1)},\end{align} $$

(19)

where |$\beta $| is a hyperparameter that controls the weight of batch information. It is worth noting that due to the lack of batch labels, the augmented dataset |${\hat{\mathbf{X}}}$| does not participate in constructing the batch graph.

Translation in Poincaré space and Poincaré pseudotime

Hyperbolic space, represented by the Poincaré ball manifold, is a metric space that enables low-distortion embeddings of tree-structured data and measures hierarchical relationships between points through hyperbolic distances. The total loss function |$\mathcal{L}_{total}$| is designed to encourage embeddings in which nodes with small overall distances are positioned near the center of the disk. Although nodes near the origin are often close to the tree’s root, this does not ensure that the root is the nearest one. An isometric transformation can be applied to reposition it at the origin without altering distances between points when the root node is known. To translate the disk’s origin to the root node |$\mathbf{p}$|⁠, the translation of any node |$\mathbf{z}$| is given by

$$ \begin{align}& \tau(\mathbf{z}, \mathbf{p}) = \frac{(1 + 2 \langle \mathbf{p}, \mathbf{z} \rangle + \|\mathbf{z}\|^{2})\mathbf{p} + (1 - \|\mathbf{p}\|^{2})\mathbf{z}}{1 + 2 \langle \mathbf{p}, \mathbf{z} \rangle + \|\mathbf{p}\|^{2} \|\mathbf{z}\|^{2}},\end{align} $$

(20)

where |$\langle \mathbf{p}, \mathbf{z} \rangle $| denotes the inner product. In the Poincaré ball, spatial resolution is higher near the origin, making this translation act like a zoom-in tool for focusing on specific areas within the embedding.

Pseudotime measures the progression of a cell through biological processes, such as differentiation. In PoincaréDMT, pseudotime is represented by the hyperbolic distance of a cell’s low-dimensional embedding from the root node within the Poincaré disk.

Workflow of PoincaréDMT

In this paper, we present a novel method PoincaréDMT for scRNA-seq data as illustrated in the flowchart in Fig. 1(A). The detailed steps are described in Algorithm 1:

Results and discussion

Comparison methods and evaluation metrics

We evaluate PoincaréDMT against existing methods across a range of single-cell data analysis tasks, including (1) dimensionality reduction and visualization techniques, such as PCA [6], t-SNE [14], UMAP [15], PaCMap [16], DiffMap [50], ForceAtlas2 [59], PHATE [10], PoincaréMaps [18], and DV_Poin [20]; (2) batch correction methods, including Harmony [41] and scVI [42]; and (3) pseudotime inference approach diffusion pseudotime (DPT) [35]. The evaluation is performed on synthetic datasets derived from established dynamical systems, as well as real scRNA-seq datasets of varying sizes and complexities, encompassing different number of cell types and branching structures. We specifically evaluate PoincaréDMT in the context of cell lineage trees and developmental hierarchies. In addition, we validate the precision of PoincaréDMT in facilitating the identification of marker genes.

To evaluate the effectiveness of various visualization methods in a quantitative manner, we adopt the scale-independent quality metrics |$Q_{global}$| and |$Q_{local}$| [60], which evaluate how well global and local structures are preserved in the low-dimensional embeddings, respectively. These metrics range from 0 (indicating poor preservation) to 1 (indicating excellent preservation). The core opinion behind these metrics is that an effective dimensionality reduction method should faithfully preserve both local and global distances of the high-dimensional data manifold. This means that points that are close neighbors in the high-dimensional space should remain close in the low-dimensional embeddings, while points that are far apart should maintain large separations.

For the batch correction task, we separately evaluate both geometric structure preservation and clustering performance for each batch, presenting the statistic results using boxplots. This allows us to analyze the performance of dimensionality reduction methods on a complex multi-batch scRNA-seq dataset composed of various smooth manifolds. An effective dimensionality reduction method should maintain both global and local structures within each manifold after eliminating batch effects. We compute the |$Q_{global}$| and |$Q_{local}$| metrics by comparing the input data before batch correction with the resulting embeddings after batch correction, thus measuring the preservation of global and local structures after batch correction [60]. In addition, the average silhouette width (ASW) is used to assess the clustering quality of the embeddings after batch correction [61].

In high-dimensional space, the path length in a fully connected graph is computed using Euclidean distance. For PoincaréMaps, geodesic distances are calculated as the shortest path length within a |$k$|-nearest neighbors graph. In low-dimensional space, path lengths are typically computed using Euclidean distance. For hyperbolic embedding methods such as DV_Poin, Poincaré maps, and PoincaréDMT, hyperbolic distances are applied.

Model implementation and hyperparameter setting

The PoincaréDMT model is developed using Python3 with PyTorch Lighting. The layer configurations for the structure and semantic modules are set as (p-200-200-200-200-200-200) and (200-500-80-2), respectively. Each layer utilizes the LeakyReLU activation function alongside batch normalization. The model is trained using the Adam optimizer with a learning rate of |$0.001$| and a weight decay of |$10^{-9}$|⁠. Initially, the model undergoes pretraining without the local loss |$\mathcal{L}_{local}$| for |$200$| epochs, followed by training to optimize the total loss |$\mathcal{L}_{total}$| for another |$200$| epochs. For small datasets, the mini-batch size is determined by |$n/10$|⁠, while for larger datasets with over |$10,000$| cells, the batch size is fixed at |$512$|⁠. All experiments run on an Ubuntu server equipped with a single 32GB V100 GPU.

To ensure optimal performance of each method, we perform a grid search to determine the best hyperparameters as shown in Supplementary Table S2. For t-SNE, UMAP, DiffMap, ForceAtals2, and PaCMap, the ‘perplexity’ parameter is explored within the range |$\{5, 10, 15, 20, 30, 40\}$|⁠. For UMAP, the ‘min_dist’ parameter search space is |$\{0.1, 0.3, 0.5, 0.7, 0.9\}$|⁠. For PaCMap, the ‘MN_rate’ parameter is varied within {0.2, 0.4, 0.6, 0.8, 1.0}, and the ‘FP_rate’ parameter ranges from {1.0, 1.5, 2.0, 2.5, 3.0}. For PoincaréMaps, the search space spans ‘knn’ values of |$\{15, 20, 30\}$|⁠, ‘sigma’ values of |$\{1, 2\}$|⁠, and ‘gamma’ values of |$\{1, 2\}$|⁠. The default settings are used for PHATE, while DV_Poin follows the search space specified in its original publication [20].

$Analysis of synthetic and real scRNA-seq datasets with PoincaréDMT. Comparison of embedding quality metrics $Q_{global}$ and $Q_{local}$ (right) with visual inspection (middle) for various datasets, including (A) MyeloidProgenitors, (B) Krumsiek11_blobs, (C) Olsson, and (D) Paul datasets. PoincaréDMT perform consistently well compared with the cell lineage tree (left).$

Figure 2

Analysis of synthetic and real scRNA-seq datasets with PoincaréDMT. Comparison of embedding quality metrics |$Q_{global}$| and |$Q_{local}$| (right) with visual inspection (middle) for various datasets, including (A) MyeloidProgenitors, (B) Krumsiek11_blobs, (C) Olsson, and (D) Paul datasets. PoincaréDMT perform consistently well compared with the cell lineage tree (left).

Open in new tab Download slide

Below, we provide an overview of the roles played by various hyperparameters in PoincaréDMT and offer recommendations for typical value ranges. The parameter |$k^{r}$| reflects the average connectivity of the clusters in |$\mathcal{L}_{global}$|⁠, with a search space of |$\{5, 10, 20, 30\}$|⁠. The parameter |$k^{s}$| controls the size of the local region in |$\mathcal{L}_{local}$|⁠, with a search space of |$\{5, 10, 15, 20\}$|⁠. The parameters |$\nu ^{l}_{global}$| and |$\nu ^{l}_{local}$| govern the degrees of freedom in the t-distribution, with a search space of |$\{0.005, 0.01, 0.03, 0.1\}$|⁠. The parameter |$\alpha $| controls the importance of |$\mathcal{L}_{local}$|⁠, with a search space of |$\{0.1, 1, 10\}$|⁠. The parameter |$\beta $| controls the strength of batch correction, with a search space of |$\{1, 10\}$|⁠. The hyperparameters of PoincaréDMT used for each experimental dataset are shown in Supplementary Table S3.

PoincaréDMT for dimensionality reduction and visualization

A direct comparison of the quality metrics |$Q_{global}$| and |$Q_{local}$| with visual inspection offers a comprehensive view of the performance of different dimensionality reduction methods, as illustrated in Fig. 2. For datasets with relatively simple trajectories, such as MyeloidProgenitors (Fig. 2A) and Krumsiek11_blobs (Fig. 2B), PoincaréDMT shows a strong performance, effectively preserving both global and local structures. While PCA also demonstrates strong performance in preserving global structures in these simpler datasets, its ability to capture local structures is comparatively weaker, as indicated by the lower |$Q_{local}$| scores. For more complex datasets with multiple distinct cell types and branching structures, such as Olsson (Fig. 2C) and Paul (Fig. 2D), the limitations of PCA become more apparent, with a significant decline in both |$Q_{global}$| and |$Q_{local}$| metrics. t-SNE often provides good local clustering but fails to capture global relationships accurately (Supplementary Fig. S4D). UMAP generally performs better than t-SNE in preserving global structures in complex datasets. PaCMap and ForceAtlas2 attempt to balance both local and global structure preservation, but they fail to accurately capture the branching structures and continuous trajectories (Supplementary Figs S3D and S5D). PHATE, which is designed to handle both local and global structures, exhibits limitations by distorting certain branches (Supplementary Fig. S3D) and failing to clearly resolve some trajectories (Supplementary Fig. S5D). DiffMap’s representations tend to appear entangled, making it challenging to distinguish separate developmental paths (Supplementary Figs S3D and Fig. S5D). DV_Poin aim to better handle hierarchical and continuous data compared to Euclidean methods, but its effectiveness varies inconsistently between synthetic datasets (Supplementary Figs S2D and S3D) and real scRNA-seq datasets (Supplementary Figs S3D and Fig. S5D). In contrast, PoincaréDMT and PoincaréMaps consistently outperform other methods in these scenarios, providing superior preservation of both local and global structures. This is particularly evident in their ability to accurately capture the complex branching patterns in hematopoietic cell lineages, such as better capturing the local structure of Meg and Eryth cells in the Olsson dataset, and more accurately depicting the global structure of CMP, GMP, and LMPP branches in the Paul dataset.

We present a large-scale summary of the scRNA-seq dataset for the entire C. elegans cell atlas using the PoincaréDMT embedding within the Poincaré disk (Fig. 3A). Unlike PoincaréMaps, which are computationally expensive for such tasks, PoincaréDMT provides a feasible solution for this large-scale high-dimensional dataset. Analysis reveals that cells sampled from embryo times earlier than 100 min are mostly unfertilized germline cells (Fig. 3A, B). These germline cells, positioned close to the border of the disk and nearer to mature cell types, potentially reflect their transcription diversity compared to other early-stage cells [18]. By randomly selecting a cell from the embryo time 100–130 min as the root, we enable a meaningful trajectory analysis. A comparative analysis of embryo age and cell type within the PoincaréDMT embedding reveals that cells are ordered by embryo time in a continuous trajectory along the Poincaré disk, with distinct trajectories occupying different angular sectors (Fig. 3D). For instance, body wall muscle cells (the most abundant cell type in this dataset) first appear around embryo time 130–170 min and continuously progress towards the border of Poincaré disk, aligning with embryo time (Supplementary Fig. S7). Similarly, other cell types, such as ciliated amphid neurons, ciliated non-amphid neurons, hypodermis, and seam cells, which appear at slightly different embryo time points, form distinct and continuous trajectories on the disk. These trajectories vary in both angular position and radial distance from the center, effectively capturing their unique developmental timelines [27]. When comparing the pseudotime inferred from the PoincaréDMT embedding with embryo age (Fig. 3D), there is generally strong concordance, except during very early embryonic stages (before 130 min). Conversely, baseline methods show limitations in representing this complex dataset (Supplementary Fig. S6C, D). For example, PHATE and ForceAtlas2 exhibit a mixing of cells from different types, failing to clearly separate distinct trajectories. Similarly, UMAP, PHATE, and ForceAtlas2 suffer from overcrowded centers in the embedding space, making it challenging to diskern distinct trajectories. The t-SNE method displays significant crowding of cells (Fig. 3C), a phenomenon that undermines global structure preservation. In contrast, the superior performance of PoincaréDMT in preserving global structure, as evidenced by the quality metric |$Q_{global}$| (Fig. 3E), highlights its advantage over baseline methods. Visual inspection indicates that the poorer global performance of t-SNE and UMAP might be due to poor separation of seam cells and pharynx cells.

$Analysis of C. elegans cell atlas with PoincaréDMT. (A) PoincaréDMT embeddings of the C. elegans cell atlas, where the major cell types are labeled with text. (B) PoincaréDMT embeddings are labeled by embryo time and the germline cell type is positioned near the mature cell types. (C) The t-SNE and UMAP embeddings annotated by cell type and embryo time. (D) Rotation of PoincaréDMT with respect to a randomly picked root cell from the embryo time 100–130 min, and the corresponding PoincaréDMT pseudotime inference. (E) Comparison of embedding quality metrics $Q_{global}$ and $Q_{local}$ for different dimensionality reduction methods.$

Figure 3

Analysis of C. elegans cell atlas with PoincaréDMT. (A) PoincaréDMT embeddings of the C. elegans cell atlas, where the major cell types are labeled with text. (B) PoincaréDMT embeddings are labeled by embryo time and the germline cell type is positioned near the mature cell types. (C) The t-SNE and UMAP embeddings annotated by cell type and embryo time. (D) Rotation of PoincaréDMT with respect to a randomly picked root cell from the embryo time 100–130 min, and the corresponding PoincaréDMT pseudotime inference. (E) Comparison of embedding quality metrics |$Q_{global}$| and |$Q_{local}$| for different dimensionality reduction methods.

Open in new tab Download slide

PoincaréDMT for batch correction

scRNA-seq data in real-world biological datasets are affected by a range of factors, including batch-to-batch technical effects resulting from variations in experimental setups and laboratory workflows, as well as biological influences such as disease status, tissue origin, and individual variability. Most existing embedding methods do not account for batch effects directly, the batch correction methods like Harmony [41] and scVI [42] are often employed as pre-processing steps before dimensionality reduction. However, most current batch correction methods are limited to handling a single batch variable, making them less effective for increasingly complex datasets with multiple confounding factors. In contrast, PoincaréDMT integrates a batch graph that accounts for batch labels into the low-dimensional embedding during model training, thereby achieving a unified optimization of both dimensionality reduction and batch correction. Applying PoincaréDMT to colon epithelial cells and colon immune cells, which are influenced by multiple known confounding factors such as patient origin, disease status and location factors, we systematically assess the performance of PoincaréDMT embeddings across various visualization methods. These methods include t-SNE, UMAP, DiffMap, ForceAtlas2, and PHATE, each combined with batch correction methods such as Harmony (Harmony-corrected 50 PCs) and scVI.

For colon epithelial cells, we analyze the hierarchical structures within the developmental trajectories. In the Poincaré disk, we can distinctly observe two developmental trajectories originating from intestinal stem cells and leading to terminally differentiated cells, with the stem cells positioned in the center of the disk for intuitive interpretation (Fig. 4B). The identified trajectories are as follows (Fig. 4A): (1) Stem cells differentiate into cycling transit-amplifying (TA) cells, which then transition into secretory TA cells, followed by immature goblet cells, and finally mature goblet cells. (2) Stem cells progress into TA2 cells, then differentiate into immature enterocytes, and ultimately become mature enterocytes. Euclidean space-based methods show distortions in cell developmental order (Supplementary Fig. S8C). Specifically, DiffMap and PHATE exhibit significant distortions, t-SNE embeddings contain noise, and developmental trajectories are less diskernible with ForceAtlas2. Although UMAP reveals the two trajectories, it relies heavily on effective batch correction, as the results from combining scVI with UMAP are less satisfactory. In contrast, PoincaréDMT clearly reveals both trajectories and suggests a potential differentiation pathway from enteroendocrine cells to Tuft cells (Fig. 4B). PoincaréDMT obtains a significant advantage in preserving both global and local geometric structures and clustering performance compared to baseline methods in Euclidean space (Fig. 4D).

$Analysis of colon cells with PoincaréDMT. (A) The developmental hierarchy as suggested by Ding et al. [27]. (B) Rotation of PoincaréDMT with respect to a randomly picked up root cell from the stem cell type. (C) PoincaréDMT pseudotime. (D) Statistics of embedding quality metrics $Q_{global}$, $Q_{local}$ and average silhouette width across multiple patient origin batches for colon epithelial cells. (E) Embeddings produced by PoincaréDMT for colon immune cells, T cells and B cells. (F) Statistics of embedding quality metrics across multiple patient origin batches for colon immune cells.$

Figure 4

Analysis of colon cells with PoincaréDMT. (A) The developmental hierarchy as suggested by Ding et al. [27]. (B) Rotation of PoincaréDMT with respect to a randomly picked up root cell from the stem cell type. (C) PoincaréDMT pseudotime. (D) Statistics of embedding quality metrics |$Q_{global}$|⁠, |$Q_{local}$| and average silhouette width across multiple patient origin batches for colon epithelial cells. (E) Embeddings produced by PoincaréDMT for colon immune cells, T cells and B cells. (F) Statistics of embedding quality metrics across multiple patient origin batches for colon immune cells.

Open in new tab Download slide

For colon immune cells, we analyze the hierarchical structures among cell subtypes. In the Poincaré disk, three main cell types (B cells, T cells, and myeloid cells) are clearly distinguishable (Fig. 4E). For T cells, there is a diskernible trend from cycling T cells (proliferating early-stage cells) towards various functional cell groups, such as CD4+ T cells, CD8+ T cells, and Tregs. For B cells, a clear progression from cycling B cells, through the germinal center reaction, to differentiation into follicular B cells, and finally into plasma cells, marks the completion of B cell maturation and functional specialization. Myeloid cells exhibit a hierarchical structure, progressing from immature stages (cycling monocytes) to mature effector cells (macrophages, DC1 and DC2). The transition of mast cells between activation states (CD69- Mast to CD69+ Mast) emphasizes their role in immune responses (Supplementary Fig. S9A). Euclidean space-based methods show distortions similar to those observed in colon epithelial cells (Supplementary Fig. S9B). DiffMap and PHATE are typically well-suited for capturing dynamic changes during cell development, and t-SNE and ForceAtlas2 produce crowded results, all are less effective at revealing the hierarchical structure of cell subtypes. PoincaréDMT shows significant advantage in preserving both global and local geometric structures compared to other methods in Euclidean space (Fig. 4F), and achieves competitive performance in the clustering performance compared to baseline methods combined with Harmony.

PoincaréDMT for pseudotime inference and marker gene analysis

The proposed PoincaréDMT method enhances the interpretability of scRNA-seq data by integrating complementary approaches for pseudotime inference, lineage detection and marker gene analysis. These enhancements provide a comprehensive toolkit for understanding dynamic cellular processes and identifying key regulatory genes in scRNA-seq datasets. Specifically, PoincaréDMT combines low-dimensional embeddings and hyperbolic distance for pseudotime inference, integrates embeddings with agglomerative clustering based on angular coordinates for lineage detection, and employs the SHAP method with the trained model for marker gene selection.

Figure 5

Analysis of mice hematopoiesis with PoincaréDMT. (A) The developmental hierarchy as suggested by Moignard et al. [54]. (B) Rotation of PoincaréDMT with respect to reassigned root. (C) Lineages identified by clustering based on angle within the Poincaré disk. (D) Composition of the identified lineages, consisting of cells from various developmental stages. (E) Diffusion and PoincaréDMT pseudotime with respect to the root from Haghverdi et al. and the ressigned root. (F) Violin plots comparing diffusion and PoincaréDMT pseudotime for each embryo development stage. (G) Gene contribution (mean SHAP absolute value) in cells at the 4SFG stage. (H) Expression of the main genes selected in cells at the 4SFG stage.

Open in new tab Download slide

An appropriate root node is crucial for accurate pseudotime analysis. For MyeloidProgenitors (Supplementary Fig. S2B), Krumsiek11_blobs (Supplementary Fig. S3B), Olsson (Supplementary Fig. S4B), Paul (Supplementary Fig. S5B), and colon epithelial cells (Fig. 4C), we select either stem cells or a predefined root cell as the root node and present the pseudotime inference results derived from the PoincaréDMT embeddings. Our analysis reveals a strong concordance between the PoincaréDMT pseudotime and the corresponding cell lineage trees, indicating an effective alignment with the developmental trajectories. To further validate this, we compare the pseudotime inferred from PoincaréDMT embeddings with Haghverdi’s DPT reconstruction for the Moignard2015 dataset (Fig. 5E). According to PoincaréDMT, there is a distinct cluster composed of cells from various developmental stages (Supplementary Fig. S11). Moignard et al. [54] describe this cluster as ‘mesodermal’, while Haghverdi et al. [35] view it as the origin point of the developmental process. However, assigning this cluster as the root contradicts the temporal progression of development (Supplementary Fig. S12A). To address this, we reassign the root to the most distal PS cell outside the ‘mesodermal’ cluster. Figure 5F and Supplementary Fig. S13(E) show a significantly enhanced agreement between the PoincaréDMT pseudotime (with the reassigned root) and development stages, as opposed to the pseudotime ordering suggested by Haghverdi et al.

We also compare the results from PoincaréDMT with those obtained using Moignard’s DiffMap study [54]. Our findings indicate that PCA, DiffMap and ForceAtlas2 are insufficient for capturing the full range of heterogeneity present at the onset of development and fail to effectively reveal developmental asynchrony (Supplementary Fig. S10C), as suspected by Moignard’s and Haghverdi’s analyses. To further elucidate this phenomenon, we employ agglomerative clustering based on the angular distance between points in the Poincaré disk after rotation with respect to the reassigned root for lineage detection (Supplementary Fig. S13). This method identifies six potential lineages, providing insights into the developmental asynchrony through marker gene expression analysis (Fig. 5C). The analysis of cell composition in each lineage shows that erythroid cells are exclusively found in lineage |$2$| and lineage |$4$| (Fig. 5D), which contain no endothelial cells.

Previous research suggests that the division into endothelial and erythroid sub-populations occurs during the HF stage. In contrast, PoincaréDMT indicates that the fate of these sub-populations is predetermined at the PS stage [18]. Gene expression analysis of key endothelial and hemogenic markers aligns with known activation patterns of genes in both endothelial and erythroid branches (Supplementary Fig. S14). This analysis shows that primary hemogenic genes related to the erythroid population begin expressing at the PS stage, with distinct gene expression differences evident across all stages between the lineages. The marker genes selected for 4SFG stage, based on the trained PoincaréDMT model, are consistent with known hemogenic markers such as HbbbH1, Ikaros, Gfi1b, and Myb (Fig. 5G, H), which highlights the model’s effectiveness in accurately identifying key regulatory genes associated with hematopoietic differentiation.

Conclusion

In this paper, we propose the PoincaréDMT specifically designed for scRNA-seq data. By mapping high-dimensional scRNA-seq data to a hyperbolic Poincaré disk, PoincaréDMT effectively preserves the complex hierarchical structures and continuous trajectories intrinsic to cellular development. This method overcomes the limitations of existing methods by leveraging the global structure preservation strength of a graph Laplacian derived from the pairwise distance matrix and incorporating a dedicated local structure correction module combined with data augmentation. Additionally, PoincaréDMT integrates a batch graph for batch correction during the embedding process and utilizes the SHAP method to interpret important marker genes within specific clusters and cell differentiation processes. Therefore, the unified framework provided by PoincaréDMT supports several critical tasks in scRNA-seq analysis, including trajectory analysis, batch correction, pseudotime inference, and marker gene selection. Through extensive evaluations on simulated and real scRNA-seq datasets, we demonstrate that PoincaréDMT surpasses current methods in preserving global and local data structures. These capabilities highlight the potential of PoincaréDMT as a powerful tool for advancing our understanding of cellular processes and improving the analysis of scRNA-seq data.

Key Points

We describe PoincaréDMT, a universal deep manifold learning framework for dimensionality reduction, data visualization, batch correction, pseudotime inference, and marker gene selection.
PoincaréDMT preserves the global structure from the graph Laplacian matrix and achieves the local structure correction via a structure module combined with data augmentation to learn low-dimensional embeddings in the hyperbolic space.
PoincaréDMT is a versatile, efficient, easy-to-use, and robust framework for scRNA-seq data analysis.

Acknowledgments

The authors thank the Dr. Xingnan Huang for valuable discussions and suggestions. The authors thank the anonymous reviewers for their valuable suggestions.

Author contributions

Yongjie Xu and Zelin Zang contributed equally to this work. Stan Z. Li proposed this research. Yongjie Xu and Zelin Zang developed the method. Yongjie Xu, Bozhen Hu, Yue Yuan, Cheng Tan, and Jun Xia collected the datasets. Yongjie Xu and Zelin Zang conceived the experiments. Yongjie Xu conducted the experiments and analysed the results. Yongjie Xu and Zelin Zang wrote the manuscript with guidance from Stan Z. Li. All authors discussed the results, revised the draft manuscript, and read and approved the final manuscript.

Conflict of interest

None declared.

Funding

This work was supported by National Science and Technology Major Project of China (No.2022ZD0115100), National Natural Science Foundation of China Project (No. U21A20427, No. 624B2115), and Project (No. WU2022A009) from the Center of Synthetic Biology and Integrated Bioengineering of Westlake University. We thank the Westlake University HPC Center for providing computational resources.

Data and software availability

Source code are freely available at https://github.com/Westlake-AI/PoincareDMT. To make the results presented in this study reproducible, all processed data are available in Single Cell Portal SCP2757.

References

Tanay

Regev

Scaling single-cell genomics from phenomenology to mechanism

Nature

2017

;

541

331

–

Goldberg

David Allis

Bernstein

Epigenetics: A landscape takes shape

Cell

2007

;

128

635

–

10.1016/j.cell.2007.02.006

Liu

Zeng

Kan

. et al. .

CAKE: a flexible self-supervised framework for enhancing cell visualization, clustering and rare cell identification

Brief Bioinform

2024

;

bbad475

Google Scholar

Crossref

WorldCat

Luecken

Theis

Current best practices in single-cell RNA-seq analysis: A tutorial

Mol Syst Biol

2019

;

e8746

10.15252/msb.20188746

Stegle

Teichmann

Marioni

Computational and analytical challenges in single-cell transcriptomics

Nat Rev Genet

2015

;

133

–

Wold

Esbensen

Geladi

Principal component analysis

Chemom Intel Lab Syst

1987

;

–

10.1016/0169-7439(87)80084-9

Google Scholar

Crossref

WorldCat

Tenenbaum

De Silva

Langford

A global geometric framework for nonlinear dimensionality reduction

Science

2000

;

290

2319

–

10.1126/science.290.5500.2319

Coifman

Lafon

Diffusion maps

Appl Comput Harmon Anal

2006

;

–

10.1016/j.acha.2006.04.006

Google Scholar

Crossref

WorldCat

Alexander Wolf

Hamey

Plass

. et al. .

PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells

Genome Biol

2019

;

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

10.

Moon

van Dijk

Wang

. et al. .

Visualizing structure and transitions in high-dimensional biological data

Nat Biotechnol

2019

;

1482

–

10.1038/s41587-019-0336-3

11.

Roweis

Saul

Nonlinear dimensionality reduction by locally linear embedding

Science

2000

;

290

2323

–

10.1126/science.290.5500.2323

12.

Hinton

Roweis

Stochastic neighbor embedding

Adv Neural Inf Process Syst

2002;

:833–40.

13.

Belkin

Niyogi

Laplacian eigenmaps for dimensionality reduction and data representation

Neural Comput

2003

;

1373

–

10.1162/089976603321780317

Google Scholar

Crossref

WorldCat

14.

Van der Maaten

Hinton

Visualizing data using t-sne

J Mach Learn Res

2008

;

:2579–2605.

Google Scholar

OpenURL Placeholder Text

WorldCat

15.

McInnes

Healy

Melville

Umap: Uniform manifold approximation and projection for dimension reduction.

arXiv preprint arXiv:1802.03426

2018

16.

Wang

Huang

Rudin

. et al. .

Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMap, and PaCMap for data visualization

J Mach Learn Res

2021

;

9129

–

201

Google Scholar

OpenURL Placeholder Text

WorldCat

17.

Zang

. et al. .

Dlme: Deep local-flatness manifold embedding

. In:

European Conference on Computer Vision

, pp.

576

–

. Springer Nature Switzerland, 2022.

10.1007/978-3-031-19803-8_34

18.

Klimovskaia

Lopez-Paz

Bottou

. et al. .

Poincaré maps for analyzing complex hierarchies in single-cell data

Nat Commun

2020

;

2966

10.1038/s41467-020-16822-4

19.

Bhasker

Chung

Boucherie

. et al. .

Contrastive poincaré maps for single-cell data analysis

. In:

ICLR 2024 Workshop on Machine Learning for Genomics Explorations

. Vienna, Austria: Conference on Learning Representations; 2024.

20.

Yongjie

Zang

Xia

. et al. .

Structure-preserving visualization for single-cell RNA-seq profiles using deep manifold transformation with batch-correction

Communications Biology

2023

;

:369.

10.1038/s42003-023-04662-z

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

21.

Tian

Zhong

Lin

. et al. .

Complex hierarchical structures in single-cell genomics data unveiled by deep hyperbolic manifold learning

Genome Res

2023

;

232

–

10.1101/gr.277068.122

22.

Gromov

Katz

Pansu

. et al. .

Metric Structures for Riemannian and Non-Riemannian Spaces

, Vol.

152

. Basel, Switzerland, and Boston, MA: Birkhäuser; 1999.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

23.

Chami

Ying

Ré

. et al. .

Hyperbolic graph convolutional neural networks

Adv Neural Inf Process Syst

2019

;

4869

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

24.

Mathieu

Le Lan

Maddison

. et al. .

Continuous hierarchical representations with poincaré variational auto-encoders

Adv Neural Inf Process Syst

2019

;

:12565–76.

Google Scholar

OpenURL Placeholder Text

WorldCat

25.

Nickel

Kiela

Poincaré embeddings for learning hierarchical representations

Adv Neural Inf Process Syst

2017

;

:6341–50.

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Ovinnikov

Poincaré wasserstein autoencoder.

arXiv preprint arXiv:1901.01427

2019

27.

Ding

Regev

Deep generative model embedding of single-cell RNA-seq profiles on hyperspheres and hyperbolic spaces

Nat Commun

2021

;

2554

10.1038/s41467-021-22851-4

28.

Kingma

Welling

Auto-encoding variational bayes.

arXiv preprint arXiv:1312.6114

2013

29.

Ganea

Bécigneul

Hofmann

Hyperbolic neural networks

Adv Neural Inf Process Syst

2018

;

:5350–60.

Google Scholar

OpenURL Placeholder Text

WorldCat

30.

Park

Cho

Chang

. et al. .

Unsupervised hyperbolic representation learning via message passing auto-encoders

. In:

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

5516

–

. Nashville, TN, USA: IEEE,

2021

31.

Peng

Varanka

Mostafa

. et al. .

Hyperbolic deep neural networks: A survey

IEEE Trans Pattern Anal Mach Intell

2021

;

10023

–

Google Scholar

Crossref

WorldCat

32.

Shimizu

Mukuta

Harada

Hyperbolic neural networks++

. In:

International Conference on Learning Representations

. 2021.

33.

Qiu

Mao

Tang

. et al. .

Reversed graph embedding resolves complex single-cell trajectories

Nat Methods

2017

;

979

–

34.

Zang

Duan

. et al. .

A review of artificial intelligence based biological-tree construction: priorities, methods, applications and trends.

arXiv preprint arXiv:2410.04815

2024

35.

Haghverdi

Maren Büttner

Wolf

. et al. .

Diffusion pseudotime robustly reconstructs lineage branching

Nat Methods

2016

;

845

–

36.

Amodio

Van Dijk

Srinivasan

. et al. .

Exploring single-cell data with deep multitasking neural networks

Nat Methods

2019

;

1139

–

10.1038/s41592-019-0576-7

37.

Barkas

Petukhov

Nikolaeva

. et al. .

Joint analysis of heterogeneous single-cell RNA-seq dataset collections

Nat Methods

2019

;

695

–

10.1038/s41592-019-0466-z

38.

Butler

Hoffman

Smibert

. et al. .

Integrating single-cell transcriptomic data across different conditions, technologies, and species

Nat Biotechnol

2018

;

411

–

39.

Haghverdi

Lun

ATL

Morgan

. et al. .

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

Nat Biotechnol

2018

;

421

–

40.

Hie

Bryson

Berger

Efficient integration of heterogeneous single-cell transcriptomes using scanorama

Nat Biotechnol

2019

;

685

–

10.1038/s41587-019-0113-3

41.

Korsunsky

Millard

Fan

. et al. .

Fast, sensitive and accurate integration of single-cell data with harmony

Nat Methods

2019

;

1289

–

10.1038/s41592-019-0619-0

42.

Lopez

Regier

Cole

. et al. .

Deep generative modeling for single-cell transcriptomics

Nat Methods

2018

;

1053

–

10.1038/s41592-018-0229-2

43.

Welch

Kozareva

Ferreira

. et al. .

Single-cell multi-omic integration compares and contrasts features of brain cell identity

Cell

2019

;

177

1873

–

1887.e17

10.1016/j.cell.2019.05.006

44.

Zang

Cheng

Xia

. et al. .

DMT-EV: an explainable deep network for dimension reduction

IEEE Trans Vis Comput Graph

2022

;

1710

–

Google Scholar

Crossref

WorldCat

45.

Zang

Yongjie

Linyan

. et al. .

UDRN: unified dimensional reduction neural network for feature selection and feature projection

Neural Netw

2023

;

161

626

–

10.1016/j.neunet.2023.02.018

46.

Peike

Sun

Fahira

. et al. .

DROEG: a method for cancer drug response prediction based on omics and essential genes integration

Brief Bioinform

2023

;

bbad003

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

47.

Zang

Wang

. et al. .

DMT-HI: MOE-based hyperbolic interpretable deep manifold transformation for unspervised dimensionality reduction.

arXiv preprint arXiv:2410.19504

2024

48.

Wolf

Angerer

Theis

SCANPY: large-scale single-cell gene expression data analysis

Genome Biol

2018

;

–

49.

Gardner

Cantor

Collins

Construction of a genetic toggle switch in escherichia coli

Nature

2000

;

403

339

–

50.

Haghverdi

Buettner

Theis

Diffusion maps for high-dimensional single-cell analysis of differentiation data

Bioinformatics

2015

;

2989

–

10.1093/bioinformatics/btv325

51.

Krumsiek

Marr

Schroeder

. et al. .

Hierarchical differentiation of myeloid progenitors is encoded in the transcription factor network

PloS One

2011

;

e22649

10.1371/journal.pone.0022649

52.

Olsson

Venkatasubramanian

Chaudhri

. et al. .

Single-cell analysis of mixed-lineage states leading to a binary cell fate choice

Nature

2016

;

537

698

–

702

53.

Paul

Arkin

Y’a

Giladi

. et al. .

Transcriptional heterogeneity and lineage commitment in myeloid progenitors

Cell

2015

;

163

1663

–

10.1016/j.cell.2015.11.013

54.

Moignard

Woodhouse

Haghverdi

. et al. .

Decoding the regulatory network of early blood development from single-cell gene expression measurements

Nat Biotechnol

2015

;

269

–

55.

Packer

Zhu

Huynh

. et al. .

A lineage-resolved molecular atlas of c. elegans embryogenesis at single-cell resolution

Science

2019

;

365

:eaax1971.

10.1126/science.aax1971

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

56.

Smillie

Biton

Ordovas-Montanes

. et al. .

Intra-and inter-cellular rewiring of the human colon during ulcerative colitis

Cell

2019

;

178

714

–

730.e22

10.1016/j.cell.2019.06.029

57.

Chebotarev

Shamis

The matrix-forest theorem and measuring relations in small social groups.

arXiv preprint math/0602070

2006

58.

Zang

Deep manifold transformation for dimension reduction

arXiv preprint arXiv:2010.14831, 372:373

2020

59.

Jacomy

Venturini

Heymann

. et al. .

ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software

PloS One

2014

;

e98679

10.1371/journal.pone.0098679

60.

Lee

Verleysen

Scale-independent quality criteria for dimensionality reduction

Pattern Recogn Lett

2010

;

2248

–

10.1016/j.patrec.2010.04.013

Google Scholar

Crossref

WorldCat

61.

Tran

HTN

Ang

Chevrier

. et al. .

A benchmark of batch-effect correction methods for single-cell RNA sequencing data

Genome Biol

2020

;

–

10.1186/s13059-019-1850-9

Google Scholar

Crossref

WorldCat

Author notes

Yongjie Xu and Zelin Zang contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Download all slides

Month:	Total Views:
January 2025	144
February 2025	162
March 2025	139
April 2025	66

Article Contents

Complex hierarchical structures analysis in single-cell data with Poincaré deep manifold transformation

Abstract

Introduction

Materials and methods

Datasets and data preprocessing

Methods

Data definition and augmentation

Poincaré ball model

From local connectivity to global proximities

Framework of PoincaréDMT

Loss function

Batch correction

Translation in Poincaré space and Poincaré pseudotime

Workflow of PoincaréDMT

Results and discussion

Comparison methods and evaluation metrics

Model implementation and hyperparameter setting

PoincaréDMT for dimensionality reduction and visualization

PoincaréDMT for batch correction

PoincaréDMT for pseudotime inference and marker gene analysis

Conclusion

Acknowledgments

Author contributions

Conflict of interest

Funding

Data and software availability

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Complex hierarchical structures analysis in single-cell data with Poincaré deep manifold transformation

Abstract

Introduction

Materials and methods

Datasets and data preprocessing

Methods

Data definition and augmentation

Poincaré ball model

From local connectivity to global proximities

Framework of PoincaréDMT

Loss function

Batch correction

Translation in Poincaré space and Poincaré pseudotime

Workflow of PoincaréDMT

Results and discussion

Comparison methods and evaluation metrics

Model implementation and hyperparameter setting

PoincaréDMT for dimensionality reduction and visualization

PoincaréDMT for batch correction

PoincaréDMT for pseudotime inference and marker gene analysis

Conclusion

Acknowledgments

Author contributions

Conflict of interest

Funding

Data and software availability

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only