Graph contrastive learning as a versatile foundation for advanced scRNA-seq data analysis

Author Notes

Abstract

Single-cell RNA sequencing (scRNA-seq) offers unprecedented insights into transcriptome-wide gene expression at the single-cell level. Cell clustering has been long established in the analysis of scRNA-seq data to identify the groups of cells with similar expression profiles. However, cell clustering is technically challenging, as raw scRNA-seq data have various analytical issues, including high dimensionality and dropout values. Existing research has developed deep learning models, such as graph machine learning models and contrastive learning-based models, for cell clustering using scRNA-seq data and has summarized the unsupervised learning of cell clustering into a human-interpretable format. While advances in cell clustering have been profound, we are no closer to finding a simple yet effective framework for learning high-quality representations necessary for robust clustering. In this study, we propose scSimGCL, a novel framework based on the graph contrastive learning paradigm for self-supervised pretraining of graph neural networks. This framework facilitates the generation of high-quality representations crucial for cell clustering. Our scSimGCL incorporates cell-cell graph structure and contrastive learning to enhance the performance of cell clustering. Extensive experimental results on simulated and real scRNA-seq datasets suggest the superiority of the proposed scSimGCL. Moreover, clustering assignment analysis confirms the general applicability of scSimGCL, including state-of-the-art clustering algorithms. Further, ablation study and hyperparameter analysis suggest the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL can serve as a robust framework for practitioners developing tools for cell clustering. The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.

scRNA-seq data, graph contrastive learning, self-supervised learning, cell-cell graph, cell clustering

Introduction

Cells are essential building blocks of all living organisms and play a pivotal role in a myriad of biological functions. The heterogeneity in gene expression across individual cells has emerged as a significant field of study, especially with the development of single-cell RNA sequencing (scRNA-seq). The scRNA-seq, perhaps the most widely used method for transcriptome-wide analysis at the single-cell level [1], allows a deeper insight into cell classification, characterization, and differentiation. Such analysis have successfully uncovered the complex association between cellular functions and improved the performance of various downstream applications. These applications include profiling genetic diversity within populations [2], identifying and annotating cell subtypes [3], and facilitating advances in drug discovery and development [4].

However, raw scRNA-seq data present their own set of challenges, such as high dimensionality and measurement noise [5]. Notably, measurement noise often causes dropout events, where genes appear to be unexpressed due to technical artifacts rather than biological reality. Recent advancements in deep learning techniques have shown promise in addressing these challenges by reducing the dimensionality of data and filtering out the noise to uncover biological signals. Deep learning techniques have been widely used in bioinformatics, including scRNA-seq data analysis, to provide new insights into cellular functions.

In the realm of scRNA-seq data analysis, deep neural networks have been usefully explored. Prior studies by [6–13] have employed autoencoders – a type of neural network to effectively learn condensed representations of scRNA-seq data in a lower-dimensional space. Representative examples of scDeepCluster, DESC, and scDCC [6, 7, 10], which are specifically designed to provide clustering assignments with scRNA-seq data. Such approaches, DCA and scGMAI [8, 9], integrate data imputation into their clustering assignments to mitigate the impact of dropout events, thereby facilitating the formation of informative gene expression matrices, which is essential for improving cell clustering.

More recent studies have seen increasingly rapid advances in the field of scRNA-seq data analysis [14–19]. On the one hand, graph-sc, scASGC, and scGAC [14–16] employ graph autoencoders to encode scRNA-seq data as cell graphs and capture the between-cell association. On the other hand, contrastive-sc, scDCCA, and scDECL [17–19] focus on the provision of autoencoder with contrastive learning. The intuition behind contrastive learning is to compare input samples in order to learn desirable representations using the similarity between positive pairs or the dissimilarity between negative pairs. Accordingly, contrastive learning is recognized as an alternative approach to reconstruction-based with a focus on learning an embedding space where similar ones are pushed closer to each other while dissimilar ones are pushed farther apart [20–23].

Despite these promising advances, research has yet to systematically develop a simple yet effective framework for learning high-quality representations crucial for robust cell clustering. The learning of high-quality representations is built upon the success of pretraining generalizable models, which aligns with the promise of the current foundation models (i.e. a series of large-scale pretrained models that can be applied on various downstream use cases and tasks) [24–26]. The foundation model has been instrumental in our understanding of the role of deep learning in the biological context. However, the transition to practical, actionable insights in scRNA-seq remains challenging, especially in analyzing and interpreting complex biological data.

To bridge this gap, we incorporate the graph contrastive learning (GCL) paradigm into scRNA-seq data analysis. Several key aspects of modeling need to be carefully considered. First, the construction of cell-cell graphs from scRNA-seq data presents a fundamental challenge. Second, there is an urgent need to design a neural network architecture that leverages the GCL paradigm to learn representations desirable for cell-cell graphs. Third, the development of pretraining strategies as a pretext task in the self-supervised learning setting is critical for improving the effectiveness and robustness of the model decisions.

In this study, we propose scSimGCL, a simple and effective framework that combines graph neural networks with contrastive learning, aligning with the GCL paradigm, specifically tailored for scRNA-seq data analysis. The GCL paradigm facilitates self-supervised pretraining of graph neural networks, which enables the generation of high-quality representations crucial for robust cell clustering. A critical component of our scSimGCL is the innovative construction of cell-cell graphs using scRNA-seq data. We develop a cell-cell graph structure learning mechanism that pays attention to the critical parts of the input data using a multi-head attention module [27] for improving the accuracy and relevance of graphs. This module is used for the purpose of deriving a nuanced cell-cell graph, where individual cells form nodes and their biological associations are represented as edges.

Moreover, our study addresses a major issue in existing methods that combine graph machine learning with contrastive learning for scRNA-seq data. Prior studies [28–30] often directly integrate preprocessed cell graphs (e.g., via K-Means) into the contrastive learning scheme. However, this method is likely to disrupt the intrinsic homophily of the graph, where connected nodes are expected to share similar properties, thus resulting in inferior quality of representation. In response to this issue, our approach adopts an effective strategy for constructing the pairs of contrastive learning using the homophily of original and augmented cell-cell graphs (generated by gene masking and edge dropping [31]). This strategy preserves the internal structure of graphs, facilitating the generation of informative contrastive pairs. Further, our approach introduces a well-designed pretraining mechanism to explore the topological nuances of cell-cell graphs. This mechanism implements contrastive learning with data imputation in a joint learning strategy, enhancing the model performance across various evaluation metrics.

We conduct rigorous evaluations of scSimGCL against deep learning approaches on both simulated and real scRNA-seq datasets. Our framework outperforms competing state-of-the-art models in scRNA-seq data imputation and cell clustering by a considerable margin. Further experimental investigations into the clustering assignments reveal that scSimGCL can seamlessly integrate state-of-the-art clustering algorithms into its architecture. Notably, the model is capable of making informed decisions by indicating a predefined number of clusters or automatically determining the number of clusters (i.e. K-Means and Affinity Propagation), which is a key aspect in unsupervised learning scenarios such as scRNA-seq data analysis. Ablation study and hyperparameter analysis further confirm the efficacy and robustness of scSimGCL in the self-supervised learning setting.

Materials and methods

The framework of scSimGCL

The overall structure of the proposed scSimGCL is illustrated in Fig. 1. First, we incorporate a multi-head attention module [27] to learn a cell-cell graph from scRNA-seq data. Second, we adopt feature/gene masking and edge dropping in the learned cell-cell graph to generate its counterpart. Subsequently, the learned cell-cell graph and its counterpart are used to create contrastive pairs. Last, we design a pretraining mechanism to generate high-quality representations for cell clustering, where contrastive learning and data imputation are implemented with a joint learning strategy.

Figure 1

The overall framework of scSimGCL, which includes cell-cell graph structure and contrastive learning that take into account the contributions of cell-cell graph representations for cell clustering.

Open in new tab Download slide

Cell-cell graph structure learning

Let |$\mathcal{G} = (\mathcal{V}, \mathcal{E}, \mathrm{X}) = (\mathrm{A}, \mathrm{X})$| be a cell-cell graph. |$\mathcal{V}=\left \{v_{1}, v_{2}, \cdots , v_{N}\right \}$| is a collection of nodes, where each is a cell and |$N$| is the number of cells. |$\mathcal{E}$| is a collection of edges between nodes. |$\mathrm{X} \in \mathbb{R}^{N \times g}$| is the gene expression matrix obtained after preprocessing, where |$g$| is the number of genes. The genes with zero values in |$X$| are denoted by using a mask matrix |$\mathrm{M} \in \mathbb{R}^{N \times g}$|⁠. The aim of cell-cell graph structure learning is to obtain an adjacency matrix |$A \in \{0, 1\}^{N \times N}$| using scRNA-seq data, where |$A_{ij}$| denotes the edge between cells |$i$| and |$j$|⁠. In order to obtain the adjacency matrix |$A$|⁠, we incorporate a multi-head attention module [27] into the network architecture. Specifically, query |$Q$| and key |$K$| vectors are generated using a linear transformation on |$X$|⁠. The dot product is carried out between |$Q$| and |$K$|⁠, using the Softmax function to obtain up to |$m$| attention heads. All heads are concatenated and then projected using a linear transformation to the original space. The mathematical formulation can be written as follows:

$$ \begin{align}& \begin{split} \begin{aligned} & \mathrm{Q}_{\mathrm{i}} = \mathrm{W}_{\mathrm{i}}^{\mathrm{Q}} \cdot \mathrm{X}, \mathrm{K}_{\mathrm{i}} = \mathrm{W}_{\mathrm{i}}^{\mathrm{K}} \cdot \mathrm{X}, \\ & \operatorname{head}_{i}(\mathrm{X})=\operatorname{softmax}\left(\frac{\mathrm{Q}_{i} \cdot \mathrm{K}_{i}^{\top}}{\sqrt{d_{K}}}\right), \\ & \mathrm{A}^{\prime} = \operatorname{MultiHeadAtt}(\mathrm{X}), \\ & = \left[\operatorname{head}_{1}(\mathrm{X}) \oplus \cdots \oplus \operatorname{head}_{i}(\mathrm{X}) \oplus \cdots \oplus \operatorname{head}_{m}(\mathrm{X})\right] \mathrm{W}_{O}, \end{aligned} \end{split}\end{align} $$

(1)

where |$head_{i}$| is the |$i$|th attention head. |$\oplus $| is the concatenation. |$d_{K}$| is the dimension of |$K$|⁠. Accordingly, a sparse and non-negative adjacency matrix |$A$| can be obtained using the intermediate matrix |$\mathrm{A}^{\prime }$| with numerical constraints as follows:

$$ \begin{align} \begin{split} \mathrm{A}= \begin{cases} 1, & \mathrm{A}^{\prime} \geqslant \eta \\ 0, & \mathrm{A}^{\prime} < \eta \end{cases} , \end{split}\end{align} $$

(2)

where |$\eta $| is a learnable threshold that excludes the values lower than that. Note that the proposed scSimGCL is an end-to-end cell representation learning framework where cell-cell graph structure learning and cell-cell GCL work closely together. Accordingly, the proposed cell-cell graph structure learning module is not trained independently of the scSimGCL.

Cell-cell GCL

Data augmentation techniques allow the construction of counterparts from data samples. A data augmentation module is an important component in the GCL paradigm and plays a vital role in increasing the diversity of input graph samples [32]. We adopt attribute/gene masking and edge dropping [31] in the learned cell-cell graph to generate its counterpart. Specifically, we implement edge dropping by randomly dropping a collection of edges from the cell-cell graph. Given the adjacency matrix |$A$|⁠, a masking matrix |$\mathrm{M}^{(a)} \in \{0, 1\}^{N \times N}$| is used here, where each |$M_{ij}^{(a)}$| is derived from a Bernoulli distribution with probability |$p^{(a)}$| independently. Accordingly, a new |$\tilde{\mathrm{A}}$| can be derived from using |$A$| with |$M$| as follows:

$$ \begin{align} \begin{split} \begin{gathered} \mathrm{M}_{i j}^{(a)} \sim \operatorname{Bernoulli}\left(p^{(a)}\right) \\ \tilde{\mathrm{A}} = \mathrm{M}^{(a)} \odot \mathrm{A} \end{gathered} . \end{split}\end{align} $$

(3)

We implement gene masking by randomly masking a collection of genes with zero values from the cell-cell graph. In the same vein, a random vector |$\mathrm{M}^{(x)} \in \{0,1\}^{g}$| is used here, where each is derived from a Bernoulli distribution with probability |$p^{(x)}$| independently. Subsequently, the gene vector of each cell is masked with |$\mathrm{M}^{(x)}$| and the gene matrix |$\tilde{\mathrm{X}}$| is generated as follows:

$$ \begin{align}& \begin{split} \tilde{\mathrm{X}} = \left[\mathrm{X}_{1} \odot \mathrm{M}^{(x)}, \mathrm{X}_{2} \odot \mathrm{M}^{(x)}, \cdots, \mathrm{X}_{N} \odot \mathrm{M}^{(x)}\right]^{\top}. \end{split}\end{align} $$

(4)

The outputs of the above process are |$\tilde{\mathrm{A}}$| and |$\tilde{\mathrm{X}}$|⁠. The two cell-cell graphs |$\mathcal{G} = (\mathcal{A}, \mathrm{X})$| and |$\tilde{\mathcal{G}} = (\tilde{\mathcal{A}}, \tilde{\mathrm{X}})$| can be regarded as two graph views |$View_{\mathcal{G}}$| and |$View_{\tilde{\mathcal{G}}}$| for contrastive learning. Before implementing contrastive learning, we employ a graph neural network-based encoder |$f_{\theta }(\cdot )$| to generate node/cell representations as follows:

$$ \begin{align}& \begin{split} \begin{aligned} & \mathrm{Z} = f_\theta(\mathcal{G}) = \operatorname{ReLU}\left(\mathrm{A} \operatorname{ReLU}\left(\mathrm{AXW}_{1}^{\mathrm{Z}}\right) \mathrm{W}_{2}^{\mathrm{Z}}\right) \\ & \tilde{\mathrm{Z}} = f_\theta(\widetilde{\mathcal{G}}) = \operatorname{ReLU}\left(\widetilde{\mathrm{A}} \operatorname{ReLU}\left(\tilde{\mathrm{A}} \tilde{\mathrm{X}} \mathrm{W}_{1}^{\mathrm{Z}}\right) \mathrm{W}_{2}^{\mathrm{Z}}\right)\end{aligned}, \end{split}\end{align} $$

(5)

where |$\theta $| is the parameter of |$f_{\theta }(\cdot )$|⁠. |$Z$| and |$\tilde{Z}$| are cell representations for both |$View_{\mathcal{G}}$| and |$View_{\tilde{\mathcal{G}}}$|⁠. The graph neural network-based encoder employed here is graph convolutional networks [33]. Subsequently, a projector |$f_{\phi }(\cdot )$| is used to map the generated cell representations |$\mathrm{Z}$| and |$\tilde{\mathrm{Z}}$| into a lower-dimensional latent space as follows:

$$ \begin{align} \begin{split} \mathrm{H} = f_{\phi}(\mathrm{Z})= \mathrm{W}^{\mathrm{H}} \cdot \mathrm{Z} + \mathrm{b}^{\mathrm{H}}, \\ \tilde{\mathrm{H}} = f_\phi(\tilde{\mathrm{Z}}) = \mathrm{W}^{\mathrm{H}} \cdot \tilde{\mathrm{Z}} + \mathrm{b}^{\mathrm{H}}, \end{split}\end{align} $$

(6)

where |$\phi $| is the parameter of |$f_{\phi }(\cdot )$|⁠. |$H$| and |$\tilde{H}$| are cell representations obtained after projection. The projector used here is built upon a natural Multi-Layer Perceptrons. The cell representations |$H$| and |$\tilde{H}$| obtained through the above steps are used for contrastive learning.

Contrastive learning aims to maximize |$H$| and |$\tilde{H}$| mutual information. A node/cell is arbitrarily chosen from the learned cell-cell graph view as an anchor. The positive sample pairs for contrastive learning are as follows: (i) the nodes connected to the anchor, (ii) the anchor counterpart in its counterpart view, and (iii) the nodes connected to the anchor from its counterpart view, while negative samples are the remaining samples. The |$i$|th cell in |$View_{\mathcal{G}}$| is taken as the anchor and thus the contrastive loss |$\mathcal{L}_{i}^{\text{(gcl)}}\left (\mathrm{H}, \text{View}_{\mathcal{G}}\right )$| as follows:

$$ \begin{align} \begin{split} \begin{gathered}\mathcal{L}_{i}^{\text{(gcl)}}\left(\mathrm{H}, \text{View}_{\mathcal{G}}\right) = \\ -\frac{1}{2|\mathcal{N}(i)|+1} \log \frac{\exp \left(f_{\varphi}\left(\mathrm{H}_{i}, \tilde{\mathrm{H}}_{i}\right)/\tau\right) +} {\exp \left(f_{\varphi}\left(\mathrm{H}_{i}, \tilde{\mathrm{H}}_{i}\right)/\tau\right) + } \\ \frac{\sum_{v_{j} \in \mathcal{N}(i)}\left(\exp \left(f_{\varphi}\left(\mathrm{H}_{i}, \mathrm{H}_{j}\right) / \tau\right)+\exp \left(f_{\varphi}\left(\mathrm{H}_{i}, \tilde{\mathrm{H}}_{j}\right) / \tau\right)\right)}{\sum_{j \neq i}\left(\exp \left(f_{\varphi}\left(\mathrm{H}_{i}, \mathrm{H}_{j}\right) / \tau\right) + \exp \left(f_{\varphi}\left(\mathrm{H}_{i}, \tilde{\mathrm{H}}_{j}\right)/\tau\right)\right)} \end{gathered}, \end{split}\end{align} $$

(7)

where |$\mathcal{N}(i)$| is a collection of neighbors for the |$i$|th cell in |$\mathcal{G}$|⁠. |$f_{\varphi }(\cdot )$| is the measurement of similarity, and the dot product is carried out here |$f_{\varphi }(a, b) = a \cdot b$|⁠. |$\tau $| is a temperature parameter that provides the penalty for negative pairs.

Since we have two graph views |$View_{\mathcal{G}}$| and |$View_{\tilde{\mathcal{G}}}$| used for contrastive learning, the |$i$|th cell in |$View_{\tilde{\mathcal{G}}}$| is chosen as an anchor again. The contrastive loss |$\mathcal{L}_{i}^{\text{(gcl)}}(\tilde{\mathrm{H}}, \text{View}_{\tilde{\mathcal{G}}} )$| is computed by the way used in Equation (7). Accordingly, the contrastive loss between |$H$| and |$\tilde{H}$| as follows:

$$ \begin{align} \begin{split} \mathcal{L}^{(gcl)}= \alpha \cdot \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}_{i}^{(gcl)}(\mathrm{H}, View_{\mathcal{G}}) + \\ \beta \cdot \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}_{i}^{(gcl)}(\tilde{\mathrm{H}}, View_{\tilde{\mathcal{G}}}), \end{split}\end{align} $$

(8)

where |$\alpha $| and |$\beta $| are two scaling parameters that take into account the contributions of the two losses.

Training and prediction

We design a pretraining mechanism to explore the topological nuances of cell-cell graphs. This mechanism implements contrastive learning with data imputation in a joint learning strategy. We have obtained contrastive learning loss |$\mathcal{L}^{(gcl)}$| from the above section. Now, the imputation loss |$\mathcal{L}^{(imp)}$| needs to be formulated. The cell representation |$Z$| for |$View_{\mathcal{G}}$| is used for scRNA-seq data imputation as follows:

$$ \begin{align} \begin{split} \hat{\mathrm{X}} = \mathrm{W}^{\mathrm{x}} \cdot \mathrm{Z} + \mathrm{b}^{\mathrm{x}}, \end{split}\end{align} $$

(9)

where |$\hat{X}$| is the obtained scRNA-seq data after predicting. The imputation of scRNA-seq data is a regression task, and thus the mean absolute error between |$X$| and |$\hat{X}$| of each cell as follows:

$$ \begin{align} \begin{split} \mathcal{L}^{(imp)} = \frac{1}{N} \sum_{i=1}^{N} \left|\mathrm{X}_{i} \odot \mathrm{M}_{\mathrm{i}} - \hat{\mathrm{X}}_{\mathrm{i}} \odot \mathrm{M}_{i}\right|. \end{split}\end{align} $$

(10)

Next, we integrate both |$\mathcal{L}^{(gcl)}$| and |$\mathcal{L}^{(imp)}$| losses into an overall loss as follows:

$$ \begin{align} \begin{split} \mathcal{L} = \mathcal{L}^{(imp)} + \lambda \cdot \mathcal{L}^{(gcl)}, \end{split}\end{align} $$

(11)

where |$\lambda $| is a scaling parameter that takes into account the contributions of |$\mathcal{L}^{(imp)}$| and |$\mathcal{L}^{(gcl)}$|⁠.

In summary, through the obtained cell representation |$Z$|⁠, a general clustering method can be employed to carry out clustering assignments. Accordingly, the two stages, cell representation learning and clustering assignment analysis, are decoupled in the proposed scSimGCL framework. By doing this, our scSimGCL framework provides more flexibility in implementing different clustering algorithms (K-Means and Affinity Propagation).

Data

We evaluate the proposed scSimGCL against competing deep learning approaches on simulated and real scRNA-seq datasets. We compare scSimGCL with state-of-the-art models on scRNA-seq data imputation and cell clustering tasks. The 10 scRNA-seq datasets used are Shekhar mouse retina cells [34], Baron [35], 10X PBMC [36], Camp [37], Mouse bladder cells [38], Zeisel [39], Tabula Sapiens—Heart [40], Tabula Sapiens—Fat [40], Tabula Sapiens—Tongue [40], and Chien [41]. Each dataset is a two-dimensional matrix, where rows and columns are cells and genes, and each cell has its own available label. These datasets are processed using the Scanpy toolkit [42]. It includes excluding genes with expression values of zero from all cells first and then performing the normalization and logarithmically transformation on the gene expression matrix. Note that scSimGCL is not an approach proposed for batch effects. For example, when analyzing the Shekhar mouse retina cells dataset, we used all the data and did not discard any batches. The results obtained from the preliminary analysis of the 10 datasets are set out in Table 1.

Table 1

Open in new tab

Results of the statistical analysis on the 10 datasets

Dataset	No. of cells	No. of genes	No. of cell types	Sequencing platform
Shekhar mouse retina cells	27499	13166	19	Drop-seq
Baron	8569	20125	14	inDrop
10X PBMC	4340	33694	8	10x
Camp	777	16270	7	SMARTer
Mouse bladder cells	2746	20670	16	Microwell-seq
Zeisel	3005	19972	9	STRT-seq UMI
Tabula Sapiens—Heart	10188	58482	5	10x 3’ v3
Tabula Sapiens—Fat	19612	58482	12	10x 3’ v3
Tabula Sapiens—Tongue	13629	58482	12	10x 3’ v3
Chien	37121	60448	15	mCT-seq

Dataset	No. of cells	No. of genes	No. of cell types	Sequencing platform
Shekhar mouse retina cells	27499	13166	19	Drop-seq
Baron	8569	20125	14	inDrop
10X PBMC	4340	33694	8	10x
Camp	777	16270	7	SMARTer
Mouse bladder cells	2746	20670	16	Microwell-seq
Zeisel	3005	19972	9	STRT-seq UMI
Tabula Sapiens—Heart	10188	58482	5	10x 3’ v3
Tabula Sapiens—Fat	19612	58482	12	10x 3’ v3
Tabula Sapiens—Tongue	13629	58482	12	10x 3’ v3
Chien	37121	60448	15	mCT-seq

Table 1

Open in new tab

Results of the statistical analysis on the 10 datasets

Dataset	No. of cells	No. of genes	No. of cell types	Sequencing platform
Shekhar mouse retina cells	27499	13166	19	Drop-seq
Baron	8569	20125	14	inDrop
10X PBMC	4340	33694	8	10x
Camp	777	16270	7	SMARTer
Mouse bladder cells	2746	20670	16	Microwell-seq
Zeisel	3005	19972	9	STRT-seq UMI
Tabula Sapiens—Heart	10188	58482	5	10x 3’ v3
Tabula Sapiens—Fat	19612	58482	12	10x 3’ v3
Tabula Sapiens—Tongue	13629	58482	12	10x 3’ v3
Chien	37121	60448	15	mCT-seq

Dataset	No. of cells	No. of genes	No. of cell types	Sequencing platform
Shekhar mouse retina cells	27499	13166	19	Drop-seq
Baron	8569	20125	14	inDrop
10X PBMC	4340	33694	8	10x
Camp	777	16270	7	SMARTer
Mouse bladder cells	2746	20670	16	Microwell-seq
Zeisel	3005	19972	9	STRT-seq UMI
Tabula Sapiens—Heart	10188	58482	5	10x 3’ v3
Tabula Sapiens—Fat	19612	58482	12	10x 3’ v3
Tabula Sapiens—Tongue	13629	58482	12	10x 3’ v3
Chien	37121	60448	15	mCT-seq

The simulated scRNA-seq data is derived from the implementation of the Splatter [43] toolkit. The Splatter package is designed for simulating scRNA-seq data, offering a user-friendly interface for creating reproducible simulations. It allows the estimation of parameters from real data and provides functions for comparing real and simulated datasets. There are up to 12 simulated datasets created for three case studies. This includes the simulation of (i) the dropout events in the gene expression matrix, (ii) the strength of low signals for clustering assignments, and (iii) the high dimensionality of the gene expression matrix under extremely sparsity space. The three case studies are designed according to the previous study [6] and de.fracScale, dropout.mid, and dropout.shape mentioned below are all parameters of the Splatter package. In order to implement the case study with varied dropout rates, we set the value of dropout.mid from 0 to 2, i.e. 0.0, 0.5, 1.0, 1.5, and 2.0. Meanwhile, the number of cells is set to 6000 with up to four clusters, the number of genes is set to 5000, and dropout.shape and de.fracScale are set to −1 and 0.3, respectively. In order to implement the case study with the different strengths of low signals (Sigma), we set the value of de.fracScale from 0.1 to 0.25, i.e. 0.1, 0.15, 0.2, and 0.25. Meanwhile, the number of cells is set to 6000 with up to four clusters, the number of genes is set to 5000, and dropout.shape and dropout.mid are set to −1 and 0, respectively. In order to implement the case study with varied dimension sizes under extremely sparsity space, we set the number of genes from 10 000 to 20 000, i.e. 10 000, 15 000, and 20 000, and the value of dropout.mid to 2. Meanwhile, the number of cells is set to 6000 with up to four clusters, the number of genes is set to 5000, and dropout.shape and de.fracScale are set to −1 and 0.3, respectively.

Implementation and parameters setting

The proposed scSimGCL is developed using Python 3.8.17 with PyTorch 1.8.1. The scRNA-seq data are divided into two groups for analysis: 80% for model development and 20% for testing. For Cell-cell Graph Structure Learning, the number of attention heads |$m$| is 5, and the initial value of the learnable threshold |$\eta $| is 0.5. For Cell-cell Graph Contrastive Learning, the probability of a Bernoulli distribution is 0.5; the node size of the graph neural network-based encoder is 275; the node size of the projector is 80; the temperature parameter |$\tau $| is 0.4, and the two scaling parameters |$\alpha $| and |$\beta $| are 0.6 and 0.75, respectively. For Training and Prediction, the scaling parameter |$\lambda $| is 0.59. The Adam optimizer with a learning rate 1e-3 is employed to train the proposed approach. We utilize the default values of K-Means in Sklearn and specify the corresponding number of classes for each dataset. For other clustering algorithms, we utilize the default values provided in the relevant packages. The batch size for training depends on the sample size of the datasets. Specifically, a batch size of 128 is applied for datasets with a sample size below 4000, a batch size of 512 is applied for those with a sample size between 4000 and 8000, and a batch size of 1024 is applied for datasets exceeding 8000 samples. All parameters are obtained using the Grid Search method. The training is performed on a machine with a CPU: Intel Xeon Silver 4210, 256GB RAM, and GPU: Nvidia Titan RTX.

Evaluation

The clustering accuracy (CA) [44], normalized mutual information (NMI) [45], and adjusted rand index (ARI) [46] are utilized to assess the performance of clustering assignments. Accordingly, the CA between the predicted cell type |$\hat{Y}$| with |$N_{\hat{Y}}$| clusters and the ground truth |$Y$| with |$N_{Y}$| clusters as follows:

$$ \begin{align} \begin{split} CA = \mathop{max}\limits_{\psi} \frac{\sum^{N}_{i = 1} \mathbb{1}_{[Y_{i} = \psi(\hat{Y}_{i})]}}{N}, \end{split}\end{align} $$

(12)

where |$N$| is the number of cells. |$\psi $| is a function that makes a transition between |$\hat{Y}$| and |$Y$|⁠. |$\mathbb{1}_{[\cdot ]}$| is an indicator function. The NMI between |$\hat{Y}$| and |$Y$| as follows:

$$ \begin{align} \begin{split} NMI = \frac{\sum_{i = 1}^{N_{\hat{Y}}}\sum_{j = 1}^{N_{Y}}|\hat{Y}_{i} \bigcap Y_{j}| log \frac{N \times |\hat{Y}_{i} \bigcap Y_{j}|}{|\hat{Y}_{i}| \times |Y_{j}|}}{\mathop{max}\left(- \sum^{N_{\hat{Y}}}_{i = 1} |\hat{Y}_{i}| log \frac{|\hat{Y}_{i}|}{N}, - \sum^{N_{Y}}_{j = 1} |Y_{j}| log \frac{|Y_{j}|}{N}\right)}. \end{split}\end{align} $$

(13)

The ARI between |$\hat{Y}$| and |$Y$| as follows:

$$ \begin{align} \begin{split} ARI = \frac{\sum_{i=1}^{N_{\hat{Y}}}{\sum_{j=1}^{N_{Y}}{\left(\begin{array}{c} |\hat{Y}_{i}\cap Y_{j}|\\ 2\\ \end{array} \right)}}} {\frac{1}{2}\left[ \sum_{i=1}^{N_{\hat{Y}}}{\left(\begin{array}{c} |\hat{Y}_{i}|\\ 2\\ \end{array} \right)}+\sum_{j=1}^{N_{Y}}{\left(\begin{array}{c} |Y_{j}|\\ 2\\ \end{array} \right)} \right]} \\ \frac{- \left[\sum_{i=1}^{N_{\hat{Y}}}{\left(\begin{array}{c} |\hat{Y}_{i}|\\ 2\\ \end{array} \right)}\sum_{j=1}^{N_{Y}}{\left(\begin{array}{c} |Y_{j}|\\ 2\\ \end{array} \right)} \right] \Big/\left(\begin{array}{c} N\\ 2\\ \end{array} \right)} {- \left[ \sum_{i=1}^{N_{\hat{Y}}}{\left(\begin{array}{c} |\hat{Y}_{i}|\\ 2\\ \end{array} \right)}\sum_{j=1}^{N_{Y}}{\left(\begin{array}{c} |Y_{j}|\\ 2\\ \end{array} \right)} \right] \Big/\left(\begin{array}{c} N\\ 2\\ \end{array} \right)} . \end{split}\end{align} $$

(14)

The Pearson correlation coefficients (PCCs) and L1 distance are employed to assess the performance of scRNA-seq data imputation as follows:

$$ \begin{align} \begin{split} PCCs = \frac{E(\hat{X}X) - E(\hat{X})E(X)}{\sqrt{E(\hat{X}^{2}) - E(\hat{X})^{2}} + \sqrt{E(X^{2}) -- E(X)^{2}}}, \end{split} \end{align} $$

(15)

$$ \begin{align} \begin{split}\kern-5.6pc L1 = \frac{1}{N} \sum_{i=1}^{N}{|\hat{X_{i}} - X_{i}|}. \end{split} \end{align} $$

(16)

where |$E(\cdot )$| is the expected value. |$\hat{X}$| is the imputed value. |$X$| is the ground truth.

Results and discussion

Performance comparison with baselines

We evaluate our scSimGCL against competing clustering approaches on 10 real scRNA-seq datasets. The clustering approaches employed in this study are scDCCA [18], scGCC [29], scGNN [47], graph-sc [14], CIDR [48], SIMLR [49], K-Means, Leiden [50], and Louvain [51]. The scDCCA is a combination of an autoencoder module and a contrastive learning module and focuses particularly on cell clustering. The scGCC comprises two modules, including a representation learning module and a clustering module and focuses particularly on cell clustering. The scGNN comprises three stacked autoencoders and focuses on the provision of scRNA-seq data imputation and cell clustering. The graph-sc is built upon a graph autoencoder and focuses particularly on cell clustering. The CIDR is a statistical model that includes reducing the impact of the high dimensionality of scRNA-seq data using principal component analysis on cell clustering. The SIMLR is built upon multi-kernel learning that performs dimension reduction and cell clustering by measuring cell similarity in scRNA-seq data. Louvain and Leiden are hierarchical clustering algorithms. Louvain merges communities into single nodes and performs modular clustering on compressed graphs, while Leiden is an improvement of Louvain and solves the problem of disconnected communities. We run all methods 10 times and present the average CA, NMI, and ARI scores.

Figure 2 compares the CA, NMI, and ARI scores for each approach. We observed that the proposed scSimGCL consistently achieves the competing CA, NMI, and ARI scores compared with baselines. Significant improvement in our scSimGCL was found compared with K-Means. This result verifies the efficacy of our network architecture, which can generate high-quality representations and improve the performance of cell clustering. Looking at Fig. 2, scDCCA is the best baseline for both Shekhar mouse retina cells and Zeisel datasets. Interestingly, the K-Means was observed to achieve the competing CA, NMI, and ARI scores on the Camp dataset compared to other baselines. This result may partly be explained by the fact that the performance of deep learning-based models is influenced by the input data size. As can be seen from Table 1, the Camp dataset contains a small sample size among the 10 real scRNA-seq datasets. We run all methods 10 times and present the average CA, NMI, and ARI scores.

Figure 2

Clustering performance comparison using CA, NMI, and ARI on 10 real scRNA-seq datasets. Higher values indicate better performance.

Open in new tab Download slide

Figure 3 compares the results obtained from the clustering assignment of all baselines and our scSimGCL on 12 simulated scRNA-seq datasets. These results are average values obtained by running all approaches 10 times. Our scSimGCL consistently achieves the competing CA, NMI, and ARI scores compared with baselines. After zooming on the subgraphs with varied dropout rates, we observed that there is an overlap between graph-sc and our scSimGCL. Thus, graph-sc is the best baseline among deep learning-based models, and it reported significantly more CA, NMI, and ARI scores than the other approaches. Moreover, graph-sc was observed to achieve the best accuracy in the case study with the different strengths of low signals. Despite its efficacy, there is a significant difference between graph-sc and our scSimGCL, which was evident in the data of de.fracScale 0.1 and 0.15. After zooming on the subgraphs with different dimension sizes, no difference in CA, NMI, and ARI scores of K-Means was observed. This rather intriguing result might be explained by the fact that K-Means lose its power as the number of dimensions increases. Together, these results provide important insights into the flexibility of deep learning-based models, i.e. clustering performance is superior to statistical models under extreme learning settings.

Figure 3

Clustering performance comparison using CA, NMI, and ARI on 12 simulated scRNA-seq datasets. Higher values indicate better performance.

Open in new tab Download slide

Figure 4 displays the results obtained from the visualization analysis of all baselines and our scSimGCL on the Baron and Zeisel datasets. Specifically, the following steps should be done for Fig. 4: the intermediate representation is obtained from each model and fed into the UMAP method [52] to generate a 2D embedding; the raw data is directly fed into the UMAP method to generate a 2D embedding. The difference between scSimGCL and baselines was significant. The identified clusters by our proposed scSimGCL were relatively visible compared to baselines. These findings enhance our understanding of model decisions, i.e. the transparency of deep learning models. However, caution must be applied with these identified clusters, as the findings might not be extrapolated to all datasets and learning settings. For example, the identified clusters by deep learning-based models are also known as intermediate representations and are therefore largely affected by model parameters or datasets. This has been illustrated by the results of scGNN on the two datasets. A note of caution is due here the input used by Leiden and Louvain in Figs 2, 3, and 4 is scRNA-seq data.

Figure 4

Visualization analysis results of baselines and scSimGCL on the (A) Baron and (B) Zeisel datasets.

Open in new tab Download slide

Figure 5 displays the results of clustering algorithms on 10 real scRNA-seq datasets. These results are average values obtained by running all approaches 10 times. We feed the cell representation |$Z$| into the clustering algorithms used. Our scSimGCL is implemented with K-Means. K-Means, BIRCH, Leiden, Louvain, and Agglomerative Clustering can make decisions by indicating a predefined number of clusters. In contrast, Affinity Propagation can make decisions by automatically determining the number of clusters. The results of Louvain in the Shekhar mouse retina cells dataset are optimal. There was a significant difference between the performance of Affinity Propagation and other methods. This has been largely illustrated by the weak performance of Affinity Propagation on the four datasets, including Tabula Sapiens—Heart, Tabula Sapiens—Fat, Tabula Sapiens—Tongue, and Chien. The most interesting aspect of this graph is the competitive performance of Agglomerative Clustering on the Camp dataset. These results may be due to the size of the input data.

Figure 5

Performance comparison results of clustering algorithms on 10 real scRNA-seq datasets using CA, NMI, and ARI. Higher values indicate better performance.

Open in new tab Download slide

We evaluate our scSimGCL against competing scRNA-seq data imputation approaches on 10 real scRNA-seq datasets. The scRNA-seq data imputation approaches employed in this study are GE-Impute [53], scGNN [47], scGCL [30], AutoClass [54], MAGIC [55], SAVER [56], and scImpute [57]. The GE-Impute is built upon graph neural networks and focuses particularly on scRNA-seq data imputation. Notably, GE-Impute incorporates the cell-cell similarity calculation into the imputation process, which is a crucial factor for improving imputation performance. The scGCL is a combination of GCL and Zero-inflated Negative Binomial distribution and focuses particularly on scRNA-seq data imputation. The AutoClass comprises two neural networks, including an autoencoder and a classifier, which focuses on the provision of scRNA-seq data imputation and cell clustering. The MAGIC is a Markov affinity-based graph imputation of cells that includes a low-rank assumption for data propagation. Accordingly, MAGIC can retain data points in the low-frequency space and filter out noise/dropout points in the high-frequency space. The SAVER is a negative binomial model that imputes dropout values by estimating the distribution of input scRNA-seq data. The scImpute is a statistical model that focuses particularly on scRNA-seq data imputation. The core idea of scImpute is similar to GE-Impute, which imputes dropout values in the gene expression matrix by computing cell-cell similarity. We run all methods 10 times and present the average PCCs score.

Figure 6 compares the PCCs and L1 distance for each approach. We observed that the proposed scSimGCL consistently achieves the competing PCCs and L1 distance compared with baselines. Looking at Fig. 6, it is apparent that AutoClass is the best baseline for the Camp dataset. Besides, we observed that scGNN, scGCL, and AutoClass achieve lower PCCs but higher L1 distance on the Baron and Zeisel datasets. The performance variations of deep learning-based models are interesting but not surprising. This result may have something to do with the dropout rate of the input data or the sensitivity and specificity of deep learning-based models.

Figure 6

Imputation performance comparison on 10 real scRNA-seq datasets using PCC and L1 distance. The higher the PCC value, the better the performance. The lower the L1 distance value, the better the performance.

Open in new tab Download slide

Ablation study

We propose three variants of scSimGCL to determine the efficacy of the designed GCL framework. These three variants are termd as scSimGCL|$_{\alpha }$|⁠, scSimGCL|$_{\beta }$|⁠, and scSimGCL|$_{\gamma }$|⁠. In particular, the positive samples for the anchor used in scSimGCL|$_{\alpha }$| are the nodes connected to the anchor and the nodes connected to the anchor from its counterpart view; the positive samples for the anchor used in scSimGCL|$_{\beta }$| are the nodes connected to the anchor and the anchor counterpart in its counterpart view; the positive samples for the anchor used in scSimGCL|$_{\gamma }$| are the anchor counterpart in its counterpart view. We run all methods 10 times and present the average CA, NMI, ARI, PCCs, and L1 distance. The results of the ablation study are summarized in Fig. 7A. On average, scSimGCL was shown to have achieved the most satisfactory performance on cell clustering and scRNA-seq data imputation. These results provide further support for the homophily assumption in the graph, which can facilitate the formation of informative positive and negative sample pairs, thereby improving the performance of GCL.

Figure 7

Performance comparison results between scSimGCL and its four variants.

Open in new tab Download slide

We propose a variant of scSimGCL to determine the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL|$_{\delta }$| includes two implementation settings, where |$\mathcal{L}^{(gcl)}$| and |$\mathcal{L}^{(imp)}$| are separately utilized as the sole objective function in cluster assignment and data imputation. Note that the percentages used in PCCs (10%), PCCs (20%), PCCs (30%), L1 (10%), L1 (20%), and L1 (30%) are dropout rates. We run all methods 10 times and present the average CA, NMI, ARI, PCCs, and L1 distance. Figure 7B compares CA, NMI, ARI, PCCs, and L1 distance for scSimGCL and scSimGCL|$_{\delta }$|⁠. Compared with the data in scSimGCL, the performance of scSimGCL|$_{\delta }$| on both cluster assignment and data imputation has a clear decreasing trend. Accordingly, contrastive learning and data imputation in the designed pretraining mechanism are inseparable, which is helpful for improving the overall performance of our network architecture.

Analysis of Hyperparameters

We present a hyperparameter analysis to determine the efficacy of our network architecture. There are batch size, temperature |$\tau $|⁠, learnable threshold |$\eta $|⁠, dropout rate |$p^{(drop)}$|⁠, and scaling parameter |$\lambda $| that we are interested in. A note of caution is due here since the dropout rate in neural networks differs from that in scRNA-seq datasets. We implement the dropout rates in neural networks with the obtained cell representation |$Z$|⁠. Batch size and dropout rate have long been a question of great interest in neural networks. Temperature parameter is often used in research on contrastive learning. Scaling parameter is used to make tradeoff the contributions of |$\mathcal{L}^{(imp)}$| and |$\mathcal{L}^{(gcl)}$|⁠. Note that varied batch sizes are implemented using three datasets with a relatively large sample size, as shown in Fig. 8. These results are average values obtained by running scSimGCL with different parameters 10 times. After zooming on the subgraphs with varied batch sizes (Fig. 8A), we observed that PCCs in the three datasets could reach the optimal value on the same batch size. In the same vein, CA, NMI, and ARI in the Shekhar mouse retina cells and 10X PBMC datasets can reach optimal values on the same batch size. As can be seen from the subgraphs below (Fig. 8B), our approach could make decisions based on varied parameters |$\phi $|⁠, |$\tau $|⁠, |$p^{(drop)}$|⁠, and |$\lambda $|⁠. We have also examined the impact of |$\alpha $| and |$\beta $| (i.e. two scaling parameters in Equation (8)), |$p^{(a)}$| and |$p^{(x)}$| (i.e. edge dropping and gene masking) on the model decisions. The results, as shown in Fig. 9, were obtained from the Zeisel dataset. This analysis was successful as it was able to identify patterns and trends in the data.

$Hyperparameter analysis results using CA, NMI, ARI, and PCCs. The batch size, learnable threshold $\eta $, dropout rate $p^{(drop)}$, and scaling parameter $\lambda $ were carried out in the analysis. Higher values indicate better performance.$

Figure 8

Hyperparameter analysis results using CA, NMI, ARI, and PCCs. The batch size, learnable threshold |$\eta $|⁠, dropout rate |$p^{(drop)}$|⁠, and scaling parameter |$\lambda $| were carried out in the analysis. Higher values indicate better performance.

Open in new tab Download slide

$Hyperparameter analysis results using CA, NMI, ARI, and PCCs. The edge dropping $p^{(a)}$ and gene masking $p^{(x)}$ were carried out in the analysis. Higher values indicate better performance.$

Figure 9

Hyperparameter analysis results using CA, NMI, ARI, and PCCs. The edge dropping |$p^{(a)}$| and gene masking |$p^{(x)}$| were carried out in the analysis. Higher values indicate better performance.

Open in new tab Download slide

Cell trajectory inference

Trajectory inference in scRNA-seq data analysis allows researchers to identify critical stages and transitions in cell development. We assess the effectiveness of the representations generated by scSimGCL using the Yan [58] dataset and focus on the developmental process of mouse embryos. The UMAP graph in Fig. 10 is generated using the UMAP method [52]. It maps high-dimensional gene expression data into a 2D plane for visualization analysis. In particular, the representation generated from scSimGCL (⁠|$Z$| in Equation 5) and the PAGA method [59] are used to generate cell developmental trajectories, as shown in the PAGA graph. We can see that the entire cell development trajectory starts from the zygote cell, passes through 2, 4, 8, and 16 cells, and ends with the blast cell. Overall, this result agrees with the findings of the actual mouse embryo development process. The findings in this trajectory inference are subject to one limitation: there is a disconnect between the Two- and four-cell clusters.

Figure 10

Cell trajectory inference on the Yan dataset.

Open in new tab Download slide

The run-time analysis of each model

Figure 11 compares the clustering and imputation performance on the Camp dataset and presents the training time of scSimGCL on 10 datasets. From the graph below, we can see that the proposed scSimGCL achieved higher NMI and PCC scores. On average, the run time of K-Means, Leiden, and Louvain was lower than the deep neural network models such as scDCCA and scGNN. Interestingly, the MAGIC was observed to achieve the competing PCC scores, but the run time of MAGIC was lower than that of other models. It seems possible that these results are due to the computational complexity of deep learning, i.e. the computational complexity of deep neural network models is usually higher than that of traditional machine learning or statistical models.

Figure 11

Clustering and imputation performance of baselines and our scSimGCL on the Camp dataset and the training time of scSimGCL on 10 datasets.

Open in new tab Download slide

Conclusion

scRNA-seq has become the most popular method for investigating transcriptome-wide gene expression at the single-cell level. Using scRNA-seq data on cell clustering is a well-established approach to provide a detailed understanding of the groups of cells with similar expression profiles. To achieve this aim, we need to solve challenges stemming from raw scRNA-seq data, including high dimensionality and dropout phenomenon. The existing deep learning-based solutions for addressing these challenges are extensive and focus particularly on graph machine learning models and contrastive learning-based models. There is still much work to be done to achieve desirable representations and accuracy simultaneously.

This study set out to marry graph neural networks with contrastive learning for scRNA-seq data analysis. We propose scSimGCL, a simple and effective framework based on the GCL paradigm that leverages self-supervised pretraining of graph neural networks to generate high-quality representations critical for cell clustering. Extensive experiments were carried out on simulated and real scRNA-seq datasets. Our scSimGCL demonstrated significant improvement in scRNA-seq data imputation and cell clustering. The results of clustering assignment analysis indicated that scSimGCL is a general approach that can include competing clustering algorithms. Ablation study and hyperparameter analysis further confirmed the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting.

With regard to the research method, a major limitation needs to be acknowledged here, i.e. the lack of uncertainty estimation in model decisions (associated with the over-confidence problem in deep learning approaches) adds caution regarding the generalizability of findings. Further modeling work will have to be conducted in order to improve the transparency and safety of scSimGCL, such as developing interpretable and reliable analytical approaches. Batch effects in scRNA-seq data may lead to false conclusions, as they arise when variations in the sample group are due to technical factors rather than biological realities. A further study with more focus on batch effects in scRNA-seq data is therefore suggested. Another possible area of future research would be to explore the potential use of scSimGCL on the scATAC-seq datasets. Since the dropout effects in scATAC-seq data are higher than those in scRNA-seq data, the new understanding should help assess the effectiveness and robustness of our scSimGCL.

Key Points

scSimGCL is a simple and effective graph contrastive learning framework for learning high-quality representations crucial for robust cell clustering.
scSimGCL outperforms competing state-of-the-art models in scRNA-seq data imputation and cell clustering by a significant margin.
scSimGCL is capable of making informed decisions by indicating a predefined number of clusters or automatically determining the number of clusters, which is a critical aspect in unsupervised learning settings for scRNA-seq data analysis.
scSimGCL represents a further step towards developing a foundation model for cell clustering.

Conflict of interest: None declared.

Funding

This work is supported by the National Key Research and Development Program of China (No. 2022YFF1000100), the National Natural Science Foundation of China (No. 62202388), the Qin Chuangyuan Innovation and Entrepreneurship Talent Project (No. QCYRCXM-2022-230), and the Chinese Universities Scientific Fund (No. 2452024407).

Data availability

The scRNAseq data that support the findings of this paper are publicly available at GitHub: https://github.com/zhangzh1328/scSimGCL.

Code availability

The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.

References

AlJanahi

Danielsen

Dunbar

An introduction to the analysis of single-cell RNA-sequencing data

Mol Ther Methods Clin Dev

2018

;

189

–

. 10.1016/j.omtm.2018.07.003.

Zhou

Bihui

Minn

. et al.

DENDRO: genetic heterogeneity profiling and subclone detection by single-cell RNA sequencing

Genome Biol

2020

;

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Yang

Wang

. et al.

scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data

Nat Mach Intell

2022

;

852

–

. 10.1038/s42256-022-00534-z.

Google Scholar

Crossref

WorldCat

Van de Sande

Lee

Mutasa-Gottgens

. et al.

Applications of single-cell RNA sequencing in drug discovery and development

Nat Rev Drug Discov

2023

;

496

–

520

. 10.1038/s41573-023-00688-4.

Qiu

Embracing the dropouts in single-cell RNA-seq analysis

Nat Commun

2020

;

1169

. 10.1038/s41467-020-14976-9.

Tian

Wan

Song

. et al.

Clustering single-cell RNA-seq data with a model-based deep learning approach

Nat Mach Intell

2019

;

191

–

. 10.1038/s42256-019-0037-0.

Google Scholar

Crossref

WorldCat

Wang

Lyu

. et al.

Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis

Nat Commun

2020

;

. 10.1038/s41467-020-15851-3.

Google Scholar

OpenURL Placeholder Text

WorldCat

Eraslan

Simon

Mircea

. et al.

Single-cell RNA-seq denoising using a deep count autoencoder

Nat Commun

2019

;

. 10.1038/s41467-018-07931-2.

Google Scholar

OpenURL Placeholder Text

WorldCat

Bin

Chen

. et al.

scGMAI: a Gaussian mixture model for clustering single-cell RNA-seq data based on deep autoencoder

Brief Bioinform

2021

;

bbaa316

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

10.

Tian

Zhang

Lin

. et al.

Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data

Nat Commun

2021

;

1873

. 10.1038/s41467-021-22008-3.

11.

Dayu

Liang

Zhou

. et al.

scDFC: a deep fusion clustering method for single-cell RNA-seq data

Brief Bioinform

2023

;

bbad216

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

12.

Dayu

Liang

Dong

. et al.

Effective multi-modal clustering method via skip aggregation network for parallel scRNA-seq and scATAC-seq data

Brief Bioinform

2024

;

. 10.1093/bib/bbae102.

Google Scholar

OpenURL Placeholder Text

WorldCat

13.

Liu

Guo

. et al.

Scdfn: Enhancing single-cell RNA-seq clustering with deep fusion networks

2024

14.

Ciortan

Defrance

GNN-based embedding for clustering scRNA-seq data

Bioinformatics

2022

;

1037

–

. 10.1093/bioinformatics/btab787.

15.

Wang

Zhang

. et al.

scASGC: an adaptive simplified graph convolution model for clustering single-cell RNA-seq data

Comput Biol Med

2023

;

163

107152

. 10.1016/j.compbiomed.2023.107152.

16.

Cheng

scGAC: a graph attentional architecture for clustering single-cell RNA-seq data

Bioinformatics

2022

;

2187

–

. 10.1093/bioinformatics/btac099.

17.

Ciortan

Defrance

Contrastive self-supervised clustering of scRNA-seq data

BMC Bioinform

2021

;

280

. 10.1186/s12859-021-04210-8.

Google Scholar

Crossref

WorldCat

18.

Wang

Xia

Wang

. et al.

scDCCA: deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network

Brief Bioinform

2023

;

bbac625

19.

Gan

Chen

Guangwei

. et al.

Deep enhanced constraint clustering based on contrastive learning for scRNA-seq data

Brief Bioinform

2023

;

bbad222

20.

Guan

. et al.

Pixel-superpixel contrastive learning and pseudo-label correction for hyperspectral image clustering

. In:

ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

, pp.

6795

–

Seoul, the Republic of Korea: IEEE

2024

21.

Guan

Wenxuan

. et al.

Contrastive multi-view subspace clustering of hyperspectral images based on graph convolutional networks

IEEE Trans Geosci Remote Sens

2024

;

–

. 10.1109/TGRS.2024.3370633.

Google Scholar

OpenURL Placeholder Text

WorldCat

22.

Liu

Zhang

Qin

. et al. .

Contrastive learning-based imputation-prediction networks for in-hospital mortality risk modeling using EHRS

. In

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

, pages

428

–

Cham: Springer

2023

, 10.1007/978-3-031-43427-3_26.

23.

Zhang

Liu

Bian

. et al.

Boosting patient representation learning via graph contrastive learning

. In:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

, pp.

335

–

Springer

2024

24.

Moor

Banerjee

Abad

ZSH

. et al.

Foundation models for generalist medical artificial intelligence

Nature

2023

;

616

259

–

. 10.1038/s41586-023-05881-4.

25.

Wornow

Yizhe

Thapa

. et al.

The shaky foundations of large language models and foundation models for electronic health records

NPJ Digit Med

2023

;

. 10.1038/s41746-023-00879-8.

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Cheng

Yang

Lyu

. et al.

GatorTron and GatorTronGPT: large language models for clinical narratives

. In:

AAAI 2024 Spring Symposium on Clinical Foundation Models

, OpenReview,

2024

. https://openreview.net/forum?id=H4YTWMehKx.

27.

Vaswani

Shazeer

Parmar

. et al.

Attention is all you need

Adv Neural Inf Process Syst

2017

;

Google Scholar

OpenURL Placeholder Text

WorldCat

28.

Lee

Kim

Hyun

. et al.

Deep single-cell RNA-seq data clustering with graph prototypical contrastive learning

Bioinformatics

2023

;

. 10.1093/bioinformatics/btad342.

Google Scholar

OpenURL Placeholder Text

WorldCat

29.

Tian

Wang

. et al.

scGCC: graph contrastive clustering with neighborhood augmentations for scRNA-seq data analysis

IEEE J Biomed Health Inform

2023

;

6133

–

. 10.1109/JBHI.2023.3319551.

30.

Xiong

Luo

Shi

. et al.

scGCL: an imputation method for scRNA-seq data based on graph contrastive learning

Bioinformatics

2023

;

btad098

31.

Yixin Liu

Zheng

Chen

. et al.

Towards unsupervised deep graph structure learning

. In:

Proceedings of the ACM Web Conference

2022

;

2022

1392

–

403

Google Scholar

OpenURL Placeholder Text

WorldCat

32.

Zhao

Jin

Liu

. et al.

Graph data augmentation for graph machine learning: a survey

arXiv preprint arXiv:2202.08871

2022

33.

Kipf

Welling

Semi-supervised classification with graph convolutional networks

arXiv preprint arXiv:1609.02907

2016

10.48550/arXiv.1609.02907

34.

Shekhar

Lapan

Whitney

. et al.

Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics

Cell

2016

;

166

1308

–

1323.e30

. 10.1016/j.cell.2016.07.054.

35.

Baron

Veres

Wolock

. et al.

A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure

Cell Syst

2016

;

346

–

360.e4

. 10.1016/j.cels.2016.08.011.

36.

Zheng

GXY

Terry

Belgrader

. et al.

Massively parallel digital transcriptional profiling of single cells

Nat Commun

2017

;

. 10.1038/ncomms14049.

Google Scholar

OpenURL Placeholder Text

WorldCat

37.

Gray Camp

Sekine

Gerber

. et al.

Multilineage communication regulates human liver bud development from pluripotency

Nature

2017

;

546

533

–

. 10.1038/nature22796.

38.

Han

Wang

Zhou

. et al.

Mapping the mouse cell atlas by microwell-seq

Cell

2018

;

172

1091

–

1107.e17

. 10.1016/j.cell.2018.02.001.

39.

Zeisel

Muñoz-Manchado

Codeluppi

. et al.

Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq

Science

2015

;

347

1138

–

. 10.1126/science.aaa1934.

40.

The Tabula Sapiens Consortium*

Jones

Karkanias

. et al.

The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans

Science

2022

;

376

eabl4896

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

41.

Chien

J-F

Liu

Wang

B-A

. et al.

Cell-type-specific effects of age and sex on human cortical neurons

Neuron

2024

;

112

2524

–

2539.e5

. 10.1016/j.neuron.2024.05.013.

42.

Wolf

Angerer

Theis

Scanpy: large-scale single-cell gene expression data analysis

Genome Biol

2018

;

–

43.

Zappia

Phipson

Oshlack

Splatter: simulation of single-cell RNA sequencing data

Genome Biol

2017

;

174

. 10.1186/s13059-017-1305-0.

44.

Xie

Girshick

Farhadi

Unsupervised deep embedding for clustering analysis

. In:

International Conference on Machine Learning

, pp.

478

–

PMLR

2016

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

45.

Strehl

Ghosh

Cluster ensembles—a knowledge reuse framework for combining multiple partitions

J Mach Learn Res

2002

;

583

–

617

Google Scholar

OpenURL Placeholder Text

WorldCat

46.

Hubert

Arabie

Comparing partitions

J Classif

1985

;

193

–

218

. 10.1007/BF01908075.

Google Scholar

Crossref

WorldCat

47.

Wang

Chang

. et al.

scGNN is a novel graph neural network framework for single-cell RNA-seq analyses

Nat Commun

2021

;

1882

. 10.1038/s41467-021-22197-x.

48.

Lin

Troup

JWK

and accurate clustering through imputation for single-cell RNA-seq data

Genome Biol

2017

;

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

49.

Wang

Zhu

Pierson

. et al.

Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning

Nat Methods

2017

;

414

–

. 10.1038/nmeth.4207.

50.

Traag

Waltman

Van Eck

From Louvain to Leiden: guaranteeing well-connected communities

Sci Rep

2019

;

–

51.

Levine

Simonds

Bendall

. et al.

Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis

Cell

2015

;

162

184

–

. 10.1016/j.cell.2015.05.047.

52.

McInnes

Healy

Melville

UMAP: uniform manifold approximation and projection

Journal of Open Source Software

2018

;

:861.

10.21105/joss.00861

53.

Xiaobin

Zhou

GE-Impute: graph embedding-based imputation for single-cell RNA-seq data

Brief Bioinform

2022

;

bbac313

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

54.

Brouwer

Luo

A universal deep neural network for in-depth cleaning of single-cell RNA-seq data

Nat Commun

2022

;

. 10.1038/s41467-022-29576-y.

Google Scholar

OpenURL Placeholder Text

WorldCat

55.

Van Dijk

Sharma

Nainys

. et al.

Recovering gene interactions from single-cell data using data diffusion

Cell

2018

;

174

716

–

729.e27

. 10.1016/j.cell.2018.05.061.

56.

Huang

Wang

Torre

. et al.

SAVER: gene expression recovery for single-cell RNA sequencing

Nat Methods

2018

;

539

–

. 10.1038/s41592-018-0033-z.

57.

An accurate and robust imputation method scimpute for single-cell RNA-seq data

Nat Commun

2018

;

997

. 10.1038/s41467-018-03405-7.

58.

Yan

Yang

Hongshan Guo

. et al.

Single-cell RNA-seq profiling of human preimplantation embryos and embryonic stem cells

Nat Struct Mol Biol

2013

;

1131

–

59.

Alexander Wolf

Hamey

Plass

. et al.

Paga: Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells

Genome Biol

2019

;

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Author notes

Zhenhao and Yuxi are equal contributors to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Download all slides

Month:	Total Views:
November 2024	678
December 2024	335
January 2025	231
February 2025	229
March 2025	192
April 2025	219
May 2025	13

Article Contents

Graph contrastive learning as a versatile foundation for advanced scRNA-seq data analysis

Abstract

Introduction

Materials and methods

The framework of scSimGCL

Cell-cell graph structure learning

Cell-cell GCL

Training and prediction

Data

Implementation and parameters setting

Evaluation

Results and discussion

Performance comparison with baselines

Ablation study

Analysis of Hyperparameters

Cell trajectory inference

The run-time analysis of each model

Conclusion

Funding

Data availability

Code availability

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Graph contrastive learning as a versatile foundation for advanced scRNA-seq data analysis

Abstract

Introduction

Materials and methods

The framework of scSimGCL

Cell-cell graph structure learning

Cell-cell GCL

Training and prediction

Data

Implementation and parameters setting

Evaluation

Results and discussion

Performance comparison with baselines

Ablation study

Analysis of Hyperparameters

Cell trajectory inference

The run-time analysis of each model

Conclusion

Funding

Data availability

Code availability

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only