Abstract

In single-cell RNA-seq (scRNA-seq) data analysis, a fundamental problem is to determine the number of cell clusters based on the gene expression profiles. However, the performance of current methods is still far from satisfactory, presumably due to their limitations in capturing the expression variability among cell clusters. Batch effects represent the undesired variability between data measured in different batches. When data are obtained from different labs or protocols batch effects occur. Motivated by the practice of batch effect removal, we considered cell clusters as batches. We hypothesized that the number of cell clusters (i.e. batches) could be correctly determined if the variances among clusters (i.e. batch effects) were removed. We developed a new method, namely, removal of batch effect and testing (REBET), for determining the number of cell clusters. In this method, cells are first partitioned into k clusters. Second, the batch effects among these k clusters are then removed. Third, the quality of batch effect removal is evaluated with the average range of normalized mutual information (ARNMI), which measures how uniformly the cells with batch-effects-removal are mixed. By testing a range of k values, the k value that corresponds to the lowest ARNMI is determined to be the optimal number of clusters. We compared REBET with state-of-the-art methods on 32 simulated datasets and 14 published scRNA-seq datasets. The results show that REBET can accurately and robustly estimate the number of cell clusters and outperform existing methods. Contact: H.D.L. ([email protected]) or Q.S.X. ([email protected])

Introduction

Single-cell RNA sequencing (scRNA-seq) is a powerful technology in generating gene expression profiles at single-cell resolution [1, 2]. scRNA-seq enables unprecedented biological discoveries [3]. For example, researchers can use gene expression profiles to determine cell clusters and predict cell fates. They have been applied to many areas such as personalized medicine for cancer and biomarker recognition. They play an important role in many fields such as oncology, developmental biology, microbiology and neuroscience [4, 5]. scRNA-seq is gradually becoming a focus of life science research.

Clustering analysis is widely utilized in scRNA-seq studies. Clustering is a classic unsupervised machine learning problem that has been extensively studied in recent years. A wide range of clustering algorithms have been proposed, including hierarchical clustering (HC), k-means, density-based spatial clustering of applications with noise [6], self-organizing maps [7], spectral clustering [8] and deep learning methods [9]. In particular, many clustering methods based on gene expression data have been proposed, such as consensus clustering (CC) [10], shared nearest neighbor (SNN)-Cliq [11], single-cell consensus clustering (SC3) [12], single-cell interpretation via multi-kernel learning (SIMLR) [13] and Seurat [14, 15]. CC is a method that clusters the input data repeatedly through resampling, and it evaluates the stability of the discovered clusters [10]. The main idea of SNN-Cliq is to construct an SNN map by using the k-nearest-neighbors algorithm, which takes into account the effect of surrounding neighboring data points. SC3 is a consensus method that combines multiple clustering solutions. It is a user-friendly clustering method that works well for small datasets [16]. SIMLR is an analytic framework that learns an appropriate cell-to-cell similarity metric from the inputted single-cell data to perform dimensional reduction, clustering and visualization [13]. It first performs feature construction based on scRNA-seq data, and then performs similarity learning. Seurat is a toolkit for the analysis of sc-RNAseq. The clustering method in Seurat is based on SNN modularization optimization [14]. Other methods include clustering through imputation and dimensionality reduction [17] and deep learning methods [18].

A single tissue may contain multiple subpopulations of cells with different structures and functions. Understanding the development of an organ requires characteristic cell subpopulations, so it is important to quantify single-cell subpopulations accurately [19]. At present, the determination of the number of cell subpopulations is mainly based on clustering algorithms. Most clustering algorithms also need to provide the number of clusters in advance. However, the number of clusters is often the key to determining the performance of clustering algorithms [20].

Many methods have been developed recently for determining the number of cell clusters for scRNA-seq data [21]. SNN-Cliq is a graph partitioning algorithm that can identify the number of clusters in the SNN graph by finding the maximal number of cliques [11]. SC3 is a CC method with high accuracy. They first perform gene filtering on the original gene expression data. Next, they calculate the distance between the cells and transform it. Thirdly, they perform k-means clusterings for various parameters on transformed distance matrices. Then, a consensus matrix is calculated by averaging all similarity matrices of individual clusterings. Finally, they perform HC on the consensus matrix [12]. The number of clusters in SC3 is determined by the Tracy–Widom method [22]. SIMLR provides two methods to estimate the optimal number of clusters. One method, called Eigengap, is to analyze the eigenvalues of the Laplace matrix and seek the maximum reduction of the eigenvalue gap. The number of clusters determined by another Separation Cost method is the one with the largest drop in the value of separation cost [13]. Single-cell aggregated (from ensemble) clustering (SAFE-clustering) uses output results from multiple clustering methods to build one CC map [23]. This method selects the number of clusters based on the maximum value of the average normalized mutual information from the comparison between the output results and the ensemble label. ROGUE is an entropy-based statistic used to quantify the purity of a cell cluster. ROGUE can also be used to identify cell subpopulations and guide clustering [24]. In recent years, some methods have been developed to determine the number of clusters for multiple types of gene expression data, such as CC [10], integrative clustering (iCluster+) [25], similarity network fusion (SNF) [26] and perturbation clustering for data integration and disease subtyping (PINS) [27]. CC is a model-independent resampling-based approach that has been widely used for subtype discovery [10]. The algorithm begins with subsampling a proportion of items and a proportion of features from data. Then each subsample is clustered and the consensus matrix is obtained for each k. The number of clusters is determined so that the area under the empirical cumulative distribution curve of the consensus matrix levels off and |$\Delta k$| approaches zero. iCluster+ uses generalized linear regression for the formulation of a joint model and determines its number of clusters at a transition point of the deviation ratiometric, which can be interpreted as a percentage of the total change in the current model interpretation. Other divisions larger than this transition point will no longer provide significant improvement. However, iCluster+ is based on the Gaussian assumption, which is inadequate when the data are too heterogeneous in terms of the signal distribution [25]. SNF is a theoretical graph approach that can discover disease subtypes based on data from a single data type or by integrating data from multiple data types [26]. SNF is sensitive to minor changes in molecular measurements or parameter settings because of the unstable nature of kernel-based clustering. For PINS, the original connectivity matrix is first constructed for all possible cluster numbers, then it perturbs the data by adding Gaussian noise. It finally evaluates the stability of the clustering results. The number of clusters estimated by PINS is that its clustering result is the most stable to perturbation. [27].

Batch effects generally consist of undesired variability and system errors in high-throughput data, which are due to differences in results caused by different groupings of sources of variability [28]. Specifically, batch effects occur when cells from one biological group or with one condition are cultured, isolated and sequenced separately from cells with a second condition [29]. There are many tools for batch correction of gene expression data, such as limma and ComBat. limma uses a linear model to capture batch effects [30]. There is a unique strategy in limma that all genes are restricted to share the same intrablock correlations, and these correlation structures are incorporated into the linear model fitting [31]. ComBat is a method to adjust batch effects using an empirical Bayesian approach [32]. The method is often used to adjust for known batches [33].

Cell clusters can be regarded as batches and batch effects (i.e. between-cluster variances) are optimally removed only if the number of batches (i.e. cell clusters) is correctly specified. Based on the analysis above, we propose a method to determine the number of cell clusters via REmoving Batch Effect and Testing (REBET). We hypothesize that clustering results are regarded as batch variables, and the batch effects can be completely removed if and only if the number of clusters is correctly defined. After removing the batch effect, different clusters of cells are evenly mixed. The first step of REBET is to use SC3 to perform clustering for a range of potential cluster numbers k (by default |$ k\in [2,K] $|⁠; K is the maximum number of clusters), then we treat the clusters as batches and remove these batch effects. We believe that the number of clusters is the optimal number of clusters when the cells with different labels in the data after removing the batch effect are mixed evenly. We define an indicator to describe the degree of the uniformity of mixing between different cells. We applied REBET to simulated and real scRNA-seq datasets, and the experimental results showed the advantages of REBET in determining the number of clusters.

Materials and methods

Method

The input is an |${n \times p}$| gene expression matrix |$ \textbf{D}$|⁠, where n is the number of cells and p is the number of genes or features. In brief, the rows of matrix D represent cells, and the columns represent the gene expression features (Fig. 1A). We need to provide the maximum number of clusters K. In general, we first set K to 10, and if the estimated number of clusters happens to be 10, we then increase the value of K. REBET is carried out in the following steps (Fig. 1).

The workflow of REBET. (A). Inputs of this method: a gene expression data matrix of $n$ cells in rows and $p$ genes in columns. (B). SC3 was used to cluster the input data with a range of potential cluster numbers (From top to bottom, $k $ from 2 to 10). Small squares, triangles and circles all represent cells, and different shapes indicate different cell clusters (there are three cell clusters in total). When the cells are in the same blue background color circle, it means they are partitioned into the same cluster. (C). Each clustering result was treated as a batch variable to remove the batch effect from the input data. To illustrate the effect of removing batch effects, we visualized the batch-effect-removed data by applying t-distributed Stochastic Neighbor Embedding(t-SNE) for dimensionality reduction. When the $k $ value is less than the true number of cell clusters, the batch-effect-removed data $\textbf{D}_{k}$ still contains the information of the cell cluster label. Such as Dataset $\textbf{D}_{2}$ can still be separated by the first principal component. (D). The batch-effect-removed data were clustered separately again, and the mixing degree coefficient, ARNMI, was calculated. The number of clusters corresponding to the minimum ARNMI value is the number of cell clusters determined by REBET.
Figure 1

The workflow of REBET. (A). Inputs of this method: a gene expression data matrix of |$n$| cells in rows and |$p$| genes in columns. (B). SC3 was used to cluster the input data with a range of potential cluster numbers (From top to bottom, |$k $| from 2 to 10). Small squares, triangles and circles all represent cells, and different shapes indicate different cell clusters (there are three cell clusters in total). When the cells are in the same blue background color circle, it means they are partitioned into the same cluster. (C). Each clustering result was treated as a batch variable to remove the batch effect from the input data. To illustrate the effect of removing batch effects, we visualized the batch-effect-removed data by applying t-distributed Stochastic Neighbor Embedding(t-SNE) for dimensionality reduction. When the |$k $| value is less than the true number of cell clusters, the batch-effect-removed data |$\textbf{D}_{k}$| still contains the information of the cell cluster label. Such as Dataset |$\textbf{D}_{2}$| can still be separated by the first principal component. (D). The batch-effect-removed data were clustered separately again, and the mixing degree coefficient, ARNMI, was calculated. The number of clusters corresponding to the minimum ARNMI value is the number of cell clusters determined by REBET.

Generation of initial clusters

We first applied median centralization to gene expression data. We partition the cells using all possible numbers of clusters |$ k\in [2,K] $|⁠. Then for each k, we can obtain labels |$\textbf{y}_{k}$| for each of the cells being clustered, and |$\textbf{y}_{k}$| is a vector of the resulting integers indicating the cluster to which each cell is allocated.

For each possible number of clusters k, each cell will be allocated a label. After the first step, we generate K-1 initial clusters.

Removal of between-cluster variance

Cells from different clusters can be regarded as cells from different batches, and the batch effect removal is equivalent to eliminating differences between clusters. Therefore, in the second step, we treat each |$\textbf{y}_{k}$| as a batch variable, and remove the batch effects from the input data D to obtain K-1 datasets |$ \lbrace \textbf{D}_{2},...,\textbf{D}_{K}\rbrace $|⁠. We specify |$\textbf{y}_{k}$| as the known batch variable and remove batch effects using the ComBat function in the R package sva.

ComBat assumes that for gene |$j$|⁠, the sample |$i$| of batch |$b$| follows the following model: |$D_{bij}=\alpha _{j}+Y_{bi} \beta _{b}+\gamma _{jb}+\delta _{jb} \varepsilon _{bij}$|⁠, where |$\alpha _{j}$| is the overall gene expression, |$Y_{bi}$| is a design matrix for sample conditions (In our case, it is |$\textbf{y}_{k}$|⁠), and |$ \beta _{b}$| is the vector of regression coefficients corresponding to |$Y_{bi}$|⁠. |$\gamma _{jb}$| and |$\delta _{jb}$| represent additive and multiplicative batch effects of batch |$b$| on the gene |$j$|⁠, affecting the mean and variance of gene expression in batch |$b$|⁠, respectively. The error terms |$\varepsilon _{bij}$| are assumed to follow a normal distribution with an expected value of zero and the variance of |$ \sigma _{b}^2$|⁠. ComBat then uses the empirical Bayesian method to estimate the parameters [32, 34].

We assume that cells from different clusters are evenly mixed after batch effect is removed if and only if k corresponds to a true number of clusters of |$k_t$|⁠. We then identified the number of cell clusters by evaluating the mixed uniformity of batch-effect-removed data |$\textbf{D}_{k}$|⁠. The closer |$k$| is to the true number of cell clusters |$k_t$|⁠, the more thoroughly the between-cluster variance is removed. The batch-effect-removed data will allow the cells of different clusters to mix more uniformly.

Identifying the number of cell types based on clustering of batch-effect-removed data

If k is equal to the true number of cell clusters |$ k_{t} $|⁠, the batch effect will be completely removed. There will be no obvious clusters in the batch-effect-removed data |$\textbf{D}_{k}$|⁠. Therefore, we again cluster |$\textbf{D}_{k}$| into |$k^{\prime }$| classes, where |$k^{\prime }$| still belongs to 2 to K. No matter what |$k^{\prime }$| is, the clustering labels will be randomly assigned. Obtain the predicted label of the batch-effect-removed data. To obtain stable clustering results, we generate H perturbed datasets by perturbing the batch-effect-removed data according to Nguyen et al [27]. One way to do so is to add Gaussian noise to the batch-effect-removed data |$\textbf{D}_{k}$|⁠. We perturbed the data with noise that had a variance equal to the variance of the data. The noise variance is calculated as follows:
(1)
where |$ \sigma _{j}^2 = var\lbrace \textbf{D}_{(i,j)}\rbrace $|⁠, |$i\in [1,n]$| and |$j \in [1,p] $|⁠.
For each batch-effect-removed dataset |$\textbf{D}_{k}$|⁠, we first generate H|$n\times p$| perturbed datasets |$\textbf{D}_{k}^{(h)}, h\in [1,H]$|⁠, by adding Gaussian noise |$\mathcal N(0,\sigma ^2) $|⁠:
(2)
We then cluster each of the H perturbed datasets using SC3 with varying values of |$ k^{\prime }\in [2,K] $|⁠. The corresponding clustering result is called the predicted label and denoted as |$ y_{kk^{\prime }}^{(h)}$|⁠, |$h\in [1,H] $|⁠. When clustering perturbed data that does not contain any cluster information, no matter how many clusters the data are partitioned into (that is, no matter what |$k^{\prime }$| is), it will be randomly divided into |$k^{\prime }$| clusters. In the third step, we use these labels to evaluate the uniformity of the mix for the batch-effect-removed data. Therefore, we define the mixing degree coefficient based on this analysis.
Calculate mixing degree coefficient. For the mixing degree coefficient, we first calculate the Normalized Mutual Information (NMI) metric |$\textrm{NMI}_{kk^{\prime }} ^ {(h)}$| between |$\textbf{y}_{kk^{\prime }}^{(h)}$| and |$\textbf{y}_{k}$| for each |$k^{\prime }$| and h, which is defined as follows:
(3)
where |$H(\textbf{y}_{k})$| and |$H(\textbf{y}_{kk^{\prime }}^{(h)})$| represent the entropy between |$\textbf{y}_{k}$| and |$\textbf{y}_{kk^{\prime }}^{(h)}$|⁠, |$I(\textbf{y}_{k}; \textbf{y}_{kk^{\prime }}^{(h)})$| represents the mutual information of |$H(\textbf{y}_{k})$| and |$H(\textbf{y}_{kk^{\prime }}^{(h)})$| [35]. Mutual information means the reduction in the entropy of the true class label when the predicted label is known. NMI is often used to compare two cluster partitions, and can also be applied when they have different numbers of clusters.
For |$ k^{\prime} \in [2,K] $|⁠, if |$ y_{kk^{\prime }}^{(h)}$| is randomly assigned, the NMI between |$y_{kk^{\prime }}^{(h)}$| and |$\textbf{y}_{k}$| should be close. This means that the fluctuation of |$\textrm{NMI} _{kk^{\prime }} ^ {(h)} $| will be small. In this paper, we use the maximum value of |$\textrm{NMI}_ {kk^{\prime }} ^ {(h)}$| minus the minimum value to evaluate the fluctuation of |$\textrm{NMI}_ {kk^{\prime }} ^ {(h)}$|⁠. To assess the stability of the test partitioning performed by this algorithm, we calculate the average range NMI between the inital labels and predicted labels by averaging the fluctuation of |$\textrm{NMI}_{kk^{\prime }}^{(h)}$|⁠.
(4)
For each batch-effect-removed dataset |$\textbf{D}_{k}$|⁠, we repeat the above process and finally obtain the |${K-1}$| mixing degree coefficient |$ARNMI_k$|⁠. In the ideal case involving perfectly stable clusters, when k is the true number of clusters |$k_t$|⁠, the batch-effect-removed data do not contain any information on cell clusters. Therefore, no matter how many clusters the batch-effect-removed data are clustered into, the clustering labels are randomly generated. Based on this criterion, we choose the number of clusters |$ \hat{k}$| for which the ARNMI is minimized as follows:
(5)
|$ \hat{k}$| is the optimal number of clusters found by the algorithm.

Datasets

Simulated data

To demonstrate that REBET can find the number of clusters, we used the Splatter package to simulate scRNA-seq data with known cluster information [36]. In Splatter, the dispersion between clusters is determined by two parameters, de.prob and de.facLoc. De.prob represents the probability of a gene being selected as a differentially expressed gene, and de.facLoc represents the expression level of differentially expressed genes. A higher de.prob means that more genes are differentially expressed, but changing de.facLoc would change the differentially expressed level of the same number of genes. The larger the values of de.prob and de.facLoc, the greater the difference between clusters. We set the parameters de.prob and de.facLoc to 0.1 and 0.2, respectively to generate four different parameter settings, which were denoted as Exp1, Exp2, Exp3and Exp4 (Table 1). Each parameter setting generated eight datasets, and the number of their clusters ranged from 2 to 9 (Supplementary Tables S1–S4). Each dataset has 200 cells, with each cell measured on 10 000 genes. The probability of cells being divided into each cluster was the same.

Table 1

The performance of SC3-TW, CC, PINS, SNF, Seurat, SIMLR, ROGUE and REBET in determining the number of cell clusters of simulated datasets with different dispersion between clusters. Eight datasets were simulated for each parameter setting*

ParametersNumber of datasets accurately estimated
de. probde. facLocSC3- TWCCPINSSNFSeuratSIMLRROGUEREBET
Exp10.10.101015134
Exp20.20.114014348
Exp30.10.201014027
Exp40.20.215016256
All211041961425
ParametersNumber of datasets accurately estimated
de. probde. facLocSC3- TWCCPINSSNFSeuratSIMLRROGUEREBET
Exp10.10.101015134
Exp20.20.114014348
Exp30.10.201014027
Exp40.20.215016256
All211041961425

*The number in bold indicates the maximum number of datasets that accurately estimated.

Table 1

The performance of SC3-TW, CC, PINS, SNF, Seurat, SIMLR, ROGUE and REBET in determining the number of cell clusters of simulated datasets with different dispersion between clusters. Eight datasets were simulated for each parameter setting*

ParametersNumber of datasets accurately estimated
de. probde. facLocSC3- TWCCPINSSNFSeuratSIMLRROGUEREBET
Exp10.10.101015134
Exp20.20.114014348
Exp30.10.201014027
Exp40.20.215016256
All211041961425
ParametersNumber of datasets accurately estimated
de. probde. facLocSC3- TWCCPINSSNFSeuratSIMLRROGUEREBET
Exp10.10.101015134
Exp20.20.114014348
Exp30.10.201014027
Exp40.20.215016256
All211041961425

*The number in bold indicates the maximum number of datasets that accurately estimated.

scRNA-seq data

To test and verify the effectiveness of our method, we applied the REBET and seven other classical subtyping methods to fourteen published scRNA-seq datasets with known cell subtypes [4, 5, 37–50]. All datasets have defined cell type information. For each dataset, we used the expression units and cell types provided by the authors (Table 2). The t-distributed Stochastic Neighbor Embedding (t-SNE) plots of the original gene expression data of Engel dataset and Zhou dataset is shown in Figure 2 and Supplementary Figure S1A, and the t-SNE plots of other datasets are shown in Supplementary Figure S2. The real scRNA-seq datasets can be accessed through the GEO database.

Table 2

Summary of the 14 real benchmarking datasets

Datasets# Cells# GenesUnitsOrganismsReferences
Zhou18123624FPKMMouseZhou et al., 2016 [38]
Yan9020214RPKMHumanYan et al., 2013 [40]
Treutlein8023129FPKMMouseTreultein et al., 2014 [45]
Ramskold3321042RPKMMouseRamskold et al., 2012 [5]
Patel4305948TPMHumanPatel et al., 2014 [44]
Leng24719080TPMHumanLeng et al., 2015 [48]
Grover13515158RPKMMouseGrover et al., 2016 [43]
Goolam12440315CPMMouseGoolam et al., 2016 [41]
Engel20323337RPMMouseEngel et al., 2014 [46]
Deng25922147RPKMMouseDeng et al., 2014 [42]
Chung51841821TPMHumanChung et al., 2017 [4]
Biase4925384FPKMMouseBiase et al., 2014 [39]
Usoskin62219153RPMMouseUsoskin et al., 2015 [50]
Cirstea153421785TPMHumanKaraayvaz et al., 2018 [49]
Datasets# Cells# GenesUnitsOrganismsReferences
Zhou18123624FPKMMouseZhou et al., 2016 [38]
Yan9020214RPKMHumanYan et al., 2013 [40]
Treutlein8023129FPKMMouseTreultein et al., 2014 [45]
Ramskold3321042RPKMMouseRamskold et al., 2012 [5]
Patel4305948TPMHumanPatel et al., 2014 [44]
Leng24719080TPMHumanLeng et al., 2015 [48]
Grover13515158RPKMMouseGrover et al., 2016 [43]
Goolam12440315CPMMouseGoolam et al., 2016 [41]
Engel20323337RPMMouseEngel et al., 2014 [46]
Deng25922147RPKMMouseDeng et al., 2014 [42]
Chung51841821TPMHumanChung et al., 2017 [4]
Biase4925384FPKMMouseBiase et al., 2014 [39]
Usoskin62219153RPMMouseUsoskin et al., 2015 [50]
Cirstea153421785TPMHumanKaraayvaz et al., 2018 [49]
Table 2

Summary of the 14 real benchmarking datasets

Datasets# Cells# GenesUnitsOrganismsReferences
Zhou18123624FPKMMouseZhou et al., 2016 [38]
Yan9020214RPKMHumanYan et al., 2013 [40]
Treutlein8023129FPKMMouseTreultein et al., 2014 [45]
Ramskold3321042RPKMMouseRamskold et al., 2012 [5]
Patel4305948TPMHumanPatel et al., 2014 [44]
Leng24719080TPMHumanLeng et al., 2015 [48]
Grover13515158RPKMMouseGrover et al., 2016 [43]
Goolam12440315CPMMouseGoolam et al., 2016 [41]
Engel20323337RPMMouseEngel et al., 2014 [46]
Deng25922147RPKMMouseDeng et al., 2014 [42]
Chung51841821TPMHumanChung et al., 2017 [4]
Biase4925384FPKMMouseBiase et al., 2014 [39]
Usoskin62219153RPMMouseUsoskin et al., 2015 [50]
Cirstea153421785TPMHumanKaraayvaz et al., 2018 [49]
Datasets# Cells# GenesUnitsOrganismsReferences
Zhou18123624FPKMMouseZhou et al., 2016 [38]
Yan9020214RPKMHumanYan et al., 2013 [40]
Treutlein8023129FPKMMouseTreultein et al., 2014 [45]
Ramskold3321042RPKMMouseRamskold et al., 2012 [5]
Patel4305948TPMHumanPatel et al., 2014 [44]
Leng24719080TPMHumanLeng et al., 2015 [48]
Grover13515158RPKMMouseGrover et al., 2016 [43]
Goolam12440315CPMMouseGoolam et al., 2016 [41]
Engel20323337RPMMouseEngel et al., 2014 [46]
Deng25922147RPKMMouseDeng et al., 2014 [42]
Chung51841821TPMHumanChung et al., 2017 [4]
Biase4925384FPKMMouseBiase et al., 2014 [39]
Usoskin62219153RPMMouseUsoskin et al., 2015 [50]
Cirstea153421785TPMHumanKaraayvaz et al., 2018 [49]
REBET illustrated on the Engel dataset (The true number of clusters is 4). (A) The t-SNE plot of the original gene expression data with cells colored according to the true label. (B–E). The scatter plots of batch-effect-removed data $\textbf{D}_{k}$ ($ k\in [2,5] $) with the first dimension reduction component as the x-axis and the second dimension reduction component as the y-axis after t-SNE dimensionality reduction. Cells are colored according to the true label. (B) Batch-effect-removed data still have obvious clusters. (C) The cells of NKT0 subpopulation shown in the lower-left corner and the top still clustered together. (D) All cells are mixed together, containing cells from the NKTO subpopulation and the NTK2 subpopulation. (E) Except for the cells of NTK2 subpopulation, other cell subpopulations are also mixed together. (F) Error bar plot of the mixing degree coefficient ARNMI values in the range $k=2,3,...,9$. When $k$ is between 2 and 4, ARNMI goes down. From (B) to (D), the different subpopulations of cells are also increasingly uniformly mixed together. REBET accurately estimated the number of cell subpopulations.
Figure 2

REBET illustrated on the Engel dataset (The true number of clusters is 4). (A) The t-SNE plot of the original gene expression data with cells colored according to the true label. (BE). The scatter plots of batch-effect-removed data |$\textbf{D}_{k}$| (⁠|$ k\in [2,5] $|⁠) with the first dimension reduction component as the x-axis and the second dimension reduction component as the y-axis after t-SNE dimensionality reduction. Cells are colored according to the true label. (B) Batch-effect-removed data still have obvious clusters. (C) The cells of NKT0 subpopulation shown in the lower-left corner and the top still clustered together. (D) All cells are mixed together, containing cells from the NKTO subpopulation and the NTK2 subpopulation. (E) Except for the cells of NTK2 subpopulation, other cell subpopulations are also mixed together. (F) Error bar plot of the mixing degree coefficient ARNMI values in the range |$k=2,3,...,9$|⁠. When |$k$| is between 2 and 4, ARNMI goes down. From (B) to (D), the different subpopulations of cells are also increasingly uniformly mixed together. REBET accurately estimated the number of cell subpopulations.

Data preprocessing

Before applying these methods, we preprocessed the data with reference to SC3, including gene filtering and logarithmic transformation [12]. Gene filters remove genes that are expressed in cells (expression value >2) less than 6%, or those that have a variance of 0. Then logarithmic transformation is performed on the filtered expression matrix after adding pseudo-count 1. To avoid changing the distribution of the data, we do not apply logarithmic transformation when we use the ROGUE method.

Benchmark methods

We applied the SC3, CC, PINS, SNF, Seurat, SIMLR and ROGUE methods to the above twelve published scRNA-seq datasets for evaluation and comparison of determining the number of cell clusters. SC3 is a consensus method that combines multiple clustering solutions. The first step is to use z-score to normalize the input data and record the normalized data as matrix Z. The second step is to calculate the eigenvalue of |$\textbf{Z} ^{\prime }\textbf{Z} $|⁠. The final number of clusters is the number of eigenvalues that are significantly different from the Tracy–Widom distribution at a significance level of 0.001 [12]. To distinguish the cluster number determination from the clustering method SC3, we denote this method for determining the number of clusters as SC3–TW. In the CC method, a proportion of samples and a proportion of features are subsampled from the data to form subsamples. Then, each subsample is clustered and the consensus matrix of each k is calculated. When the area under the curve leveling off and |$ \Delta k$| approaching zero, the corresponding k value is the number of clusters [51]. For PINS, the data are perturbed by adding Gaussian noise, and then a connectivity matrix is constructed based on the original data and perturbed data for each k (⁠|$ k\in [2,K] $|⁠) to calculate the absolute difference. Next, the empirical cumulative distribution functions of the connectivity difference matrices are calculated. It performs the partitioning based on the highest AUC [27]. The main idea of SNF is to construct a sample network and then effectively fuse the patient similarity network representing the full spectrum of the underlying data [26]. SNF determines the number of clusters by using eigengaps based on network connections. SIMLR provides two methods to estimate the number of clusters. The first method is the Eigengap method. It first calculates the eigenvalues of the Laplace matrix and sorts them in descending order (i.e. |$\lambda _{1},\lambda _{2},\cdots ,\lambda _{n}$|⁠). The number of clusters is estimated as |$\max _{i>1}\left (\lambda _{i+1}-\lambda _{i}\right )$| [13]. The second method is Separation Cost, it determines the number of clusters which results in the greatest largest drop in the value of the separation cost [52]. We get most of the same results using Eigengap method and Separation Cost method, so our results below only show the results of Separation Cost method. Seurat first calculates k-nearest neighbors, constructs the SNN graph, and then uses a smart local moving algorithm to identify community structures [14, 53]. ROGUE first uses differential entropy to describe the distribution of gene expression in single-cell data, and then builds an S-E model to describe the relationship between differential entropy and the mean of gene expression. Based on this model, a ROGUE uniform measurement is designed to evaluate the purity of a cell cluster [24]. For ROUGE, we first used SC3 to cluster gene expression data for all possible cluster numbers (⁠|$ k\in [2,K] $|⁠), and then used ROGUE to select the number of cell clusters according to the clustering results.

Evaluation criteria

Accuracy evaluation of the estimated number of cell clusters

For each dataset, we can calculate the deviation between the estimated number of cell clusters and the true number of cell clusters. To compare the accuracy of each method in estimating the number of cell clusters, we use the median of absolute deviations of the 14 datasets as the evaluation index of accuracy.

Evaluation metric for clustering

After determining the number of clusters with REBET, we used SC3 to cluster the gene expression data and denoted it as REBET-SC3. After using SC3-TW, CC, PINS, SNF, Seurat, SIMLR and ROGUE to estimate the number of clusters, we use SC3, CC, PINS, SNF, Seurat, SIMLR and SC3 to cluster the original data. The adjusted Rand index (ARI) is used to calculate the similarity between the clustering results and the known true clusters, which can be used to evaluate the clustering result. If the estimated label is the same as the true label, the ARI is 1, and the ARI decreases with increasing inconsistency [54].

Results

Simulated Datasets

To evaluate the accuracy of our method, we applied REBET and other compared methods to 32 simulation datasets with the different number of clusters and dispersion between clusters.The median of the absolute deviation between the number of cell clusters estimated by REBET, SIMLR and Seurat methods and the true cell clusters is 0 across the 32 simulated datasets (Fig. 3). The deviation of REBET was less variable than other methods (Figure 3). REBET accurately estimated the number of cell clusters in 25 datasets, which was the most accurate estimate among the above methods (Table 1). Followed by Seurat and ROGUE, which accurately estimated 19 and 14 datasets, respectively (Table 1).

The deviations between the estimated number of cell clusters and the true number of cell clusters for the 32 simulated datasets.
Figure 3

The deviations between the estimated number of cell clusters and the true number of cell clusters for the 32 simulated datasets.

For the datasets in Exp1 with a small dispersion between clusters, REBET accurately estimated the number of clusters in four datasets (Table 1, Supplementary Table S1). Except that Seurat accurately estimated the number of clusters of five datasets, other methods performed worse (Table 1, Supplementary Table S1). REBET accurately estimated the number of clusters in all eight datasets of Exp2. All of the other seven methods performed worse (Table 1, Supplementary Table S2). For all the datasets in Exp3, REBET only incorrectly estimated the dataset with eight clusters (Supplementary Table S3). REBET estimated the number of clusters to 7, and other methods had not accurately estimated (Supplementary Table S3). For the datasets in Exp4 with a large dispersion between clusters, both REBET and Seurat accurately estimated the number of clusters in six datasets (Table 1). The number of clusters in the dataset with the true number of clusters 5 and 6 were estimated by REBET to be 6 and 7 (Supplementary Table S4). This may be because even if the correct batch effect is defined, the removal of the batch effect may be incomplete. The number of clusters determined by REBET is still close to the true number of clusters. No matter what the true number of clusters |$k$| is, REBET can robustly determine the number of clusters in the majority datasets.

We use the simulated dataset with a true number of clusters of 3 in Exp2 to illustrates how REBET determines the number of clusters (Figure 4). As mentioned above, the input data were first partitioned into k classes. In the second step, for each k (⁠|$ k\in [2, K] $|⁠), we removed the differences between different clusters by treating the cluster label as a batch variable and then removing the effect of this batch variable. When k is equal to the true number of clusters, the batch-effect-removed data for each true cluster should be mixed uniformly. To explain more clearly, we applied t-SNE to the batch-effect-removed data |$\textbf{D}_{k}$|⁠. Figure 4B- 4D show the t-SNE plots for the batch-effect-removed data |$\textbf{D}_{k}$| with cells colored according to the true label for |$k=2$|⁠, |$k=3$| and |$k=4$|⁠. When |$k=2$|⁠, we can see that the first dimension component can divide the cells into three classes, indicating that the batch-effect-removed data |$\textbf{D}_{2}$| still contains information of the label. When |$k>2$|⁠, no clear cluster is seen based on the dimension reduction components. When |$k$| is the true number of clusters 3, the average range of normalized mutual information (ARNMI) reaches its minimum (Figure 4E). The result shows that REBET can correctly estimate the true clusters of this dataset.

The REBET algorithm applied on the simulated dataset with a true number of clusters of 3 in Exp2. The dataset consists of 200 samples and 10 000 genes. Each of the second and third classes has 65 samples whereas the first class has 70 samples (a total of 200 samples). (A) The t-SNE plot of the original gene expression data with cells colored according to the true label. (B–D) The t-SNE plots of the batch-effect-removed data $ \textbf{D}_{k}$ with cells colored according to the true label for $k=2$, $k=3$ and $k=4$. (E) Error bar plot of the ARNMI values for the dataset. REBET correctly estimated the true number of clusters.
Figure 4

The REBET algorithm applied on the simulated dataset with a true number of clusters of 3 in Exp2. The dataset consists of 200 samples and 10 000 genes. Each of the second and third classes has 65 samples whereas the first class has 70 samples (a total of 200 samples). (A) The t-SNE plot of the original gene expression data with cells colored according to the true label. (BD) The t-SNE plots of the batch-effect-removed data |$ \textbf{D}_{k}$| with cells colored according to the true label for |$k=2$|⁠, |$k=3$| and |$k=4$|⁠. (E) Error bar plot of the ARNMI values for the dataset. REBET correctly estimated the true number of clusters.

For the 32 simulated datasets, we found that the ARI of REBET-SC3 of 25 datasets was the highest or the same as the ARI obtained by the other seven methods (Supplementary Figure S3). Table 3 shows the median ARI for the simulation datasets under the same parameter settings. The method with the highest ARI in each row is in boldface. We find that REBET can improve the clustering performance (Table 3).

Table 3

The median ARI values for SC3, CC, PINS, SNF, Seurat, SIMLR, ROGUE-SC3 and REBET-SC3 of simulated datasets with different dispersion between clusters. Eight datasets were simulated for each parameter setting*

Parameters Median ARI
de. probde. facLocSC3CCPINSSNFSeuratSIMLRROGUE -SC3REBET -SC3
Exp10.10.100000.850.190.960.92
Exp20.20.100.990.430.010.950.750.951
Exp30.10.200000.890.390.921
Exp40.20.2010.470.010.990.7311
Parameters Median ARI
de. probde. facLocSC3CCPINSSNFSeuratSIMLRROGUE -SC3REBET -SC3
Exp10.10.100000.850.190.960.92
Exp20.20.100.990.430.010.950.750.951
Exp30.10.200000.890.390.921
Exp40.20.2010.470.010.990.7311

*The method with the highest ARI in each row is in boldface.

Table 3

The median ARI values for SC3, CC, PINS, SNF, Seurat, SIMLR, ROGUE-SC3 and REBET-SC3 of simulated datasets with different dispersion between clusters. Eight datasets were simulated for each parameter setting*

Parameters Median ARI
de. probde. facLocSC3CCPINSSNFSeuratSIMLRROGUE -SC3REBET -SC3
Exp10.10.100000.850.190.960.92
Exp20.20.100.990.430.010.950.750.951
Exp30.10.200000.890.390.921
Exp40.20.2010.470.010.990.7311
Parameters Median ARI
de. probde. facLocSC3CCPINSSNFSeuratSIMLRROGUE -SC3REBET -SC3
Exp10.10.100000.850.190.960.92
Exp20.20.100.990.430.010.950.750.951
Exp30.10.200000.890.390.921
Exp40.20.2010.470.010.990.7311

*The method with the highest ARI in each row is in boldface.

We changed the method of removing batch effect in the second step to limma [31]. We found that the number of clusters can be estimated accurately on the majority of datasets. It accurately estimated the number of clusters for the five datasets of Exp1. For the datasets in Exp2 and Exp3, the number of clusters of seven datasets can be accurately estimated. It accurately estimated the number of clusters for all eight datasets in Exp4 (Supplementary Table S5).

scRNA-seq Datasets

We applied REBET and the other seven methods to 14 scRNA-seq datasets. By default, we set the maximum number of cell clusters K to 10. As shown in Table 4, the absolute deviation between the true number of clusters and the number of clusters estimated by REBET was less than or equal to 2 for all datasets except the Patel and Deng datasets. The estimated number of clusters for the eight datasets (Yan, Grover, Biase, Ramskold, Patel, Engel, Leng and Usoskin) were the same as the true number of clusters, whereas the deviation of the Cristea dataset was only one. For the Chung, Treutlein and Goolam datasets, the number of cell clusters estimated by REBET differed from the true number of cell clusters by 2. Furthermore, the number of cell clusters estimated by REBET were still closest to the true number of cell clusters in the Deng and Chung datasets.

Table 4

The performance of SC3-TW, CC, PINS, SNF, Seurat, SIMLR, ROGUE and REBET in determining the number of cell clusters from scRNA-seq data*

Datasets# True clustersEstimated number of cell clusters
SC3-TWCCPINSSNFSeuratSIMLRROGUEREBET
Zhou8610324642
Yan642223246
Treutlein54102324NA3
Ramskold732241427
Patel5177726425
Leng323234243
Grover232324322
Goolam532344467
Engel4332254NA4
Deng1062327367
Chung41592211192
Biase332322623
Usoskin4927273104
Cristea63934213987
Datasets# True clustersEstimated number of cell clusters
SC3-TWCCPINSSNFSeuratSIMLRROGUEREBET
Zhou8610324642
Yan642223246
Treutlein54102324NA3
Ramskold732241427
Patel5177726425
Leng323234243
Grover232324322
Goolam532344467
Engel4332254NA4
Deng1062327367
Chung41592211192
Biase332322623
Usoskin4927273104
Cristea63934213987

*For each dataset, the numbers highlighted in bold are the most accurate estimates of the number of cell clusters. ROGUE produced errors for Treutlein and Engel datasets, shown as an NA value.

Table 4

The performance of SC3-TW, CC, PINS, SNF, Seurat, SIMLR, ROGUE and REBET in determining the number of cell clusters from scRNA-seq data*

Datasets# True clustersEstimated number of cell clusters
SC3-TWCCPINSSNFSeuratSIMLRROGUEREBET
Zhou8610324642
Yan642223246
Treutlein54102324NA3
Ramskold732241427
Patel5177726425
Leng323234243
Grover232324322
Goolam532344467
Engel4332254NA4
Deng1062327367
Chung41592211192
Biase332322623
Usoskin4927273104
Cristea63934213987
Datasets# True clustersEstimated number of cell clusters
SC3-TWCCPINSSNFSeuratSIMLRROGUEREBET
Zhou8610324642
Yan642223246
Treutlein54102324NA3
Ramskold732241427
Patel5177726425
Leng323234243
Grover232324322
Goolam532344467
Engel4332254NA4
Deng1062327367
Chung41592211192
Biase332322623
Usoskin4927273104
Cristea63934213987

*For each dataset, the numbers highlighted in bold are the most accurate estimates of the number of cell clusters. ROGUE produced errors for Treutlein and Engel datasets, shown as an NA value.

The cell clusters are treated as batches, and we assumed that when the number of clusters is the same as the number of batches, the batch effect is removed from the original gene expression data, and the cells from different subpopulations will mix most uniformly. To illustrate the effect of removing batch effects, we applied t-SNE to reduce the dimensionality of batch-effect-removed data. Here, we take the Engel dataset as an example. First, the Engel dataset was clustered into k classes (⁠|$ k\in [2,10] $|⁠) and the clustering labels were recorded as |$\textbf{y}_{k}$|⁠. Then we used |$\textbf{y}_{k}$| as the batch variable to remove the batch effect from the input data for the Engel dataset and obtained the batch-effect-removed data |$\textbf{D}_{k}$|⁠. We visualized the batch-effect-removed data |$\textbf{D}_{k}$| with t-SNE (Figure 2B–2E). Cells are colored according to the true label. There are obvious clusters in data |$\textbf{D}_{2}$| (Figure 2B). As |$k$| goes from 2 to 4, the cells from different subpopulations are increasingly uniformly mixed together (Fig. 2B–2D). And ARNMI goes down (Figure 2F). When |$k=3$|⁠, the cells of the NKT0 subpopulation shown in the lower-left corner and the top still clustered together (Figure 2C). When |$k=5$|⁠, the cells from different subpopulations are mixed except for the cells from the NTK2 subpopulation with the color purple (Figure 2E). This is because |$k$| is larger than the true number of cell subpopulations, and the cells of the NTK2 subpopulation are initially partitioned into two clusters. Although the NTK2 subpopulation and NTK0 subpopulation are mixed in each cell subpopulation in Figure 2D. These findings validated our assumptions. Our proposed mixing degree coefficient ARNMI is minimum when |$k$| is equal to 4 (Figure 2F). Therefore, REBET accurately estimated that the number of cell subpopulations of the Engel dataset was 4.

Accuracy of the estimated number of cell clusters. (A–H) Correlations between the cluster numbers estimated by SC3-TW, CC, PINS, SNF, Seurat, SIMLR, ROGUE and REBET and the true cluster numbers, for the 14 benchmarking datasets. (I) The deviations between the estimated number of cell clusters and the true number of cell clusters. The median absolute deviation for the estimate by REBET of the number of cell clusters is the smallest of the eight methods.
Figure 5

Accuracy of the estimated number of cell clusters. (AH) Correlations between the cluster numbers estimated by SC3-TW, CC, PINS, SNF, Seurat, SIMLR, ROGUE and REBET and the true cluster numbers, for the 14 benchmarking datasets. (I) The deviations between the estimated number of cell clusters and the true number of cell clusters. The median absolute deviation for the estimate by REBET of the number of cell clusters is the smallest of the eight methods.

We compared our method with other methods in terms of the accuracy of the number of clusters estimation and the clustering results. We first compared the accuracy of the estimated number of cell clusters. Table 4 shows the performance of the SC3-TW, CC, PINS, SNF, Seurat, SIMLR, ROGUE and REBET methods in determining the number of cell clusters based on the gene expression data. The numbers highlighted in bold are the most accurate estimates of the number of cell clusters (The estimated number of cell clusters is the closest to the true number of cell clusters) in their respective rows. As shown in Table 4, REBET performed robustly across various datasets in estimating the number of cell clusters. For the 14 datasets, the number of cell clusters estimated by the REBET in 11 datasets was the most accurate. SNF and SIMLR made the most accurate estimation for four datasets, whereas SC3-TW only made the most accurate estimation for three datasets. To view the accuracy of the estimation by each method of the number of clusters more clearly, we drew a scatter plot with the number of clusters estimated by each method as the y-axis and the true number of clusters as the x-axis (Figure 5). SC3-TW, PINS and SNF tend to underestimate the number of clusters (Figure 5). For the fourteen datasets, the median absolute deviation of the number of cell clusters estimated by REBET was 0, which was the smallest among the eight methods (Figure 5I). And the variance of REBET was smaller. In short, REBET outperformed existing solutions according to the median absolute deviation.

For the outlier of REBET in Figure 5I, REBET incorrectly estimated the number of clusters of the Zhou dataset as 2 (Table 4). No matter what |$k$| is, after removing batch effect, the cells of different clusters do not mix, especially the cells colored by yellow (Supplementary Figure S1). The reason why REBET incorrectly estimated the number of clusters of the Zhou dataset may be that the initial partition obtained by clustering was not accurate enough, or that the batch effect is complex so that it can not be captured by ComBat.

Figure 6 shows the final clustering result evaluation index ARI of each method for the fourteen datasets. REBET-SC3 performed better than all of the other seven methods for eight datasets: Usoskin, Biase, Deng, Engel, Grover, Leng, Patel and Ramskold (Supplementary Table S6). Furthermore, REBET-SC3 performed better than at least five methods for the other two datasets (Yan and Treutlein) (Figure 6). Then, we observed that the ARI values of REBET-SC3 were higher than that of the original SC3, except for the Cristea, Yan, Treutlein, Goolam, Chung and Zhou datasets. For the Yan dataset, although the original SC3 had a higher ARI than REBET-SC3, REBET accurately estimated the number of cell clusters. The mean and median ARI of REBET-SC3 were both the highest among the compared methods (Figure 6).

Comparison of clustering performances of SC3, CC, PINS, SNF, Seurat, SIMLR, ROGUE-SC3 and REBET-SC3, measured by Adjusted Rand Index (ARI). ARI is employed to measure the consistency between cluster labels and true labels. The labels “Median” and “Mean” represent the median and mean values across all 14 datasets. The exact ARI values of the 14 datasets are shown in Supplementary Table S6.
Figure 6

Comparison of clustering performances of SC3, CC, PINS, SNF, Seurat, SIMLR, ROGUE-SC3 and REBET-SC3, measured by Adjusted Rand Index (ARI). ARI is employed to measure the consistency between cluster labels and true labels. The labels “Median” and “Mean” represent the median and mean values across all 14 datasets. The exact ARI values of the 14 datasets are shown in Supplementary Table S6.

Robustness

To test the robustness of REBET, we randomly selected 95% of the cells from each cluster in each dataset and reran the REBET and the other methods 10 times [55]. We found that REBET estimated a relatively stable and accurate number of cell clusters. For the 32 simulated datasets, there are 25 datasets in which the median absolute deviation between the true number of clusters and the number of clusters estimated by REBET is 0 (Figure 7). For the remaining seven datasets, the median absolute deviation of REBET for 5 datasets is still the smallest among the eight methods. When the median absolute deviation of the number of clusters estimated by REBET is 0, the variance of the number of clusters estimated by REBET is also close to 0 (Figure 7).

Accuracy and robustness of the estimated number of cell clusters in the simulated datasets. Panels are box plots of the deviation between the true number of cell clusters and the estimated number of cell numbers for the 32 simulated datasets, across 10 times of randomly sampling 95% of cells. Panels in the same row indicate the same separation parameter setting between clusters (Exp1, Exp2, Exp3 and Exp4), and panels in the same column indicate the true number of cell clusters $k$.
Figure 7

Accuracy and robustness of the estimated number of cell clusters in the simulated datasets. Panels are box plots of the deviation between the true number of cell clusters and the estimated number of cell numbers for the 32 simulated datasets, across 10 times of randomly sampling 95% of cells. Panels in the same row indicate the same separation parameter setting between clusters (Exp1, Exp2, Exp3 and Exp4), and panels in the same column indicate the true number of cell clusters |$k$|⁠.

For the 14 real scRNA-seq datasets, the median absolute deviation of the estimated number of clusters of REBET after 10 samplings is the minimum on the 8 datasets (Yan, Ramskold, Patel, Grover, Chung, Biase, Usoskin and Cristea) (Figure 8, Supplementary Table S7). Especially for Yan, Ramskold, Patel, Usoskin and Cristea datasets, REBET performed best in terms of robustness and accuracy (Figure 8). For the Grover dataset, REBET, ROGUE and SNF accurately estimated the number of cell clusters each time (Figure 8).

Discussion

In this study we proposed a new method, REBET, to estimate the number of clusters of cells based on clustering of batch-effect-removed data. We treated different clusters as different batches, and then removed the batch effects, that is, the between-cluster variance was removed. When the number of cell clusters is correctly estimated and the clustering result is treated as a batch variable, the batch effect can be completely removed. From the scatter plot of t-SNE dimensionality reduction for batch-effect-removed data, it can be seen that the data will not have obvious clustering and will be evenly mixed. The optimal number of clusters can be achieved by minimizing the mixed degree coefficient that we defined. The source codes for the REBET are freely available at https://github.com/genemine/REBET.

Determining the number of cell clusters is the main challenge in scRNA-seq data analysis. Whereas, both simulation studies and real data applications indicated that REBET can accurately estimate the number of cell clusters in the majority of data. We generated four sets of simulated datasets, each consisting of eight datasets. The median absolute deviation of the number of cell clusters estimated by REBET was 0. REBET accurately estimated the number of cell clusters in 25 datasets, and the estimated deviations for the other six datasets were small. No matter what the true number of clusters |$k$| is, REBET can robustly determine the number of clusters in the majority of data.

We also benchmarked REBET along with other seven subtyping methods (SC3-TW, CC, SNF, PINS, SIMLR, Seruat and ROGUE) for 14 published scRNA-Seq datasets. REBET achieves consistently satisfactory performance in the majority of datasets. The median deviation of the number of cell clusters estimated by REBET was 0, and the variance was the smallest among the eight methods. The resampling experiment evaluated the robustness of REBET in determining the number of clusters. We found that REBET estimated a relatively robust and accurate number of clusters. After determining the number of clusters with REBET, we used SC3 to cluster the gene expression data. The results of this study show that REBET can improve the clustering performance.

We tested whether our method can correctly estimate the number of clusters for data with batch effect. We generated eight simulation datasets with the true number of cell clusters ranging from 2 to 9. The cells of each cluster of these eight datasets are from 2 batches (Supplementary Figure S4). We first used ComBat to remove batch effects from the original data, and then use REBET to estimate the number of cell clusters. REBET accurately estimated the number of clusters in the all datasets (Supplementary Figure S5). For data with batch effect, we suggest removing batch effects before applying REBET.

Accuracy and robustness of the estimated number of cell clusters. Panels are box plots of the deviation between the true number of cell clusters and the estimated number of cell numbers for the 14 real datasets, across 10 times of randomly sampling 95% of cells.
Figure 8

Accuracy and robustness of the estimated number of cell clusters. Panels are box plots of the deviation between the true number of cell clusters and the estimated number of cell numbers for the 14 real datasets, across 10 times of randomly sampling 95% of cells.

REBET does not have any other parameters (except for the maximum number of clusters K). REBET relies on methods to remove batch effects and clustering methods. By default, the clustering method we choose is SC3, and the method used for removing batch effects is Combat. We can also use the limma method to remove batch effects (Supplementary Table S5). The eight methods consume almost the same memory. For example, when the number of cells is 200, the running time is about 30 minutes (Supplementary Table S8). However, other methods (SC3-TW, CC, Seurat, SIMLR and PINS) are as fast as a few seconds, and ROGUE only takes a few minutes (Supplementary Table S8). The SC3 method does not run fast, and SC3 needs to be run multiple times in REBET. So REBET is time-consuming.

With the explosive growth of scRNA-seq data, efficient analysis methods are demanding. The determination of the number of cell clusters is fundamental to scRNA-seq analysis [55]. REBET can accurately and robustly estimate the number of cell clusters, which will facilitate further analysis of scRNA-seq data. Indeed, there are still some topics that can be further expanded. First, we can improve the method to remove the batch effect, to remove the batch effect better. Second, we can select genes that accurately contain the information of cell clusters, which will contribute to more accurate initial partitioning and more thorough removal of batch effects. Thirdly, we can analyze the difference of isoform between different clusters, so we can consider adding isoform information when determining the number of cell clusters.

Key Points
  • Understanding the development of an organ requires characteristic cell subpopulations, so it is important to quantify single-cell subpopulations accurately. Motivated by the practice of batch effect removal, we considered cell clusters as batches. We developed REBET (REmoving Batch Effect and Testing), a new method for determining the number of cell clusters based on single-cell RNA-seq (scRNA-seq) data.

  • The performances of REBET and other seven state-of-the-art methods are compared on simulated datasets with different number of clusters and dispersion between clusters. No matter what the true number of clusters is, REBET can robustly determine the number of clusters in the majority of data.

  • The performances of REBET and other seven state-of-the-art methods are compared on 14 published scRNA-seq datasets. REBET can accurately and robustly estimate the number of cell clusters and outperform existing methods. By comparing the clustering evaluation metric adjusted Rand index (ARI), we find that REBET can improve the clustering performance.

Funding

This work is supported by National Natural Science Foundation of China (61972185, U1909208), 111 Project (No. B18059) and Hunan Provincial Science and Technology Program (2018WK4001).

Zhao-Yu Fang is a graduate student at the School of Mathematics and Statistics, Central South University, Hunan, China. Her research interests include bioinformatics and machine learning.

Cui-Xiang Lin is a graduate student in the Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering at Central South University, Hunan, China. Her research interests include bioinformatics and systems biology.

Yun-Pei Xu is a graduate student in the Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering at Central South University, Hunan, China. His research interests include single-cell RNA-seq data analysis and machine learning.

Hong-Dong Li is an associate professor in the Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering at Central South University, Hunan, China. His research interests include alternative splicing, computational medicine and systems biology.

Qing-Song Xu is a professor at the School of Mathematics and Statistics, Central South University, Hunan, China. His research interests include bioinformatics, computational medicine and chemometrics.

References

1.

Ting
DT
,
Wittner
BS
,
Ligorio
M
, et al.
Single-Cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells
.
Cell Rep
2004
;
8
(
6
):
1905
18
.

2.

Tang
H
,
Zeng
T
,
Chen
L
.
High-order correlation integration for single-cell or bulk RNA-seq data analysis
.
Front Genet
2019
;
10
:
371
.

3.

Darmanis
S
,
Sloan
SA
,
Zhang
Y
, et al.
A survey of human brain transcriptome diversity at the single cell level
.
Proc Natl Acad Sci U S A
2015
;
112
(
23
):
7285
90
.

4.

Chung
W
,
Eum
HH
,
Lee
HO
, et al.
Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer
.
Nat Commun
2017
;
8
:15081.

5.

Ramsköld
D
,
Luo
S
,
Wang
YC
, et al.
Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells
.
Nat Biotechnol
2012
;
30
(
8
):
777
82
.

6.

Ester
M
,
Kriegel
HP
,
Sander
J
, et al.
A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining(KDD-96)
.
AAAI Press
1996
:
226
31
.

7.

Kohonen
T
.
The self-organizing map
.
Proc IEEE
1990
;
78
(
9
):
1464
80
.

8.

Luxburg
UV
.
A tutorial on spectral clustering
.
Stat Comput
2007
;
17
(
4
):
395
416
.

9.

Xie
J
,
Girshick
R
,
Farhadi
A
.
Unsupervised deep embedding for clustering analysis
. In:
33rd International Conference on Machine Learning, ICML 2016
.
International Machine Learning Society
,
2016
,
478
87
.

10.

Monti
S
,
Tamayo
P
,
Mesirov
J
, et al.
Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data
.
Mach Learn
2003
;
52
:
91
118
.

11.

Xu
C
,
Su
Z
.
Identification of cell types from single-cell transcriptomes using a novel clustering method
.
Bioinformatics
2015
;
31
(
12
):
1974
80
.

12.

Kiselev
VY
,
Andrews
TS
,
Hemberg
M
.
SC3: consensus clustering of single-cell RNA-seq data
.
Nat Methods
2017
;
14
(
5
):
483
6
.

13.

Wang
B
, et al.
Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning
.
Nat Methods
2017
;
14
(
4
):
414
6
.

14.

Butler
A
,
Hoffman
P
,
Smibert
P
, et al.
Integrating single-cell transcriptomic data across different conditions, technologies, and species
.
Nat Biotechnol
2018
;
36
(
5
):
411
20
.

15.

Stuart
T
,
Butler
A
,
Hoffman
P
, et al.
Comprehensive integration of single-cell data
.
Cell
2019
;
177
(
7
):
1888
1902.e21
.

16.

Kiselev
VY
,
Kirschner
K
,
Schaub
MT
, et al.
Challenges in unsupervised clustering of single-cell RNA-seq data
.
Nat Rev Genet
2019
;
20
(
5
):
273
82
.

17.

Lin
P
,
Troup
M
,
Ho
JWCIDR
.
Ultrafast and accurate clustering through imputation for single-cell RNA-seq data
.
Genome Biol
2017
;
18
:
59
.

18.

Tian
T
,
Wan
J
,
Song
Q
, et al.
Clustering single-cell RNA-seq data with a model-based deep learning approach
.
Nat Mach Intell
2019
;
1
(
4
):
191
8
.

19.

Dudoit
S
,
Fridlyand
J
.
A prediction-based resampling method for estimating the number of clusters in a dataset
.
Genome Biol
2002
;
3
(
7
):
research0036–1
.

20.

Tibshirani
R
,
Walther
G
,
Hastie
T
.
Estimating the number of clusters in a data set via the gap statistic
.
J R STAT SOC B
2001
;
63
(
2
):
411
23
.

21.

Chen
R
,
Yang
L
,
Goodison
S
, et al.
Deep-learning approach to identifying cancer subtypes using high-dimensional genomic data
.
Bioinformatics
2019
;
35
(
8
):
1269
77
.

22.

Tracy
CA
,
Widom
H
.
Level spacing distributions and the bessel kernel
.
Commun Math Phys
1994
;
161
(
2
):
289
309
.

23.

Yang
Y
,
Huh
R
,
Culpepper
HW
, et al.
SAFE-clustering: Single-cell aggregated (from ensemble) clustering for single-cell RNA-seq data
.
Bioinformatics
2018
;
35
(
8
):
1269
77
.

24.

Liu
B
,
Li
C
,
Li
Z
, et al.
An entropy-based metric for assessing the purity of single cell populations
.
Nat Commun
2020
;
11
:
3155
.

25.

Mo
Q
,
Wang
S
,
Seshan
VE
, et al.
Pattern discovery and cancer gene identification in integrated cancer genomic data
.
Proc Natl Acad Sci U S A
2013
;
110
(
11
):
4245
50
.

26.

Wang
B
,
Mezlini
AM
,
Demir
F
, et al.
Similarity network fusion for aggregating data types on a genomic scale
.
Nat Methods
2014
;
11
(
3
):
333
7
.

27.

Nguyen
T
,
Tagett
R
,
Diaz
D
, et al.
A novel approach for data integration and disease subtyping
.
Genome Res
2017
;
27
(
12
):
2025
39
.

28.

Price
EM
,
Robinson
WP
.
Adjusting for batch effects in DNA methylation microarray data, a lesson learned
.
Front Genet
2018
;
9
:
83
.

29.

Leek
JT
,
Storey
JD
.
Storey JDCapturing heterogeneity in gene expression studies by surrogate variable analysis
.
PLoS Genet
2007
;
3
(
9
):
1724
35
.

30.

Smyth
GK
,
Speed
T
.
Normalization of cDNA microarray data
.
Methods
2003
;
31
(
4
):
265
73
.

31.

Ritchie
ME
,
Phipson
B
,
Wu
D
, et al.
Limma powers differential expression analyses for RNA-sequencing and microarray studies
.
Nucleic Acids Res
2015
;
43
(
7
):
e47
7
.

32.

Johnson
WE
,
Li
C
,
Rabinovic
A
.
Adjusting batch effects in microarray expression data using empirical Bayes methods
.
Biostatistics
2007
;
8
(
1
):
118
27
.

33.

Tran
HTN
,
Ang
KS
,
Chevrier
M
, et al.
A benchmark of batch-effect correction methods for single-cell RNA sequencing data
.
Genome Biol
2020
;
21
:12.

34.

Leek
JT
,
Johnson
WE
,
Parker
HS
, et al.
The sva package for removing batch effects and other unwanted variation in high-throughput experiments
.
Bioinformatics
2012
;
28
(
6
):
882
3
.

35.

Ghosh
J
,
Acharya
A
.
Cluster ensembles
.
WIREs Data Mining Knowl Discov
2011
;
1
(
4
):
305
15
.

36.

Zappia
L
,
Phipson
B
,
Oshlack
A
.
Splatter: simulation of single-cell RNA sequencing data
.
Genome Biol
2017
;
18
:174.

37.

Xu
Y
,
Li
HD
,
Pan
Y
, et al.
BioRank: A similarity assessment method for single cell clustering
. In:
In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
IEEE
,
2018
,
157
62
.

38.

Zhou
F
,
Li
X
,
Wang
W
, et al.
Tracing haematopoietic stem cell formation at single-cell resolution
.
Nature
2016
;
533
(
7604
):
487
92
.

39.

Biase
FH
,
Cao
X
,
Zhong
S
.
Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing
.
Genome Res
2014
;
24
(
11
):
1787
96
.

40.

Yan
L
,
Yang
M
,
Guo
H
, et al.
Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells
.
Nat Struct Mol Biol
2013
;
20
(
9
):
1131
9
.

41.

Goolam
M
,
Scialdone
A
,
Graham
SJ
, et al.
Heterogeneity in Oct4 and Sox2 targets Biases cell fate in 4-cell mouse embryos
.
Cell
2016
;
165
(
1
):
61
74
.

42.

Deng
Q
,
Ramsköld
D
,
Reinius
B
, et al.
Single-cell RNA-Seq reveals dynamic, random monoallelic gene expression in mammalian cells
.
Science
2014
;
343
(
6167
):
193
6
.

43.

Grover
A
,
Sanjuan-Pla
A
,
Thongjuea
S
, et al.
Single-cell RNA sequencing reveals molecular and functional platelet bias of aged haematopoietic stem cells
.
Nat Commun
2016
;
7
:11075.

44.

Patel
AP
,
Tirosh
I
,
Trombetta
JJ
, et al.
Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma
.
Science
2014
;
344
(
6190
):
1396
401
.

45.

Treutlein
B
,
Brownfield
DG
,
Wu
AR
, et al.
Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq
.
Nature
2014
;
509
(
7500
):
371
5
.

46.

Engel
I
,
Seumois
G
,
Chavez
L
, et al.
Innate-like functions of natural killer T cell subsets result from highly divergent gene programs
.
Nat Immunol
2016
;
17
(
6
):
728
39
.

47.

Song
Y
,
Botvinnik
OB
,
Lovci
MT
, et al.
Single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation
.
Mol Cell
2017
;
67
(
1
):
148
61
.

48.

Leng
N
,
Chu
LF
,
Barry
C
, et al.
Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments
.
Nat Methods
2015
;
12
(
10
):
947
50
.

49.

Karaayvaz
M
,
Cristea
S
,
Gillespie
SM
, et al.
Unravelling subclonal heterogeneity and aggressive disease states in TNBC through single-cell RNA-seq
.
Nat Commun
2018
;
9
:3588.

50.

Usoskin
D
, et al.
Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing
.
Nat Neurosci
2015
;
18
(
1
):
145
.

51.

Wilkerson
MD
,
Hayes
DN
.
ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking
.
Bioinformatics
2010
;
26
(
12
):
1572
3
.

52.

Zelnik-Manor
L
,
Perona
P
.
Self-tuning spectral clustering
. In:
NIPS’04: Proceedings of the 17th International Conference on Neural Information Processing Systems
,
2004
,
1601
8
.

53.

Ludo
W
,
Jan
VEN
.
A smart local moving algorithm for large-scale modularity-based community detection
.
Eur Phys J B
2013
;
86
(
11
):
1
14
.

54.

Hubert
L
,
Arabie
P
.
Comparing partitions
.
J Classif
1985
;
2
:
193
218
.

55.

Baran
PM
,
Dennis
K
.
Random forest based similarity learning for single cell RNA sequencing data
.
Bioinformatics
2018
;
34
(
13
):
i79
88
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data