scGAD: a new task and end-to-end framework for generalized cell type annotation and discovery

Relationship between our annotation goal and the annotation goal of other methods

Setting	Task	Seen cell types	Novel cell types	Method
\|$\mathcal{C}_{t}\subset \mathcal{C}_{r}$\|(closed set)	Supervised classification	Cell type label	N/A	singleR
	Semi-supervised classification	Cell type label	N/A	scANVI
\|$\mathcal{C}_{r}\subseteq \mathcal{C}_{t}$\|(open set)	Semi-supervised clustering	Cluster label	Cluster label	scCNC
	Unsupervised clustering	Cluster label	Cluster label	scziDesk; scNAME
	Semi-supervised classification	Cluster label	Cluster label	MARS
	Semi-supervised classification	Cell type label	‘Unassigned’	scNym
	Transfer learning classification	Cell type label		scArches
Ours	Semi-supervised and transfer learning classification	Cell type label	Cluster label	scGAD

Setting	Task	Seen cell types	Novel cell types	Method
\|$\mathcal{C}_{t}\subset \mathcal{C}_{r}$\|(closed set)	Supervised classification	Cell type label	N/A	singleR
	Semi-supervised classification	Cell type label	N/A	scANVI
\|$\mathcal{C}_{r}\subseteq \mathcal{C}_{t}$\|(open set)	Semi-supervised clustering	Cluster label	Cluster label	scCNC
	Unsupervised clustering	Cluster label	Cluster label	scziDesk; scNAME
	Semi-supervised classification	Cluster label	Cluster label	MARS
	Semi-supervised classification	Cell type label	‘Unassigned’	scNym
	Transfer learning classification	Cell type label		scArches
Ours	Semi-supervised and transfer learning classification	Cell type label	Cluster label	scGAD

Table 1

Relationship between our annotation goal and the annotation goal of other methods

Setting	Task	Seen cell types	Novel cell types	Method
\|$\mathcal{C}_{t}\subset \mathcal{C}_{r}$\|(closed set)	Supervised classification	Cell type label	N/A	singleR
	Semi-supervised classification	Cell type label	N/A	scANVI
\|$\mathcal{C}_{r}\subseteq \mathcal{C}_{t}$\|(open set)	Semi-supervised clustering	Cluster label	Cluster label	scCNC
	Unsupervised clustering	Cluster label	Cluster label	scziDesk; scNAME
	Semi-supervised classification	Cluster label	Cluster label	MARS
	Semi-supervised classification	Cell type label	‘Unassigned’	scNym
	Transfer learning classification	Cell type label		scArches
Ours	Semi-supervised and transfer learning classification	Cell type label	Cluster label	scGAD

Setting	Task	Seen cell types	Novel cell types	Method
\|$\mathcal{C}_{t}\subset \mathcal{C}_{r}$\|(closed set)	Supervised classification	Cell type label	N/A	singleR
	Semi-supervised classification	Cell type label	N/A	scANVI
\|$\mathcal{C}_{r}\subseteq \mathcal{C}_{t}$\|(open set)	Semi-supervised clustering	Cluster label	Cluster label	scCNC
	Unsupervised clustering	Cluster label	Cluster label	scziDesk; scNAME
	Semi-supervised classification	Cluster label	Cluster label	MARS
	Semi-supervised classification	Cell type label	‘Unassigned’	scNym
	Transfer learning classification	Cell type label		scArches
Ours	Semi-supervised and transfer learning classification	Cell type label	Cluster label	scGAD

To evaluate the performance of scGAD fairly, we select various comparison baselines and carefully construct single and cross-data benchmarks based on simulation data and massive, highly imbalanced real scRNA-seq data. In a variety of well-designed simulation experiments, single-data annotation experiments, cross-data annotation experiments and biological interpretations, scGAD exhibits excellent across-the-board performance in annotation accuracy, clustering accuracy and overall accuracy, compared with three other competitive clustering methods and four competitive annotation methods. Not only that, but the experimental results also prove robustness and scalability, showing that scGAD can be widely used in various scenarios. Furthermore, by exploring how the clustering accuracy on reference data varies with |$|\mathcal{C}_{t}|$| on real datasets, we can validate that the accuracy reaches the maximum when |$|\mathcal{C}_{t}|$| takes the true value, demonstrating the effectiveness of our estimation method.

Methods

We first introduce some notation. Assuming that scRNA-seq data are divided into reference data and target data, they can come from the same scRNA-seq dataset or different scRNA-seq datasets. Reference data are recorded as |$\mathcal{D}_{r}=\{(x_{i}^{r},y_{i}^{r})_{i=1}^{n_{r}}\}$| where the label |$y_{i}^{r}$| is known and constitutes the class set |$\mathcal{C}_{r}$|⁠. Similarly, target data are recorded as |$\mathcal{D}_{t}=\{(x_{i}^{t},y_{i}^{t})_{i=1}^{n_{t}}\}$| where the label |$y_{i}^{t}$| is unknown and constitutes the class set |$\mathcal{C}_{t}$|⁠. Moreover, the gene dimension of every cell is m. In our problem, we assume that |$\mathcal{C}_{r} \subset \mathcal{C}_{t}$|⁠; furthermore, the seen label set is defined as |$\mathcal{C}_{s}=\mathcal{C}_{r} \cap \mathcal{C}_{t}$|⁠, and the novel label set is defined as |$\mathcal{C}_{n}=\mathcal{C}_{t}\backslash \mathcal{C}_{r}$|⁠. Unlike previous clustering and annotation methods, which always classify the cells of novel cell types as ‘unassigned’, our goal is to assign cell labels to all cells in target data, using cell types that may or may not be observed in the reference data.

First, considering the discrete, sparse, and large variance characteristics of scRNA-seq data, we use ZINB distribution to model this gene expression pattern, that is:

$$ \begin{align} p_{ZINB}(x_{ij}^{*}\lvert \pi_{ij},\mu_{ij},\theta_{ij})=\pi_{ij}\delta_{x_{ij}^{*}=0}+(1-\pi_{ij})p_{NB}(x_{ij}^{*}\lvert \mu_{ij},\theta_{ij}), \end{align} $$

(1)

where |$\delta _{x_{ij}^{*}=0}$| means that if |$x_{ij}^{*}=0$| then |$\delta _{x_{ij}^{*}=0}=1$|⁠, otherwise |$\delta _{x_{ij}^{*}=0}=0$|⁠, the definition of |$\Gamma (x)$| can be expressed as |$\Gamma (x)=\int _{0}^{+\infty }t^{x-1}e^{-t}dt$| and |$p_{NB}(x_{ij}^{*}\lvert \mu _{ij},\theta _{ij})$| is:

$$ \begin{align} p_{NB}(x_{ij}^{*}\lvert \mu_{ij},\theta_{ij})=\dfrac{\Gamma(x_{ij}^{*}+\theta_{ij})}{\Gamma(x_{ij}^{*}+1)\Gamma(\theta_{ij})}\left(\dfrac{\theta_{ij}}{\theta_{ij}+\mu_{ij}}\right)^{\theta_{ij}}\left(\dfrac{\mu_{ij}}{\theta_{ij}+\mu_{ij}}\right)^{x_{ij}^{*}}. \end{align} $$

(2)

Among them, |$x_{ij}^{*}\ (1\leq i \leq n_{r}+n_{t},1\leq j\leq m)$| represents the expression level of the |$i$|-th cell on the |$j$|-th gene, m represents the number of genes, and, |$\pi _{ij}$|⁠, |$\mu _{ij}$|⁠, and |$\theta _{ij}$| represent zero-inflated parameters, mean parameters and dispersion parameters, respectively, together constituting the parameters to be estimated for the model. Owing to the complex interaction between genes, these three sets of parameters are not independent of each other, but actually fall in a low-dimensional manifold. Therefore, we use the DCA model to estimate the parameters, and, at the same time, approximate the manifold so as to effectively reduce the dimension and denoise the scRNA-seq data [29]. Specifically, the preprocessed data matrix is denoted as X, and |$x_{ij}\ (1\leq i \leq n_{r}+n_{t},1\leq j\leq m)$| are elements of it. The DCA model is as follows. Let |$f_{e}(x):R^{m}\rightarrow R^{d}$| be the encoder function that maps the cells into the low-dimensional embedding space to get the embedding representation |$Z=f_{e}(X)$|⁠. Similarly, let |$f_{d}(x):R^{d}\rightarrow R^{m}$| be the decoder function to get the reconstructed variable |$X_{r}=f_{d}(Z)$|⁠. Then, we use the reconstructed variable |$X_{r}$| to estimate the parameters:

$$ \begin{align} \hat{\pi}&=sigmoid(w_{\pi}^{\prime}X_{r});\ \hat{\theta}=exp(w_{\theta}^{\prime}X_{r});\ \hat{\mu}=exp(w_{\mu}^{\prime}X_{r}) \end{align} $$

(3)

where |$w_{\pi }$|⁠, |$w_{\theta }$|⁠, and |$w_{\mu }$| are the corresponding weights. To accomplish data denoising, we naturally use the negative log-likelihood of ZINB distribution as the training loss [29], namely

$$ \begin{align} L_{zinb}=-\sum_{i=1}^{n_{r}+n_{t}}\sum_{j=1}^{m}p(x_{ij}^{*}\lvert \hat{\pi}_{ij},\hat{\mu}_{ij},\hat{\theta}_{ij}). \end{align} $$

(4)

As shown in previous work [30], the ZINB-based denoising network is less capable of capturing correlations across genes. Therefore, inspired by the recent progress in semi-supervised learning [31, 32], we, instead, use the same data augmentation strategy as that in scNAME to generate a different gene expression matrix. Specifically, we first construct two auxiliary matrices: a binary mask matrix |$B$| that samples from the Bernoulli distribution and a shuffled expression matrix |$X^{\prime}$| obtained by randomly shuffling the original data within each feature column. Then the augmented data matrix |$\tilde{X}$| can be generated as:

$$ \begin{align} \tilde{X}=B\ \odot X^{\prime} + (1 - B)\ \odot X, \end{align} $$

(5)

where |$\odot $| represents element-wise multiplication. After |$\tilde{X}$| passes through the denoising autoencoder network, we can get the estimation value |$\hat{B}$| of the mask matrix |$B$|⁠. To account for the dependencies of genes, binary cross-entropy loss is applied to train the model, that is,

$$ \begin{align} L_{mask} = -\sum_{i=1}^{n_{r}+n_{t}}\sum_{j=1}^{m} (B_{ij}\log \hat{B}_{ij} + (1 - B_{ij})\log (1 - \hat{B}_{ij})). \end{align} $$

(6)

To sum up, the overall loss of our basic framework is

$$ \begin{align} L_{den}&=L_{zinb}+L_{mask}. \end{align} $$

(7)

In order to perform a preliminary classification of each cell, we attach a classifier |$\Phi $| to the hidden layer, thereby mapping each embedding representation |$z_{i}$| to one of the |$\mathcal{C}_{r}\cup \mathcal{C}_{t}$| cell types together with a predictive probability vector |$p_{i}$|⁠. Without loss of generality, we assume that the first |$\mathcal{C}_{s}$| classification heads correspond to |$\mathcal{C}_{s}$| seen cell types, while the remaining |$\mathcal{C}_{n}$| heads correspond to novel cell types. The number of |$\mathcal{C}_{n}$| can be estimated and entered into the model as known information. The specific estimation method will be introduced in Section 3.3. Since we also take the corrupted data matrix as input, its embedding representation and predictive probability can be written as |$\tilde{z}_{i}$| and |$\tilde{p}_{i}$|⁠, respectively. In the next several sections, our proposed modules will depend on these pre-defined variables from different sample views.

Anchor-based self-supervised learning

Having known labels of the reference data allows us to directly train the classifier |$\Phi $| on the reference data and use standard cross-entropy loss as the objective function for training:

$$ \begin{align} L_{sce}=-\dfrac{1}{2n_{r}}\sum_{i=1}^{n_{r}}\sum_{j=1}^{|\mathcal{C}_{r}\cup \mathcal{C}_{t}|}(y_{ij}^{r}\log p_{ij}^{r}+y_{ij}^{r}\log \tilde{p}_{ij}^{r}), \end{align} $$

(8)

where |$y_{i}^{r}$| is written as the |$|\mathcal{C}_{r}\cup \mathcal{C}_{t}|$|-dimensional one-hot vector to facilitate calculations.

Since the classifier is only trained on seen cell types and lacks label supervision for novel cell types, this training discrepancy would lead to a prediction imbalance between the two kinds of cell types, which will directly cause the model to misclassify novel cells into seen cell types. To address this challenge, we find that cells from the same cell types are expected to be geometrically and semantically close to each other based on the manifold assumption, even though target data possess batch effect and cell type shift. Thus, we perform classification on both novel and seen cell types by mixing target and reference data together and exploiting the pairwise affinity between homogeneous cells. This leads us to the concept of anchor pairs whereby each unlabeled cell is either close to a labeled cell or aligned with the adjacent unlabeled cell, so that we can balance the discrimination state between seen cell types and novel cell types by perceiving the intrinsic discrimination correspondence across the whole dataset.

Here, we introduce MNN as geometric anchors between reference and target data. Assume that the embedding features of labeled and unlabeled data are |$\{z_{i}^{r}\}_{i=1}^{n_{r}}$| and |$\{z_{i}^{t}\}_{i=1}^{n_{t}}$|⁠, respectively. Then, for every cell |$z_{i}^{r} \in \mathcal{D}_{r}$|⁠, we find g closest cells to cell |$z_{i}^{r}$| from |$\mathcal{D}_{t}$| in Euclidean distance and denote this set as |$N_{g}^{i}$|⁠. Similarly, we construct |$M_{g}^{j}$|⁠; that is, we find g closest cells to |$z_{j}^{t}$| from |$\mathcal{D}_{r}$|⁠. For a pair of cells |$(z_{i}^{r},z_{j}^{t})$|⁠, if they are in the neighborhood of each other, namely

$$ \begin{align} \{z_{i}^{r}\in M_{g}^{j}\} \cap \{z_{j}^{t}\in N_{g}^{i}\}, \end{align} $$

(9)

then we call such pair an anchor pair between |$\mathcal{D}_{r}$| and |$\mathcal{D}_{t}$|⁠. However, the above definition of anchor pair is only based on geometric distance in the embedding space, while ignoring the semantic information of cells expressed by the classifier, which may lead to negative knowledge transfer. Therefore, we further restrict the anchor pair semantically, requiring that it not only be geometrically closest, but also semantically consistent, that is:

$$ \begin{align} \{z_{i}^{r}\in M_{g}^{j}\} \cap \{z_{j}^{t}\in N_{g}^{i}\} \ \& \ \mathop{\arg\max}\limits_{c}p_{ic}^{r}=\mathop{\arg\max}\limits_{c}p_{jc}^{t}, \end{align} $$

(10)

where the value range of c is |$1\leq c\leq |\mathcal{C}_{r}\cup \mathcal{C}_{t}|$|⁠, |$p_{ic}^{r}$| represents the probability that |$x_{i}^{r}$| belongs to class c, and |$p_{jc}^{t}$| is the same.

In order to represent the anchor pairs between |$\mathcal{D}_{r}$| and |$\mathcal{D}_{t}$| as a whole, we introduce the anchor adjacency matrix |$A_{rt}$|⁠, which is defined as:

$$ \begin{align} A_{rt}(i,j)=\left\{ \begin{array}{@{}rcl} 1 & & (z_{i}^{r},z_{j}^{t}) \ is \ anchor \ pair\\ 0 & & otherwise. \end{array} \right. \end{align} $$

(11)

Based on symmetry of the anchor pair, we can obtain the conclusion that |$A_{rt}=A_{tr}^{\prime }$|⁠. Similarly, we can also define anchor pairs within |$\mathcal{D}_{t}$|⁠; that is |$(z_{i}^{t},z_{j}^{t})$| is an anchor pair if and only if it meets the following conditions:

$$ \begin{align} \{z_{i}^{t}\in N_{g}^{j}\} \cap \{z_{j}^{t}\in N_{g}^{i}\} \ \& \ \mathop{\arg\max}\limits_{c}p_{ic}^{t}=\mathop{\arg\max}\limits_{c}p_{jc}^{t}. \end{align} $$

(12)

The anchor adjacency matrix within |$\mathcal{D}_{t}$| can be defined as |$A_{tt}$|⁠, and we have:

$$ \begin{align} A_{tt}(i,j)=\left\{ \begin{array}{@{}rcl} 1 & & (z_{i}^{t},z_{j}^{t}) \ is \ anchor \ pair\\ 0 & & otherwise. \end{array} \right. \end{align} $$

(13)

Although the anchor adjacency matrix within |$\mathcal{D}_{r}$||$A_{rr}$| can be directly obtained by the labels of reference samples, this information asymmetry may lead to an imbalance in the prediction of anchor pairs, so we still need to add geometric constraints on |$A_{rr}$|⁠, namely

$$ \begin{align} A_{rr}(i,j)=\left\{ \begin{array}{@{}rcl} 1 & & \{z_{i}^{r}\in M_{g}^{j}\} \cap \{z_{j}^{r}\in M_{g}^{i}\} \\SplitEq\& \ y_{i}^{r}=y_{j}^{r}\\ 0 & & otherwise \end{array} \right. \end{align} $$

(14)

where |$y_{i}^{r}$| denoted the label of cell |$x_{i}^{r}$| and is known.

Robust identification of anchor pairs is essential for improving the discrimination on unlabeled data. However, a cell may form anchor pairs with multiple cells. This means that, an incorrect anchor pair will affect judgment of the cell type. In order to reduce the influence of false anchor pairs on prediction, we normalize the anchor adjacency matrix with a connectivity-aware weight. First, we concatenate different kinds of anchor adjacency matrices together to get the global adjacency matrix:

$$ \begin{align} A= \begin{pmatrix} A_{rr}&A_{rt}\\ A_{tr}&A_{tt} \end{pmatrix}. \end{align} $$

(15)

Let |$r_{i}$| be the degree of the |$i$|-th cell, i.e., the number of anchor pairs where the |$i$|-th cell is located, and then we can get a diagonal matrix |$R=diag\{r_{1},r_{2},\cdots ,r_{n_{r}+n_{t}}\}$|⁠. Thus, the normalized symmetric Laplace matrix |$W$| can be denoted as |$W=I-R^{-\frac{1}{2}}AR^{-\frac{1}{2}}$|⁠, where |$w_{i,j}=-\dfrac{A(i,j)}{\sqrt{r_{i}r_{j}}} \quad i\neq j$|⁠. We regard |$-w_{i,j}$| as the weighted affinity values of anchor pairs, such that the larger the |$-w_{i,j}$|⁠, the stronger the connectivity between the |$i$|-th and |$j$|-th cells, thereby increasing reliability and, hence, the belief that they belong to the same cell type. Similar to anchor adjacency matrix A, we can also write W in the form of a block matrix to facilitate the following discussion:

$$ \begin{align} W= \begin{pmatrix} W_{rr}&W_{rt}\\ W_{tr}&W_{tt} \end{pmatrix}. \end{align} $$

(16)

To propagate the known label message from |$\mathcal{D}_{r}$| to |$\mathcal{D}_{t}$| and aggregate the novel semantic knowledge within |$\mathcal{D}_{t}$| itself, we use the prediction consistency between anchor pairs to supervise the discrimination training on |$\mathcal{D}_{t}$|⁠. Specifically, define the two prediction vector sets as |$P=\{p_{1}^{r},\cdots ,p_{n_{r}}^{r},p_{1}^{t},\cdots ,p_{n_{t}}^{t}\}$| and |$\tilde{P}=\{\tilde{p}_{1}^{r},\cdots ,\tilde{p}_{n_{r}}^{r},\tilde{p}_{1}^{t},\cdots ,\tilde{p}_{n_{t}}^{t}\}$|⁠, respectively. Then, the anchor-guided, self-supervised learning objective function can be given as:

$$ \begin{align} &L_{ssl}=-\dfrac{1}{2n_{t}}\sum_{i=1}^{n_{t}}\Big[\sum_{\{j:-W_{rt}(i,j)> 0\}}-W_{rt}(i,j)(\tilde{P}_{j}^{r\prime}P_{i}^{t}+P_{j}^{r\prime}\tilde{P}_{i}^{t})+\\ \nonumber &\sum_{\{j:-W_{tt}(i,j)> 0\}}-W_{tt}(i,j)(\tilde{P}_{j}^{t\prime}P_{i}^{t}+P_{j}^{t\prime}\tilde{P}_{i}^{t})\Big], \end{align} $$

(17)

where |$P_{j}$| and |$\tilde{P}_{j}$| are the |$j$|-th item in |$P$| and |$\tilde{P}$|⁠, respectively. With the affinity value as weight, we push the prediction consistent to their associated anchors with strong connectivity and, to a lesser degree, to those with weak connectivity. As stated above, we can get the training loss function in the discrimination space, namely

$$ \begin{align} L_{anc}=L_{sce}+L_{ssl}. \end{align} $$

(18)

Confidential prototypical self-supervised learning

Prediction consistency constraint on matched anchors can be regarded as cellular level self-supervised learning regularization. Clustering is mainly done from the perspective of paired cells by exploiting the attractive forces between the same cell types, which effectively solves the problem of imbalanced discriminative states between seen cell types and novel cell types. Nonetheless, cell annotation may still encounter some potential risks. First, since cells are clustered at the cellular level, individually deviating cells can have a severe impact on the overall result; that is, our model may be sensitive to abnormal samples. Second, since the weights of the classification head for novel cell types could be arbitrary, loss function |$L_{ssl}$| may not be able to very accurately separate seen and novel cell types in target data. Therefore, we propose performing prototypical self-supervised learning to implicitly encode the consensus category structure of labeled and unlabeled data into the embedding space by deep generative clustering. Compared with cellular level alignment, matching a sample to a prototype is more robust to abnormal samples and makes the optimization converge faster and smoother.

In order to perform prototypical self-supervised learning on |$\mathcal{D}_{t}$|⁠, we first perform k-means clustering on |$\{z_{i}^{t}\}_{i=1}^{n_{t}}$| in the embedding space to obtain the corresponding class prototypes |$\{\mu _{i}^{t}\}_{i=1}^{|\mathcal{C}_{r}\cup \mathcal{C}_{t}|}$| and the cluster label |$\{c_{i}\}_{i=1}^{n_{t}}$|⁠. However, since the embedding features are not fully discriminative in the early stages of training, the cluster labels are highly noisy. Relying on them to supervise model training can easily lead to error accumulation. Therefore, to avoid error propagation, we introduce a normalized confidence score to divide the target data into reliable set |$\mathcal{D}_{t}^{r}$| and fuzzy set |$\mathcal{D}_{t}^{f}$|⁠. This score is obtained by calculating the ratio of the distance between the sample and its own cluster center and the distance between the sample and the nearest non self-cluster center. The smaller the score, the more reliable the cluster label of the sample. In each training epoch, we default to select the top |$\alpha $|% samples with the lowest score in each cluster into |$\mathcal{D}_{t}^{r}$|⁠, otherwise into |$\mathcal{D}_{t}^{f}$|⁠. Then, for samples in |$\mathcal{D}_{t}^{r}$|⁠, we use their clustering labels to supervise learning their representations, whereas for samples in |$\mathcal{D}_{t}^{f}$|⁠, we perform assignment entropy minimization to move them near the intimate prototypes. Specifically, we use the t-distribution kernel function to measure the similarity between each cell and the prototypes. Assuming that |$q_{ij}^{t}$|⁠, |$\tilde{q}_{ij}^{t}$| represents the affinity between the i-th prototype and the j-th cell, we have

$$ \begin{align} q_{ij}^{t} = \frac{(1 + ||z_{i}^{t} - \mu_{j}^{t}||_{2}^{2})^{-1}}{\sum_{l=1}^{|\mathcal{C}_{r}\cup \mathcal{C}_{t}|}(1 + ||z_{i}^{t} - \mu_{l}^{t}||_{2}^{2})^{-1}} \end{align} $$

(19)

$$ \begin{align} \tilde{q}_{ij}^{t} = \frac{(1 + ||\tilde{z}_{i}^{t} - \mu_{j}^{t}||_{2}^{2})^{-1}}{\sum_{l=1}^{|\mathcal{C}_{r}\cup \mathcal{C}_{t}|}(1 + ||\tilde{z}_{i}^{t} - \mu_{l}^{t}||_{2}^{2})^{-1}}. \end{align} $$

(20)

Then the within |$\mathcal{D}_{t}$| prototypical self-supervised learning objective can be written as

$$ \begin{align} L_{pro}^{t} &= -\dfrac{1}{2n_{t}}\sum_{i=1}^{n_{t}}[(\log q_{ic_{i}}^{t}+\log \tilde{q}_{ic_{i}}^{t})I_{i\in \mathcal{D}_{t}^{r}} \\ \nonumber & + \sum_{j=1}^{|\mathcal{C}_{r}\cup \mathcal{C}_{t}|}(q_{ij}^{t}\log q_{ij}^{t}+\tilde{q}_{ij}^{t}\log\tilde{q}_{ij}^{t})I_{i\in \mathcal{D}_{t}^{f}}]. \end{align} $$

(21)

By minimizing the objective function, the model can improve the compactness within clusters and the separability between clusters in the embedding space. Moreover, using |$L_{pro}^{t}$| as regularization can effectively satisfy the issue of blurred classification boundaries across cell types that may be caused by the connection of paired samples.

To further ensure the stability of model against potential batch effect, the following dual alignment process is proposed to serve as a domain-invariant regularizer. The discrimination learning of target data under the classifier |$\Phi $| can be regarded as the process of aligning target data to some known prototypes in reference data. Similarly, we can also perform cell-prototype dual alignment from reference data |$\mathcal{D}_{r}$| to target data |$\mathcal{D}_{t}$| to implicitly enforce learning cell-type-aligned and discriminative features. Specifically, we discover positive matching, as well as negative matching, between cells in |$\mathcal{D}_{r}$| and cluster prototypes in |$\mathcal{D}_{t}$|⁠. Similarly, |$q_{i}^{r}$| and |$\tilde{q}_{i}^{r}$|⁠, the similarity distribution vectors on |$\mathcal{D}_{r}$|⁠, can be computed as:

$$ \begin{align} q_{ij}^{r} = \frac{(1 + ||z_{i}^{r} - \mu_{j}^{t}||_{2}^{2})^{-1}}{\sum_{l=1}^{|\mathcal{C}_{r}\cup \mathcal{C}_{t}|}(1 + ||z_{i}^{r} - \mu_{l}^{t}||_{2}^{2})^{-1}} \end{align} $$

(22)

$$ \begin{align} \tilde{q}_{ij}^{r} = \frac{(1 + ||\tilde{z}_{i}^{r} - \mu_{j}^{t}||_{2}^{2})^{-1}}{\sum_{l=1}^{|\mathcal{C}_{r}\cup \mathcal{C}_{t}|}(1 + ||\tilde{z}_{i}^{r} - \mu_{l}^{t}||_{2}^{2})^{-1}}. \end{align} $$

(23)

Then, to find a match for every reference sample, we can perform assignment entropy minimization on |$q_{ij}^{r}$| and |$\tilde{q}_{ij}^{r}$|⁠. Moreover, since the reference sample comes with accurate label information, we naturally anticipate that similarity distributions in the same labels will be consistent. So, in the case of consistent ground truth labels, we use the bi-norm to measure the discrepancy between predicted probability vectors and let the discrepancy be as small as possible. Therefore, our cross-data prototypical self-supervised objective for |$\mathcal{D}_{r}$| is defined as:

$$ \begin{align} L_{pro}^{r}=-\dfrac{1}{2n_{r}}\sum_{i=1}^{n_{r}}\sum_{j=1}^{|\mathcal{C}_{r}\cup \mathcal{C}_{t}|}(q_{ij}^{r}\log q_{ij}^{r}+\tilde{q}_{ij}^{r}\log\tilde{q}_{ij}^{r})\\ \nonumber +\dfrac{1}{n_{r}^{2}}\sum_{i=1}^{n_{r}}\sum_{i=s}^{n_{r}}\|q_{i}^{r}-\tilde{q}_{s}^{r}\|_{2}^{2}I_{\{y_{i}^{r}=y_{s}^{r}\}}. \end{align} $$

(24)

By minimizing |$L_{pro}^{r}$|⁠, the alignment between reference samples and target prototypes can be achieved, allowing the model to better learn discriminative features. Such bidirectional dual alignment facilitates better resistance to batch effect and cell type bias. Combining |$L_{pro}^{t}$| and |$L_{pro}^{r}$|⁠, we give the total prototypical self-supervised learning loss as:

$$ \begin{align} {L_{pro}}=L_{pro}^{t}+L_{pro}^{r}. \end{align} $$

(25)

Overall, together with the data denoising loss |$L_{den}$|⁠, we give the overall training objective as

$$ \begin{align} L_{tol}=L_{den}+L_{anc}+L_{pro} \end{align} $$

(26)

and the entire workflow of scGAD is displayed in Figure 1.

$Schematics of scGAD. (A) Overall model consists of an autoencoder and a classifier. Anchor-based loss $L_{anc}$ and Prototype-based loss $L_{pro}$ are designed to train the model. (B) The definition of geometic and semantic anchor pair. (C) We add confidence splitting procedure into prototypical learning to avoid error propagation.$

Figure 1

Schematics of scGAD. (A) Overall model consists of an autoencoder and a classifier. Anchor-based loss |$L_{anc}$| and Prototype-based loss |$L_{pro}$| are designed to train the model. (B) The definition of geometic and semantic anchor pair. (C) We add confidence splitting procedure into prototypical learning to avoid error propagation.

Estimation of |$|\mathcal{C}_{t}|$| value

In this section, we propose a solution to a challenging and under-investigated problem in cell annotation: estimating cell type number |$|\mathcal{C}_{t}|$| in target data. Almost all annotation methods assume that the number of |$|\mathcal{C}_{t}|$| is a prior. However, this assumption is unrealistic in the real world. This calls on the community to develop a method for estimating |$|\mathcal{C}_{t}|$|⁠. Our main idea derives from the information available in |$\mathcal{D}_{r}$|⁠. Specifically, we perform k-means clustering on the whole dataset |$\mathcal{D}_{r}\bigcup \mathcal{D}_{t}$| and then evaluate clustering accuracy only on the reference data |$\mathcal{D}_{r}$|⁠. Let |$|\hat{\mathcal{C}}_{t}|$| represent the estimated value. If |$|\hat{\mathcal{C}}_{t}|> |\mathcal{C}_{t}|$|⁠, then |$\hat{\mathcal{C}}_{t}-\mathcal{C}_{t}$| can be called the extra cell types, and all cells assigned to extra cell types are mispredicted. Similarly, if |$|\hat{\mathcal{C}}_{t}|< |\mathcal{C}_{t}|$|⁠, then |$\mathcal{C}_{t}-\hat{\mathcal{C}}_{t}$| can be called the extra true cell types, and all cells with those cell types are predicted incorrectly. Based on this analysis, whether |$|\hat{\mathcal{C}}_{t}|$| is higher or lower than |$|\mathcal{C}_{t}|$| will have a negative impact on the clustering accuracy of |$\mathcal{D}_{r}$|⁠. In other words, the clustering accuracy of |$\mathcal{D}_{r}$| will be maximized when |$ |\hat{\mathcal{C}}_{t}|=|\mathcal{C}_{t}|$|⁠. According to this intuition, we use |$AC=f(|\hat{\mathcal{C}}_{t}|,\mathcal{D}_{r})$| to measure the clustering accuracy of |$\mathcal{D}_{r}$|⁠, that is:

$$ \begin{align} AC=f(|\hat{\mathcal{C}}_{t}|,\mathcal{D}_{r})=\max_{\gamma\in \Gamma(\mathcal{C}_{r})}\dfrac{1}{n_{r}}\sum_{i=1}^{n_{r}}1_{\{y_{i}=\gamma(\hat{y}_{i})\}}, \end{align} $$

(27)

where |$n_{r}$| represents the number of the cells in the reference data, |$y_{i}$| and |$\hat{y}_{i}$| represent the true cell label and estimated cell label of the |$i$|-th sample, respectively, and |$\Gamma (\mathcal{C}_{r})$| is the set of all permutations of cell labels in |$\mathcal{C}_{r}$|⁠. Specifically, we optimize with Brent’s algorithm to find optimal |$|\hat{\mathcal{C}}_{t}|$| [33].

Experiment

To evaluate the annotation and clustering performance of scGAD, we apply it to both simulations and real data with various biological scenarios. The implementation details of scGAD can be seen in Supplementary Materials (Supplementary Section S5). Our task is to establish a new cell annotation setting for which no ready-to-use baselines exist. Thus, we compare scGAD with recently developed scRNA-seq clustering and annotation algorithms, including three clustering methods (scziDesk [34], scCNC [35] and scNAME [30]) and four annotation methods (MARS [24], ItClust [36], scNym [25] and scArches [19]). The execution of these methods along with their parameters are described in Supplementary Section 8. For clustering methods, only scCNC participates in training with both |$\mathcal{D}_{r}$| and |$\mathcal{D}_{t}$|⁠, whereas the other two train only on |$\mathcal{D}_{t}$|⁠. We report their clustering performance on seen and novel cell types. For annotation methods, we first use them to classify target cells into seen cell types and identify the ‘unassigned’ group. Next, we apply k-means clustering on the ‘unassigned’ group to obtain novel clusters. The detailed information of these baselines can be seen in Supplementary Section S1. To assess the model effect, we use two widely adopted metrics: annotation accuracy and clustering accuracy. We report classification accuracy on seen cell types and clustering accuracy on novel cell types for scGAD and annotation baselines, whereas we report clustering accuracy on both seen and novel cell types for clustering baselines. Specifically, to compute clustering accuracy, we apply the Hungarian algorithm to solve the optimal assignment problem [37]. When reporting accuracy on all cell types, we solve the optimal assignment problem on both seen and novel cell types. The reported accuracies are the mean values of three runs. Moreover, we also use Silhouette score (see Supplementary Section S3 for details) to measure the compactness within cluster and the separability between clusters.

Data preprocessing

For data preprocessing, all datasets we used have passed through quality control and are in the formats of counts. We unify raw cell type annotation by Cell Ontology [38], a structured controlled vocabulary for cell types. We next discard the genes that have expression values in less than one cell and then filter the cells without gene expression. Considering neural network numerical optimization stability, we need to transform discrete data into continuous smooth data. Specifically, we first normalize the total expression value of each cell to its median value and then take a natural log-transformation on data. Since most genes have little information to identify and describe cell types, we select the top 2000 highly variable genes according to their normalized dispersion value ranking. Finally, we transform the logarithm data into z-score data, which implies that each selected gene has zero mean and unit variance. All these steps are completed by using the Scanpy package[39]. Furthermore, we take the preprocessed data as neural network input and use its corresponding original count data for modeling.

Simulation study

In terms of existence of batch effect, or not, the simulation experiments can be divided into intra-data and inter-data experiments. We use the R package splatter [40] to generate simulated data. The splatter parameters are set to dropout.shape = -1, dropout.mid = 0.5 and de.facScale = 0.25. For intra-data experiment, we generate the dataset of which the cluster number is set to 8, with total 8000 cells and 2500 genes. According to identical cluster sizes, or not, we further divide the experiment settings into balance and imbalance settings. For balance setting, the size of each cell type is same in both reference data and target data with 1000 cells in it. For imbalance setting, the size of each cell type is arranged in a proportional sequence with a ratio of 0.8. Unless otherwise noted, the number of seen cell types is equal to the number of novel cell types, and the ratio of labeled data in seen cell types is 0.5, so that reference and target data can be determined. For inter-data experiment, there are two batches contained in the simulation data, one representing the reference data and the other representing the target data. The total number of cells in every batch is 4000 with 2500 genes and every batch has eight cell types. Considering the general situation, the number of seen cell types and the number of novel cell types are equal and the labeled ratio in reference data is set to 0.5. Especially, redundant cell types are removed from one of the batches to form the reference data after the seen cell types are identified. Similarly, the size of each cell type is same with 500 cells in it in balance setting, whereas the size of each cell type is arranged in a proportional sequence with a ratio of 0.8 in imbalance setting. For each setting, three datasets are generated with random seed from 1 to 3, and we record the average accuracy to measure annotation and clustering results.

All in all, scGAD generally outperforms the other seven methods, regardless of using any accuracy, which can be seen in the boxplot shown in Figure 2. In both the classification of seen cell types and the clustering of novel cell types, scGAD has a greater advantage. Its superiority is particularly reflected in superior clustering accuracy, while ensuring annotation accuracy, suggesting the effectiveness of scGAD in discovering novel cell types. Compared with the other four annotation methods, this result also fully demonstrates that the two-step method of first annotation and then clustering is not feasible for the new annotation task that we propose. scArches gains the worst performance among the eight methods. This can be attributed to its assumption that low-dimensional latent space follows a Gaussian mixture model, which limits the representation ability of the model. scCNC, Itclust and scNym achieve impressive results in annotation accuracy, but they catastrophically fail in clustering accuracy owing to their tendency to classify cells with ambiguous cell types as seen cell types. Although MARS, scNAME and scziDesk have excellent performance in both annotation and clustering accuracy, they only assign cells cluster labels, rather than true cell types. Besides, as the representative methods for unsupervised clustering, scNAME and scziDesk do not utilize label information in the reference dataset, causing them to lose competitiveness in annotation accuracy.

Figure 2

Boxplot of accuracy in simulation data. (A) Boxplot of annotation accuracy in data without batch effect. (B) Boxplot of clustering accuracy in data without batch effect. (C) Boxplot of annotation accuracy in data with batch effect. (D) Boxplot of clustering accuracy in data with batch effect.

When reference and target data come from the same dataset, we first evaluate the performance of scGAD and seven other competitive methods in balance experiments. For balance experiments, Figure 3A and B show the trend of annotation and clustering accuracy of the eight methods as the number of novel cell types changes, respectively. It can be seen from the line graph that scGAD retains substantial predominance, irrespective of the number of novel cell types. Especially, the annotation accuracy of scGAD is smooth, even with an increasing number of novel cell types, indicating its robustness. In contrast, significant fluctuations exist in the annotation accuracy of the other methods. Among them, MARS, scNym and scArches show a significant upward trend with increasing novel cell types. This result is in line with our speculation because these three methods focus on the annotation of seen cell types, and any increase in the number of novel cell types reduces their workload. However, Itclust shows a downward trend since a decrease in the number of seen cell types makes less information available, thereby affecting its judgment on target data. Although scCNC, scziDesk and scNAME gain a much smoother performance, their annotation accuracy is suboptimal, showing the superiority of scGAD in handling variation in the number of novel cell types. Similarly, when the number of novel cell types increases, only the clustering accuracy of scGAD, but not that of its competitors, remains excellent and stable, indicating that scGAD is superior to other methods in solving this new task we proposed. We also fixed the number of seen cell types and novel cell types in order to study method performance when changing the ratio of labeled data on seen cell types. This step allowed us to see the effects on annotation and clustering performance of the eight methods from another perspective. As shown in Figure 3C and D, when the proportion of labeled data increases, the annotation and clustering accuracy of scGAD remained stable, far outperforming seven other seven methods, so it can be concluded that scGAD is robust with substantial advantages over its competitors.

Figure 3

Line graph of accuracy without batch effect in ‘balance’ setting. (A,B) Line graph of annotation accuracy and clustering accuracy as the variation of number of novel cell types, respectively. (C,D) Line graph of annotation accuracy and clustering accuracy as the variation of labeled ratio, respectively.

The line graph of the imbalanced experiment is shown in Figure 4. In general, scGAD maintains substantial superiority over different settings of parameters, and its accuracy remains stable as parameters change, demonstrating its effectiveness in annotation and clustering. This result is in line with our expectation because the introduction of anchor pairs allows us to establish intrinsic correspondence between different cells unaffected by the different settings. In addition, only scGAD can deliver satisfactory results on both annotation and clustering accuracy since existing annotation methods always guarantee that annotation accuracy and existing clustering methods can not give cell type labels. However, compared with the balance setting, the clustering accuracy of scGAD varies, to some extent, with the change of labeled ratio under the imbalance setting, which illustrates that the imbalance setting adds a certain challenge to the discovery and clustering of novel cell types.

Figure 4

Line graph of accuracy without batch effect in ‘imbalance’ setting. (A,B) Line graph of annotation accuracy and clustering accuracy as the variation of number of novel cell types respectively; (C,D) Line graph of annotation accuracy and clustering accuracy as the variation of labeled ratio, respectively.

When reference and target data come from different datasets, all eight methods may be subject to batch effects. In the balance setting, it can be concluded from the line graphs in Figure 5 that scGAD performs best among all other methods, no matter which accuracy is used. Although the performance of each method decreases compared to the case without batch effect, scGAD has the smallest decrease among them, reflecting the effectiveness of scGAD in handling different batches of data. Especially, we can conclude that scGAD is relatively stable, irrespective of change in number of novel cell types from 2 to 6 or change in the ratio of labeled data from 0.2 to 1.0, without extreme fluctuations. These results demonstrate the ability of scGAD to withstand a variety of parameters and batch effects. However, other methods gain much worse performance and vary drastically with changing parameters. In particular, since scNAME and scziDesk are unsupervised clustering methods that do not use the reference dataset, they are not affected by the labeled ratio. However, they do not make full use of the information in reference data and their performance is,consequently, unsatisfactory. Compared with the balance setting, the accuracy and stability of each method are somewhat affected by the imbalance setting, which is reasonable because it is a complex setting, the results of which are shown in Figure 6. Despite this, scGAD maintains its dominant position in both annotation accuracy and clustering accuracy, indicating its strength in resisting disturbance. In contrast, other methods provide suboptimal results, at best, and cannot take into account either annotation accuracy or clustering accuracy, so they are not suitable for solving this new annotation task. In summary, scGAD still has the best performance in the presence of batch effects.

Figure 5

Line graph of accuracy with batch effect in ‘balance’ setting. (A,B) Line graph of annotation accuracy and clustering accuracy as the variation of number of novel cell types respectively; (C,D) line graph of annotation accuracy and clustering accuracy as the variation of labeled ratio, respectively.

Figure 6

Line graph of accuracy with batch effect in ‘imbalance’ setting. (A,B) Line graph of annotation accuracy and clustering accuracy as the variation of number of novel cell types respectively. (C,D) Line graph of annotation accuracy and clustering accuracy as the variation of labeled ratio, respectively.

To better demonstrate the predominance of scGAD, we display the results of scGAD against three other competing annotation methods under inter-data and balance setting via UMAP plot. Of the eight cell types, group 4 to group 7 are novel cell types, and the others are seen cell types. From Figure 7, we can easily see that only scGAD can separate all eight cell types thoroughly, which indicates its effectiveness in correcting batch effects and annotating unlabeled cells. For its failure to learn low-dimensional representation and intrinsic discrimination of data, MARS can only separate seen and novel cell types roughly, without further subdivision. scNym cannot separate seen and novel cell types clearly at all, because it focuses on annotation of seen cell types and ignores the discovery of novel cell types. Owing to the large difference between reference and target datasets, scArches cannot perform effective transfer learning and fails catastrophically. Therefore, it is easy to conclude that the new annotation task we propose is very challenging and that scGAD can solve it very well.

Figure 7

Two dimensional UMAP visualization plots of cell type ID on simulation data by MARS, scNym, scArches and scGAD under inter-data and balance setting, exploring the performance of four annotation methods in annotation and clustering.

To better explore the effectiveness of scGAD in correcting batch effects, we also draw UMAP plots with two groups representing reference and target data, respectively. From Figure 8, we can conclude that scGAD is unaffected by batch effects, fully demonstrating its superiority in handling multiple batches of data. Meanwhile, MARS and scArches are catastrophically hit by batch effects and can not complete annotations. Although scNym is somewhat resistant to batch effects, it can only provide suboptimal results by its inadequacy in discovering novel cell types, as noted above.

Figure 8

Two dimensional UMAP visualization plots of batch ID on simulation data by MARS, scNym, scArches and scGAD under inter-data and balance setting, exploring the effectiveness of four annotation methods in correcting batch effects.

To further evaluate the performance of each method, a Sankey diagram is also drawn to display the results of scGAD against three competing methods under inter-data and balance setting more intuitively. It can be clearly seen from Figure 9 that scGAD delivers satisfactory results, since it can assign cluster labels to cells of novel cell types, while annotating cells of seen cell types, showing, again, its superiority for this more practical task we proposed. In contrast, whether it is annotation or clustering accuracy, MARS and scArches perform poorly, indicating that they are not only unable to align cells of seen cell types but also have no predominance in discovering novel cell types. Although scNym shows its strengths in annotating cells of seen cell types, it shows very poor performance in discovering cells of novel cell types. Overall, scGAD is superior to all other methods on the simulated data and it can accurately complete clustering and classification tasks at the same time, which is currently impossible for other methods.

Figure 9

Sankey plot on simulation data by MARS, scNym, scArches and scGAD under inter-data and balance setting.

Real data analysis

We next tackle real data divided into intra-data and cross-data,.. For the former, we adopt 10 scRNA-seq datasets from different tissues and organs, which are obtained by different sequencing platforms and have their own characteristics. The cell numbers range from 6462 to 110704, and the cell type numbers vary from 9 to 45. Unless otherwise noted, we first divide the total number of cell types as evenly as possible so that the number of seen cell types and the number of novel cell types are equal. If this is not possible, we make the number of novel cell types one more. Then, we select 50% samples in seen cell types as |$\mathcal{D}_{r}$| and the rest as |$\mathcal{D}_{t}$|⁠. For the latter, we select 10 groups of datasets. Each group consists of a reference dataset and a target dataset, and batch effect exists between them. Their cell numbers range from a few thousand to tens of thousands, and the cell type number in the reference data is nearly half that of the target data. The basic information of these datasets can be seen in Supplementary Section S2. For each experiments, we set ten random seeds to repeatedly run each method three times and take the average of the three results as our final result.

Table 2 summarizes the performance of scGAD and other baselines on ten datasets. From the table, we can see that scGAD outperforms all other competing methods on most accuracies and datasets, indicating the effectiveness of our strategy. Especially, scGAD achieves maximum annotation accuracy and clustering accuracy on most datasets, showing the advantages of scGAD on both classification and clustering problems. In other words, the results indicate that the two-step method of first classifying and then clustering gives suboptimal results. For example, scNym always achieves impressive results in annotation accuracy, even surpassing scGAD on individual datasets. Such result is not surprising since scNym prefers to recognize some confusing cells as seen cell types, and the catastrophic failure of scNym in clustering accuracy is enough to illustrate this point. MARS can also annotate novel cells with clustering label, but scGAD still beats it with clear margins. This evidence again emphasizes the necessity of scGAD to balance the prediction states of cells in seen and novel cell types by introducing anchor pairs. The clustering methods scziDesk and scNAME do not utilize label information of the reference data and thus lose competitiveness in annotation accuracy on seen cell types. Furthermore, they only label each cluster with clustering label, rather than true cell-type label, which is a common limitation of clustering methods in real-world scenarios. In addition, we compared the overall accuracy of scGAD with other eight clustering and annotation methods, including CellTypist [50] on 10 real datasets in closed set setting, demonstrating the basic ability of scGAD to annotate seen cell types using reference datasets. The detailed results are in Supplementary Section 7. In summary, we can conclude that scGAD outperforms other baselines on three evaluation indexes in the intra-data annotation task.

Table 2

Performance comparison on ten real datasets in intra-data annotation experiments

	Cao [41]			Hochane [42]			Park [43]			Quake 10x [44]			Quake Smart-seq2 [44]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	overall
scziDesk	85.2	74.1	63.8	91.0	83.9	84.4	97.3	72.9	85.0	84.1	58.5	73.3	76.7	72.5	70.7
scNAME	79.1	78.5	75.1	91.0	85.7	84.3	56.6	79.5	73.4	82.2	62.0	69.8	76.5	61.2	63.5
scCNC	50.2	60.9	52.7	94.1	70.0	70.4	92.4	61.0	76.6	85.0	49.8	61.3	65.0	40.8	39.0
MARS	88.6	75.8	64.3	96.9	74.5	78.8	61.6	78.2	68.3	92.1	52.8	68.9	80.3	70.6	69.2
ItClust	14.5	62.3	56.6	33.1	49.5	45.3	76.3	42.6	62.4	70.5	47.3	52.3	32.7	55.5	49.4
scNym	99.2	69.4	66.2	98.9	49.8	46.0	99.8	48.9	45.2	98.4	52.8	60.8	96.9	59.2	56.4
scArches	73.4	46.5	52.2	82.6	91.5	89.3	86.6	36.8	65.7	88.3	56.6	69.1	72.3	54.7	57.2
scGAD	92.4	81.0	78.3	99.5	93.8	84.9	97.7	80.9	88.4	95.8	62.1	83.7	91.3	76.3	75.7
	Wagner [45]			Zeisel [46]			Zheng [47]			Chen [48]			Guo [49]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall
scziDesk	72.1	48.2	54.6	78.0	89.1	82.5	57.7	52.0	45.7	78.8	92.2	90.8	99.6	80.0	75.1
scNAME	74.4	48.4	54.8	93.4	88.1	84.3	57.7	52.0	45.7	79.8	92.1	91.1	99.8	76.4	72.0
scCNC	85.8	51.4	55.0	77.2	50.8	55.0	61.5	56.6	48.6	79.8	92.1	91.1	99.8	76.4	72.0
MARS	81.6	42.6	50.9	98.9	83.3	84.1	72.5	59.5	50.6	80.1	94.1	90.4	99.7	72.6	68.8
ItClust	18.5	32.2	36.4	52.7	57.3	54.1	20.7	50.9	43.8	20.8	82.5	72.2	19.5	74.7	67.8
scNym	96.5	42.3	44.2	99.6	64.6	62.7	98.8	56.5	51.4	97.4	77.7	72.2	99.8	60.4	56.8
scArches	58.1	35.9	41.7	78.1	60.0	63.3	60.4	72.9	68.4	74.4	85.6	82.9	61.0	78.9	74.8
scGAD	92.1	49.6	56.4	99.7	89.7	88.1	97.6	74.8	66.3	98.3	93.0	94.1	99.9	82.4	75.2

	Cao [41]			Hochane [42]			Park [43]			Quake 10x [44]			Quake Smart-seq2 [44]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	overall
scziDesk	85.2	74.1	63.8	91.0	83.9	84.4	97.3	72.9	85.0	84.1	58.5	73.3	76.7	72.5	70.7
scNAME	79.1	78.5	75.1	91.0	85.7	84.3	56.6	79.5	73.4	82.2	62.0	69.8	76.5	61.2	63.5
scCNC	50.2	60.9	52.7	94.1	70.0	70.4	92.4	61.0	76.6	85.0	49.8	61.3	65.0	40.8	39.0
MARS	88.6	75.8	64.3	96.9	74.5	78.8	61.6	78.2	68.3	92.1	52.8	68.9	80.3	70.6	69.2
ItClust	14.5	62.3	56.6	33.1	49.5	45.3	76.3	42.6	62.4	70.5	47.3	52.3	32.7	55.5	49.4
scNym	99.2	69.4	66.2	98.9	49.8	46.0	99.8	48.9	45.2	98.4	52.8	60.8	96.9	59.2	56.4
scArches	73.4	46.5	52.2	82.6	91.5	89.3	86.6	36.8	65.7	88.3	56.6	69.1	72.3	54.7	57.2
scGAD	92.4	81.0	78.3	99.5	93.8	84.9	97.7	80.9	88.4	95.8	62.1	83.7	91.3	76.3	75.7
	Wagner [45]			Zeisel [46]			Zheng [47]			Chen [48]			Guo [49]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall
scziDesk	72.1	48.2	54.6	78.0	89.1	82.5	57.7	52.0	45.7	78.8	92.2	90.8	99.6	80.0	75.1
scNAME	74.4	48.4	54.8	93.4	88.1	84.3	57.7	52.0	45.7	79.8	92.1	91.1	99.8	76.4	72.0
scCNC	85.8	51.4	55.0	77.2	50.8	55.0	61.5	56.6	48.6	79.8	92.1	91.1	99.8	76.4	72.0
MARS	81.6	42.6	50.9	98.9	83.3	84.1	72.5	59.5	50.6	80.1	94.1	90.4	99.7	72.6	68.8
ItClust	18.5	32.2	36.4	52.7	57.3	54.1	20.7	50.9	43.8	20.8	82.5	72.2	19.5	74.7	67.8
scNym	96.5	42.3	44.2	99.6	64.6	62.7	98.8	56.5	51.4	97.4	77.7	72.2	99.8	60.4	56.8
scArches	58.1	35.9	41.7	78.1	60.0	63.3	60.4	72.9	68.4	74.4	85.6	82.9	61.0	78.9	74.8
scGAD	92.1	49.6	56.4	99.7	89.7	88.1	97.6	74.8	66.3	98.3	93.0	94.1	99.9	82.4	75.2

Table 2

Performance comparison on ten real datasets in intra-data annotation experiments

	Cao [41]			Hochane [42]			Park [43]			Quake 10x [44]			Quake Smart-seq2 [44]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	overall
scziDesk	85.2	74.1	63.8	91.0	83.9	84.4	97.3	72.9	85.0	84.1	58.5	73.3	76.7	72.5	70.7
scNAME	79.1	78.5	75.1	91.0	85.7	84.3	56.6	79.5	73.4	82.2	62.0	69.8	76.5	61.2	63.5
scCNC	50.2	60.9	52.7	94.1	70.0	70.4	92.4	61.0	76.6	85.0	49.8	61.3	65.0	40.8	39.0
MARS	88.6	75.8	64.3	96.9	74.5	78.8	61.6	78.2	68.3	92.1	52.8	68.9	80.3	70.6	69.2
ItClust	14.5	62.3	56.6	33.1	49.5	45.3	76.3	42.6	62.4	70.5	47.3	52.3	32.7	55.5	49.4
scNym	99.2	69.4	66.2	98.9	49.8	46.0	99.8	48.9	45.2	98.4	52.8	60.8	96.9	59.2	56.4
scArches	73.4	46.5	52.2	82.6	91.5	89.3	86.6	36.8	65.7	88.3	56.6	69.1	72.3	54.7	57.2
scGAD	92.4	81.0	78.3	99.5	93.8	84.9	97.7	80.9	88.4	95.8	62.1	83.7	91.3	76.3	75.7
	Wagner [45]			Zeisel [46]			Zheng [47]			Chen [48]			Guo [49]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall
scziDesk	72.1	48.2	54.6	78.0	89.1	82.5	57.7	52.0	45.7	78.8	92.2	90.8	99.6	80.0	75.1
scNAME	74.4	48.4	54.8	93.4	88.1	84.3	57.7	52.0	45.7	79.8	92.1	91.1	99.8	76.4	72.0
scCNC	85.8	51.4	55.0	77.2	50.8	55.0	61.5	56.6	48.6	79.8	92.1	91.1	99.8	76.4	72.0
MARS	81.6	42.6	50.9	98.9	83.3	84.1	72.5	59.5	50.6	80.1	94.1	90.4	99.7	72.6	68.8
ItClust	18.5	32.2	36.4	52.7	57.3	54.1	20.7	50.9	43.8	20.8	82.5	72.2	19.5	74.7	67.8
scNym	96.5	42.3	44.2	99.6	64.6	62.7	98.8	56.5	51.4	97.4	77.7	72.2	99.8	60.4	56.8
scArches	58.1	35.9	41.7	78.1	60.0	63.3	60.4	72.9	68.4	74.4	85.6	82.9	61.0	78.9	74.8
scGAD	92.1	49.6	56.4	99.7	89.7	88.1	97.6	74.8	66.3	98.3	93.0	94.1	99.9	82.4	75.2

	Cao [41]			Hochane [42]			Park [43]			Quake 10x [44]			Quake Smart-seq2 [44]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	overall
scziDesk	85.2	74.1	63.8	91.0	83.9	84.4	97.3	72.9	85.0	84.1	58.5	73.3	76.7	72.5	70.7
scNAME	79.1	78.5	75.1	91.0	85.7	84.3	56.6	79.5	73.4	82.2	62.0	69.8	76.5	61.2	63.5
scCNC	50.2	60.9	52.7	94.1	70.0	70.4	92.4	61.0	76.6	85.0	49.8	61.3	65.0	40.8	39.0
MARS	88.6	75.8	64.3	96.9	74.5	78.8	61.6	78.2	68.3	92.1	52.8	68.9	80.3	70.6	69.2
ItClust	14.5	62.3	56.6	33.1	49.5	45.3	76.3	42.6	62.4	70.5	47.3	52.3	32.7	55.5	49.4
scNym	99.2	69.4	66.2	98.9	49.8	46.0	99.8	48.9	45.2	98.4	52.8	60.8	96.9	59.2	56.4
scArches	73.4	46.5	52.2	82.6	91.5	89.3	86.6	36.8	65.7	88.3	56.6	69.1	72.3	54.7	57.2
scGAD	92.4	81.0	78.3	99.5	93.8	84.9	97.7	80.9	88.4	95.8	62.1	83.7	91.3	76.3	75.7
	Wagner [45]			Zeisel [46]			Zheng [47]			Chen [48]			Guo [49]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall
scziDesk	72.1	48.2	54.6	78.0	89.1	82.5	57.7	52.0	45.7	78.8	92.2	90.8	99.6	80.0	75.1
scNAME	74.4	48.4	54.8	93.4	88.1	84.3	57.7	52.0	45.7	79.8	92.1	91.1	99.8	76.4	72.0
scCNC	85.8	51.4	55.0	77.2	50.8	55.0	61.5	56.6	48.6	79.8	92.1	91.1	99.8	76.4	72.0
MARS	81.6	42.6	50.9	98.9	83.3	84.1	72.5	59.5	50.6	80.1	94.1	90.4	99.7	72.6	68.8
ItClust	18.5	32.2	36.4	52.7	57.3	54.1	20.7	50.9	43.8	20.8	82.5	72.2	19.5	74.7	67.8
scNym	96.5	42.3	44.2	99.6	64.6	62.7	98.8	56.5	51.4	97.4	77.7	72.2	99.8	60.4	56.8
scArches	58.1	35.9	41.7	78.1	60.0	63.3	60.4	72.9	68.4	74.4	85.6	82.9	61.0	78.9	74.8
scGAD	92.1	49.6	56.4	99.7	89.7	88.1	97.6	74.8	66.3	98.3	93.0	94.1	99.9	82.4	75.2

For cross-data experiments with batch effect, we select 14 sets of data to conduct experiments. Each set of data consists of two scRNA-seq datasets, representing reference and target datasets, respectively. As shown in Table ??, the batch effect does, indeed, have a certain influence on the accuracy of all methods. This is reasonable since batch effect makes cells of the same cell type move away from each other, thus leading to a more challenging annotation task. Even so, scGAD still achieves an impressive performance that is only slightly affected by batch effect, indicating that our confidential prototypical self-supervised learning strategy removes batch effect to a certain extent implicitly. However, the performance of other methods is severely degraded. Especially, MARS and Itclust separate reference and target data in training, elevating susceptibility to batch effect and, in turn, leading to model overfitting and false cell type annotations. Moreover, Sankey and UMAP diagrams are introduced to see the performance of each method on annotation more intuitively. From the Sankey diagram in Figure 10, we can conclude that scGAD can cluster cells of novel cell types and assign them with cluster labels, rather than ‘unassigned’, while classifying the cells of seen cell types. However, compared to scGAD, the performance of other methods is much worse, fully demonstrating that the strategy of scGAD is more effective than the two-stage method. Meanwhile, the performance of MARS has a full-blown collapse, since it cannot perform either annotation or clustering on any cell type and always aligns cells from the same cell type into different groups. scNym and scArches can only label cells of some cell types with the correct cell type, likely resulting from the subjectivity of threshold setting and wrong assumptions about embedding space. We next draw Sankey diagrams on Vento Tormo Smart-seq2, as reference data, and Vento Tormo 10x, as target data. From Supplementary Figure 1, we see that scGAD beats competitive methods with clear margins, validating its superiority. Furthermore, to observe the annotation and clustering performance of scGAD more intuitively, we use UMAP to visualize the learned embedding features of scGAD and three other competitive baselines in the two-dimensional space, as depicted in Figure 11. The visualization plots provide clear evidence to confirm our judgment that scGAD can give novel cells with clustering labels, while annotating the cells of seen cell types accurately and outperform the other methods accordingly. To better display the ability of methods to remove batch effects, we utilize UMAP to embed the latent features extracted by scGAD, MARS, scNym and scArches into a shared two-dimensional plane. As shown in Figure 12, even if the two sets of data are from different batches, scGAD can group cells of the same cell type together, while the other three methods misclassify them into different groups, owing to the batch effect. Similar patterns are repeated for other datasets (Supplementary Figure 2). These results demonstrate that scGAD can effectively overcome the impact of batch effects and discover novel cell types, while performing cell annotation, implying its superiority. Moreover, we conduct significance test of the improvements in results and use autorank package [51] to rank different methods with statistical significance by comparing the overall accuracy results in both Tables 2 and ??. The detailed results can be seen in Supplementary Section S6. Overall, scGAD still achieves consistently better results than the other methods for real data, validating its dominance in this new task.

Figure 10

Sankey plot for four annotation methods on Quake Smart-seq2 Limb Muscle as reference data and Quake 10x Limb Muscle as target data.

Figure 11

UMAP feature visualization for four methods on one cross-data annotation task: from Vento Smart-seq2 to Vento 10x.

Figure 12

UMAP to explore the effectiveness of four annotation methods in correcting batch effects: from Vento Smart-seq2 to Vento 10x.

Table 3

Performance comparison in cross-data annotation experiments. ‘R’: reference data; ‘T’: target data

	Enge (R) [52] Baron_human (T) [53]			Lawlor (R) [54] Baron_human (T)			Muraro (R) [55] Baron_human (T)			Xin (R) [56] Baron_human (T)			Vento _10x (R) [57] Vento_Smart-seq2 (T) [57]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	seen	novel	overall	seen	novel	overall
scziDesk	81.6	81.5	81.6	81.3	80.3	81.2	81.6	81.5	81.6	75.1	84.2	81.3	81.7	79.0	81.5
scNAME	81.0	81.9	81.2	80.7	79.4	79.9	95.8	71.4	91.2	73.6	85.3	77.7	87.4	80.3	86.0
scCNC	47.6	38.5	38.8	54.0	43.9	40.9	75.0	40.8	61.1	46.6	54.7	36.5	92.1	63.4	84.8
MARS	90.3	86.2	79.8	80.9	90.7	80.3	79.5	82.3	80.0	93.6	78.0	88.6	71.3	78.6	70.3
ItClust	83.4	52.3	72.7	88.5	48.9	77.1	80.9	56.4	69.2	84.5	80.7	84.1	79.8	50.7	70.4
scNym	97.7	71.9	84.7	90.2	52.2	82.8	88.2	55.5	63.9	97.9	40.0	52.3	98.7	66.5	75.9
scArches	89.2	58.0	80.3	47.3	66.8	52.5	89.3	52.8	80.9	61.5	52.2	52.7	87.6	52.9	78.2
scGAD	96.3	82.6	93.8	96.6	82.6	90.3	96.5	83.2	93.7	93.6	86.0	91.3	98.8	80.5	92.4
	Vento Smart-seq2 (R) Vento 10x (T)			Plasschaert (R) [58] Montoro 10x (T) [59]			Mammary Smart-seq2 (R) [44] Mammary 10x (T) [44]			Haber largecell (R) [60] Haber region (T) [60]			Haber region (R) Haber largecell (T)
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	seen	novel	overall	seen	novel	overall
scziDesk	88.4	98.4	90.9	67.9	74.6	68.3	94.0	89.3	91.2	43.9	60.9	53.0	85.3	80.8	71.0
scNAME	86.5	98.2	92.8	95.1	90.2	96.0	93.7	99.0	96.8	46.0	62.6	54.3	89.1	80.9	71.6
scCNC	83.4	47.1	43.7	79.7	73.1	73.0	92.4	65.5	76.2	62.7	69.4	55.9	75.7	50.4	51.6
MARS	94.5	78.6	83.8	88.6	94.5	89.1	81.5	97.5	86.9	57.1	75.1	68.2	83.8	64.1	67.1
ItClust	64.3	75.0	58.2	90.1	75.1	83.2	36.8	70.5	67.2	53.4	58.2	56.4	6.2	64.5	53.6
scNym	98.1	70.4	80.6	96.1	77.7	83.1	95.1	48.6	49.8	95.8	44.4	51.2	84.2	53.7	53.0
scArches	83.4	66.8	75.2	91.4	67.4	85.3	62.0	55.5	59.0	72.3	51.7	59.6	71.9	45.4	50.4
scGAD	98.4	99.2	97.4	93.6	94.0	96.2	94.1	99.1	97.1	86.2	97.4	93.6	89.8	81.5	72.2

	Enge (R) [52] Baron_human (T) [53]			Lawlor (R) [54] Baron_human (T)			Muraro (R) [55] Baron_human (T)			Xin (R) [56] Baron_human (T)			Vento _10x (R) [57] Vento_Smart-seq2 (T) [57]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	seen	novel	overall	seen	novel	overall
scziDesk	81.6	81.5	81.6	81.3	80.3	81.2	81.6	81.5	81.6	75.1	84.2	81.3	81.7	79.0	81.5
scNAME	81.0	81.9	81.2	80.7	79.4	79.9	95.8	71.4	91.2	73.6	85.3	77.7	87.4	80.3	86.0
scCNC	47.6	38.5	38.8	54.0	43.9	40.9	75.0	40.8	61.1	46.6	54.7	36.5	92.1	63.4	84.8
MARS	90.3	86.2	79.8	80.9	90.7	80.3	79.5	82.3	80.0	93.6	78.0	88.6	71.3	78.6	70.3
ItClust	83.4	52.3	72.7	88.5	48.9	77.1	80.9	56.4	69.2	84.5	80.7	84.1	79.8	50.7	70.4
scNym	97.7	71.9	84.7	90.2	52.2	82.8	88.2	55.5	63.9	97.9	40.0	52.3	98.7	66.5	75.9
scArches	89.2	58.0	80.3	47.3	66.8	52.5	89.3	52.8	80.9	61.5	52.2	52.7	87.6	52.9	78.2
scGAD	96.3	82.6	93.8	96.6	82.6	90.3	96.5	83.2	93.7	93.6	86.0	91.3	98.8	80.5	92.4
	Vento Smart-seq2 (R) Vento 10x (T)			Plasschaert (R) [58] Montoro 10x (T) [59]			Mammary Smart-seq2 (R) [44] Mammary 10x (T) [44]			Haber largecell (R) [60] Haber region (T) [60]			Haber region (R) Haber largecell (T)
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	seen	novel	overall	seen	novel	overall
scziDesk	88.4	98.4	90.9	67.9	74.6	68.3	94.0	89.3	91.2	43.9	60.9	53.0	85.3	80.8	71.0
scNAME	86.5	98.2	92.8	95.1	90.2	96.0	93.7	99.0	96.8	46.0	62.6	54.3	89.1	80.9	71.6
scCNC	83.4	47.1	43.7	79.7	73.1	73.0	92.4	65.5	76.2	62.7	69.4	55.9	75.7	50.4	51.6
MARS	94.5	78.6	83.8	88.6	94.5	89.1	81.5	97.5	86.9	57.1	75.1	68.2	83.8	64.1	67.1
ItClust	64.3	75.0	58.2	90.1	75.1	83.2	36.8	70.5	67.2	53.4	58.2	56.4	6.2	64.5	53.6
scNym	98.1	70.4	80.6	96.1	77.7	83.1	95.1	48.6	49.8	95.8	44.4	51.2	84.2	53.7	53.0
scArches	83.4	66.8	75.2	91.4	67.4	85.3	62.0	55.5	59.0	72.3	51.7	59.6	71.9	45.4	50.4
scGAD	98.4	99.2	97.4	93.6	94.0	96.2	94.1	99.1	97.1	86.2	97.4	93.6	89.8	81.5	72.2

Table 3

Performance comparison in cross-data annotation experiments. ‘R’: reference data; ‘T’: target data

	Enge (R) [52] Baron_human (T) [53]			Lawlor (R) [54] Baron_human (T)			Muraro (R) [55] Baron_human (T)			Xin (R) [56] Baron_human (T)			Vento _10x (R) [57] Vento_Smart-seq2 (T) [57]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	seen	novel	overall	seen	novel	overall
scziDesk	81.6	81.5	81.6	81.3	80.3	81.2	81.6	81.5	81.6	75.1	84.2	81.3	81.7	79.0	81.5
scNAME	81.0	81.9	81.2	80.7	79.4	79.9	95.8	71.4	91.2	73.6	85.3	77.7	87.4	80.3	86.0
scCNC	47.6	38.5	38.8	54.0	43.9	40.9	75.0	40.8	61.1	46.6	54.7	36.5	92.1	63.4	84.8
MARS	90.3	86.2	79.8	80.9	90.7	80.3	79.5	82.3	80.0	93.6	78.0	88.6	71.3	78.6	70.3
ItClust	83.4	52.3	72.7	88.5	48.9	77.1	80.9	56.4	69.2	84.5	80.7	84.1	79.8	50.7	70.4
scNym	97.7	71.9	84.7	90.2	52.2	82.8	88.2	55.5	63.9	97.9	40.0	52.3	98.7	66.5	75.9
scArches	89.2	58.0	80.3	47.3	66.8	52.5	89.3	52.8	80.9	61.5	52.2	52.7	87.6	52.9	78.2
scGAD	96.3	82.6	93.8	96.6	82.6	90.3	96.5	83.2	93.7	93.6	86.0	91.3	98.8	80.5	92.4
	Vento Smart-seq2 (R) Vento 10x (T)			Plasschaert (R) [58] Montoro 10x (T) [59]			Mammary Smart-seq2 (R) [44] Mammary 10x (T) [44]			Haber largecell (R) [60] Haber region (T) [60]			Haber region (R) Haber largecell (T)
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	seen	novel	overall	seen	novel	overall
scziDesk	88.4	98.4	90.9	67.9	74.6	68.3	94.0	89.3	91.2	43.9	60.9	53.0	85.3	80.8	71.0
scNAME	86.5	98.2	92.8	95.1	90.2	96.0	93.7	99.0	96.8	46.0	62.6	54.3	89.1	80.9	71.6
scCNC	83.4	47.1	43.7	79.7	73.1	73.0	92.4	65.5	76.2	62.7	69.4	55.9	75.7	50.4	51.6
MARS	94.5	78.6	83.8	88.6	94.5	89.1	81.5	97.5	86.9	57.1	75.1	68.2	83.8	64.1	67.1
ItClust	64.3	75.0	58.2	90.1	75.1	83.2	36.8	70.5	67.2	53.4	58.2	56.4	6.2	64.5	53.6
scNym	98.1	70.4	80.6	96.1	77.7	83.1	95.1	48.6	49.8	95.8	44.4	51.2	84.2	53.7	53.0
scArches	83.4	66.8	75.2	91.4	67.4	85.3	62.0	55.5	59.0	72.3	51.7	59.6	71.9	45.4	50.4
scGAD	98.4	99.2	97.4	93.6	94.0	96.2	94.1	99.1	97.1	86.2	97.4	93.6	89.8	81.5	72.2

	Enge (R) [52] Baron_human (T) [53]			Lawlor (R) [54] Baron_human (T)			Muraro (R) [55] Baron_human (T)			Xin (R) [56] Baron_human (T)			Vento _10x (R) [57] Vento_Smart-seq2 (T) [57]
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	seen	novel	overall	seen	novel	overall
scziDesk	81.6	81.5	81.6	81.3	80.3	81.2	81.6	81.5	81.6	75.1	84.2	81.3	81.7	79.0	81.5
scNAME	81.0	81.9	81.2	80.7	79.4	79.9	95.8	71.4	91.2	73.6	85.3	77.7	87.4	80.3	86.0
scCNC	47.6	38.5	38.8	54.0	43.9	40.9	75.0	40.8	61.1	46.6	54.7	36.5	92.1	63.4	84.8
MARS	90.3	86.2	79.8	80.9	90.7	80.3	79.5	82.3	80.0	93.6	78.0	88.6	71.3	78.6	70.3
ItClust	83.4	52.3	72.7	88.5	48.9	77.1	80.9	56.4	69.2	84.5	80.7	84.1	79.8	50.7	70.4
scNym	97.7	71.9	84.7	90.2	52.2	82.8	88.2	55.5	63.9	97.9	40.0	52.3	98.7	66.5	75.9
scArches	89.2	58.0	80.3	47.3	66.8	52.5	89.3	52.8	80.9	61.5	52.2	52.7	87.6	52.9	78.2
scGAD	96.3	82.6	93.8	96.6	82.6	90.3	96.5	83.2	93.7	93.6	86.0	91.3	98.8	80.5	92.4
	Vento Smart-seq2 (R) Vento 10x (T)			Plasschaert (R) [58] Montoro 10x (T) [59]			Mammary Smart-seq2 (R) [44] Mammary 10x (T) [44]			Haber largecell (R) [60] Haber region (T) [60]			Haber region (R) Haber largecell (T)
	Seen	Novel	Overall	Seen	Novel	Overall	Seen	Novel	Overall	seen	novel	overall	seen	novel	overall
scziDesk	88.4	98.4	90.9	67.9	74.6	68.3	94.0	89.3	91.2	43.9	60.9	53.0	85.3	80.8	71.0
scNAME	86.5	98.2	92.8	95.1	90.2	96.0	93.7	99.0	96.8	46.0	62.6	54.3	89.1	80.9	71.6
scCNC	83.4	47.1	43.7	79.7	73.1	73.0	92.4	65.5	76.2	62.7	69.4	55.9	75.7	50.4	51.6
MARS	94.5	78.6	83.8	88.6	94.5	89.1	81.5	97.5	86.9	57.1	75.1	68.2	83.8	64.1	67.1
ItClust	64.3	75.0	58.2	90.1	75.1	83.2	36.8	70.5	67.2	53.4	58.2	56.4	6.2	64.5	53.6
scNym	98.1	70.4	80.6	96.1	77.7	83.1	95.1	48.6	49.8	95.8	44.4	51.2	84.2	53.7	53.0
scArches	83.4	66.8	75.2	91.4	67.4	85.3	62.0	55.5	59.0	72.3	51.7	59.6	71.9	45.4	50.4
scGAD	98.4	99.2	97.4	93.6	94.0	96.2	94.1	99.1	97.1	86.2	97.4	93.6	89.8	81.5	72.2

Validity of the |$|\mathcal{C}_{t}|$| value estimation method

The |$|\mathcal{C}_{t}|$| value represents the cell type number in target data, and its estimation plays an important role in discovering novel cell types. To verify the validity of our estimation method, we conduct experiments on Quake 10x and Quake Smart-seq2 datasets, which have 36 and 45 cell types, respectively. We study the case where the increment of |$|\mathcal{C}_{t}|$| varies in the range of |$[-12,-6,0,6,12]$|⁠, and increment=0 means that the estimation of |$|\mathcal{C}_{t}|$| gets the true value. The results on these two datasets are shown in Table ??. We see that the accuracy rises to maximum value at increment=0, regardless of the datasets, indicating the validity of our estimation method.

Table 4

Clustering accuracy on reference data when vary |$|\mathcal{C}_{t}|$|-value on Quake 10x and Quake Smart-seq2 datasets.

increment	−12	−6	0	6	12
Quake 10x	92.2	94.7	95.1	94.9	93.6
Quake Smart-seq2	78.1	90.5	91.6	91.2	90.1

Table 4

Clustering accuracy on reference data when vary |$|\mathcal{C}_{t}|$|-value on Quake 10x and Quake Smart-seq2 datasets.

increment	−12	−6	0	6	12
Quake 10x	92.2	94.7	95.1	94.9	93.6
Quake Smart-seq2	78.1	90.5	91.6	91.2	90.1

Robustness of scGAD

We explore the robustness of scGAD by changing the value of some variables. Specifically, for the reason that our task is transductive learning task rather than inductive learning task, we use the reference and target dataset as training dataset and use the target dataset as testing dataset. Therefore, the training and testing dataset are all determined by the real dataset we used.

Since novel cell type number |$|\mathcal{C}_{n}|$| determines the difficulty of clustering novel cells, it is imperative to explore the impact of |$|\mathcal{C}_{n}|$|⁠. We chose to conduct experiments on Quake 10x and Quake Smart-seq2 datasets, which have the most cell types. Here, |$|\mathcal{C}_{n}|$| varies in the range of |$[4,11,18,25,32]$| for Quake 10x and |$[5,14,23,32,41]$| for Quake Smart-seq2. In other words, the novel cell type number was varied in the test dataset (target dataset). Specifically, the total number of cell types in the real dataset is invariable, only the ratio of the number of novel cell types and seen cell types is changed. For ease of illustration, the results are shown in Figure 13A and B in the form of line graphs. No matter what value |$|\mathcal{C}_{n}|$| takes, it is easy to see that scGAD always performs better than other baselines, validating its superiority. Moreover, the polyline of scGAD is smoother, which demonstrates its robustness. The overall accuracy of scNym, scArches and scCNC drops catastrophically with increasing |$|\mathcal{C}_{n}|$|⁠, which is reasonable because they focus more on annotating cells of seen cell types, and increasing |$|\mathcal{C}_{n}|$| will give them more interference. The result for ItClust rises dramatically on Quake 10x and drops significantly on Quake Smart-seq2, indicating its instability. MARS, scziDesk and scNAME are relatively stable for various |$|\mathcal{C}_{n}|$|⁠, but their overall accuracy is lower than that of scGAD. These results show that scGAD is more robust and stable than other baselines.

Figure 13

Accuracy on all cell types. (A, B) Changing novel cell type number in Quake 10x and Quake Smart-seq2 datasets, respectively. (C, D) Changing labeled ratio in Quake 10x and Quake Smart-seq2 datasets, respectively.

Since the ratio of labeled data determines how much information can be provided from the reference dataset, we also explored its impact by again conducting experiments on Quake 10x and Quake Smart-seq2 datasets. We change the labeled ratio to explore the impact of the ratio of labeled data in seen cell types on scGAD, so the novel cell type number is invariable in the test dataset (target dataset) and only the ratio of labeled data in seen cell types is changed. Figures 13C and D show the line graphs of overall accuracy with change in the ratio of labeled data, which varies in the range of |$[0.1,0.3,0.5,0.7,0.9]$|⁠. We find that scGAD still achieves consistently better results than the other baselines and can maintain its excellent performance without being affected by the ratio of labeled data. However, the other methods are affected by the varied ratio of labeled data and, hence, tend toward providing suboptimal results. In conclusion, scGAD can provide reliable and remarkable performance, even with a few labeled data. Because of limited space, we put other experimental results in Supplementary Figure 3, and the conclusion is the same. Furthermore, for a better illustration, we provide detailed tables of the results about robustness analysis on the tested datasets. Supplementary Table 12 shows the results for the three kinds of accuracy, i.e. annotation accuracy, clustering accuracy and overall accuracy, as novel cell type number |$|\mathcal{C}_{n}|$| varies in the Quake 10x and Quake Smart-seq2 datasets. Irrespective of accuracy, this table shows that the performance of scGAD is always stable and excellent without being affected by changes in |$|\mathcal{C}_{n}|$|⁠. In contrast, the three kinds of accuracy of the other methods all show relatively large fluctuations, which validate the superiority and robustness of scGAD. Moreover, Supplementary Table 13 shows the results for the three kinds of accuracy by varying the ratio of labeled data, and, again, we can get the same conclusion. We also conduct experiments on the sensitivity of scGAD to hyperparameters |$g$| and |$\alpha $|⁠, and the specific experimental results can be found in Supplementary Figure 3.

Effect of |$\mathcal{L}_{ssl}$| and |$\mathcal{L}_{pro}$|

In this section, we perform ablation studies on ten real datasets to further investigate the effect of introducing anchor-guided self-supervised learning objective |$\mathcal{L}_{ssl}$| and prototypical self-supervised learning paradigm |$\mathcal{L}_{pro}$| in scGAD on clustering and annotation. The scatter plots in Figure 14A show the corresponding overall accuracy of scGAD with and without the anchor-guided self-supervised learning objective. When |$\mathcal{L}_{ssl}$| is dropped, we can see that the performance of scGAD decreases significantly for all datasets, which fully verifies the roles of |$\mathcal{L}_{ssl}$| in the whole framework. This improvement is especially visible for the datasets ‘Park’, ‘Quake 10x’ and ‘Zeisel’. Moreover, in Figure 14B, we exhibit the overall accuracy with and without the prototypical self-supervised learning paradigm. It can be clearly seen that |$\mathcal{L}_{pro}$| has a positive effect on annotation and clustering performance on all datasets, proving that this improvement is very meaningful. Moreover, we also perform ablation study on 10 real datasets by eliminating both |$\mathcal{L}_{ssl}$| and |$\mathcal{L}_{pro}$| to demonstrate the importance of them. The experimental results are shown in the Supplementary Table 15. In conclusion, the results above fully show that our two improvements to the training function are essential, which is consistent with our expectations. First, the anchor-guided self-supervised learning objective propagates the known cell types label message from labeled data to unlabeled data and aggregates the knowledge of novel cell types within unlabeled data itself. Second, the prototypical self-supervised learning paradigm enhances inter-class separation and intra-class compactness for both seen and novel cell types effectively to make the boundaries between different cell types clearer.

$(A, B) Ablation study. Comparing the accuracy on all cell types with or without $\mathcal{L}_{ssl}$ and $\mathcal{L}_{pro}$, respectively. (C) Comparing the compactness within cluster and the separability between clusters with or without $\mathcal{L}_{ssl}$ and $\mathcal{L}_{pro}$, respectively. (D) Comparison of various algorithms in terms of running time.$

Figure 14

(A, B) Ablation study. Comparing the accuracy on all cell types with or without |$\mathcal{L}_{ssl}$| and |$\mathcal{L}_{pro}$|⁠, respectively. (C) Comparing the compactness within cluster and the separability between clusters with or without |$\mathcal{L}_{ssl}$| and |$\mathcal{L}_{pro}$|⁠, respectively. (D) Comparison of various algorithms in terms of running time.

Moreover, as stated above, the prototypical self-supervised learning paradigm |$\mathcal{L}_{pro}$| can improve compactness within cluster and the separability between clusters in the embedding space. To further confirm this claim, we introduce the Silhouette score [61] to compare clustering performance with or without |$\mathcal{L}_{pro}$|⁠, and the result is shown in Figure 14C where the Silhouette score of scGAD with |$\mathcal{L}_{pro}$| is significantly higher than without |$\mathcal{L}_{pro}$|⁠. Therefore, we can conclude that |$\mathcal{L}_{pro}$| plays an indispensable role in clustering performance, and it is vitally important for us to add it in scGAD.

Furthermore, since scGAD is proposed to address a more challenging task, it is natural to test its efficiency, and to validate the efficiency of algorithms. We compared scGAD with seven other methods on four datasets, ‘Cao’, ‘Quake Smart-seq2’, ‘Quake 10x’ and ‘Zeisel’, and the result is shown in Figure 14D. Although MARS, scNym and Itclust are faster than scGAD, it can be concluded from the figure that scGAD is the fastest of the remaining five methods, demonstrating that scGAD is efficient for this new task in comparison with the other competitive annotation and clustering methods.

Discussion

Based on the fact that existing methods can only provide uniform ‘unassigned’ labels rather than specific cluster labels for novel cell types, which is not conducive to subsequent downstream analysis, we propose a practical and challenging annotation scenario named generalized cell type annotation and discovery and propose a novel end-to-end framework called scGAD for it.

In the previous section, we have shown the accuracy of scGAD on both simulated and real data, validating its superiority on this new annotation setting. However, to further demonstrate the effectiveness of scGAD for clustering cells of novel cell types, we perform marker gene identification on the basis of cluster labels, which demonstrates the effectiveness of scGAD in giving cluster labels through further biological analysis. Specifically, gene expression matrix and predicted labels are utilized to find differentially expressed genes (DEGs) and, hence, identify marker genes in each cluster. We first compare the overlap among the top 50 DEGs in each cluster, as obtained by four annotation methods, with those in the gold standard cell types and the similarity between them is calculated by dividing the overlap by 50.

We conduct further experiments in two cases: intra-data and cross-data. For intra-data experiments without batch effect, we conduct marker gene identification for the ‘chen’ dataset, and the resulting heatmap is shown in Figure 15. Each row represents an actual cell type, which is mapped to a novel cluster, regarded as gold standard (macrophage, microglial cell, oligodendrocyte, oligodendrocyte precursor cell and tanycyte, respectively), whereas each column represents a cluster acquired by a certain method. It is easy to see that scGAD can accurately and uniquely associate each novel cell cluster with a certain cell type by calculating the similarity, which demonstrates the effectiveness of scGAD for subdividing novel cell types and facilitating subsequent downstream analysis. However, MARS can only assign specific cell types to individual clusters, and the overall similarity is low, which means that MARS cannot cluster novel cell types accurately. No cluster match is found for macrophage by scNym, and scNym cannot assign a cell type to cluster 4. Although scArches can align most clusters with specific cell types, it fails to assign a cell type to cluster 5, and no cluster match is found for macrophage. Therefore, it is easy to conclude that only scGAD can cluster cells of novel cell types clearly and performs the annotation task we propose well. In addition, we conduct marker gene identification on cross-data, with the Vento Smart-seq2 dataset, as reference data, and the Vento 10x dataset, as target data. Similarly, each row represents an actual cell type, and the actual cell types are natural killer cells, placental villous trophoblasts, stromal cells and trophoblast cells, respectively, which are cell types not present in the reference dataset, whereas each column represents a cluster acquired by a certain method. From Figure 16, we can see that scGAD can assign one cell type to a unique cluster with highest similarity calculated by the most important top 50 DEGs, demonstrating the superiority of scGAD in discovering and clustering novel cell types. However, MARS provides a confusing result, aligning only a few clusters with cell types. Although scNym also can give each cluster a unique cell type by similarity, the similarity between trophoblast cells and cluster 4 is relatively low, indicating that the cluster result by scNym is not accurate. scArches fails to assign a cell type to cluster 1 and cluster 2, and no cluster match was found for trophoblast cells and placental villous trophoblasts by scGAD. In conclusion, scGAD can discover and cluster novel cell types accurately, demonstrating that scGAD can provide excellent results for subsequent downstream analysis.

Figure 15

Marker gene analysis in the ‘chen’ dataset. Overlap of top 50 DEGs in clusters detected by four methods with gold standard cell types.

Figure 16

Marker gene analysis in the ‘Vento Smart-seq2’ dataset as reference data and ‘Vento 10x’ dataset as target data. Overlap of top 50 DEGs in clusters detected by four methods with gold standard cell types.

Limitations of the present study

In the previous discussion, we never addressed the deficiency of our work. In fact, there are some limitations in our study. First, we consider an open-set scenario in our study, that is |$\mathcal{C}_{r} \subset \mathcal{C}_{t}$|⁠. However, there exist another situation in the real world, that is, there is an overlap between the reference and target dataset, which can be expressed as |$\mathcal{C}_{r} \cap \mathcal{C}_{t}\neq \emptyset $|⁠, |$\mathcal{C}_{r} \notin{C}_{t}$|⁠. We did not explore the effect of this overlap situation on scGAD and we will consider extending our model to accommodate this scenario later. Second, we have a basic assumption when dealing with batch effects, that is, even though the target data possess the cell type shift, cells from the same cell types are still expected to be geometrically and semantically close to each other based on the manifold assumption. Although this assumption holds in most cases, it will falls apart when batch effects are too large. In this case, Confidential prototypical self-supervised learning technique may be not so effective to remove batch effect. Third, from the results of simulation data, it can be seen that although our method beat other representation methods with clear margins in any experimental setting, the accuracy of our method still fluctuates to a certain degree with the change of the number of novel cell types and the ratio of labeled data in imbalance setting, which needs to be further studied and improved in the future.

Conclusion

In this paper, considering the limitations of the existing annotation methods, we propose a new annotation task. While annotating the cells of seen cell types in the target data, we give cluster labels to the cells of novel cell types, rather than ‘unassigned’. To achieve this goal, we propose a new end-to-end deep learning annotation method named scGAD. First, to alleviate the problem of unbalanced predictions between seen and novel cell types caused by the lack of label supervision for novel cell types, anchor pairs are proposed to sense the intrinsic relationship between the two kinds of cell types so that discriminative states between seen cell types and novel cell types can be balanced. Second, a cell-level, anchor-guided, self-supervised learning paradigm is introduced to transfer label information from labeled data to unlabeled data and aggregate information for novel cell types in unlabeled data. Third, to enhance intra-class compactness and inter-class separability, a prototypical self-supervised learning paradigm is performed to implicitly exploit the consensus category structure of the labeled and unlabeled data.

We apply scGAD on both simulated and real datasets, respectively, to evaluate its performance. On simulated datasets, scGAD outperforms other state-of-the-art methods, including MARS, scNym and scArches, which not only proves the excellent performance of scGAD, but also illustrates that the two-step approach of annotation first and then clustering does not work well for our proposed new annotation goal. In addition, we found that scGAD shows excellent stability and robustness against dramatic changes in the number of seen cell types and the ratio of labeled data. On real datasets, scGAD also has outstanding performance, both in the annotation of seen cell types and clustering of novel cell types. Using real data from different sequencing platforms, we find that scGAD outperforms other competitive methods in different biological scenarios, while eliminating batch effects. From the experimental results, it can be seen that scGAD achieves the newly proposed annotation goal effectively and is more practical than existing state-of-the-art methods. To better verify the accuracy of ScGAD for novel cell-type clustering results and their biological significance, we also conducted marker genes identification based on clustering results.

Future implications of scGAD

Compared with other annotation methods, scGAD can provides the fine-grained semantic knowledge of novel cell types absent from the reference data, which is of significant value for subsequent functional analysis. Since scRNA-seq data can captures only a static snapshot at a point in time, not information that changes over time, it is hard to reveal the trajectory of cell development along important biological process. In the future, scGAD can be applied to time series data to help explore trajectory path of cell development. Furthermore, considering the rapid growth of high-throughput single-cell multi-omics sequencing technologies, we plan to adapt this new annotation task and our proposed framework to multiomics data study in the future.

Key Points

We propose a new, realistic and challenging task called generalized cell type annotation and discovery in the single-cell annotation field. To effectively tackle this task, we propose a novel end-to-end algorithmic framework called scGAD.
We introduce an anchor-based self-supervised learning module with a confidential prototypical self-supervised learning paradigm to achieve seen cell type annotation and novel cell type clustering simultaneously.
We propose a bidirectional dual alignment mechanism between embedding space and prediction space to better handle batch effect and cell type shift.
We propose an easy, yet effective, solution to the challenging problem of estimating the total cell type number in target data.
We design the comprehensive comparison baselines and evaluation benchmarks to validate the practicality of scGAD, and the experimental results on simulation and real datasets demonstrate that scGAD outperforms seven state-of-the-art annotation and clustering methods under the three kinds of accuracy. Furthermore, we also implement marker gene identification to validate the biological significance of scGAD.

Funding

National Key Research and Development Program of China (2021YFF1200902); National Natural Science Foundation of China (32270689, 12126305).

Author Biographies

Yuyao Zhai, a doctoral student at Peking University’s School of Mathematical Sciences.

Liang Chen, an algorithm researcher in Huawei.

References

1.

Ding

J

,

Adiconis

X

,

Simmons

SK

, et al.

Systematic comparison of single-cell and single-nucleus rna-sequencing methods

.

Nat Biotechnol

2020

;

38

(

6

):

737

–

46

.

2.

Mereu

E

,

Lafzi

A

,

Moutinho

C

, et al.

Benchmarking single-cell rna-sequencing protocols for cell atlas projects

.

Nat Biotechnol

2020

;

38

(

6

):

747

–

55

.

3.

Kiselev

VY

,

Andrews

TS

,

Hemberg

M

.

Challenges in unsupervised clustering of single-cell rna-seq data

.

Nat Rev Genet

2019

;

20

(

5

):

273

–

82

.

4.

Soneson

C

,

Robinson

MD

.

Bias, robustness and scalability in single-cell differential expression analysis

.

Nat Methods

2018

;

15

(

4

):

255

–

61

.

5.

Vieth

B

,

Parekh

S

,

Ziegenhain

C

, et al.

A systematic evaluation of single cell rna-seq analysis pipelines

.

Nat Commun

2019

;

10

(

1

):

1

–

11

.

6.

Shao

X

,

Liao

J

,

Xiaoyan

L

, et al.

Sccatch: automatic annotation on cell types of clusters from single-cell rna sequencing data

.

Iscience

2020

;

23

(

3

):100882.

7.

Lazić

P

.

Cellmatch: combining two unit cells into a common supercell with minimal strain

.

Comput Phys Commun

2015

;

197

:

324

–

34

.

8.

Cao

Y

,

Wang

X

,

Peng

G

.

Scsa: a cell type annotation tool for single-cell rna-seq data

.

Front Genet

2020

;

11

:

490

.

9.

Zhang

X

,

Lan

Y

,

Jinyuan

X

, et al.

Cellmarker: a manually curated resource of cell markers in human and mouse

.

Nucleic Acids Res

2019

;

47

(

D1

):

D721

–

8

.

10.

Yuan

H

,

Yan

M

,

Zhang

G

, et al.

Cancersea: a cancer single-cell state atlas

.

Nucleic Acids Res

2019

;

47

(

D1

):

D900

–

8

.

11.

Pliner

HA

,

Shendure

J

,

Trapnell

C

.

Supervised classification enables rapid annotation of cell atlases

.

Nat Methods

2019

;

16

(

10

):

983

–

6

.

12.

Abdelaal

T

,

Michielsen

L

,

Cats

D

, et al.

A comparison of automatic cell identification methods for single-cell rna sequencing data

.

Genome Biol

2019

;

20

(

1

):

1

–

19

.

13.

Qi

R

,

Ma

A

,

Ma

Q

, et al.

Clustering and classification methods for single-cell rna-sequencing data

.

Brief Bioinform

2020

;

21

(

4

):

1196

–

208

.

14.

Cao

Z-J

,

Wei

L

,

Shen

L

, et al.

Cell blast: searching large-scale scrna-seq databases via unbiased cell embedding

.

BioRxiv

.

2019

;

587360

.

15.

Aran

D

,

Looney

AP

,

Liu

L

, et al.

Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage

.

Nat Immunol

2019

;

20

(

2

):

163

–

72

.

16.

Hou

R

,

Denisenko

E

,

Forrest

ARR

.

Scmatch: a single-cell gene expression profile annotation tool using reference datasets

.

Bioinformatics

2019

;

35

(

22

):

4688

–

95

.

17.

Chen

L

,

He

Q

,

Zhai

Y

, et al.

Single-cell RNA-seq data semi-supervised clustering and annotation via structural regularized domain adaptation

.

Bioinformatics

2021

;

37

(

6

):

775

–

84

.

18.

Chenling

X

,

Lopez

R

,

Mehlman

E

, et al.

Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models

.

Mol Syst Biol

2021

;

17

(

1

):e9620.

19.

Lotfollahi

M

,

Naghipourfar

M

,

Luecken

MD

, et al.

Mapping single-cell data to reference atlases by transfer learning

.

Nat Biotechnol

2022

;

40

(

1

):

121

–

30

.

20.

De Kanter

JK

,

Lijnzaad

P

,

Candelli

T

, et al.

Chetah: a selective, hierarchical cell type identification method for single-cell RNA sequencing

.

Nucleic Acids Res

2019

;

47

(

16

):

e95

–

5

.

21.

Alquicira-Hernandez

J

,

Sathe

A

,

Ji

HP

, et al.

Scpred: accurate supervised method for cell-type classification from single-cell RNA-seq data

.

Genome Biol

2019

;

20

(

1

):

1

–

17

.

22.

Eraslan

G

,

Avsec

ž

,

Gagneur

J

, et al.

Deep learning: new computational modelling techniques for genomics

.

Nat Rev Genet

2019

;

20

(

7

):

389

–

403

.

23.

Chen

L

,

Zhai

Y

,

He

Q

, et al.

Integrating deep supervised, self-supervised and unsupervised learning for single-cell RNA-seq clustering and annotation

.

Genes

2020

;

11

(

7

):

792

.

24.

Brbić

M

,

Zitnik

M

,

Wang

S

, et al.

Mars: discovering novel cell types across heterogeneous single-cell experiments

.

Nat Methods

2020

;

17

(

12

):

1200

–

6

.

25.

Kimmel

JC

,

Kelley

DR

.

scnym: semi-supervised adversarial neural networks for single cell classification

bioRxiv

,

2020

.

26.

Shaham

U

,

Stanton

KP

,

Zhao

J

, et al.

Removal of batch effects using distribution-matching residual networks

.

Bioinformatics

2017

;

33

(

16

):

2539

–

46

.

27.

Wang

T

,

Johnson

TS

,

Shao

W

, et al.

Bermuda: a novel deep transfer learning method for single-cell rna sequencing batch correction reveals hidden high-resolution cellular subtypes

.

Genome Biol

2019

;

20

(

1

):

1

–

15

.

28.

Lakkis

J

,

Wang

D

,

Zhang

Y

, et al.

A joint deep learning model enables simultaneous batch effect correction, denoising, and clustering in single-cell transcriptomics

.

Genome Res

2021

;

31

(

10

):

1753

–

66

.

29.

Eraslan

G

,

Simon

LM

,

Mircea

M

, et al.

Single-cell rna-seq denoising using a deep count autoencoder

.

Nat Commun

2019

;

10

(

1

):

1

–

14

.

30.

Wan

H

,

Chen

L

,

Deng

M

.

Scname: neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data

.

Bioinformatics

2022

;

38

(

6

):

1575

–

83

.

31.

He

K

,

Fan

H

,

Wu

Y

, et al.

Momentum contrast for unsupervised visual representation learning

. In:

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

9729

–

38

,

2020

.

32.

Chen

T

,

Kornblith

S

,

Norouzi

M

,

Hinton

G

.

A simple framework for contrastive learning of visual representations

. In

International Conference on Machine Learning

, pp.

1597

–

607

.

PMLR

,

2020

.

33.

Brent

RP

.

Algorithms for Minimization Without Derivatives

.

Courier Corporation

,

2013

.

Google Preview

34.

Chen

L

,

Wang

W

,

Zhai

Y

, et al.

Deep soft k-means clustering with self-training for single-cell rna sequence data

.

NAR Genomics Bioinform

2020

;

2

(

2

):

lqaa039

.

35.

Wang

H-Y

,

Zhao

J-P

,

Zheng

C-H

, et al.

Sccnc: a method based on capsule network for clustering scrna-seq data

.

Bioinformatics

2022

;

38

(

15

):

3703

–

9

.

36.

Jian

H

,

Li

X

,

Gang

H

, et al.

Iterative transfer learning with neural network for clustering and cell type classification in single-cell rna-seq analysis

.

Nat Mach Intell

2020

;

2

(

10

):

607

–

18

.

37.

Kuhn

HW

.

The hungarian method for the assignment problem

.

Naval Res Logistics Q

1955

;

2

(

1-2

):

83

–

97

.

38.

Cao

ZJ

,

Wei

L

Lu

S

, et al.

Searching large-scale scrna-seq databases via unbiased cell embedding with cell blast.

Nat Commun

,

11

(

1

):

1

–

13

,

2020

.

39.

Wolf

F.A.

,

Angerer

P

,

Theis

F.J.

Scanpy: large-scale single-cell gene expression data analysis.

.

Genome Biol

2018

;

19

(

1

):

1

–

5

.

40.

Zappia

L

,

Phipson

B

,

Oshlack

A

.

Splatter: simulation of single-cell rna sequencing data

.

Genome Biol

2017

;

18

(

1

):

1

–

15

.

41.

Cao

J

,

Packer

JS

,

Ramani

V

, et al.

Comprehensive single-cell transcriptional profiling of a multicellular organism

.

Science

2017

;

357

(

6352

):

661

–

7

.

42.

Hochane

M

,

Berg van den

PR

,

Fan

X

, et al.

Single-cell transcriptomics reveals gene expression dynamics of human fetal kidney development

.

PLoS Biol

2019

;

17

(

2

):

e3000152

.

43.

Park

J

,

Shrestha

R

,

Qiu

C

, et al.

Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease

.

Science

2018

;

360

(

6390

):

758

–

63

.

44.

Tabula Muris Consortium

,

Overall coordination

,

Logistical coordination

, et al.

Single-cell transcriptomics of 20 mouse organs creates a tabula muris

.

Nature

2018

;

562

(

7727

):

367

–

72

.

45.

Wagner

DE

,

Weinreb

C

,

Collins

ZM

, et al.

Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo

.

Science

2018

;

360

(

6392

):

981

–

7

.

46.

Zeisel

A

,

Hochgerner

H

,

Lönnerberg

P

, et al.

Molecular architecture of the mouse nervous system

.

Cell

2018

;

174

(

4

):

999

–

1014

.

47.

Zheng

GXY

,

Terry

JM

,

Belgrader

P

, et al.

Massively parallel digital transcriptional profiling of single cells

.

Nat Commun

2017

;

8

(

1

):

1

–

12

.

48.

Chen

R

,

Xiaoji

W

,

Jiang

L

, et al.

Single-cell RNA-seq reveals hypothalamic cell diversity

.

Cell Rep

2017

;

18

(

13

):

3227

–

41

.

49.

Guo

J

,

Grow

EJ

,

Mlcochova

H

, et al.

The adult human testis transcriptional cell atlas

.

Cell Res

2018

;

28

(

12

):

1141

–

57

.

50.

Domínguez Conde

C

,

Xu

C

,

Jarvis

LB

, et al.

Cross-tissue immune cell analysis reveals tissue-specific features in humans

.

Science

2022

;

376

(

6594

):

eabl5197

.

51.

Demšar

J

Statistical comparisons of classifiers over multiple data sets

.

J Mach Learn Res

2006

;

7

:

1

–

30

.