SMURF: embedding single-cell RNA-seq data with matrix factorization preserving self-consistency

Pu, Juhua; Wang, Bingchen; Liu, Xingwu; Chen, Lingxi; Li, Shuai Cheng

doi:10.1093/bib/bbad026

Abstract

The advance in single-cell RNA-sequencing (scRNA-seq) sheds light on cell-specific transcriptomic studies of cell developments, complex diseases and cancers. Nevertheless, scRNA-seq techniques suffer from ‘dropout’ events, and imputation tools are proposed to address the sparsity. Here, rather than imputation, we propose a tool, SMURF, to extract the low-dimensional embeddings from cells and genes utilizing matrix factorization with a mixture of Poisson-Gamma divergent as objective while preserving self-consistency. SMURF exhibits feasible cell subpopulation discovery efficacy with obtained cell embeddings on replicated in silico and eight web lab scRNA datasets with ground truth cell types. Furthermore, SMURF can reduce the cell embedding to a 1D-oval space to recover the time course of cell cycle. SMURF can also serve as an imputation tool; the in silico data assessment shows that SMURF parades the most robust gene expression recovery power with low root mean square error and high Pearson correlation. Moreover, SMURF recovers the gene distribution for the WM989 Drop-seq data. SMURF is available at https://github.com/deepomicslab/SMURF.

scRNA-seq, matrix factorization, imputation, embedding, cell cycle

Introduction

The emergence of high-throughput single-cell RNA-sequencing (scRNA-seq) flourishes the cell-specific transcriptomes applications [1–3], including cell cycles [4–6], complex diseases [7–9] and cancers [6, 10–14]. scRNA-seq allows us to obtain a gene-cell count matrix, where the rows represent the genes and the columns signify the cells. Nevertheless, this technique has a drawback—it could only catch a small portion of genes during the sequencing process, and such missing entries will be recorded as zero in the count matrix. We call this phenomenon the ‘dropout’ event and such a matrix the zero-inflated matrix [15]. As such, a zero value in a zero-inflated count matrix contains two possibilities—the first is that the gene expression in a cell is in fact zero, and the second is that it is a fake zero caused by ‘dropout’ events. The sparse gene-cell expression profile might influence the accuracy of downstream analyses and mask potential biological insights [16]. Thus, we need to determine the two possibilities concerning zero entries and amend the missing values from false zero to a more statistically realistic value. In computational biology, we termed this data wrangling process as ‘imputation’.

With the intent of making it more appropriate and valuable for a variety of downstream purposes such as clustering [17], cell type recognition [17], dimension reduction [18], differential gene expression analysis [19], identification of cell-specific genes [19] and reconstruction of differentiation trajectory [20] on zero-inflated single-cell gene expression data, several scRNA-seq imputation tools have been developed. A systematic benchmarking of 18 scRNA-seq imputation methods suggested that MAGIC [21], kNN-smoothing [22] and SAVER [23] were state-of-the-art tools [24]. MAGIC conducts imputation utilizing information across similar cells with data diffusion [21]. kNN-smoothing imputes the missing entries by collapsing gene-specific UMI counts from the k nearest neighbors [22]. SAVER estimates the gene expression level by borrowing information across similar genes utilizing the Bayesian approach [23]. Also, in a previous work [25], we introduced a self-consistent method I-Impute and demonstrated its robustness compared with SAVER. I-Impute optimizes continuous similarities and dropout probabilities iteratively to reach a self-consistent state. However, these tools focus on imputation only. Directly applying the imputed expression profile to downstream tasks like cell subpopulation detection and cell-cycle inference might fail due to the curse of dimensionality [26, 27].

In a previous work, we displayed matrix-factorization-based model could conduct robust imputation ability alongside cancer subtype preserved dimension reduction in bulk cancer omics data [28]. Thus, in this study, we develop SMURF to impute scRNA data based on matrix factorization. SMURF leverages a mixture of Poisson-Gamma distribution to deal with the zero-inflated problem. SMURF also generates cell and gene embeddings in their latent space that facilities cell type identification and visualization. We first access SMURF’s gene expression recovery ability with the state-of-the-art imputation tools I-Impute, SAVER, MAGIC and kNN-smoothing. In in silico data assessment, SMURF models parade the most robust gene expression recovery power with low root mean square error (RMSE) and high Pearson correlation. Moreover, SMURF’s cell embedding successfully preserves the underlying cell-subpopulation structures against the full-dimensional imputed gene-cell expression data for different sparsity. As for the wet lab dataset evaluation, we demonstrate that SMURF recovered the genes’ distribution from WM989 Drop-seq data. We further collect eight wet lab cell line datasets to benchmark the tools. SMURF exhibits feasible cell subpopulation discovery efficacy on all the eight cell line mixtures. Furthermore, we illustrate that SMURF could embed the cells into a 1D-oval latent space and preserve the time course of the cell cycle.

Materials and methods

Algorithm of SMURF

Problem formulation

Let |$\boldsymbol{X}\in \mathbb{R}^{M \times N}$| be the count gene-cell matrix, where |$M$| and |$N$| are the number of genes and cells, respectively. Here, we use a library size normalization defined as the library size divided by the mean library size across all cells, but SMURF can also be applied to prenormalized dataset using different normalization method. SMURF assumes that the element |$x_{ij}$| in |$\boldsymbol{X}$| follows a negative binomial distribution, which can be decomposed into a mixture of Poisson and Gamma distribution [23],

$$ \begin{align} { {x}_{ij} \sim\textrm{NB}(\alpha_{i j},\beta_{i j}) \sim\textrm{Poisson}(\textrm{Gamma}(\alpha_{i j},\beta_{i j})), } \end{align} $$

(1)

where |$\alpha _{i j}$| is the shape parameter and |$\beta _{i j}$| is the scale parameter of Gamma distribution. These two parameters actually reflect the mean |$\mu _{i j}$| and variance |$\nu _{i j}$| of the distribution,

$$ \begin{align} \begin{split} \mu_{i j}=\frac{\alpha_{i j}}{\beta_{i j}}, \nu_{i j}=\frac{\alpha_{i j}}{\beta_{i j}}^{2}. \end{split} \end{align} $$

(2)

Our goal is to make a reasonable estimate of the prior mean |$\boldsymbol{{U}}$| and prior variance |$\boldsymbol{{V}}$| through matrix factorization based on the negative binomial distribution and use the calculated prior mean |$\boldsymbol{{U}}$| as the final estimate of the gene expression level of this model (Supplementary Method—Algorithm 1).

Model initialization based on self-consistency

In order to improve the accuracy of the model, it is necessary to initialize the parameters. The initialization process not only needs to determine the initial values of the parameters, but also needs to disassemble the parameters into two matrices to facilitate subsequent matrix decomposition.

First, we initialize the mean |$\boldsymbol{{U}}$| based on self-consistency. As stated in I-Impute [25], a self-consistent functional mapping |$f: x \rightarrow x$| given by input data |$\boldsymbol{X} \in \mathbb{R}^{M \times N}$| is

$$ \begin{align} \begin{split} \text{self-consistency cost }(f; \boldsymbol{X}, n) &= \frac{||\boldsymbol{X} - f^n(\boldsymbol{X})||^2_F}{M \times N}, \\ \text{self-consistency cost }(f; \boldsymbol{X}, n) &\lt \theta \rightarrow f \text{ is self-consistent}, \end{split} \end{align} $$

(3)

where |$n$| is the function iteration times and |$\theta $| stands for the tolerant noise.

Here, we assume that |$\mu _{i j}$| in |$\boldsymbol{U}$| is a prediction derived by other genes in the same cell defined by the inter-gene relationship matrix|$\boldsymbol{R} \in \mathbb{R}^{M \times M}$|⁠, where |$r_{i j}$| in |$\boldsymbol{R}$| represents the relation coefficient of gene |$j$| for gene |$i$|⁠. Obviously, |$r_{i j} = 0$| when |$i = j$|⁠. So ideally, the following equation should exist:

$$ \begin{align} \boldsymbol{RX} = \boldsymbol{X}. \end{align} $$

(4)

We call the above equation as the first-order equation, from which we can continue to obtain the following high-order equations:

$$ \begin{align} \begin{split} \boldsymbol{R}(\boldsymbol{RX}) &= \boldsymbol{RX}, \\ \boldsymbol{R}(\boldsymbol{R}(\boldsymbol{RX})) &= \boldsymbol{R}(\boldsymbol{RX}), \\ &\vdots \\ \boldsymbol{R}^{n}\boldsymbol{X} &= \boldsymbol{R}^{n-1}\boldsymbol{X}. \end{split} \end{align} $$

(5)

It is the inherent property of the relation matrix |$\boldsymbol{R}$| to make the above first to |$n$|-order equations true. The |$n$|-order equations is the iterative function mapping |$f(\boldsymbol{X}) = \boldsymbol{R}\boldsymbol{X}$| refining |$\boldsymbol{X}$| to achieve self-consistency. Based on this property, we can calculate the approximate matrix |$\boldsymbol{\hat{R}}$| of the relationship matrix that satisfies the self-consistency nature, and make coarse imputation of the cell gene matrix to get the initialization of |$\boldsymbol{{U}}$|⁠.

In order to make the size of all parameter values in |$\boldsymbol{\hat{R}}$| reasonable, and to enhance the stability and expressive ability of the model, we first initialize a matrix |$\boldsymbol{D} \in \mathbb{R}^{M \times M}$|⁠, where

$$ \begin{align} {d_{i j}} = \begin{cases} -\infty & {i = j},\\ -\ln{(M - 2)} & {i \neq j}. \end{cases} \end{align} $$

(6)

We can initialize the relation matrix |$\hat{\boldsymbol{R}}$| as |$r_{ij} = \frac{1}{1 + \exp ^{-d_{ij}}}$|⁠. Then we design the objective function based on self-consistency,

$$ \begin{align} {\min_{\boldsymbol{D}}} \frac{\lambda_{0}}{2}||\boldsymbol{\hat{R}}||^{2} + \sum_{i=1}^{n} \frac{\lambda_{i}}{2}||\hat{\boldsymbol{R}}^{i}\boldsymbol{{X}} - \hat{\boldsymbol{R}}^{i-1}\boldsymbol{{X}}||^{2}. \end{align} $$

(7)

We can get a reasonable approximation of the relationship matrix by optimizing the above objective function. In practice, |$n$| does not need to be too large, generally 3 to 5 can be used to obtain satisfactory results. Finally, we can get a preliminary prediction of |$\boldsymbol{{U}}$|⁠,

$$ \begin{align} \boldsymbol{{U}} = \hat{\boldsymbol{R}}\boldsymbol{{X}}. \end{align} $$

(8)

Above, we have applied the inter-gene relationship matrix for discussion. In fact, this idea is also applicable to the inter-cell relationship matrix. However, we have found in actual experiments that since the number of cells in the relevant data set is far less than the number of genes, when we initialize with the self-consistency of the inter-cell relationship matrix, although it consumes less computing resources, the case of underfitting occurs more often.

Based on singular value decomposition, we decompose the matrix |$\boldsymbol{{U}}$| into two matrices |$\boldsymbol{G}$| and |$\boldsymbol{H}$|⁠, where |$\boldsymbol{G} \in \mathbb{R}^{M \times K}, \boldsymbol{H} \in \mathbb{R}^{ K \times N}$|⁠, |$ K \ll M, N$|⁠.

For the prior variance |$\boldsymbol{{V}}$|⁠, we assume that for each gene, there is an implicit connection between its mean and variance, and introduce three different noise models to estimate the variance: constant variance model, constant Fano model and constant coefficient of variation model; detailed formulation is given in Supplementary Methods.

Objectives and training

After initialization, SMURF uses the matrix decomposition based on the negative binomial distribution and different noise model to denoise the original expression matrix |$\boldsymbol{X}$| and impute in the dropouts in the matrix. First, we need to obtain the set |$P$| composed of the coordinates of cells and genes according to the matrix |$\boldsymbol{X}$|⁠, and make |$P$| satisfy |$\forall (i,j)\in P$|⁠, |${x}_{ij}>0$|⁠.

For the constant variance model (Supplementary Methods), the objective function is

$$ \begin{align} \begin{split} \min_{\boldsymbol{G},\boldsymbol{H},\boldsymbol{{V}}}- \sum_{(i, j)\in P} \frac{\mu_{i j}^{2}}{\nu_i} \log \mu_{i j}-\frac{\mu^2_{i j}}{\nu_i}\log \nu_i-\log\Gamma\left(\frac{\mu^2_{i j}}{\nu_i}\right)\\+\log \Gamma\left(x_{ij}+\frac{\mu^2_{i j}}{\nu_i}\right)-\left(x_{ij}+\frac{\mu^2_{i j}}{\nu_i}\right) \log \left(1+\frac{\mu_{i j}}{\nu_i}\right), \end{split} \end{align} $$

(9)

with |${\boldsymbol{U}}=\boldsymbol{G}\times \boldsymbol{H}$|⁠. Let the loss function be, we separately |$f_{c v}$| calculate |$\frac{\partial{f_{c v}}}{\partial g_i}$|⁠, |$\frac{\partial f_{c v}}{{\partial }h_j}$|⁠, |$\frac{\partial f_{c v}}{{\partial }\nu _i}$|⁠, through the gradient descent method. We continuously update the mean |$\boldsymbol{{U}}$| and variance |$\boldsymbol{{V}}$| until the loss function finally converges.

Under the assumption of the constant Fano model where |$\beta _i = \frac{\mu _{ij}}{\nu _{ij}}$| (Supplementary Methods), the objective function is

$$ \begin{align} & \min_{\boldsymbol{G},\boldsymbol{H},\boldsymbol{{B}}}-\sum_{(i, j)\in P}\mu_{i j} \beta_{i} \log \beta_{i}-\log \Gamma\left(\mu_{i j} \beta_{i}\right) \nonumber \\ & \quad +\log \Gamma\left(x_{ij}+\mu_{i j} \beta_{i}\right)-\left(x_{ij}+\mu_{i j} \beta_{i}\right) \log \left(1+\beta_{i}\right). \end{align} $$

(10)

Similarly, for the constant coefficient of variance model where |$\alpha _i = \frac{\mu ^2_{ij}}{\nu _{ij}}$| (Supplementary Methods), we can get the objective function

$$ \begin{align} & \min_{\boldsymbol{G},\boldsymbol{H},\boldsymbol{{A}}}-\sum_{(i, j)\in P}\alpha_{i} \log \alpha_{i}-\alpha_{i} \log \mu_{i j}-\log \Gamma\left(\alpha_{i}\right) \nonumber \\ &\quad +\log \Gamma\left(x_{ij}+\alpha_{i}\right)-\left(x_{ij}+\alpha_{i}\right)\log\left(1+\frac{\alpha_{i}}{\mu_{i j}}\right). \end{align} $$

(11)

1D-oval cell-cycle visualization

When we use SMURF to process the gene-cell matrix, we can get the cell embedding |$\boldsymbol{H}$|⁠, which holds the underlying relationship between cells. For cells at different stages in the life cycle, we can think that the interrelationship contained in the cell characteristic matrix reflects the cell’s phases in the cell life cycle.

In order to extract the cell information from the cell feature matrix, to restore the time courses of each cell in the life cycle and to make a visual display, we reduce the dimension of the high-dimensional cell features, and make the target low-dimensional space a closed curve, that is, a one-dimensional space, to facilitate our observation.

The widely used linear dimension reduction methods such as PCA [29] are not suitable for our situation because they will destroy the global structure of the data, which is contrary to our purpose of dimension reduction. The method UMAP (Uniform Manifold Approximation and Projection) [18] based on manifold learning gives us new inspiration.

Like UMAP, we made the following three assumptions: (1) The data are uniformly distributed on the Riemannian manifold; (2) The Riemann metric is locally constant (or can be approximated in this way); (3) Manifolds are locally connected. It essentially constructs a weighted map from high-dimensional data, uses edge strength to indicate the ‘closeness’ of one point to another point and then projects this map to a two-dimensional space. Unlike UMAP, we restricted the mapping to 1D space, and the shape of the space should be a closed curve.

We can understand the 1D embedding of cell features in the 1D space as an angle. As the angle rotates, the cell experiences different moments in its life. We chose the oval curve as the target space for visualization:

$$ \begin{align} \frac{{x^{2}}}{{a^{2}}} + \frac{y^{2}}{b^{2}t(x)} = 1, \end{align} $$

(12)

where |$t(x)=\exp{(kx)}$|⁠, and |$a=3$|⁠, |$b=2$| and |$k=0.2$| are hyper-parameters.

Simulation and benchmark settings

We use Splatter [30] to simulate five scRNA-seq datasets. The parameter configurations are nGenes=2000, batchCells=500, seeds=2021, group.prob = c(0.2, 0.2, 0.2, 0.2, 0.2), dropout.type= ‘experiment’, dropout.shape = -1 and dropout.mid = 1, 3, 4, 5 and 7 for dropout rate 28.38%, 63.14%, 78.00%, 88.06% and 97.22%, respectively. For replication experiments, we fixed dropout.shape = -1, and dropout.mid=1, 3, 4, 5 and 7, and set the seeds = 2020, 2022, 2023 and 2024. As a result, we get five replications for dropout.mid = 1, 3, 4, 5 and 7 with average dropout rates 27.94%, 62.83%, 77.83%, 87.98% and 97.24%, respectively.

We selected I-Impute, SAVER, MAGIC and kNN-smoothing as state-of-the-art imputation tools. For the SAVER R package, we used the ‘saver’ function with the parameters ncores=12 and estimates.only=TRUE to perform the imputation tasks. The parameters for scImpute are drop_thre=0.5, ncores=10, Kclusters=(number of true clusters in input data). For I-Impute, we used n=20, c_drop=0.5, p_pca=0.4, normalize=False for synthetic data and True for real data sets, iteration=True. For MAGIC, genes=‘all_genes’, knn=5, delay=1,t=3. The parameters for kNN-smoothing are k=10, d=10, dither=0.03.

Results

Overview of SMURF

The workflow of SMURF is summarized in Figure 1A. In brief, with zero-inflated gene-cell expression matrix |$\boldsymbol{X}$| as input, SMURF first iteratively refine the expression matrix to |$\boldsymbol{U}$| with the principle of self-consistency. Then, SMURF initializes the gene and cell embeddings (⁠|$\boldsymbol{G}$| and |$\boldsymbol{H}$|⁠) from the refined expression profile by singular value decomposition. Then, we assume the gene dispersion under three variance models (constant variance, constant Fano and constant coefficient of variance) and initialize them with the value of 1. With the gene embedding |$\boldsymbol{G}$|⁠, cell embedding |$\boldsymbol{H}$| and gene dispersion |$\boldsymbol{\varphi }$| as parameters, we model and optimize the underlying expression profile |$\boldsymbol{U}$| with a Poisson-Gamma divergent objective to deal with the zero-inflated issues. After iterative gradient descent, SMURF outputs the optimized gene-cell matrix |$\boldsymbol{U}$|⁠, gene embedding |$\boldsymbol{G}$|⁠, cell embedding |$\boldsymbol{H}$| and gene dispersion |$\boldsymbol{\varphi }$|⁠. The downstream analyses that SMURF could aid a better performance are listed in Figure 1B: (i) With the aid of Poisson-Gamma divergent, the final gene-cell expression profile |$\boldsymbol{U}$| best recovers the gene distribution; we demonstrate it with in silico benchmarking and melanoma Drop-seq dataset. (ii) SMURF generates cell embedding |$\boldsymbol{H}$| in the latent space that facilitates cell type identification and visualization. We illustrate it with in silico benchmarking and cell line mixture datasets. (iii) SMURF could further embed cycling cells into a 1D-oval latent space and preserve the time course of the cell cycle. We verify it with human and mouse embryonic stem cell data.

Figure 1

Overview of SMURF. (A) The workflow of SMURF. The red circle means the data are optimized by Poisson-Gamma divergent. (B) The downstream analysis of SMURF. SVD: singular value decomposition. hESCs: human embryonic stem cells. mESCs: mouse embryonic stem cells.

Open in new tab Download slide

SMURF recovers the dropout gene expression in in silico trials

We check out whether SMURF and baselines can recover the dropout gene expression in in silico scRNA-seq data generated by Splatter [30]. We simulate a raw read count matrix with 2000 genes and five cell subpopulations, each subpopulation having an average of 100 cells. Next, we randomly replace 28.38%, 63.14%, 78.00%, 88.06% and 97.22% of nonzero values with zero, yielding five zero-inflated matrices with sparsity of 42.86%, 70.59%, 82.45%, 90.47% and 97.79%.

We evaluate the consistency of imputed values with the raw ones using RMSE and Pearson correlation (Cor.). kNN-smoothing exhibits high RMSE errors, while I-Impute and three SMURF variance modes (CV, F and V) have low RMSE. SMURF achieves the lowest RMSE when the sparsity reaches 80% (Figure 2A and Supplementary Table S1). Then, we measure the Pearson correlation between raw and recovered matrix entries (Figure 2A and Supplementary Table S1). For 28.38% dropout rate, all tools achieve satisfactory performance (Cor. > 0.92, |$p <$| 2.2e-16). In Supplementary Figure S1, the correlation scatterplots demonstrate that MAGIC and SMURF models outperform other tools when sparsity goes up with no severe drop-down or pull-up estimation. SMURF models have slightly larger correlations than MAGIC. For 63.14% dropout rate, MAGIC, SMURF_CV, SMURF_F and SMURF_V had Cor. value of 0.926, 0.933, 0.9323 and 0.933, respectively (⁠|$p <$| 2.2e-16). For 78.00% dropout rate, MAGIC, SMURF_CV, SMURF_F and SMURF_V had Cor. value of 0.900, 0.913, 0.913 and 0.913, respectively (⁠|$p <$| 2.2e-16). For 88.06% dropout rate, MAGIC, SMURF_CV, SMURF_F and SMURF_V had Cor. value of 0.856, 0.873, 0.874 and 0.873, respectively (⁠|$p <$| 2.2e-16). For 97.22% dropout rate, MAGIC, SMURF_CV, SMURF_F and SMURF_V had Cor. value of 0.727, 0.746, 0.745 and 0.744, respectively (⁠|$p <$| 2.2e-16). I-Impute, kNN-smoothing and SAVER tend to pull down some highly expressed genes, hence producing relatively low correlation values. Furthermore, we inspect the group marker genes heatmaps using raw, dropout and recovered expression levels (Supplementary Figure S2). MAGIC and SMURF’s marker gene matrices are closest to the ground truth. On 78.00% and 88.06% dropout sets, I-Impute, SAVER and kNN-smoothing fail to reproduce a clear marker expression pattern. However, all tools were unable to recover the marker expression pattern when the dropout rate reaches 97.22%.

Figure 2

Benchmarking result on five in silico scRNA data with dropout rate of 28.38%, 63.14%, 78.00%, 88.06% and 97.22%. (A) The imputation consistency measured by RMSE and Pearson correlation (Cor.). (B) UMAP plot. Cells from the same group share the same annotation color. From left to right are datasets with the dropout rate of 28.38%, 63.14%, 78.00%, 88.06% and 97.22%. From top and bottom are tools: raw, dropout, I-Impute, SAVER, MAGIC, kNN-smoothing, SMURF_CV, SMURF_F and SMURF_V. (C) The clustering accuracy measured by adjusted rand index (ARI), adjusted mutual information (AMI), normalized mutual information (NMI), homogeneity score (Homogeneity), completeness score (Completeness) and V-measure score (V-measure). SMURF_CV, SMURF_V and SMURF_F stand for SMURF with variance mode constant coefficient of variance (CV), constant Fano (F) and constant variance (V), respectively. SMURF models with suffix ‘H’ mean the clustering was conducted with the cell embedding yielded by SMURF.

Open in new tab Download slide

We then assessed whether SMURF maintains the cell subpopulation structures well while retrieving the missing entries. Figure 2B displays the UMAP plots of the raw, dropout and recovered matrices annotated with five cell subpopulations. The UMAP plot of expression data recovered by MAGIC and SMURF separated the five cell subpopulations even when the dropout rate reached 88.06%. The UMAP plot of kNN-smoothing blurred at the dropout rate of 78%. I-Impute and SAVER performed worst; they failed at 63.14%. Furthermore, we employ spectral clustering to the dropout, recovered and SMURF’s cell embeddings. We evaluate the clustering accuracy using six metrics: adjusted Rand index (ARI) [31], adjusted mutual information (AMI) [32], normalized mutual information (NMI) [33], homogeneity score (Homogeneity) [34], completeness score (Completeness) [34] and V-measure score (V-measure) [34]. These metrics measure the degree of overlap between the estimation and ground truth clusters. The value of 0 signifies random clustering, while 1 indicates perfectly complete labeling. In Figure 2C and Supplementary Table S2, the overall clustering accuracy declines as the dropout rate increases. kNN-smoothing and MAGIC exhibited the lowest clustering accuracy in all settings, sometimes even worse than the dropout case. The expression data recovered by I-Impute, SAVER, SMURF_CV, SMURF_F and SMURF_V displayed moderate improvements in clustering accuracy against the dropout data. The conventional imputation tools and SMURF output the full-dimensional imputed matrix |$\boldsymbol{U}$|⁠. SMURF also generates the cell embedding |$\boldsymbol{H}$|⁠. Both |$\boldsymbol{U}$| and |$\boldsymbol{H}$| can be taken as the input of downstream cell subpopupopulation identification task. Notably, the clustering accuracy yielded by the cell embedding |$\boldsymbol{H}$| outperformed all imputed full-dimensional cell expression |$\boldsymbol{U}$| and manifested the most desirable cell subpopulation classification power. Especially, SMURF_CV_H, SMURF_F_H and SMURF_V_H obtained the most noticeable clustering accuracy enhancements against their corresponding full matrices (SMURF_CV, SMURF_F and SMURF_V) in 78%, 88.06% and 97.22% datasets.

We further conducted replication experiments. We utilized Splatter to simulate five cohorts with average dropout rates of 27.94%, 62.83%, 77.83%, 87.98% and 97.24% (Methods, Supplementary Figure S3 and Supplementary Table S3). Each cohort has five replications. After that, we run SMURF on the replication data in three different variance modes (CV, F and V), and the imputation power and clustering accuracy are repeatable, as shown in Supplementary Figure S4 and Supplementary Tables S4–S5. To check the effectiveness of self-consistency initialization, we compared the RMSE and Cor. of SMURF without and with self-consistency initialization (Supplementary Figure S5 and Supplementary Table S6). Notably, the self-consistency initialization promotes the gene expression recovering power with RMSE falls and Cor. rises. With embedding dimensionality |$K$| set at 10, 20 and 50, SMURF runs for 1 min or less for in silico replication benchmarking (Supplementary Figure S6 and Supplementary Table S7)

Overall, the in silico assessment indicated that SMURF models parade the most robust gene expression recovery power. Moreover, SMURF’s cell embeddings successfully preserve the underlying cell subpopulation structures.

SMURF recovers the distribution of genes from WM989 Drop-seq data

We benchmarked SMURF with the melanoma cell line data, WM989, where Torre et al. [35] profiled 8498 WM989 single cells employing Drop-seq and Shaffer et al. [36] measured the gold-standard expression of 26 drug-resistant marker and housekeeping genes among the 7000–88 000 WM989 cells with single-molecule RNA fluorescence in situ hybridization (smRNA FISH). There are 15 genes overlapped between the Drop-seq and gold-standard expression data, and the sequenced cells are different from the two approaches [35, 36]. Thus, we assessed imputation performance by comparing these 15 gene distributions and correlations of imputation tools recovered with the distributions captured by gold-standard smRNA FISH (Figure 3). I-Impute is not compared here as it failed to finish the task in 24 h.

Figure 3

Benchmarking result on melanoma cell line WM989. (A) The imputed gene distribution and FISH estimated gene distribution among 15 genes. (B) The Kullback–Leibler (KL) divergence between imputed gene distribution and FISH estimated gene distribution.

Open in new tab Download slide

We drew the density plot of each gene’s expression across cells (Figure 3A) and used Kullback–Leibler (KL) divergence [37] to measure how gene distributions recovered by imputation tools were different from the reference gene distributions estimated by smRNA FISH (Figure 3B and Supplementary Table S8). Compared with Drop-seq, the expression distributions obtained by SMURF and SAVER were closer to FISH. MAGIC and kNN-smoothing demonstrated a worse imputation ability as MAGIC on the shape of density plot and kNN-smoothing show the largest KL divergence on several genes.

SMURF enables accurate unsupervised clustering on mixture of cell line data

We apply SMURF to eight wet-lab mixture cell line datasets to benchmark the cell subpopulation problem. The first set is p3cl, including human dermal fibroblasts-skin, breast cancer luminal epithelial cell line MCF-7 and breast cancer basal-like epithelial cell line MDA-MB-468 in a 1:3:6 ratio [38, 39]. We evaluate SMURF with CellBench [40], where five human lung adenocarcinoma cell lines were mixed and sequenced with 10x Genomics, CEL-seq2 and Drop-seq, yielding seven datasets (mixture of HCC827, H1975 and H228: sc_10x, sc_celseq2, sc_dropseq; mixture of HCC827, H1975, A549, H838 and H2228: sc_10x_5cl, sc_celseq2_5cl_p1, sc_celseq2_5cl_p2, sc_celseq2_5cl_p3).

Likewise, we evaluated the consistency of imputed values with the raw ones using RMSE and Pearson correlation. kNN-smoothing exhibited high RMSE errors, while SMURF and other tools had relatively low RMSE (Figure 4A and Supplementary Table S9). MAGIC had the highest Pearson correlation between raw and recovered matrix entries (Figure 4A and Supplementary Table S9). The performance of I-Impute and SAVER performance is similar to that of dropout data. The SMURF models have the second place in five out of eight sets (sc_celseq2_5cl_p1, sc_celseq2_5cl_p2, sc_celseq2_5cl_p3, sc_celseq2 and sc_dropseq). The investigation of cell type marker genes heatmaps using raw, dropout and recovered expression levels (Supplementary Figure S7) demonstrates that the SMURF’s marker gene matrices are closest to the raw data on all eight sets. All other tools failed to produce distinct cell type patterns for all datasets.

Figure 4

Benchmarking result on eight mixture of cell line scRNA data. (A) The imputation consistency measured by RMSE and Pearson correlation (Cor.). (B) UMAP plot. Cells from the same group share the same annotation color. From left to right are datasets p3cl, sc_10x_5cl, sc_10x, sc_celseq2_5cl_p1, sc_celseq2_5cl_p2, sc_celseq2_5cl_p3, sc_celseq2 and sc_dropseq with the dropout rate of 90%, 80%, 80%, 65%, 65%, 75%, 65% and 75%, respectively. From top and bottom are tools: Raw, Dropout, I-Impute, SAVER, MAGIC, kNN-smoothing, SMURF_CV, SMURF_F and SMURF_V. (C) The clustering accuracy measured by adjusted rand index (ARI), adjusted mutual information (AMI), normalized mutual information (NMI), homogeneity score (Homogeneity), completeness score (Completeness) and V-measure score (V-measure) of HVG600. SMURF_CV, SMURF_V and SMURF_F stand for SMURF with variance mode constant coefficient of variance (CV), constant Fano (F) and constant variance (V), respectively. SMURF models with suffix ‘H’ mean the clustering was conducted with the cell embedding yielded by SMURF.

Open in new tab Download slide

We then evaluated whether SMURF keeps the cell type structures well while retrieving the missing entries. Figure 4B illustrates the UMAP embedding plots of the raw, dropout and recovered matrices annotated with the cell line labels. The UMAP plot of expression data recovered by I-Impute, SAVER and SMURF drew distinct cell line groups. However, MAGIC and kNN-smoothing mixed different cell line together in a mess. Furthermore, Figure 4C and Supplementary Table S10 further suggested that MAGIC and kNN-smoothing had the poorest spectral clustering accuracy on 600 of highly variable genes (HVG).

Overall, kNN-smoothing and MAGIC demonstrated the lowest clustering accuracy in all cell line mixtures, even worse than the dropout case. The expression data recovered by I-Impute and SAVER displayed moderate clustering accuracy improvements against the dropout on sc_celseq2_5cl_p2. Notably, the clustering accuracy yielded by SMURF outperformed on all cell mixtures and displayed the most desirable cell subpopulation classification power. Especially, cell embedding |$\boldsymbol{H}$| obtained the most noticeable clustering accuracy enhancements against all corresponding full matrices in the majority of datasets.

In all, the cell line mixture assessment indicated that SMURF models paraded the most robust gene expression recovery power. Moreover, SMURF’s cell embedding |$\boldsymbol{H}$| successfully preserved the underlying cell-subpopulation structures.

SMURF facilitates pseudo-time prediction for cell-cycle data

We leveraged six cell-cycle datasets to show SMURF facilitates pseudo-time prediction. In H1-hESC datasets, 247 cells were identified as being in G0/G1, S or G2/M phases using fluorescent ubiquitination-based cell-cycle indicators [41]. With G0/G1, S and G2/M phases labeled by Hoechst staining, mESC-Quartz and mESC-SMARTer has 23 and 288 mESCs sequenced by Quartz-seq and SMARTer, respectively [42, 43]. Then, as the 3Line-qPCR dataset provides G0/G1, S and G2/M cell stages marked by Hoechst staining as well, we split it into three datasets 3Line-qPCR_H9 (227 cells), 3Line-qPCR_MB (342 cells) and 3Line-qPCR_PC3 (361 cells) as the pseudo-time prediction benchmark sets [44].

Likewise, we first adopted I-Impute, kNN-smoothing, MAGIC, SAVER and SMURF to impute the six datasets. Then, we used SMURF to embed the raw and imputed full-dimensional expression profile and |$\boldsymbol{H}$| into 1D-oval latent space (Figure 5A). In the 1D-oval plot of raw H1-hESC, the cells from the G2/M phase are split into two subgroups, and the cells are not ordered according to the correct time course. However, in the 1D-oval plot of SMURF’s cell embedding |$\boldsymbol{H}$|⁠, cells formed a G0/G1-S-G2/M cycle alongside the oval. We further adopted change index (CI) and Pearson correlation coefficient (PCC) [5] to quantitatively measure the accuracy of 1D-oval pseudo-time order against known cell cycle phases labels. A close to 1 value of CI and PCC signifies near-perfect time-series order, while a value of 0 means the predict psueo-time order is randomly matched with the ground-truth cell-cycle order. In Figure 5B and Supplementary Table S11, the SMURF cell embeddings (SMURF_F_H) have higher CI and PCC than SMURF and other imputation tools at most cases. This signifies SMURF can reduce the cell-cycle noise utilizing matrix factorization, thus facilitating the cell-cycle pseudo-time prediction.

Figure 5

Cell-cycle visualization for hESCs. (A) 1D oval plot. Cells from the same cell cycle phase share the same annotation color. From left to right are tools: Raw, I-Impute, kNN-smoothing, MAGIC, SAVER, SMURF_F and SMURF_F_H. From top and bottom are datasets: H1-hESC, mESC-Quartz and mESC-SMARTer. (B) The time series scoring accuracy measured by the change index (CI) and Pearson correlation coefficient (PCC).

Open in new tab Download slide

Discussion

In this study, we presented SMURF to impute scRNA data leveraging a mixture of Poisson-Gamma distribution while preserving self-consistency and embedding cells and genes data into their latent space with matrix factorization. We have demonstrated that SMURF could aid a better performance of scRNA downstream analyses in three perspectives: (i) With the aid of Poisson-Gamma divergent and self-consistency initialization, the final gene-cell expression profile |$\boldsymbol{U}$| successfully recovers the gene distribution compared with state-of-the-art imputation tools I-Impute, SAVER, MAGIC and kNN-smoothing. In synthetic data assessment, SMURF models paraded the most robust gene expression recovery power with low RMSE and high Pearson correlation (Figure 2A, Supplementary Figures S4–S5, and Supplementary Tables S1, S4, S6). Furthermore, SMURF successfully recovered the gene distribution from melanoma cell line Drop-seq data (Figure 3 and Supplementary Table S8). (ii) By incorporating matrix factorization, SMURF generates cell embedding |$\boldsymbol{H}$| in the latent space that facilitates cell type identification and visualization. We demonstrated that SMURF’s cell embedding |$\boldsymbol{H}$| successfully preserved the underlying cell-subpopulation structures for different sparsity against the full-dimensional imputed gene-cell expression data on synthetic datasets (Figure 1B–C, Supplementary Figure S4 and Supplementary Tables S2, S5) and eight cell line mixtures datasets (Figure 4B–C and Supplementary Table S10). (iii) SMURF could further embed cycling cells into a 1D-oval latent space and best preserve the time course of the cell cycle; we verify it with human and mouse embryonic stem cell data (Figure 5 and Supplementary Table S11).

Both SMURF and SAVER adopt Poisson-Gamma distribution to the zero-inflated fact of single-cell data. SMURF, however, has an additional setup step. We use the gene–gene similarity relationship to repeatedly revise the expression profile so that it maintains self-consistency, thus achieving a better gene expression recovery power than SAVER. Moreover, the advantage of SMURF against SAVER and other imputation tools is the support of dimensional reduction utilization matrix factorization. We model the distribution mean |$\boldsymbol{U}$| with the dot product of two |$K$|-dimensional embeddings |$\boldsymbol{G}$| and |$\boldsymbol{H}$|⁠, which signifies the low-dimensional representation of gene and cell in the latent space, respectively. The gene and cell embeddings preserve the underlying gene–gene interactions and cell–cell structures and overcome the curse of dimensionality. We have demonstrated that the cell embedding |$\boldsymbol{H}$| manifested better cell type detection power against the full-dimensional matrix. Furthermore, in contrast to other imputation tools, SMURF provides 1D-oval manifold embedding dedicated to cell-cycle visualization.

However, several concerns need to be addressed. First, selecting the reduced dimension |$K$| is still an open question. Generally, the value of |$K$| first depends on the size of the zero-inflated matrix. The number of cells or genes should be at least several times larger than |$K$|⁠; otherwise, the model may be over-fitting and ineffective. The dropout rate in the matrix also affects the value of |$K$|⁠. For a matrix with a small dropout rate, the value of |$K$| can be increased appropriately, while for a very sparse matrix, a smaller |$K$| should be tried.

Second, SMURF assumes the underlying expression follows a mixture of Poisson-Gamma distribution and regards the estimated mean U from the mixture distribution as the final imputed gene expression. Moreover, SMURF uses the self-consistency principle to refine the initial |$\boldsymbol{U}$| with the gene–gene similarity relationship iteratively; that is, SMURF will adjust the noises for the genes with unusually high or low abundances in addition to imputing any potential missing entries with fake zeros.

Third, SMURF optimizes the Poisson-Gamma divergent under the assumption of three gene variance models (constant variance, constant Fano and constant coefficient of variance). Theoretically, we advise users to select the variance model that yields the smallest negative log-likelihood loss and employ the accompanying imputation and embedding to carry out the downstream analysis. In experiments, we observed that the constant Fano model performed with the greatest stability, thus we made it the default in the SMURF python package.

Fourth, SMURF does not always outperform in all situations. Still, we accept this because no model could ever be guaranteed to perform better than others on any given collection of data.

Poisson-Gamma divergence with matrix factorization is powerful to impute and embed sparse biological data simultaneously. In this study, we have shown how well the model handles sparse scRNA-seq data. To aid in the identification of topologically associated domains, we also used the model to infer the missing RNA–chromatin interaction and RNA–RNA interaction frequencies [45]. We are interested in seeing if the Poisson-Gamma divergence with matrix factorization model can accommodate alternative data modalities suffering from sparseness. For instance, single-cell assay for transposase-accessible chromatin data possesses a greater sparsity than scRNA-seq data [46]. Spatial transcriptomics is a promising technology for deciphering cellular heterogeneity by incorporating spatial context [47]. However, the widely used 10x protocol has resolution issues in addition to sparsity. That is, the 10x protocol generally gives us the mixture expression profile from two to eight cells in one spot over the tissue, which may have different cell types [48]. As a result, it is crucial to impute the spatial expression with the single-cell resolution [49]. The same applies to bulk RNA data. Given that communities have amassed substantial amounts of bulk RNA data in tumor immune micro-environment [50], computationally imputing single-cell level expression profiles from bulk RNA data could considerably encourage the effectiveness of treatment and prognosis for cancer. We intend to check if the single-cell level expression can be recovered from spatial or bulk data using the Poisson-Gamma divergence with matrix factorization model.

Key Points

We propose SMURF to extract the low-dimensional embeddings from cells and genes utilizing matrix factorization with a mixture of Poisson-Gamma divergent as objective while preserving self-consistency.
SMURF exhibits feasible cell subpopulation discovery efficacy with obtained cell embeddings on scRNA datasets with ground truth cell types.
SMURF can reduce the cell embedding into a 1D-oval space to recover the time course of cell cycle.
SMURF can serve as an imputation tool, the in silico data assessment shows that SMURF parades the most robust gene expression recovery power with low RMSE and high Pearson correlation.

Acknowledgments

Not applicable.

Funding

This project was supported by National Key Research and Development Program of China (2021YFB2104800), Hong Kong CityU Strategic Interdisciplinary Research Grant (CityU SIRG 7020005), National Natural Science Foundation of China (62177002, 62072433), and Fundamental Research Funds for the Central Universities (DUT21RC(3)102).

Data availability

The p3cl was obtained from [38, 39]. The 3Line-qPCR_PC3 was obtained from [44]. The CellBench data were obtained from [40]. The H1-hESC data were obtained from [41]. The mESC-Quartz was obtained from [42]. The mESC-SMARTer was obtained from [43]. The WM989 Drop-seq was obtained with accession code GSE99330 [35]. The WM989 smRNA FISH data were obtained from the link https://www.dropbox.com/s/ia9x0iom6dwueix/fishSubset.txt?dl=0 [36].

Software availability

The source code of SMURF is available at https://github.com/deepomicslab/SMURF.

Author Biographies

Juhua Pu, PhD, is an associate professor at the School of Computer Science, Beihang University, China. Her research interest include machine learning and smart cities.

Xingwu Liu received PhD degree from Chinese Academy of Sciences in 2005. Currently he is a professor at the School of Mathematical Sciences, Dalian University of Technology, China. His research interest includes combinatorial optimization and deep learning.

Lingxi Chen, PhD, is a PostDoc in the Department of Computer Science, City University of Hong Kong. She is good at bioinformatics, deep learning, and algorithm development.

Shuai Cheng Li, PhD, is an associate professor at the Department of Computer Science, City University of Hong Kong. He is good at bioinformatics, deep learning, and algorithm design.

Bingchen Wang received the B.S. degree in Mathematics and Applied Mathematics from Beihang University, City, China in 2019 and the M.S. degree in System Engineering from Beihang University, City, China, in 2022. His research interests include machine learning and smart cities.

References

1

Saliba

A-E

,

Westermann

AJ

,

Gorski

SA

, et al.

Single-cell rna-seq: advances and future challenges

.

Nucleic Acids Res

2014

;

42

(

14

):

8845

–

60

.

2

Kolodziejczyk

AA

,

Kim

JK

,

Svensson

V

, et al.

The technology and biology of single-cell rna sequencing

.

Mol Cell

2015

;

58

(

4

):

610

–

20

.

3

Liu

S

,

Trapnell

C

.

Single-cell transcriptome sequencing: recent advances and remaining challenges

.

F1000Research

2016

;

5

:

182

.

Google Scholar

Crossref

WorldCat

4

Trapnell

C

,

Cacchiarelli

D

,

Grimsby

J

, et al.

The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells

.

Nat Biotechnol

2014

;

32

(

4

):

381

–

6

.

5

Liu

Z

,

Lou

H

,

Xie

K

, et al.

Reconstructing cell cycle pseudo time-series via single-cell transcriptome data

.

Nat Commun

2017

;

8

(

1

):

22

.

6

Horning

AM

,

Wang

Y

,

Lin

C-K

, et al.

Single-cell rna-seq reveals a subpopulation of prostate cancer cells with enhanced cell-cycle–related transcription and attenuated androgen response

.

Cancer Res

2018

;

78

(

4

):

853

–

64

.

7

Baruch

K

,

Deczkowska

A

,

Rosenzweig

N

, et al.

Pd-1 immune checkpoint blockade reduces pathology and improves memory in mouse models of alzheimer’s disease

.

Nat Med

2016

;

22

(

2

):

135

.

8

Segerstolpe

Å

,

Palasantza

A

,

Eliasson

P

, et al.

Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes

.

Cell Metab

2016

;

24

(

4

):

593

–

607

.

9

Lawlor

N

,

George

J

,

Bolisetty

M

, et al.

Single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific expression changes in type 2 diabetes

.

Genome Res

2017

;

27

(

2

):

208

–

22

.

10

Chung

W

,

Eum

HH

,

Lee

H-O

, et al.

Single-cell rna-seq enables comprehensive tumour and immune cell profiling in primary breast cancer

.

Nat Commun

2017

;

8

:

15081

.

11

Karaayvaz

M

,

Cristea

S

,

Gillespie

SM

, et al.

Unravelling subclonal heterogeneity and aggressive disease states in tnbc through single-cell rna-seq

.

Nat Commun

2018

;

9

(

1

):

3588

.

12

Guo

X

,

Zhang

Y

,

Zheng

L

, et al.

Global characterization of t cells in non-small-cell lung cancer by single-cell sequencing

.

Nat Med

2018

;

24

(

7

):

978

.

13

Kim

C

,

Gao

R

,

Sei

E

, et al.

Chemoresistance evolution in triple-negative breast cancer delineated by single-cell sequencing

.

Cell

2018

;

173

(

4

):

879

–

93

.

14

Bartoschek

M

,

Oskolkov

N

,

Bocci

M

, et al.

Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell rna sequencing

.

Nat Commun

2018

;

9

(

1

):

5150

.

15

Kharchenko

PV

,

Silberstein

L

,

Scadden

DT

.

Bayesian approach to single-cell differential expression analysis

.

Nat Methods

2014

;

11

(

7

):

740

.

16

Li

WV

,

Li

JJ

.

An accurate and robust imputation method scimpute for single-cell rna-seq data

.

Nat Commun

2018

;

9

(

1

):

997

.

17

Chen

L

,

Li

SC

.

Incorporating cell hierarchy to decipher the functional diversity of single cells

.

Nucleic Acids Res

2022

;

11

:

gkac1044

.

Google Scholar

OpenURL Placeholder Text

WorldCat

18

Becht

E

,

McInnes

L

,

Healy

J

, et al.

Dimensionality reduction for visualizing single-cell data using umap

.

Nat Biotechnol

2019

;

37

(

1

):

38

–

44

.

Google Scholar

Crossref

WorldCat

19

Hao

Y

,

Hao

S

,

Andersen-Nissen

E

, et al.

Integrated analysis of multimodal single-cell data

.

Cell

2021

;

184

:

3573

–

3587.e29

.

20

Qiu

X

,

Mao

Q

,

Tang

Y

, et al.

Reversed graph embedding resolves complex single-cell trajectories

.

Nat Methods

2017

;

14

(

10

):

979

–

82

.

21

Van Dijk

D

,

Sharma

R

,

Nainys

J

, et al.

Recovering gene interactions from single-cell data using data diffusion

.

Cell

2018

;

174

(

3

):

716

–

29

.

22

Wagner

F

,

Yan

Y

,

Yanai

I

.

K-nearest neighbor smoothing for high-throughput single-cell rna-seq data

.

BioRxiv

2017

;

217737

.

Google Scholar

OpenURL Placeholder Text

WorldCat

23

Huang

M

,

Wang

J

,

Torre

E

, et al.

Saver: gene expression recovery for single-cell rna sequencing

.

Nat Methods

2018

;

15

(

7

):

539

–

42

.

24

Hou

W

,

Ji

Z

,

Ji

H

, et al.

A systematic evaluation of single-cell rna-sequencing imputation methods

.

Genome Biol

2020

;

21

(

1

):

1

–

30

.

Google Scholar

Crossref

WorldCat

25

Feng

X

,

Chen

L

,

Wang

Z

, et al.

I-impute: a self-consistent method to impute single cell rna sequencing data

.

BMC Genom

2020

;

21

(

10

):

1

–

9

.

Google Scholar

OpenURL Placeholder Text

WorldCat

26

Andrews

TS

,

Kiselev

VY

,

McCarthy

D

, et al.

Tutorial: guidelines for the computational analysis of single-cell rna sequencing data

.

Nat Protoc

2021

;

16

(

1

):

1

–

9

.

27

Nayak

R

,

Hasija

Y

.

A hitchhiker’s guide to single-cell transcriptomics and data analysis pipelines

.

Genomics

2021

;

113

:

606

–

19

.

28

Chen

L

,

Jiao

X

,

Li

SC

.

Deepmf: deciphering the latent patterns in omics profiles with a deep learning method

.

BMC Bioinform

2019

;

20

(

23

):

1

–

14

.

Google Scholar

OpenURL Placeholder Text

WorldCat

29

Lê

S

,

Josse

J

,

Husson

F

, et al.

Factominer: an r package for multivariate analysis

.

J Stat Softw

2008

;

25

(

1

):

1

–

18

.

Google Scholar

Crossref

WorldCat

30

Zappia

L

,

Phipson

B

,

Oshlack

A

.

Splatter: simulation of single-cell rna sequencing data

.

Genome Biol

2017

;

18

(

1

):

174

.

31

Rand

WM

.

Objective criteria for the evaluation of clustering methods

.

J Am Stat Assoc

1971

;

66

(

336

):

846

–

50

.

Google Scholar

Crossref

WorldCat

32

Vinh

NX

,

Epps

J

,

Bailey

J

.

Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance.

.

J Mach Learn Res

2010

;

11

:

2837

–

54

.

Google Scholar

OpenURL Placeholder Text

WorldCat

33

Cover

TM

,

Thomas

JA

.

Elements of information theory

.

John Wiley & Sons

,

2012

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

34

Andrew

Rosenberg

and

Julia

Hirschberg

.

V-measure: A conditional entropy-based external cluster evaluation measure

. In

Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)

, pages

410

–

20

,

2007

.

35

Torre

E

,

Dueck

H

,

Shaffer

S

, et al.

Rare cell detection by single-cell rna sequencing as guided by single-molecule rna fish

.

Cell Syst

2018

;

6

(

2

):

171

–

9

.

36

Shaffer

SM

,

Dunagin

MC

,

Torborg

SR

, et al.

Rare cell variability and drug-induced reprogramming as a mode of cancer drug resistance

.

Nature

2017

;

546

(

7658

):

431

–

5

.

37

Kullback

S

,

Leibler

RA

.

On information and sufficiency

.

Ann Math Stat

1951

;

22

(

1

):

79

–

86

.

Google Scholar

Crossref

WorldCat

38

Dong

M

,

Thennavan

A

,

Urrutia

E

, et al.

Scdc: bulk gene expression deconvolution by multiple single-cell rna sequencing references

.

Brief Bioinform

2021

;

22

(

1

):

416

–

27

.

39

Liu

S

,

Thennavan

A

,

Garay

JP

, et al.

Multik: an automated tool to determine optimal cluster numbers in single-cell rna sequencing data

.

Genome Biol

2021

;

22

(

1

):

1

–

21

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

40

Tian

L

,

Dong

X

,

Freytag

S

, et al.

Benchmarking single cell rna-sequencing analysis pipelines using mixture control experiments

.

Nat Methods

2019

;

16

(

6

):

479

–

87

.

41

Leng

N

,

Chu

L-F

,

Barry

C

, et al.

Oscope identifies oscillatory genes in unsynchronized single-cell rna-seq experiments

.

Nat Methods

2015

;

12

(

10

):

947

.

42

Sasagawa

Y

,

Nikaido

I

,

Hayashi

T

, et al.

Quartz-seq: a highly reproducible and sensitive single-cell rna sequencing method, reveals non-genetic gene-expression heterogeneity

.

Genome Biol

2013

;

14

(

4

):

1

–

17

.

Google Scholar

Crossref

WorldCat

43

Buettner

F

,

Natarajan

KN

,

Paolo Casale

F

, et al.

Computational analysis of cell-to-cell heterogeneity in single-cell rna-sequencing data reveals hidden subpopulations of cells

.

Nat Biotechnol

2015

;

33

(

2

):

155

–

60

.

44

McDavid

A

,

Dennis

L

,

Danaher

P

, et al.

Modeling bi-modality improves characterization of cell cycle on gene expression in single cells

.

PLoS Comput Biol

2014

;

10

(

7

):

e1003696

.

45

Zhang

YW

,

Chen

L

,

Li

SC

.

Detecting tad-like domains from rna-associated interactions

.

Nucleic Acids Res

2022

;

50

:

e88

.

46

Cusanovich

DA

,

Daza

R

,

Adey

A

, et al.

Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing

.

Science

2015

;

348

(

6237

):

910

–

4

.

47

Ståhl

PL

,

Salmén

F

,

Vickovic

S

, et al.

Visualization and analysis of gene expression in tissue sections by spatial transcriptomics

.

Science

2016

;

353

(

6294

):

78

–

82

.

48

Walker

BL

,

Cang

Z

,

Ren

H

, et al.

Deciphering tissue structure and function using spatial transcriptomics

.

Commun Biol

2022

;

5

(

1

):

1

–

10

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

49

Wang

J

,

Li

S

,

Chen

L

, et al.

Sprout: spectral sparsification helps restore the spatial structure at single-cell resolution

.

NAR Genom Bioinform

2022

;

4

(

3

):

lqac069

.

50

Wang

X

,

Chen

L

,

Liu

W

, et al.

TIMEDB: tumor immune micro-environment cell composition database with automatic analysis and interactive visualization

.

Nucleic Acids Res

2022

;

51

:

D1417

–

24

.

Google Scholar

Crossref

WorldCat

Author notes

The authors wish it to be known that, in their opinion, the Juhua Pu and Bingchen Wang authors should be regarded as joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
January 2023	38
February 2023	143
March 2023	64
April 2023	62
May 2023	31
June 2023	25
July 2023	27
August 2023	16
September 2023	12
October 2023	31
November 2023	29
December 2023	43
January 2024	64
February 2024	64
March 2024	81
April 2024	44
May 2024	48
June 2024	30
July 2024	36
August 2024	54
September 2024	61
October 2024	52
November 2024	41
December 2024	46
January 2025	41
February 2025	51
March 2025	44
April 2025	51
May 2025	15

Article Contents

SMURF: embedding single-cell RNA-seq data with matrix factorization preserving self-consistency

Abstract

Introduction

Materials and methods

Algorithm of SMURF

Problem formulation

Model initialization based on self-consistency

Objectives and training

1D-oval cell-cycle visualization

Simulation and benchmark settings

Results

Overview of SMURF

SMURF recovers the dropout gene expression in in silico trials

SMURF recovers the distribution of genes from WM989 Drop-seq data

SMURF enables accurate unsupervised clustering on mixture of cell line data

SMURF facilitates pseudo-time prediction for cell-cycle data

Discussion

Acknowledgments

Funding

Data availability

Software availability

Author Biographies

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

SMURF: embedding single-cell RNA-seq data with matrix factorization preserving self-consistency

Abstract

Introduction

Materials and methods

Algorithm of SMURF

Problem formulation

Model initialization based on self-consistency

Objectives and training

1D-oval cell-cycle visualization

Simulation and benchmark settings

Results

Overview of SMURF

SMURF recovers the dropout gene expression in in silico trials

SMURF recovers the distribution of genes from WM989 Drop-seq data

SMURF enables accurate unsupervised clustering on mixture of cell line data

SMURF facilitates pseudo-time prediction for cell-cycle data

Discussion

Acknowledgments

Funding

Data availability

Software availability

Author Biographies

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only