Abstract

Summary

Spatial transcriptomics (ST) allows gene expression profiling within intact tissue samples but lacks single-cell resolution. This necessitates computational deconvolution methods to estimate the contributions of distinct cell types. This article introduces NLSDeconv, a novel cell-type deconvolution method based on non-negative least squares, along with an accompanying Python package. Benchmarking against 18 existing deconvolution methods on various ST datasets demonstrates NLSDeconv’s competitive statistical performance and superior computational efficiency.

Availability and implementation

NLSDeconv is freely available at https://github.com/tinachentc/NLSDeconv as a Python package.

1 Introduction

Spatial transcriptomics (ST), crowned the Method of the Year in 2020 (Marx 2021), has revolutionized biomedical research by allowing quantification of the mRNA expression of a large number of genes simultaneously in the spatial context of the tissue. The sequencing-based ST technology, while achieving its popularity for spatial profiling of gene expression across larger tissue areas with higher throughput than the image-based ST methods, lacks single-cell resolution. Consequently, the gene expression measured within a spot or a section of the tissue only reflects the average expression of a mixture of cell populations. Understanding the precise cell type composition within each spatial location is crucial for studying developmental biology and cancer biology. Thus, deconvoluting the cell type composition within each spot has become an imperative task in ST research.

Cell-type deconvolution in ST data often utilizes a set of reference gene expression profiles for known cell types, frequently obtained from scRNA-seq studies (Moses and Pachter 2022). Many methods have been developed in recent years, including probabilistic-based methods (Andersson et al. 2020, Cable et al. 2022, Danaher et al. 2022, Kleshchevnikov et al. 2022, Lopez et al. 2022, Ma and Zhou 2022); non-negative matrix factorization-based methods (Rodriques et al. 2019, Dong and Yuan 2021, Elosua-Bayes et al. 2021); and deep learning-based methods (Biancalani et al. 2021), etc. Some of these methods utilize the spatial information [such as Ma and Zhou (2022)], while many do not (Dong and Yuan 2021, Cable et al. 2022, Kleshchevnikov et al. 2022). Therefore, in principle many methodologies developed for cell-type deconvolution for bulk RNA-seq data such as Newman et al. (2015), Hunt et al. (2019), and Li and Wu (2019) can be applied to ST deconvolution problem as well. For excellent reviews of the deconvolution topic in ST data, we refer to Chen et al. (2022) and Li et al. (2023).

In this article, we propose a novel method named NLSDeconv and a Python package based on the non-negative least squares (NNLS) method for cell-type deconvolution for ST data. We demonstrate the competitive performance of NLSDeconv relative to 18 existing methods by utilizing comprehensive benchmark datasets from both image-based and sequencing-based ST technologies, as well as the compelling efficiency of NLSDeconv.

2 Materials and methods

Like many ST cell-type deconvolution methods, NLSDeconv takes two input datasets.

  • An ST dataset containing RNA-seq read count for each gene at each spot within the tissue, denoted by Y of size ns×ng where ns and ng stand for the number of spots (spatial locations) and number of genes, respectively.

  • A single-cell RNA-seq (scRNA-seq) reference dataset, denoted X, with dimensions nc×ng, contains the genome-wide expression in read counts of ng genes collected from nc cells. Each cell is known to belong to one of K distinct cell types (typically K<<nc). Each cell type contains multiple biological profiles (i.e. multiple rows of X), and their gene expression profiles are defined as prototypes of that cell type in the reference datasets. For example, in the seqFISH3000 dataset, the cell type “iNeuron” is represented by 164 prototypes, meaning 164 rows of X correspond to the same cell type “iNeuron,” with each of the 164 rows reflecting a unique gene expression profile. The deconvolution algorithm estimates the proportion of each prototype in each spot using the observed spatial transcriptomic data. The proportion of a cell type is then calculated as the sum of the proportions of all its prototypes. For instance, the overall proportion of the “iNeuron” cell type is the sum of the 164 prototype proportions.

  • Here, we assume X and Y share the same gene dimension ng, but note that NLSDeconv can adjust for different dimensions through gene selection preprocessing.

The output of NLSDeconv is a cell-type deconvolution matrix P^ with each entry estimating the proportion of a cell type (not prototype) within a spot in the tissue.

The fundamental idea behind NLSDeconv is modeling the ST data’s gene read counts at a spot as a weighted sum of contributions from each cell prototype present. This is captured by a weight matrix M of size ns×nc, where each entry indicates the weight of cell prototypes at a spot. The relationship is formulated by a linear model:
(1)
where E is a noise matrix of dimensions ns×ng. This linear model captures how various cell prototypes contribute to the observed gene read counts across different spots, with Mij=0 indicating no contribution from the jth cell prototype at the ith spot.

In the field of ST cell-deconvolution, many existing linear models, such as those documented by Rodriques et al. (2019), Dong and Yuan (2021), and Elosua-Bayes et al. (2021), typically regress Y onto the cell type. In contrast, our approach regresses Y onto the cell prototype X. Our method (see below) essentially builds on top of least squares objectives to solve the linear model, which differs significantly from the method in Biancalani et al. (2021). That method employs Kullback–Leibler divergence and cosine distance to learn the matrix M, even though it similarly regresses Y on the cell prototypes X.

To estimate the weight matrix M in our method, we employ NNLS (Hastie et al. 2009). Our NNLS estimate M^NNLS of M is defined by:
(2)
The Frobenius norm ||·||F measures model fidelity, λ0 is a ridge regularization parameter, and the constraint M0 requires all entries of the matrix M to be nonnegative. The NNLS objective is convex, allowing projected gradient descent to converge to its global minimum (Boyd and Vandenberghe 2004). To select the regularization parameter λ, we use cross-validation. For each λ0, we split the data by genes, fit the model using the training data, and validate its performance on the test data. More precisely, consider splitting the data matrices X and Y by the genes, i.e. by columns (recalling that each column represents a gene):
Here, Ytr has dimensions ns×ng,tr and Ytest has dimensions ns×ng,test with ng=ng,tr+ng,test. The matrix X is split in the same way, with matching column dimensions to their Y counterparts. We fit the model on the training data:
The performance is then validated on the test data by:

We select the λ that minimizes the MMSE(λ) on the test data. This procedure, here described as a single split, can be directly extended to cross-validation by repeating across multiple folds. In practice, 5-fold cross-validation is often used. To reduce computational burdens, we also suggest using a default λ=0.1 in our package, as empirically it has shown a good performance across a collection of datasets (see Section 4).

The method then calculates the cell-type deconvolution matrix P^NLS using the weight matrix estimator M^NLS. The proportion of cell type k{1,,K} at spot i is calculated as follows:
(3)
where the numerator is the sum of weights of cell type k at spot i, and the denominator is the total sum of weights across all cell types at that spot. This results in P^NNLS, an estimator for the cell-type deconvolution matrix.
In practice, computing M^NNLS, consequentially P^NNLS for large datasets can be challenging due to the complexities of solving the NNLS minimization as described in Equation (2). To enhance computational efficiency, we propose a variant of the NNLS method, called soft-thresholding least squares (SLS). This approach starts with the ordinary least squares (OLS) estimator which has the following closed form without the constraint M0 in (2):
(4)
However, M^OLS does not guarantee nonnegative entries, so we apply the soft thresholding operator, defined by (z)+=max{z,0} for zR, to ensure nonnegativity (Hastie et al. 2009). Specifically, we estimate the proportion of cell type k at a spot i by [cf Equation (3)],
(5)

The soft thresholding ensures all entries of the cell-type deconvolution matrix P^SLS to be nonnegative.

In conclusion, we have developed two distinct cell-type deconvolution matrix estimators, P^NNLS and P^SLS. These two methods offer complementary approaches that are tailored to different computational budgets. In section below, we report our findings on the computational and statistical performance of these two estimators on benchmark ST datasets.

3 Software

We have developed a Python package NLSDeconv, which takes input data of all acceptable formats through scanpy.read function, e.g. h5ad, csv, text, h5, etc. The input data contains (i) the scRNA-seq reference data file, including the read count and metadata with a key indicating cell type, and (ii) the ST data file, including the read count and metadata with spatial location x and y coordinates as columns.

3.1 Preprocess

NLSDeconv provides a data preprocessing function step for scRNA-seq data. The class Preprocessing() requires scRNA-seq read count, ST read count, and the cell-type key for scRNA-seq data. It performs three ordered steps: (i) normalize each cell by total counts over all genes (cellcount_norm) (default option is True), and (ii) remove cell types and their corresponding cells if the number of cells in the cell type is less than a certain number (cellcount_min) (default number is 2), and (iii) perform differential expression analysis between each cell type VS. the rest, and a certain number (gene_top, default number is 200) of top differentially expressed genes are selected from each comparison, the union of which will form the marker gene set (Biancalani et al. 2021). The outputs are the preprocessed ST data and scRNA-seq data.

3.2 Deconvolution

Two functions of cell-type deconvolution are provided. The class Deconv() requires scRNA-seq read count, ST read count, and the cell-type key for scRNA-seq data.

  • .SLS() is the command for performing SLS cell-type deconvolution. The outputs are the estimated cell-type composition matrix P^SLS, algorithm running time, and list of cell types corresponding to the column name of the resulting matrix.

  • .NLS() is the command for performing NNLS cell-type deconvolution. The required argument is learning rate (lr) (default is 0.1). Optional arguments are: ridge regularization parameter or list of parameters (reg), cross-validation fold number for list of parameters (n_fold), whether to use least square estimator as a warm start (warm_start), number of epochs (num_epochs), device for running the algorithm (device). The outputs are the estimated cell-type composition matrix P^NNLS, algorithm running time, and list of cell types corresponding to the column name of the resulting matrix.

3.3 Visualization

We provide two functions for the visualization of deconvolution results. The required inputs are ST data and the two outputs of the previous deconvolution step: cell-type composition matrix P^SLS or P^NNLS and the list of cell types corresponding to the column name of the resulting matrix. There are optional arguments relating to display, allowing users to adjust their figures flexibly. See details in our document and tutorials.

  • overall_plt() is the command for a spatial scatter pie plot displaying inferred cell-type composition on each location.

  • separate_plt() is the command for spatial proportion plots of given cell types.

In Fig. 1, we show example visualization results of SLS on the seqFISH+(10 000) dataset.

Example visualization results of SLS on the seqFISH+ (10 000) dataset. (a) A spatial scatter pie plot displays inferred cell-type composition on spatial location (produced by the command overall_plt()). (b) The proportion of each cell type is displayed on spatial location (produced by the command separate_plt()).
Figure 1.

Example visualization results of SLS on the seqFISH+ (10 000) dataset. (a) A spatial scatter pie plot displays inferred cell-type composition on spatial location (produced by the command overall_plt()). (b) The proportion of each cell type is displayed on spatial location (produced by the command separate_plt()).

4 Results

We assess the performance of proposed methods by comparing them with 18 existing methods on diverse benchmarking datasets reported in the recent review paper (Li et al. 2023). The comprehensive datasets represent two image-based platforms, seqFISH+, MERFISH, and four sequencing-based platforms: ST, 10X Visium (Visium), Slide-seqV2, and stereo-seq. For image-based datasets, as the gene expression profile is at the single-cell level and cell type labels are known, we follow the exact approach by Li et al. (2023) to bin the neighboring cells into spots with different resolutions and use root-mean-square error (RMSE), Jensen–Shannon divergence (JSD) to gauge the accuracy of deconvoluted cell type proportion. For the sequencing-based datasets where the true label is unknown, we also follow Li et al. (2023) to use the Pearson correlation coefficient (PCC) between the deconvoluted cell type proportions within each spot and the expression of the corresponding marker genes of each cell type as the performance metric. Performance of other methods are ported from the Source Data file of Li et al. (2023).

For SLS, it is tuning-free. For NNLS, we provide the result of two ways for selecting λ: NNLS(fixed) for using default λ=0.1, and NNLS(CV) for using cross-validation method to select the best λ from the sequence of λ’s, (0,0.02,0.04,,0.2). We set the learning rate to be 0.01, the number of epochs to be 103, and use a warm start. We apply the codes provided by Li et al. (2023) for performance measures to avoid coding bias. Tables 1 and 2 show the performance of our methods in comparison to other models on the image-based datasets with true labels. In general, our NNLS has the lowest RMSE and JSD, followed by our SLS method, which can be viewed as a fast approximation version of NNLS. For the sequencing-based datasets without true labels, to our surprise, the PCC measure suggests that SLS performs even better than NNLS, ranking first overall out of all methods (Table 3).

Table 1.

RMSE of SLS and NNLS methods compared with 18 existing methods on image-based datasets seqFISH+ and MERFISH: a smaller RMSE indicates a more accurate method.a

seqFISH+
MERFISH
(10 000)(6000)(3000)(100)(50)(20)Mean
SLS0.1090.1130.126*0.1330.1810.2500.152
NNLS(fixed)0.1040.1030.126*0.1090.1650.2430.142
NNLS(CV)0.1050.1000.126*0.1080.1610.2340.139
dtangle0.2010.2090.2020.3310.3670.4290.290
Berglund0.2720.2670.2800.1770.1770.2860.243
CARD0.1260.1290.1410.1880.1880.2640.172
Cell2location0.2420.2430.2400.1090.1090.2470.198
DestVI0.2720.2820.3000.1170.1170.2190.218
DSTG0.2270.2600.2920.2310.2310.3110.259
NMFreg0.2960.2910.2910.1800.1800.3100.258
RCTD0.1250.1300.1310.1840.1840.2780.172
SD20.1790.2060.2090.2100.2100.3450.227
SpatialDecon0.2020.2040.2010.2170.2170.3060.225
SpatialDWLS0.1120.1090.1210.2130.2130.5250.215
SpiceMix0.1890.2490.4570.3900.3900.3910.345
SPOTlight0.2620.2570.2560.1690.1690.3180.239
STdeconvolve0.3430.3250.3260.2200.2200.2470.280
stereoscope0.2230.2260.2280.1930.1930.2740.223
STRIDE0.2600.2730.2320.1240.1240.3720.231
Tangram0.2510.2510.2590.1770.1770.2780.232
SpaOTsc0.2260.2230.2260.1040.1150.1510.174
novoSpaRc0.2270.2270.2250.0960.1240.1620.177
seqFISH+
MERFISH
(10 000)(6000)(3000)(100)(50)(20)Mean
SLS0.1090.1130.126*0.1330.1810.2500.152
NNLS(fixed)0.1040.1030.126*0.1090.1650.2430.142
NNLS(CV)0.1050.1000.126*0.1080.1610.2340.139
dtangle0.2010.2090.2020.3310.3670.4290.290
Berglund0.2720.2670.2800.1770.1770.2860.243
CARD0.1260.1290.1410.1880.1880.2640.172
Cell2location0.2420.2430.2400.1090.1090.2470.198
DestVI0.2720.2820.3000.1170.1170.2190.218
DSTG0.2270.2600.2920.2310.2310.3110.259
NMFreg0.2960.2910.2910.1800.1800.3100.258
RCTD0.1250.1300.1310.1840.1840.2780.172
SD20.1790.2060.2090.2100.2100.3450.227
SpatialDecon0.2020.2040.2010.2170.2170.3060.225
SpatialDWLS0.1120.1090.1210.2130.2130.5250.215
SpiceMix0.1890.2490.4570.3900.3900.3910.345
SPOTlight0.2620.2570.2560.1690.1690.3180.239
STdeconvolve0.3430.3250.3260.2200.2200.2470.280
stereoscope0.2230.2260.2280.1930.1930.2740.223
STRIDE0.2600.2730.2320.1240.1240.3720.231
Tangram0.2510.2510.2590.1770.1770.2780.232
SpaOTsc0.2260.2230.2260.1040.1150.1510.174
novoSpaRc0.2270.2270.2250.0960.1240.1620.177
a

The numbers in the parenthesis of seqFISH+ indicate the numbers of genes that were randomly chosen in the dataset; those of MERFISH indicate the binning sizes (in μm) for generating true labels. The last column is the mean of RMSE on all datasets. The top two methods are labeled in bold font. * means equal RMSE when rounding to three decimal places.

Table 1.

RMSE of SLS and NNLS methods compared with 18 existing methods on image-based datasets seqFISH+ and MERFISH: a smaller RMSE indicates a more accurate method.a

seqFISH+
MERFISH
(10 000)(6000)(3000)(100)(50)(20)Mean
SLS0.1090.1130.126*0.1330.1810.2500.152
NNLS(fixed)0.1040.1030.126*0.1090.1650.2430.142
NNLS(CV)0.1050.1000.126*0.1080.1610.2340.139
dtangle0.2010.2090.2020.3310.3670.4290.290
Berglund0.2720.2670.2800.1770.1770.2860.243
CARD0.1260.1290.1410.1880.1880.2640.172
Cell2location0.2420.2430.2400.1090.1090.2470.198
DestVI0.2720.2820.3000.1170.1170.2190.218
DSTG0.2270.2600.2920.2310.2310.3110.259
NMFreg0.2960.2910.2910.1800.1800.3100.258
RCTD0.1250.1300.1310.1840.1840.2780.172
SD20.1790.2060.2090.2100.2100.3450.227
SpatialDecon0.2020.2040.2010.2170.2170.3060.225
SpatialDWLS0.1120.1090.1210.2130.2130.5250.215
SpiceMix0.1890.2490.4570.3900.3900.3910.345
SPOTlight0.2620.2570.2560.1690.1690.3180.239
STdeconvolve0.3430.3250.3260.2200.2200.2470.280
stereoscope0.2230.2260.2280.1930.1930.2740.223
STRIDE0.2600.2730.2320.1240.1240.3720.231
Tangram0.2510.2510.2590.1770.1770.2780.232
SpaOTsc0.2260.2230.2260.1040.1150.1510.174
novoSpaRc0.2270.2270.2250.0960.1240.1620.177
seqFISH+
MERFISH
(10 000)(6000)(3000)(100)(50)(20)Mean
SLS0.1090.1130.126*0.1330.1810.2500.152
NNLS(fixed)0.1040.1030.126*0.1090.1650.2430.142
NNLS(CV)0.1050.1000.126*0.1080.1610.2340.139
dtangle0.2010.2090.2020.3310.3670.4290.290
Berglund0.2720.2670.2800.1770.1770.2860.243
CARD0.1260.1290.1410.1880.1880.2640.172
Cell2location0.2420.2430.2400.1090.1090.2470.198
DestVI0.2720.2820.3000.1170.1170.2190.218
DSTG0.2270.2600.2920.2310.2310.3110.259
NMFreg0.2960.2910.2910.1800.1800.3100.258
RCTD0.1250.1300.1310.1840.1840.2780.172
SD20.1790.2060.2090.2100.2100.3450.227
SpatialDecon0.2020.2040.2010.2170.2170.3060.225
SpatialDWLS0.1120.1090.1210.2130.2130.5250.215
SpiceMix0.1890.2490.4570.3900.3900.3910.345
SPOTlight0.2620.2570.2560.1690.1690.3180.239
STdeconvolve0.3430.3250.3260.2200.2200.2470.280
stereoscope0.2230.2260.2280.1930.1930.2740.223
STRIDE0.2600.2730.2320.1240.1240.3720.231
Tangram0.2510.2510.2590.1770.1770.2780.232
SpaOTsc0.2260.2230.2260.1040.1150.1510.174
novoSpaRc0.2270.2270.2250.0960.1240.1620.177
a

The numbers in the parenthesis of seqFISH+ indicate the numbers of genes that were randomly chosen in the dataset; those of MERFISH indicate the binning sizes (in μm) for generating true labels. The last column is the mean of RMSE on all datasets. The top two methods are labeled in bold font. * means equal RMSE when rounding to three decimal places.

Table 2.

JSD of SLS and NNLS methods compared with 18 existing methods on image-based datasets seqFISH+ and MERFISH: a smaller JSD indicates a more accurate method.a

seqFISH+
MERFISH
(10 000)(6000)(3000)(100)(50)(20)Mean
SLS0.1170.0970.118*0.0930.1850.2960.151
NNLS(fixed)0.0980.1070.1290.0780.1720.2700.142
NNLS(CV)0.1000.1020.118*0.0770.1680.2510.136
dtangle0.1480.1560.1420.4750.5080.6550.347
Berglund0.3800.3730.4250.1690.1690.1720.281
CARD0.1420.1400.1570.2250.2250.3190.201
Cell2location0.3240.3310.3190.0820.0820.3000.240
DestVI0.4430.4740.5190.0890.0890.1720.298
DSTG0.3100.3630.4290.2580.2580.3530.329
NMFreg0.4890.4780.4790.1850.1850.4570.379
RCTD0.1430.1450.1400.1560.1560.2420.164
SD20.2160.2610.2770.2210.2210.5280.288
SpatialDecon0.2750.2790.2700.2360.2360.2820.263
SpatialDWLS0.0640.0730.0960.2150.2151.0000.277
SpiceMix0.2410.3520.8270.6870.6870.4730.545
SPOTlight0.3870.3940.3860.1770.1770.4980.336
STdeconvolve0.5440.4770.5160.2670.2670.0650.356
stereoscope0.2700.2600.2630.1890.1890.2130.231
STRIDE0.3630.4010.3150.0880.0880.6590.319
Tangram0.3640.3560.3720.1410.1410.3190.282
SpaOTsc0.6480.6380.6540.1670.2680.4620.473
novoSpaRc0.6470.6260.6330.2330.3520.4910.497
seqFISH+
MERFISH
(10 000)(6000)(3000)(100)(50)(20)Mean
SLS0.1170.0970.118*0.0930.1850.2960.151
NNLS(fixed)0.0980.1070.1290.0780.1720.2700.142
NNLS(CV)0.1000.1020.118*0.0770.1680.2510.136
dtangle0.1480.1560.1420.4750.5080.6550.347
Berglund0.3800.3730.4250.1690.1690.1720.281
CARD0.1420.1400.1570.2250.2250.3190.201
Cell2location0.3240.3310.3190.0820.0820.3000.240
DestVI0.4430.4740.5190.0890.0890.1720.298
DSTG0.3100.3630.4290.2580.2580.3530.329
NMFreg0.4890.4780.4790.1850.1850.4570.379
RCTD0.1430.1450.1400.1560.1560.2420.164
SD20.2160.2610.2770.2210.2210.5280.288
SpatialDecon0.2750.2790.2700.2360.2360.2820.263
SpatialDWLS0.0640.0730.0960.2150.2151.0000.277
SpiceMix0.2410.3520.8270.6870.6870.4730.545
SPOTlight0.3870.3940.3860.1770.1770.4980.336
STdeconvolve0.5440.4770.5160.2670.2670.0650.356
stereoscope0.2700.2600.2630.1890.1890.2130.231
STRIDE0.3630.4010.3150.0880.0880.6590.319
Tangram0.3640.3560.3720.1410.1410.3190.282
SpaOTsc0.6480.6380.6540.1670.2680.4620.473
novoSpaRc0.6470.6260.6330.2330.3520.4910.497
a

The numbers in the parenthesis of seqFISH+ indicate the numbers of genes that were randomly chosen in the dataset; those of MERFISH indicate the binning sizes (in μm) for generating true labels. The last column shows the mean JSD on all datasets. The top two methods are labeled in bold font. * means equal RMSE when rounding to three decimal places.

Table 2.

JSD of SLS and NNLS methods compared with 18 existing methods on image-based datasets seqFISH+ and MERFISH: a smaller JSD indicates a more accurate method.a

seqFISH+
MERFISH
(10 000)(6000)(3000)(100)(50)(20)Mean
SLS0.1170.0970.118*0.0930.1850.2960.151
NNLS(fixed)0.0980.1070.1290.0780.1720.2700.142
NNLS(CV)0.1000.1020.118*0.0770.1680.2510.136
dtangle0.1480.1560.1420.4750.5080.6550.347
Berglund0.3800.3730.4250.1690.1690.1720.281
CARD0.1420.1400.1570.2250.2250.3190.201
Cell2location0.3240.3310.3190.0820.0820.3000.240
DestVI0.4430.4740.5190.0890.0890.1720.298
DSTG0.3100.3630.4290.2580.2580.3530.329
NMFreg0.4890.4780.4790.1850.1850.4570.379
RCTD0.1430.1450.1400.1560.1560.2420.164
SD20.2160.2610.2770.2210.2210.5280.288
SpatialDecon0.2750.2790.2700.2360.2360.2820.263
SpatialDWLS0.0640.0730.0960.2150.2151.0000.277
SpiceMix0.2410.3520.8270.6870.6870.4730.545
SPOTlight0.3870.3940.3860.1770.1770.4980.336
STdeconvolve0.5440.4770.5160.2670.2670.0650.356
stereoscope0.2700.2600.2630.1890.1890.2130.231
STRIDE0.3630.4010.3150.0880.0880.6590.319
Tangram0.3640.3560.3720.1410.1410.3190.282
SpaOTsc0.6480.6380.6540.1670.2680.4620.473
novoSpaRc0.6470.6260.6330.2330.3520.4910.497
seqFISH+
MERFISH
(10 000)(6000)(3000)(100)(50)(20)Mean
SLS0.1170.0970.118*0.0930.1850.2960.151
NNLS(fixed)0.0980.1070.1290.0780.1720.2700.142
NNLS(CV)0.1000.1020.118*0.0770.1680.2510.136
dtangle0.1480.1560.1420.4750.5080.6550.347
Berglund0.3800.3730.4250.1690.1690.1720.281
CARD0.1420.1400.1570.2250.2250.3190.201
Cell2location0.3240.3310.3190.0820.0820.3000.240
DestVI0.4430.4740.5190.0890.0890.1720.298
DSTG0.3100.3630.4290.2580.2580.3530.329
NMFreg0.4890.4780.4790.1850.1850.4570.379
RCTD0.1430.1450.1400.1560.1560.2420.164
SD20.2160.2610.2770.2210.2210.5280.288
SpatialDecon0.2750.2790.2700.2360.2360.2820.263
SpatialDWLS0.0640.0730.0960.2150.2151.0000.277
SpiceMix0.2410.3520.8270.6870.6870.4730.545
SPOTlight0.3870.3940.3860.1770.1770.4980.336
STdeconvolve0.5440.4770.5160.2670.2670.0650.356
stereoscope0.2700.2600.2630.1890.1890.2130.231
STRIDE0.3630.4010.3150.0880.0880.6590.319
Tangram0.3640.3560.3720.1410.1410.3190.282
SpaOTsc0.6480.6380.6540.1670.2680.4620.473
novoSpaRc0.6470.6260.6330.2330.3520.4910.497
a

The numbers in the parenthesis of seqFISH+ indicate the numbers of genes that were randomly chosen in the dataset; those of MERFISH indicate the binning sizes (in μm) for generating true labels. The last column shows the mean JSD on all datasets. The top two methods are labeled in bold font. * means equal RMSE when rounding to three decimal places.

Table 3.

Mean PCC of pairs of deconvoluted proportion of cell types and their corresponding marker genes from SLS and NNLS methods in comparison with 18 existing methods on sequencing-based datasets ST, Visium, Slide-seqV2, and stereo-seq: a larger PCC indicates a more accurate method.a

STVisiumSlide-seqV2
stereo-seq
(11)(17)(gene)(spot)Mean
SLS0.2410.7320.3450.3330.3920.3030.391
NNLS(fixed)0.1360.5030.2260.1900.2440.4700.295
NNLS(CV)0.1640.5140.3940.2860.3220.4720.359
Berglund0.0630.5530.1950.1950.5200.0260.259
CARD0.1170.7300.2500.2000.2330.4710.333
Cell2location0.2090.7930.3050.3250.5930.0180.374
DestVI0.0810.642−0.055−0.0700.5370.4490.264
DSTG0.0380.6610.1100.1030.3800.2360.255
NMFreg−0.1030.4960.1900.2430.5330.4000.293
novoSpaRc−0.2450.0690.1800.2850.1000.3350.121
RCTD0.1600.0110.3050.3200.6130.4620.312
SD20.0330.5850.0750.1000.4100.1490.225
SpaOTsc−0.0090.5900.1500.0980.2330.2630.221
SpatialDecon0.2950.0180.3200.3250.4500.5060.319
SpatialDWLS−0.103−0.008−0.060−0.025−0.070−0.021−0.048
SpiceMix−0.0550.540−0.065−0.0400.5730.2730.204
SPOTlight0.1250.7500.2700.2330.3600.4330.362
STdeconvolve−0.0300.7600.3000.2130.6030.1330.330
stereoscope0.1460.6680.2600.2550.5270.4360.382
STRIDE−0.0620.6420.2350.2050.5430.4340.333
Tangram0.0710.7430.2900.3650.5100.3600.390
STVisiumSlide-seqV2
stereo-seq
(11)(17)(gene)(spot)Mean
SLS0.2410.7320.3450.3330.3920.3030.391
NNLS(fixed)0.1360.5030.2260.1900.2440.4700.295
NNLS(CV)0.1640.5140.3940.2860.3220.4720.359
Berglund0.0630.5530.1950.1950.5200.0260.259
CARD0.1170.7300.2500.2000.2330.4710.333
Cell2location0.2090.7930.3050.3250.5930.0180.374
DestVI0.0810.642−0.055−0.0700.5370.4490.264
DSTG0.0380.6610.1100.1030.3800.2360.255
NMFreg−0.1030.4960.1900.2430.5330.4000.293
novoSpaRc−0.2450.0690.1800.2850.1000.3350.121
RCTD0.1600.0110.3050.3200.6130.4620.312
SD20.0330.5850.0750.1000.4100.1490.225
SpaOTsc−0.0090.5900.1500.0980.2330.2630.221
SpatialDecon0.2950.0180.3200.3250.4500.5060.319
SpatialDWLS−0.103−0.008−0.060−0.025−0.070−0.021−0.048
SpiceMix−0.0550.540−0.065−0.0400.5730.2730.204
SPOTlight0.1250.7500.2700.2330.3600.4330.362
STdeconvolve−0.0300.7600.3000.2130.6030.1330.330
stereoscope0.1460.6680.2600.2550.5270.4360.382
STRIDE−0.0620.6420.2350.2050.5430.4340.333
Tangram0.0710.7430.2900.3650.5100.3600.390
a

The numbers in the parenthesis of Slide-seqV2 indicate the cell-type numbers of the dataset after the sub-cell types are integrated and the original dataset; the names in the parenthesis of stereo-seq indicate the original dataset and integrated subcellular-resolution dataset. The last column is the overall mean of PCC on all datasets. The top two methods are labeled in bold font.

Table 3.

Mean PCC of pairs of deconvoluted proportion of cell types and their corresponding marker genes from SLS and NNLS methods in comparison with 18 existing methods on sequencing-based datasets ST, Visium, Slide-seqV2, and stereo-seq: a larger PCC indicates a more accurate method.a

STVisiumSlide-seqV2
stereo-seq
(11)(17)(gene)(spot)Mean
SLS0.2410.7320.3450.3330.3920.3030.391
NNLS(fixed)0.1360.5030.2260.1900.2440.4700.295
NNLS(CV)0.1640.5140.3940.2860.3220.4720.359
Berglund0.0630.5530.1950.1950.5200.0260.259
CARD0.1170.7300.2500.2000.2330.4710.333
Cell2location0.2090.7930.3050.3250.5930.0180.374
DestVI0.0810.642−0.055−0.0700.5370.4490.264
DSTG0.0380.6610.1100.1030.3800.2360.255
NMFreg−0.1030.4960.1900.2430.5330.4000.293
novoSpaRc−0.2450.0690.1800.2850.1000.3350.121
RCTD0.1600.0110.3050.3200.6130.4620.312
SD20.0330.5850.0750.1000.4100.1490.225
SpaOTsc−0.0090.5900.1500.0980.2330.2630.221
SpatialDecon0.2950.0180.3200.3250.4500.5060.319
SpatialDWLS−0.103−0.008−0.060−0.025−0.070−0.021−0.048
SpiceMix−0.0550.540−0.065−0.0400.5730.2730.204
SPOTlight0.1250.7500.2700.2330.3600.4330.362
STdeconvolve−0.0300.7600.3000.2130.6030.1330.330
stereoscope0.1460.6680.2600.2550.5270.4360.382
STRIDE−0.0620.6420.2350.2050.5430.4340.333
Tangram0.0710.7430.2900.3650.5100.3600.390
STVisiumSlide-seqV2
stereo-seq
(11)(17)(gene)(spot)Mean
SLS0.2410.7320.3450.3330.3920.3030.391
NNLS(fixed)0.1360.5030.2260.1900.2440.4700.295
NNLS(CV)0.1640.5140.3940.2860.3220.4720.359
Berglund0.0630.5530.1950.1950.5200.0260.259
CARD0.1170.7300.2500.2000.2330.4710.333
Cell2location0.2090.7930.3050.3250.5930.0180.374
DestVI0.0810.642−0.055−0.0700.5370.4490.264
DSTG0.0380.6610.1100.1030.3800.2360.255
NMFreg−0.1030.4960.1900.2430.5330.4000.293
novoSpaRc−0.2450.0690.1800.2850.1000.3350.121
RCTD0.1600.0110.3050.3200.6130.4620.312
SD20.0330.5850.0750.1000.4100.1490.225
SpaOTsc−0.0090.5900.1500.0980.2330.2630.221
SpatialDecon0.2950.0180.3200.3250.4500.5060.319
SpatialDWLS−0.103−0.008−0.060−0.025−0.070−0.021−0.048
SpiceMix−0.0550.540−0.065−0.0400.5730.2730.204
SPOTlight0.1250.7500.2700.2330.3600.4330.362
STdeconvolve−0.0300.7600.3000.2130.6030.1330.330
stereoscope0.1460.6680.2600.2550.5270.4360.382
STRIDE−0.0620.6420.2350.2050.5430.4340.333
Tangram0.0710.7430.2900.3650.5100.3600.390
a

The numbers in the parenthesis of Slide-seqV2 indicate the cell-type numbers of the dataset after the sub-cell types are integrated and the original dataset; the names in the parenthesis of stereo-seq indicate the original dataset and integrated subcellular-resolution dataset. The last column is the overall mean of PCC on all datasets. The top two methods are labeled in bold font.

One appealing feature of the SLS method is its computing efficiency. For a comparison, we pick three methods that were shown relatively more efficient in Li et al. (2023), including Tangram (Biancalani et al. 2021), RCTD (Cable et al. 2022), and SpatialDecon (Danaher et al. 2022). We consider benchmarking on two example datasets: seqFISH+ (10 000), which represents datasets with large gene numbers and small spot numbers, and MERFISH (20), which represents datasets with large spot numbers and small gene numbers. We run SLS and the other three methods on a cluster node with Intel(R) Xeon(R) Gold 6338 CPU @ 2.0 GHz and 256 GB DDR4 2666 MHz Memory with allocation of 1 CPU core and 100 GB memory. We run NNLS on Google Colab T4 GPU. The computing time is shown in Table 4. We note that Tangram is also computationally efficient. Nevertheless, its performance is less competitive. In conclusion, we recommend SLS for users with limited computation resources and NNLS for those with high accuracy standards and a demand for code flexibility.

Table 4.

Running time (seconds) on seqFISH+(10000) and MERFISH(20) datasets.a

seqFISH+MERFISH
(10 000)(20)
SLS2.16.5
NNLS(fixed)48.917.2
NNLS(CV)59.82496.7
Tangram4.056.3
RCTD87.01178.5
SpatialDecon10.492.4
seqFISH+MERFISH
(10 000)(20)
SLS2.16.5
NNLS(fixed)48.917.2
NNLS(CV)59.82496.7
Tangram4.056.3
RCTD87.01178.5
SpatialDecon10.492.4
a

We show the top two methods in bold.

Table 4.

Running time (seconds) on seqFISH+(10000) and MERFISH(20) datasets.a

seqFISH+MERFISH
(10 000)(20)
SLS2.16.5
NNLS(fixed)48.917.2
NNLS(CV)59.82496.7
Tangram4.056.3
RCTD87.01178.5
SpatialDecon10.492.4
seqFISH+MERFISH
(10 000)(20)
SLS2.16.5
NNLS(fixed)48.917.2
NNLS(CV)59.82496.7
Tangram4.056.3
RCTD87.01178.5
SpatialDecon10.492.4
a

We show the top two methods in bold.

Acknowledgements

Y.C. thanks Hangyu Lin and Chenhao Zhang for helpful discussion on this project.

Author contributions

Yunlu Chen, Feng Ruan, and Ji-Ping Wang conceived the experiments. Yunlu Chen developed the model and analyzed the results. Yunlu Chen, Feng Ruan, and Ji-Ping Wang wrote the manuscript.

Conflict of interest: None declared.

Funding

None declared.

Data availability

Example data of seqFISH+ available on: https://github.com/tinachentc/NLSDeconv.

All the benchmark datasets are available on: https://zenodo.org/records/7674290.

References

Andersson
A
,
Bergenstråhle
J
,
Asp
M
et al.
Single-cell and spatial transcriptomics enables probabilistic inference of cell type topography
.
Commun Biol
2020
;
3
:
565
.

Biancalani
T
,
Scalia
G
,
Buffoni
L
et al.
Deep learning and alignment of spatially resolved single-cell transcriptomes with tangram
.
Nat Methods
2021
;
18
:
1352
62
.

Boyd
S
,
Vandenberghe
L.
Convex Optimization
. Cambridge, UK:
Cambridge University Press
,
2004
.

Cable
DM
,
Murray
E
,
Zou
LS
et al.
Robust decomposition of cell type mixtures in spatial transcriptomics
.
Nat Biotechnol
2022
;
40
:
517
26
.

Chen
J
,
Liu
W
,
Luo
T
et al.
A comprehensive comparison on cell-type composition inference for spatial transcriptomics data
.
Brief Bioinform
2022
;
23
:
bbac245
.

Danaher
P
,
Kim
Y
,
Nelson
B
et al.
Advances in mixed cell deconvolution enable quantification of cell types in spatial transcriptomic data
.
Nat Commun
2022
;
13
:
385
.

Dong
R
,
Yuan
G-C.
Spatialdwls: accurate deconvolution of spatial transcriptomic data
.
Genome Biol
2021
;
22
:
145
.

Elosua-Bayes
M
,
Nieto
P
,
Mereu
E
et al.
Spotlight: seeded NMF regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes
.
Nucleic Acids Res
2021
;
49
:
e50
.

Hastie
T
,
Tibshirani
R
,
Friedman
JH
et al.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
, Vol. 2. New York, NY:
Springer
,
2009
.

Hunt
GJ
,
Freytag
S
,
Bahlo
M
et al.
dtangle: accurate and robust cell type deconvolution
.
Bioinformatics
2019
;
35
:
2093
9
.

Kleshchevnikov
V
,
Shmatko
A
,
Dann
E
et al.
Cell2location maps fine-grained cell types in spatial transcriptomics
.
Nat Biotechnol
2022
;
40
:
661
71
.

Li
H
,
Zhou
J
,
Li
Z
et al.
A comprehensive benchmarking with practical guidelines for cellular deconvolution of spatial transcriptomics
.
Nat Commun
2023
;
14
:
1548
.

Li
Z
,
Wu
H.
TOAST: improving reference-free cell composition estimation by cross-cell type differential analysis
.
Genome Biol
2019
;
20
:
190
.

Lopez
R
,
Li
B
,
Keren-Shaul
H
et al.
Destvi identifies continuums of cell types in spatial transcriptomics data
.
Nat Biotechnol
2022
;
40
:
1360
9
.

Ma
Y
,
Zhou
X.
Spatially informed cell-type deconvolution for spatial transcriptomics
.
Nat Biotechnol
2022
;
40
:
1349
59
.

Marx
V.
Method of the year: spatially resolved transcriptomics
.
Nat Methods
2021
;
18
:
9
14
.

Moses
L
,
Pachter
L.
Museum of spatial transcriptomics
.
Nat Methods
2022
;
19
:
534
46
.

Newman
AM
,
Liu
CL
,
Green
MR
et al.
Robust enumeration of cell subsets from tissue expression profiles
.
Nat Methods
2015
;
12
:
453
7
.

Rodriques
SG
,
Stickels
RR
,
Goeva
A
et al.
Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution
.
Science
2019
;
363
:
1463
7
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Christina Kendziorski
Christina Kendziorski
Associate Editor
Search for other works by this author on: