DIST: spatial transcriptomics enhancement using deep learning

Zhao, Yanping; Wang, Kui; Hu, Gang

doi:10.1093/bib/bbad013

Abstract

Spatially resolved transcriptomics technologies enable comprehensive measurement of gene expression patterns in the context of intact tissues. However, existing technologies suffer from either low resolution or shallow sequencing depth. Here, we present DIST, a deep learning-based method that imputes the gene expression profiles on unmeasured locations and enhances the gene expression for both original measured spots and imputed spots by self-supervised learning and transfer learning. We evaluate the performance of DIST for imputation, clustering, differential expression analysis and functional enrichment analysis. The results show that DIST can impute the gene expression accurately, enhance the gene expression for low-quality data, help detect more biological meaningful differentially expressed genes and pathways, therefore allow for deeper insights into the biological processes.

spatial transcriptomics, imputation, denoising, self-supervised learning, transfer learning

Introduction

Spatial transcriptomics (ST) is a recent technological innovation that enables the measurement of gene expression with spatial information in tissues. Knowledge of the relative locations of transcripts is critical for understanding biological, physiological, and pathological processes, and therefore, ST technologies have been used in various biological fields, such as tumor microenvironment [1, 2], embryonic development [3, 4] and neurology [5, 6]. ST technologies can be primarily classified two categories: (1) next-generation sequencing (NGS)-based approaches, such as ST [7] and 10X Genomics Visium [8], encoding positional information onto transcripts before sequencing and (2) imaging-based approaches, comprising in situ sequencing-based methods, represented by FISSEQ [9] and STARmap [10], in which transcripts are amplified and sequenced in the tissue, and in situ hybridization-based methods, represented by MerFISH [11] and seqFISH [12], in which imaging probes are sequentially hybridized in the tissue [13].

There is generally a trade-off between resolution and gene throughput in these technologies. Imaging-based approaches provide increased resolution and sensitivity, but have lower throughput, limiting their potential to explore transcriptome-wide interactions and discover new sequences. NGS-based methods are unbiased, as they can in principle sample the whole transcriptome, but have lower resolution and sensitivity [13–15]. More recently, the resolution of NGS-based methods has rapidly increased, for instance, DBiT-seq [16] reaches ~10 μm resolution and Stereo-seq [17] achieves nanoscale resolution (220 nm). While these NGS-based methods measure the expression level for thousands of genes in captured locations, referred to as spots, they suffer from incomplete spatial coverage, limiting their usefulness in studying detailed expression patterns. ST has 100 μm spot diameter with 200 μm center-to-center distance and Visium has 55 μm spot diameter with 100 μm center-to-center distance. The theoretical center-to-center distance of DBiT-seq can be as small as 2 μm because of the diffusion distance that is ~1 μm. Standard DNA nanoball chips of Stereo-seq have spots with ~220 nm diameter and a center-to-center distance of 500 or 715 nm.

The diameter and density of measured spots both limit the spatial resolution of current NGS-based methods, ranging from multicellular to subcellular [14]. To get high-resolution (HR) spatial expression profiles, new methods have been developed. For example, BayesSpace [18] employs a Bayesian method that uses the information from spatial neighborhoods for resolution enhancement, but BayesSpace improves the resolution by inferring the expression of sub-spots and cannot predict the expression on unmeasured locations; XFuse [15] uses a deep generative model trained by spatial expression data and corresponding histology images to infer super-resolved expression maps; HisToGene [19] is a deep learning method adopting Vision Transformer for gene expression prediction from histology images; SpatialPCA [20] extracts a low-dimensional representation of normalized and scaled ST data leveraging modified probabilistic principal component (PC) analysis, and can impute spatial PCs on unmeasured spots because of the data generative nature.

However, these existing methods either rely on histology images with obvious texture, or use the original expression information inefficiently. Some only get HR normalized or scaled data and fail to account for the mean-variance relationship existed in raw counts, leading to a potential loss of power [21]. Here, we mainly focus on NGS-based ST techniques with non-dense spatial coverage. Considering that successful applications to ST of deep learning [22–24], the nature connection between gene expression with spatial information and images and inspired by learning-based image processing [25, 26], we propose DIST, a method that imputes gene expression profiles on new and unmeasured spots only using spatial gene expression data to get a refined spatial map with a higher resolution. Unlike XFuse and HisToGene, DIST enhances the ST data using deep learning and does not need any additional information such as the histology image. In addition, despite improvements in NGS-based ST technologies, various technical factors including amplification bias and especially low RNA capture rate lead to substantial noise in sequencing [27]. The challenge is even greater for technologies with relatively HR because of relatively shallow sequencing. Noise because of amplification and dropout events may obstruct analyses and corrupt the underlying biological signals. For solving this problem, DIST can denoise imputed spatial gene expression by learning from high-quality data. Overall, we explain DIST as denoising and imputing spatial transcriptomics. We illustrate the benefits of DIST through simulation from a STARmap data set and a Stereo-seq data set and applying it to two real-world data sets obtained with, respectively, ST and 10X Visium platforms. We show that DIST could accomplish imputation task more accurately than conventional interpolation methods, meanwhile reveal more biological signals.

Methods

Overview of DIST

DIST mainly focuses on array-based ST techniques. Measured spots of these techniques arrange in certain patterns, generally comprising two classes: matrix arrangement such as ST and honeycomb arrangement such as Visium. DIST predicts an unmeasured spot between every two adjacent measured spots (Figure 1A).

Figure 1

The architecture of DIST. (A) Sketch of imputation and down-sampling on the platform having matrix (above) and honeycomb (below) permutation. Impute: predicting a new spot between every two adjacent spots. Down-sample: extracting nonadjacent spots regularly at intervals of one spot. (B) Self-supervised learning framework. Down-sample: down-sampling the original gene expression maps and creating artificial lower resolution gene expression maps to form LR-HR gene expression pairs. Training process: training an interpolation deep neural network, VLN, in a supervised manner using down-sampled LR gene expression as inputs and the original gene expression as labels. Prediction process: feeding the original gene expression maps into the trained model to output higher-resolution gene expression. (C) The network architecture of VLN.

Open in new tab Download slide

The whole idea of DIST is imputing the gene expression of unmeasured spots by self-supervised learning. Since we do not have low-resolution (LR) and HR gene expression pairs for training, we down-sample the original gene expression maps and create artificial LR gene expression maps to train the model. As shown in Figure 1B and C, DIST first trains an interpolation deep neural network, variation learning network (VLN) [26], in a supervised manner using down-sampled LR gene expression as inputs and the original gene expression as HR labels. The outputs are the imputed gene expression maps with the same resolution as original data. Then in the prediction phase (or imputation phase), the original gene expression of all spots is fed into trained model to output HR gene expression. DIST first learns how to impute the unmeasured gene expression from down-sampled LR gene expression maps and original gene expression maps and then applies the learned rules to impute the unmeasured gene expression on original data. This architecture allows DIST to enhance the ST data by various ways. Besides imputation of unmeasured spots by self-supervised learning, DIST can improve imputation of low-quality ST data by transfer learning on a high-quality ST data and denoise imputed expression by introducing synthetic noises to inputs of training set.

DIST method

DIST finishes imputation only using gene expression matrix with spatial coordinates for any data space, such as gene counts, normalized gene counts and other embeddings. In this paper, we mainly analyzed gene counts matrix.

We consider an ST study that collects gene expression matrix |${X}_{m\times n}$| for |$n$| genes on |$m$| spots. These spots have known spatial coordinates, such as |$\left({x}_i,{y}_i\right)\in{\mathbb{Z}}^2$| for the |$i\mathrm{th}$| spot. All spots’ coordinates are combined into a coordinate matrix |${C}_{m\times 2}$|⁠, whose |$i\mathrm{th}$| row and the |$i\mathrm{th}$| row of |${X}_{m\times n}$| point to the same spot. The |$j\mathrm{th}$| column of |${X}_{m\times n}$|⁠, denoted by |${X}^j$|⁠, is an |$m$|-vector that represents the |$j\mathrm{th}$| gene expression across |$m$| spots. The values of |${X}^j$| can be coordinated by corresponding spatial coordinates and transformed to a matrix |${I}^j$| with shape |$\left(u,v\right)$|⁠. In detail, the value on the |${x}_i\mathrm{th}$| row and |${y}_i\mathrm{th}$| column of |${I}^j$| is the |$i\mathrm{th}$| element of |${X}^j$|⁠. We use in-tissue matrix |$M$|⁠, having the same shape as |${I}^j$| and containing only zeros and ones, to indicate if there is a spot. For example, if the value on |$p\mathrm{th}$| row and |$q\mathrm{th}$| column of |$M$| equals to 1, there is a spot having corresponding coordinate |$\left(p,q\right)$|⁠; if it equals to 0, there is not. Here, |${I}^j$| could be considered as an expression map, and note map set |$I=\left\{{I}^1,\cdots, {I}^n\right\}.$|We use function |$T\left(\cdot \right)$| to represent the transformation from spatial gene expression to a map. Functionally,

$${I}^j,M=T\left({X}^j,C\right),j=1,\cdots, n.$$

Mostly, missing values exist when gene expression is transformed to an expression map because of irregular graphics of tissue section and filtering spots with poor quality. These missing values are simply set to 0. We do not consider out-tissue spots where there are no cells at all. Therefore, our method focuses on in-tissue spots and recovers the details inside tissue section.

DIST finishes spatial expression imputation by learning-based interpolation. The interpolation part of DIST is a VLN that estimates the variation value between the HR value and its nearest LR value [26] (Figure 1C). We train VLN on the data set |${I}_{\mathrm{train}}=\left\{{I}_{\mathrm{train}}^1,\cdots, {I}_{\mathrm{train}}^{n_{\mathrm{train}}}\right\}$|⁠, which is constructed from gene expression matrix |${X}_{\mathrm{train}}$| and coordinate matrix |${C}_{\mathrm{train}}$|⁠, and use |${I}_{\mathrm{test}}=\left\{{I}_{\mathrm{test}}^1,\cdots, {I}_{\mathrm{test}}^{n_{\mathrm{test}}}\right\}$| constructed from gene expression matrix |${X}_{test}$| and coordinate matrix |${C}_{test}$| at prediction or testing phase:

$${I}_{\mathrm{train}}^j,{M}_{\mathrm{train}}=T\left({X}_{\mathrm{train}}^j,{C}_{\mathrm{train}}\right),j=1,\cdots, {n}_{\mathrm{train}},$$

$${I}_{\mathrm{test}}^j,{M}_{\mathrm{test}}=T\left({X}_{\mathrm{test}}^j,{\mathrm{C}}_{\mathrm{test}}\right),j=1,\cdots, {n}_{\mathrm{test}},$$

where |${n}_{\mathrm{train}}$|and |${n}_{\mathrm{test}}$| denote the gene numbers of the training and test data set, respectively, |${I}_{\mathrm{train}}^j,{M}_{\mathrm{train}}$| have the same shape |$\left({u}_{\mathrm{train}},{v}_{\mathrm{train}}\right)$| and |${I}_{\mathrm{test}}^j,{M}_{\mathrm{test}}$| have the same shape |$\left({u}_{\mathrm{test}},{v}_{\mathrm{test}}\right)$|⁠.

During the training phase, DIST creates |${n}_{\mathrm{train}}$|pairs of LR and HR expression maps |${\{({L}_{\mathrm{train}}^j,{I}_{\mathrm{train}}^j)\}}_{j=1}^{n_{\mathrm{train}}}$| to train VLN, where |${L}_{\mathrm{train}}^j$| is extracted from |${I}_{\mathrm{train}}^j$|by down-sampling (Figure 1A). This process can be denoted as follows:

$${L}_{\mathrm{train}}^j\left(p,q\right)={I}_{\mathrm{train}}^j\left(2p-1,2q-1\right),$$

$$p=1,2,\cdots, \left[\frac{u_{\mathrm{train}}}{2}\right],q=1,2,\cdots, \left[\frac{v_{\mathrm{train}}}{2}\right],j=1,\cdots, {n}_{\mathrm{train}}.$$

We denote the down-sampling function as |$D\left(\cdot \right)$|⁠, so functionally

$${L}_{\mathrm{train}}^j=D\left({I}_{\mathrm{train}}^j\right),j=1,\cdots, {n}_{\mathrm{train}}.$$

The shape of |${L}_{\mathrm{train}}^j$|is |$\left(\left[{u}_{\mathrm{train}}/2\right],\left[{v}_{\mathrm{train}}/2\ \right]\right)$|⁠, where |$\left[\cdot \right]$| means the rounding operation.

Let |$F\left(\cdot \right)$| represent the network to recover the HR |${I}_{\mathrm{train}}^j$| from the input LR |${L}_{\mathrm{train}}^j$|⁠. Let |$\Theta$| denote all the parameters of the network. We adopt the following mean square error to train the VLN:

$$L\left(\Theta \right)=\frac{1}{n_{\mathrm{train}}}\sum_{j=1}^{n_{\mathrm{train}}}\big\Vert \left(F\left({L}_{\mathrm{train}}^j;\Theta \right)-{I}_{\mathrm{train}}^j\right)\cdot{M}_{\mathrm{train}}\big\Vert^2.$$

We use ADAM optimizer to minimize the loss via mini-batch gradient descent.

Note that the network can be applied to inputs of different sizes. At the stage of prediction or testing, input the test set |${I}_{\mathrm{test}}$| into the trained network:

$${H}_{\mathrm{test}}^j=F\left({I}_{\mathrm{test}}^j\right),j=1,\cdots, {n}_{\mathrm{test}}.$$

The shape of |${H}_{\mathrm{test}}^j.$| is equal to |$\left(2{u}_{\mathrm{test}},2{v}_{\mathrm{test}}\right)$|

Next, transform HR |${H}_{\mathrm{test}}^j$| into imputed gene expression matrix:

$${Y}_{\mathrm{test}}^j,{D}_{\mathrm{test}}={T}^{-1}\left({H}_{\mathrm{test}}^j,h\left({M}_{\mathrm{test}}\right)\right),j=1,\cdots, {n}_{\mathrm{test}},$$

where |$h\left(\cdot \right)$| learns the boundary of tissue from original spatial information and recovers the boundary on the imputed gene expression. It obtains in-tissue matrix of imputed |${X}_{\mathrm{test}}$| taking |${M}_{\mathrm{test}}$| as input. |$h\left({M}_{\mathrm{test}}\right)\left(p,q\right)=1$|⁠, if and only if location |$\left(p,q\right)$| points to an original spot or stays between two adjacent original spots. |${Y}_{\mathrm{test}}=\left({Y}_{\mathrm{test}}^1,\cdots, {Y}_{\mathrm{test}}^{n_{\mathrm{test}}}\right)$| is imputed |${X}_{\mathrm{test}}$| and |${D}_{\mathrm{test}}$| is coordinate matrix corresponding to |${Y}_{test}$|⁠. The imputed output |${Y}_{\mathrm{test}}$| may bring modification to expression on the same spots as |${X}_{\mathrm{test}}$|⁠, because increasing number of spots changes scale and tiny modification could enhance consistency. But the expression does not change too much, and gene-wise Pearson correlation coefficients (PCCs) are mostly close to 1.

If |${I}_{\mathrm{train}}$| and |${I}_{\mathrm{test}}$| come from the same ST data, the learning process is self-supervised learning; if not, then it is transfer learning.

The architecture of VLN

As shown in Figure 1C, VLN takes a recurrent convolutional structure that performs a progressive recovery route in a supervised manner [26]. Suppose that there is an HR expression map |$y$| and it is split into four parts: |${x}_{tl}\ \left(=x\right),{x}_{tr},{x}_{bl}$| and |${x}_{br}$|⁠, the sub-maps consisting of the top-left, top-right, bottom-left and bottom-right values extracted from every 2 × 2 nonoverlapping patches. VLN estimates the mapping |$y=F(x)$| to recover HR |$y$| from LR down-sampled |$x$|⁠. We suppose the shape of |$x$| is |$\left(u,v\right)$|⁠.

First, the network extracts LR features and enhances them by recurrent convolutional layers. Assume that there are |$K$| recurrences and every recurrence contains |$L$| layers. Let |${f}_{\mathrm{in}}^k$| denote the input for the |$k\mathrm{th}$| recurrent sub-network. The output of the sub-network, |${f}_{\mathrm{out}}^k$|⁠, is progressively updated as follows:

$${f}_1^k={W}_{\mathrm{in}}^k\ast{f}_{\mathrm{in}}^k+{b}_{\mathrm{in}}^k,$$

$${f}_2^{\mathrm{k}}=\max \left(0,{f}_1^k\right),$$

$${f}_{\mathrm{l}}^k={W}_{\mathrm{l}-1}^k\ast{f}_{\mathrm{l}-1}^k+{b}_{\mathrm{l}-1}^k,\mathrm{l}=3,\cdots, L-1,$$

$${f}_{\mathrm{out}}^k=\max \left(0,{f}_{\mathrm{in}}^k+{f}_{L-1}^k\right),$$

where |${f}_l^k$| is a feature map of the |$l\mathrm{th}$| layer, |${W}_{\mathrm{in}}^k,{W}_{l-1}^k$| and |${b}_{\mathrm{in}}^k,{b}_{l-1}^k$| are the filter parameters and biases of convolutional layers.

Here, we have

$${f}_{\mathrm{in}}^1=x,$$

$${f}_{\mathrm{in}}^k={f}_{\mathrm{out}}^{k-1},k=2,\cdots, K$$

to connect input |$x$| and |$K$| recurrent sub-networks.

Second, the variation map is reconstructed by the last convolution on the enhanced features, calculated as follows:

$$\Delta x=\left[\Delta{x}_{tl},\Delta{x}_{tr},\Delta{x}_{dl},\Delta{x}_{dr}\right]=W\ast{f}_{\mathrm{out}}^K+b,$$

where |$\Delta{x}_{tl},\Delta{x}_{tr},\Delta{x}_{dl}$| and |$\Delta{x}_{dr}$| are the estimated top-left, top-right, down-left and down-right difference values relative to input |$x$| in every 2 × 2 nonoverlapping patch, |$W,b$| denote the filter parameter and bias of the reconstruction layer.

Third, the network combines the value variation and the corresponding top-left value in each nonoverlapped patch, we get

$$\hat{x}=\left[{\hat{x}}_{tl},{\hat{x}}_{tr},{\hat{x}}_{dl},{\hat{x}}_{dr}\right]=\left[\Delta{x}_{tl}+x,\Delta{x}_{tr}+x,\Delta{x}_{dl}+x,\Delta{x}_{dr}+x\right]$$

with the shape of |$\left(4,u,v\right)$|⁠.

Finally, reconstruct HR expression map by a location-aware up-sampling layer. The output result is denoted by |${\hat{x}}_{\mathrm{up}}$|⁠, calculated as follows:

$${\hat{x}}_{\mathrm{up}}\left(p,q\right)=\hat{x}\left(r,\left\lceil \frac{p}{2}\right\rceil, \left\lceil \frac{q}{2}\right\rceil \right),p=1,2,\cdots, u,q=1,2,\cdots, v,$$

$$r=\left\{\begin{array}{c}1,\mathrm{if}\ \left(p \operatorname{mod}\ 2,q \operatorname{mod}\ 2\right)=\left(1,1\right)\\{}2,\mathrm{if}\ \left(p \operatorname{mod}\ 2,q \operatorname{mod}\ 2\right)=\left(1,0\right)\\{}3,\mathrm{if}\ \left(p \operatorname{mod}\ 2,q \operatorname{mod}\ 2\right)=\left(0,1\right)\\{}4,\mathrm{if}\ \left(p \operatorname{mod}\ 2,q \operatorname{mod}\ 2\right)=\left(0,0\right)\end{array}\right.,$$

where |$\left\lceil \cdot \right\rceil$| means the ceiling operation, |$p$| is the index of row, |$q$| is the index of column and |$r$| is the index of component that points to |${\hat{x}}_{tl},{\hat{x}}_{tr},{\hat{x}}_{dl}$| or |${\hat{x}}_{dr}$|⁠. Here, all the parameters |$\Theta =\left\{W,b\right\}\cup{\left\{{W}_{\mathrm{in}}^k,{W}_2^k,\cdots, {W}_{L-2}^k,{b}_{\mathrm{in}}^k,{b}_2^k,\cdots, {b}_{L-2}^k\right\}}_{k=1}^K.$|⁠, where filter parameters are initialized to normal distribution and biases are initialized to 0. The output |$F\left(x;\Theta \right)={\hat{x}}_{\mathrm{up}}$| with the shape of |$\left(2u,2v\right)$|

During the training phase, |$K=2$| and |$L=5$| by default. The learning rate is set to 0.001 for front-end layers and 0.00001 for the reconstruction layer. In most cases, the loss will converge within 200 epochs.

Introduction of synthetic noises

DIST can denoise the imputed expression by introducing synthetic noises to the inputs during the training process. Following SAVER [28], we introduced noises to the gene expression as follows: for gene |$j$|on spot |$i$|⁠, we treated the original count as |${X}_{ij}$|⁠, and the noised value |${Y}_{ij}$| was generated by drawing from a Poisson distribution with |${Y}_{ij}\sim \mathrm{Poisson}\left(\lambda{X}_{ij}\right)$| where |$\lambda$| is efficiency loss. We let |$\lambda =0.6$| for introducing noises to Visium human invasive ductal carcinoma (IDC) data [29] in the subsequent experiment.

Data analysis

We explored biological functions of IDC following the conventional analysis processes including quality control, normalization, clustering, differential expression analysis and pathway enrichment [30]. We used log-transformed normalization, selected 2000 highly variable genes and applied Louvain [31] to cluster spots. Since existing approaches of identifying significant differentially expressed (DE) genes mostly suit for integer unique molecular identifier (UMI) counts, we rounded the imputed counts before differential expression analysis. We used t-test to find DE genes and chose significant DE genes by thresholds of adjusted P-value and log2-foldchange, and these values are listed as follows:

(i) Comparison between original and imputed IDC: adjusted P-value < 0.05 and log2-foldchange > 1;
(ii) Comparison between denoised and non-denoised IDC: adjusted P-value < 1×10^-6 and log2-foldchange > 0.6.

Enrichment analysis was accomplished at Metascape [32] website.

Results

DIST imputes the spatial gene expression accurately

To rigorously evaluate the imputation performance of DIST, we created a simulated ST data from a STARmap data set, in which RNA molecules from 903 genes had been measured and located on a mouse placenta tissue [33]. We used this molecular-resolution data to simulate a single-cell-resolution spot-like data by projecting all RNAs onto pseudo spots (240 × 240 pixels). The total RNA counts within a 240 × 240 pixels’ scope were considered as gene expression of this simulated pseudo spot. As a result, there were totally 7465 pseudo spots with more than 5 RNA counts. The number of pseudo spots approximately equals to cells number estimated by ClusterMap [33]. We used this simulated single-cell-resolution data set as ground truth and down-sampled it to create a simulated coarse-resolution ST data. Then we compared DIST with other interpolation methods including nearest neighbor (NN), linear barycentric interpolation (Linear), cubic spline interpolation (Cubic), new edge-directed interpolation (NEDI) [34], fast image interpolation via random forests (FIRF) [35] and SpatialPCA. We did not compare DIST with state of art methods such as BayesSpace, XFuse and HisToGene for ST imputation because BayesSpace separates a spot into several sub-spots and cannot impute gene expression for unmeasured locations; XFuse and HisToGene need histology images that cannot be provided by simulation. Since the number of genes in simulated data is smaller than that of the whole transcriptome by over an order of magnitude, we pretrained DIST with IDC data, and then fine-tuned the model on the training set constructed from the simulated data itself.

Figure 2A and Supplementary Figure S1 show the ground truth, simulated coarse-resolution expression and the imputed expression using DIST, Linear, Cubic, NEDI and NN about several genes that have different spatial patterns. There are vacant positions on simulated expression because of incomplete spatial coverage and quality control. DIST and the mentioned interpolation methods can all fill vacancy expression to ensure the integrity and continuity of tissue slices. Intuitively, DIST has sharper gene expression profiles and finer depiction than other interpolation methods, though other methods achieve the same resolution as DIST. NN makes interpolated expression maps discontinuous and prone to sawtooth phenomenon. Linear and Cubic can both enhance smoothness but result in blurring. These interpolation methods impute expression on unmeasured spots only leveraging the spots on surrounding locations, the effective knowledge from other spots have not been utilized. DIST, on the other hand, considers spatial expression of all spots as well as their internal statistics information. In addition, DIST is trained on the whole gene set, taking advantage of all the genes and spots instead of only one gene and surrounding spots like other methods.

Figure 2

Simulations prove DIST can refine ST profiles with higher accuracy. (A, B) Simulation created from STARmap mouse placenta. (A) Spatial expression of several genes having different spatial patterns based on coarse-resolution simulated data, single-cell-resolution ground truth and imputed data using DIST, Linear, Cubic, NEDI and NN. Consistent color ranges scaled by gene-wise minimum and maximum of ground truth. (B) Gene-wise Pearson correlation correlations between ground truth and imputed expression using DIST, Linear, Cubic, NEDI, NN, SpatialPCA and FIRF. Boxes show 25th, 50th and 75th percentiles. Whiskers indicate the extent of all non-outliers defined as observations within 1.5 interquartile ranges from the hinges. (C) Simulation created from Stereo-seq adult mouse hemi-brain. Gene-wise Pearson correlation correlations between ground truth and imputed expression using DIST_transfer (transfer learning), DIST_self (self-supervised learning), Linear, Cubic, NEDI, NN, XFuse, SpatialPCA and FIRF. Boxes have the consistent explanation with (B).

Open in new tab Download slide

We calculated PCCs between ground truth and each imputed expression of all spots for the 903 genes and showed the results in Figure 2B. DIST attains higher median of gene-wise PCCs over the best-performing interpolation (median: DIST = 0.523, Linear = 0.389). DIST’s 25th percentile (0.498) of PCCs is even higher than the maximum of 75th percentiles of interpolation methods (Linear = 0.477). We calculated PCCs for SpatialPCA in scaled embedding spaces, since SpatialPCA can only get the imputed scaled data. It is noted that the purpose of SpatialPCA is dimension reduction, not for predicting truth expression values on unmeasured spots. The performance of FIRF is not good, mostly because it is designed for image interpolation and is trained on an image data set.

Next, we created another simulated ST data from Stereo-seq data, which offers subcellular resolution spatial expression on the adult mouse hemi-brain [17]. Similar to the above process, we calculated PCCs between ground truth and imputed expressions from DIST, Linear, Cubic, NEDI, NN, XFuse, SpatialPCA and FIRF (Figure 2C). For DIST, we evaluated imputed values from self-supervised learning and transfer learning trained on IDC data, respectively. In addition, we ran XFuse with a nuclei acid staining image of the tissue section. XFuse can predict pixel-wise expression and get resolution as high as the staining image. To have a comparison, we fused pixels into spots by their locations. Figure 2C reflects that DIST attains the highest median of gene-wise PCCs and transfer learning improves accuracy of prediction. Same as the STARmap simulated data, the PCC of DIST is the highest among all the methods. Because of the shallow sequencing depth, the median of PCCs (0.465) on Stereo-seq simulated data is lower than that on STARmap data for DIST. But after performing transfer learning using a high-quality data, the median of PCCs on Stereo-seq data is improved to 0.518, which is quite close to STARmap data.

Explore spatial patterns of ST human melanoma data

Next, we tested whether DIST can help find spatial patterns for ST. We applied Sepal, a method for identification of transcript profiles that exhibit spatial patterns by diffusion-based model, to an ST data set from human melanoma [7] before and after imputation using DIST. Following the tutorial of Sepal, first, we ranked the gene expression profiles by the degree of spatial structure from distinct to random patterns; then grouped top-ranked genes that had distinct spatial patterns into pattern families, where members of the same pattern family exhibited similar spatial organization; finally subjected the families to functional enrichment analysis querying against the Gene Ontology: Biological Processes (GO: BP) database [36].

Figure 3A shows some top-ranked transcript profiles in the imputed melanoma data but having lower rankings in original melanoma data. CSPG4 is associated with melanoma tumor formation and poor prognosis in certain melanomas [37]. DLL3 is expressed in many metastatic melanomas and targeting the gene may be a promising therapeutic strategy against inflammation-aggravated melanoma progression [38]. CD37 is expressed almost exclusively in cells of the immune system, especially in mature B cells [39]. As shown in Figure 3A, the expressions of these genes have distinct spatial structure. The lower resolution might have led to the relatively low rankings of these genes in original melanoma data. And the DIST imputed HR data allow Sepal to have more power and improve the rankings of these genes significantly.

Figure 3

Explore spatial patterns of ST human melanoma data. (A) Three disease-related genes having higher rankings in imputed data (below) because of distinct spatial patterns but having lower rankings in original data (above). The header of each transcript profile gives the name of the associated gene and ranking. (B) The pathologist’s annotations on hematoxylin- and eosin-stained tissue image from the original paper, where black: melanoma, red: stroma and yellow: lymphoid tissue. (C) The number of genes (left) and significant (P-value < 0.05) GO: BP terms (right) for each family. Green signs numbers from original data; orange signs numbers from imputed data. (D) Top 10 significant GO: BP terms only identified by imputed data in lymphoid (above) and stroma (below) family.

Open in new tab Download slide

We next asserted top 150 ranked genes into three mainly pattern families and performed enrichment analysis. From representative motifs (Supplementary Figure S2), distinct marker genes (Supplementary Figures S3 and S4) and biological processes for each pattern family, these families can be obviously linked to histological annotations (Figure 3B) including melanoma, lymphoid and stroma. Then we compared these families that had the same spatial patterns on original and imputed ST data. In all, 150 top-ranked genes from imputed ST data were separated into 87 melanoma-related genes, 37 lymphoid-related genes and 25 stroma-related genes, whereas 150 top-ranked genes from original ST data were separated into 108 melanoma-related genes, 31 lymphoid-related genes and 8 stroma-related genes (Figure 3C). For three families, imputed data were enriched more GO: BP terms than original data despite it has less melanoma-related genes (Figure 3C). Moreover, lymphoid family was enriched 124 new pathways, among them multiple processes were related to immune response and lymphocyte activation. Stroma family had more pathways about growth factor (Figure 3D). We also chose 100 or 200 top-ranked genes to repeat the above analytic steps, and they both had consistent results with 150 top-ranked genes (Supplementary Figure S5). It indicates that top-ranked genes from DIST imputed data are more region specific and biologically significant, particularly in lymphoid. Lymphoid region and stroma region are relatively smaller than melanoma region. Because of the LR and small region, lymphoid-related genes and stroma-related genes have much lower rankings compared with larger region such as melanoma. DIST improves the resolution and thereby Sepal has more power to find those lymphoid- and stroma-related genes.

Explore biological functions of Visium human IDC data

We further analyzed the human IDC sample prepared on the Visium platform following the subsequent analyses including clustering, differential expression analysis and pathway enrichment. We were interested in what new discoveries DIST could bring compared with original ST data.

We tried two strategies to train our model. First, we used the basic self-supervised learning scheme, training set and test set were from the same ST data. The result shows that new imputed spots cannot merge well with original spots on clusters, leading to phenomenon like batch effect (Figure 4A). This might be because DIST cannot be well trained because of the low quality of IDC data. So, we next tried the transfer learning scheme, training on a better-quality data. Here, we used Visium mouse sagittal posterior brain data [40] to construct training set. The mean reads number per spot of mouse sagittal posterior brain data is 87 128 [40], whereas that of IDC is only 40 795 [29]. The clustering result of imputation is visually smooth and consistent with that of original data (Figure 4B). It indicates that DIST can improve the imputed result of low-quality data by leveraging information learnt from high-quality data. The high-quality data have less noises, hence transfer learning allows DIST to avoid excessive noises to improve the accuracy of imputation. The following analyses are based on results from transfer learning.

Figure 4

Compare clustering results of Visium human IDC data. (A, B) Louvain clustering displays of imputed Visium human IDC using two strategies. (A) Based on self-supervised learning strategy. (B) Based on transfer learning model trained on a high-quality data, Visium mouse sagittal posterior brain. (C) Clustering performance of original (orange) and imputed (blue) IDC in obtaining smooth and continuous spatial domains measured by LISI. Lower LISI score indicates more homogeneous and continuous spatial structure. (D) Domains from original expression. (E) Merged version of (B), having matched domains with (D) by the same color.

Open in new tab Download slide

First, we compared clustering results of IDC before and after DIST imputation (Figure 4B and D and Supplementary Figure S6). The results show that the DIST imputed expression has smoother clustering result. We clustered original and imputed expression using different numbers of PCs and resolutions by Louvain algorithm. The spatial domains detected on imputed IDC are intuitively more continuous and smoother (Figure 4B and Supplementary Figure S6), which is also confirmed by lower local inverse Simpson’s indexes (LISIs) [41] (Figure 4C). The median of LISIs based on original expression is greater than that based on imputed expression for every parameter combination. The minimum median of LISIs based on original expression is 1.709, even greater than the maximum median of LISIs based on imputed expression that is 1.510. Figure 4B and D shows the clustering results using 15 PCs and 0.8 resolution, which was used for further differential expression analysis and functional enrichment analysis.

Differential expression analysis results show there are more meaningful markers found from imputed expression than original data. For ease of comparison between the original and imputed IDC, we manually merged domains to form one-to-one mapping between the two clustering results (Figure 4D and E). In almost all spatial domains, imputed IDC can find more significant DE genes than original data (Figure 5A). The numbers of overlapped DE genes are close to those of original transcriptomic data for every domain, which suggests that DIST can help identify more significant DE genes while preserving the information of the original data. In addition, we analyzed a matched single-cell RNA-seq data of IDC [42] as reference and verified that larger numbers of significant DE genes were not caused by false positives (Figure 5B). We considered DE results of single-cell RNA-seq more credible since it has deep sequencing depth and single-cell resolution. Here, we used Seurat label transfer method to map each cell into spatial domains [43]. In six comparisons between carcinomas and immune domains, imputed IDC has more significant DE genes, and its overlaps with single-cell reference are also much more than original data, indicating that DIST can find more true DE genes.

Figure 5

Explore biological functions of IDC data. (A) The number of significant (adjusted P-value < 0.05 and log2-foldchange > 1) DE genes for each domain. Green signs numbers from original data; orange signs numbers from imputed data; gray signs the overlaps of both. (B) The number of significant DE genes for compared pairs and overlaps with single-cell reference. (C) The pathologist’s annotations on immunofluorescent tissue image from [18], where red: invasive carcinoma, yellow: carcinoma in situ, green: benign hyperplasia and gray: unclassified tumor. (D) Top 10 significant GO: BP terms only identified by imputed data in domain 0 (above) and domain 10 (below).

Open in new tab Download slide

According to immunofluorescent imaging and histopathological annotations of the tissue section [18] (Figure 5C), the 12 domains can be categorized into invasive carcinoma (label: 0,4,7,8), carcinoma in situ (label: 3,9), benign hyperplasia (label: 2) and immune response-related domains (label: 1,5,6,10,11). In disease-related domains, several carcinomas-related genes were found in DE genes from imputed data, but not original data. For example, EZH2 [44] and SOCS1 [45] in domain 0 and AKT1 [46] in domain 4 are well known markers of IDC [47]. In immune-related domain 10, newly identified DE genes include ADA, an enzyme that can enhance CD4+ T-cell differentiation and proliferation: IGHG1 and IGHG4, predicted to be involved in the activation of immune response and phagocytosis [47].

To further explore biological functions, these domains were subjected to functional enrichment analysis querying against the GO: BP database. Imputed expression brought some new discoveries. For example, there were pathways associated with chromosome organization and DNA metabolic process enriched in domain 0, which were related to cell growth in carcinoma (Figure 5D). In domain 10, multiple processes including B cell, T cell, leukocyte activation and positive regulation involved in immune response were enriched significantly (Figure 5D).

Figure 6

DIST simultaneously imputes and denoises low-quality IDC. (A) The violin plot of total counts (total number of counts for each spot) (left) and dropout percentages (percentage of spots each gene does not appear in) (right) based on non-denoised imputed IDC (light green) and denoised imputation (light yellow). (B) Merged clustering result of denoised imputation (Supplementary Figure S7), having matched domains with Figure 4E by the same color except for domains 1 and 6. (C) The number of significant (P-value < 1×10^-6 and log2-foldchange > 0.6) DE genes for each domain. Green signs numbers from non-denoised data; orange signs numbers from denoised data; gray signs the overlaps of both. (D) Top 10 significant GO: BP terms only identified by denoised data in domain 0 (above) and domain 10 (below).

Open in new tab Download slide

DIST denoises Visium human IDC data

Transfer learning can improve the accuracy of imputation, but cannot reduce noises of imputed expression. In this section, we show how to use DIST to denoise the imputed expression by adding an additional step.

Because of low amounts of RNA sequenced within individual spots, low-quality ST data are usually noisy and sparse, leading to masked biological signals. For these noisy and sparse data, DIST can not only improve the imputation by transfer learning, but also denoise the imputed expression to increase the depth of UMI counts and decrease the rate of dropouts. Denoising could be implemented by an additional step, introducing synthetic noises to the inputs during the training process. The LR inputs have more noises than their HR ground truth, so DIST can learn how to simultaneously impute and denoise low-quality data.

We indicated the effect of noise reduction based on discussed IDC by comparing with non-denoised imputation of the previous section. We compared the total number of UMI counts for each spot and percentage of spots with zero UMI count for each gene between denoised expression and non-denoised expression. The results are shown in Figure 6A. The mean of total counts of denoised expression improves from 7237 to 12 367 and the mean of dropout percentages of denoised expression reduces from 6 to 2%. The quality of denoised expression improves significantly in terms of total UMI counts and percentages of dropouts.

Next, we analyzed denoised data by the same procedure as non-denoised data including clustering, differential expression analysis and pathway enrichment. Denoised data have similar clustering result to non-denoised data (Figures 4E and6B). We found more DE genes from denoised data than non-denoised data for every matched domain (Figure 6C). In disease-related domains, several well-known markers are included in DE genes from denoised data, but not non-denoised data. Examples of such genes are: NCOA4 [48] and SOCS1 [45] in domain 0, ADGRB1 [49] in domain 4, well-known markers of IDC [47]; HOXB13 [50] and KLK10 [51] in domain 3, two markers of ductal carcinoma in situ [47]. Moreover, we found some GO terms enriched in the upregulated DE genes that could be only detected in the denoised data, but not in the non-denoised data. These enriched terms are highly relevant to the biological functions of carcinoma and immune reaction. For example, there were pathways associated with protein and phosphate enriched in domain 0 (Figure 6D). And multiple processes related to immune response were enriched significantly in domain 10 (Figure 6D). In total, the denoising can reduce the noise of imputed expression and strength the power of identifying significant DE genes and biological signals.

In addition, we used different efficiency loss |$\lambda$| to investigate how the different synthetic noise levels impacted denoising performance. Efficiency losses |$\lambda$| = 0.2, 0.4, 0.6 and 0.8 were selected and the results are shown in Supplementary Figure S8. With the decreasing of efficiency loss, more noises were introduced to the inputs, and the denoised imputation had deeper sequencing depth. From the clustering results, smaller efficiency loss leads to smoother domains and the lack of details. Hence, it is appropriate to take efficiency loss larger than 0.5 to avoid the lack of original details.

Discussion

We developed DIST, a deep learning-based method that enhances ST data by self-supervised learning and transfer learning. Through self-supervised learning, DIST can impute the gene expression levels at unmeasured area accurately and improve the data quality in terms of total counts and dropout percentages. Moreover, transfer learning enables DIST improve the imputed gene expression by borrowing information from other high-quality data. It is worth noting that the informative high-quality data do not have to come from the same tissues or species as target ST data, as we enhanced poor quality human IDC data by the model trained on good quality mouse brain data. We also used different species or tissues data as sources to perform the transfer learning on IDC data, the results show (Supplementary Figure S9) that the sources from the same species lead to better performance, but the difference is minor. Even from different species or tissues, DIST still can learn enough knowledge from the source to improve the prediction performance.

In spatially resolved transcriptomics studies, identifying genes that display spatial expression patterns is an essential step toward characterizing the ST landscape in complex tissues [21]. Gene expression profiles that possess distinct spatial patterns are of particular interest when attempting to chart the biological processes and pathways present within a tissue using transcriptome-wide techniques [52]. Because of the low statistical power, it is hard to find the biological meaningful genes with spatial patterns in small size region with LR transcriptomics data. Through analyzing ST human melanoma data, we illustrated that DIST allowed us to detect more genes with distinct spatial patterns in small regions and make small region-specific genes, such as lymphoid- and stroma-related genes having much higher rankings compared with original data. Besides LR, noise such as shallow sequencing depth or dropout events is another reason for loss of power to explore biological significance. In Visium IDC data analysis, we showed that DIST could improve the total UMI counts and reduce the dropouts by denoising and using DIST enhanced data we could find more DE genes and thereby find more significant enriched pathways.

To quantify the cost of time and memory of DIST, we recorded the time and memory usage of NN, DIST, Linear, Cubic, NEDI, SpatialPCA and XFuse on the same ST data. The results are shown in Supplementary Table S1. DIST is the second fastest method and only takes <2 min to finish the whole process.

While our evaluations were performed on ST and Visium platforms, DIST should be applicable to other platforms such as Stereo-seq [17]. We applied DIST to Stereo-seq data from mouse embryo at E9.5 and the results are shown in Supplementary Figure S10. Same as IDC data, DIST improved the total counts significantly and found more significant DE genes than original data. Besides, while our work focused on two-dimensional tissue slices and their gene expression, DIST could be extended to three-dimensional (3D) ST reconstruction that provides more comprehensive perspective to biological exploration. For reconstructing 3D expression, existing methods usually align and integrate ST data from multiple adjacent tissue slices [53]. Some imputed approaches cannot be easily applied to 3D expression, since there is no matched histology image between tissue slices. However, DIST only need gene expression and spot coordinates therefore it could be applicable to 3D expression with slight modifications to network parameters and the dimension of VLN inputs. In summary, DIST offers comprehensive data enhancement including imputation and denoising and could be a useful tool for ST data analysis.

Key Points

DIST offers comprehensive data enhancement including imputation and denoising and could be a useful tool for spatial transcriptomic (ST) data analysis.
DIST enhances the ST data using self-learning and does not need any additional information such as the histology image.
DIST improves the ST data quality significantly by borrowing information from other high-quality data using transfer learning.
DIST is trained on the whole gene set, taking advantage of all the genes and spots instead of only one gene and surrounding spots.
DIST improves the accuracy of clustering, helps identify more biological meaningful differentially expressed genes and pathways for ST data, therefore allows for deeper insights into the biological processes.

Data availability

Data sets analyzed in this paper are available from their original publications. The STARmap mouse placenta is available at Code Ocean https://codeocean.com/capsule/9820099/tree/v1 [33]. The human melanoma ST data (mel1 rep1) can be found at https://www.spatialresearch.org/resources-published-datasets/doi-10-1158-0008-5472-can-18-0747 [7]. Raw count matrix, histology image and spatial data from the IDC sample are accessible on the 10x Genomics website at https://www.10xgenomics.com/resources/datasets/invasive-ductal-carcinoma-stained-with-fluorescent-cd-3-antibody-1-standard-1-2-0 [29]. Relevant information about the mouse sagittal posterior brain sample is accessible at https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-1-sagittal-posterior-1-standard-1-1-0 [40]. Stereo-seq data from the adult mouse hemi-brain and mouse embryo are both from MOSAT database at https://db.cngb.org/stomics/mosta/.

Code availability

An open-source implementation of the DIST algorithm can be downloaded from https://github.com/zhaoyp1997/DIST.

Funding

This work was supported by National Natural Science Foundation of China (31970649) to G.H. and K.W.

Yanping Zhao is a graduate student at School of Statistics and Data Science, Nankai University. Her research interests include statistical genomics and muti-omics analysis.

Kui Wang is an associate professor at School of Statistics and Data Science, Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin, Nankai University. His research interests include structural bioinformatics, single-cell RNA sequencing analysis, and the application of deep learning in bioinformatics.

Gang Hu is a professor at School of Statistics and Data Science, Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin, Nankai University. His research interests include structural bioinformatics, statistical genomics, and multi-omics analysis.

References

1

Hildebrandt

F

,

Andersson

A

,

Saarenpää

S

, et al.

Spatial transcriptomics to define transcriptional patterns of zonation and structural components in the mouse liver

.

Nat Commun

2021

;

12

(

1

):

7046

–

6

.

2

Moncada

R

,

Barkley

D

,

Wagner

F

, et al.

Integrating microarray-based spatial transcriptomics and single-cell RNA-seq reveals tissue architecture in pancreatic ductal adenocarcinomas

.

Nat Biotechnol

2020

;

38

(

3

):

333

–

42

.

3

Fawkner-Corbett

D

,

Antanaviciute

A

,

Parikh

K

, et al.

Spatiotemporal analysis of human intestinal development at single-cell resolution

.

Cell

2021

;

184

(

3

):

810

–

26.e23

.

4

Asp

M

,

Giacomello

S

,

Larsson

L

, et al.

A spatiotemporal organ-wide gene expression and cell atlas of the developing human heart

.

Cell

2019

;

179

(

7

):

1647

–

60.e19

.

5

Hasel

P

,

Rose

IVL

,

Sadick

JS

, et al.

Neuroinflammatory astrocyte subtypes in the mouse brain

.

Nat Neurosci

2021

;

24

(

10

):

1475

–

87

.

6

Maynard

KR

,

Collado-Torres

L

,

Weber

LM

, et al.

Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex

.

Nat Neurosci

2021

;

24

(

3

):

425

–

36

.

7

Thrane

K

,

Eriksson

H

,

Maaskola

J

, et al.

Spatially resolved Transcriptomics enables dissection of genetic heterogeneity in stage III cutaneous malignant melanoma

.

Cancer Res

2018

;

78

(

20

):

5970

–

9

.

8

Spatial Gene Expression - 10x Genomics

. p. Visium Spatial Gene Expression is a next-generation molecular profiling solution for classifying tissue based on total mRNA. In:

Map the whole transcriptome with morphological context in FFPE or fresh-frozen tissues to discover novel insights into normal development, disease pathology, and clinical translational research

https://www.10xgenomics.com/products/spatial-gene-expression.

9

Lee

JH

,

Daugharthy

ER

,

Scheiman

J

, et al.

Fluorescent in situ sequencing (FISSEQ) of RNA for gene expression profiling in intact cells and tissues

.

Nat Protoc

2015

;

10

(

3

):

442

–

58

.

10

Wang

X

,

Allen

WE

,

Wright

MA

, et al.

Three-dimensional intact-tissue sequencing of single-cell transcriptional states

.

Science

2018

;

361

(

6400

):eaat5691.

Google Scholar

OpenURL Placeholder Text

WorldCat

11

Moffitt

JR

,

Bambah-Mukku

D

,

Eichhorn

SW

, et al.

Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region

.

Science

2018

;

362

(

6416

):eaat5324.

Google Scholar

OpenURL Placeholder Text

WorldCat

12

Eng

CL

,

Lawson

M

,

Zhu

Q

, et al.

Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH

.

Nature

2019

;

568

(

7751

):

235

–

9

.

13

Rao

A

,

Barkley

D

,

França

GS

, et al.

Exploring tissue architecture using spatial transcriptomics

.

Nature

2021

;

596

(

7871

):

211

–

20

.

14

Liu

B

,

Li

Y

,

Zhang

L

.

Analysis and visualization of spatial transcriptomic data

.

Front Genet

2021

;

12

:

785290

.

15

Bergenstrahle

L

, et al.

Super-resolved spatial transcriptomics by deep data fusion

.

Nat Biotechnol

2022

;

40

(

4

):

476

–

9

.

16

Liu

Y

,

Yang

M

,

Deng

Y

, et al.

High-spatial-resolution multi-omics sequencing via deterministic barcoding in tissue

.

Cell

2020

;

183

(

6

):

1665

–

1681.e18

.

17

Chen

A

,

Liao

S

,

Cheng

M

, et al.

Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays

.

Cell

2022

;

185

(

10

):

1777

–

1792.e21

.

18

Zhao

E

,

Stone

MR

,

Ren

X

, et al.

Spatial transcriptomics at subspot resolution with BayesSpace

.

Nat Biotechnol

2021

;

39

(

11

):

1375

–

84

.

19

Pang

M

,

Su

K

,

Li

M

.

Leveraging information in spatial transcriptomics to predict super-resolution gene expression from histology images in tumors.

bioRxiv

.

2021

;

2021.11.28.470212

.

20

Shang L, Zhou X. Spatially aware dimension reduction for spatial transcriptomics.

Nat Commun

2022;

13

(1):7203.

21

Sun

S

,

Zhu

J

,

Zhou

X

.

Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies

.

Nat Methods

2020

;

17

(

2

):

193

–

200

.

22

Dong

K

,

Zhang

S

.

Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder

.

Nat Commun

2022

;

13

(

1

):

1739

.

23

Zeng

Y

,

Wei

Z

,

Yu

W

, et al.

Spatial transcriptomics prediction from histology jointly through transformer and graph neural networks

.

Brief Bioinform

2022

;

23

(

5

).

Google Scholar

OpenURL Placeholder Text

WorldCat

24

He

B

,

Bergenstråhle

L

,

Stenbeck

L

, et al.

Integrating spatial gene expression and breast tumour morphology via deep learning

.

Nat Biomed Eng

2020

;

4

(

8

):

827

–

34

.

25

Shocher

A

,

Cohen

N

,

Irani

M

. “Zero-shot super-resolution using deep internal learning”.

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

, 2018, 3118–3126.

26

Wenhan

Y

,

Liu

J

,

Xia

S

,

Guo

Z

. “Variation learning guided convolutional network for image interpolation”.

2017 IEEE International Conference on Image Processing (ICIP)

, 2017, 1652–1656.

27

Eraslan

G

,

Simon

LM

,

Mircea

M

, et al.

Single-cell RNA-seq denoising using a deep count autoencoder

.

Nat Commun

2019

;

10

(

1

):

390

.

28

Huang

M

,

Wang

J

,

Torre

E

, et al.

SAVER: gene expression recovery for single-cell RNA sequencing

.

Nat Methods

2018

;

15

(

7

):

539

–

42

.

29

Invasive Ductal Carcinoma Stained With Fluorescent CD3 Antibody - 10x Genomics

https://www.10xgenomics.com/resources/datasets/invasive-ductal-carcinoma-stained-with-fluorescent-cd-3-antibody-1-standard-1-2-0.

30

Wolf

FA

,

Angerer

P

,

Theis

FJ

.

SCANPY: large-scale single-cell gene expression data analysis

.

Genome Biol

2018

;

19

(

1

):

15

.

31

Blondel

VD

,

Guillaume

JL

,

Lambiotte

R

, et al.

Fast unfolding of communities in large networks

.

J Stat Mech

2008

;

2008

(

10

):

P10008

–

12

.

Google Scholar

Crossref

WorldCat

32

Zhou

Y

,

Zhou

B

,

Pache

L

, et al.

Metascape provides a biologist-oriented resource for the analysis of systems-level datasets

.

Nat Commun

2019

;

10

(

1

):

1523

.

33

He

Y

,

Tang

X

,

Huang

J

, et al.

ClusterMap for multi-scale clustering analysis of spatial gene expression

.

Nat Commun

2021

;

12

(

1

):

5909

.

34

Xin

L

,

Orchard

MT

. “New edge-directed interpolation”.

IEEE Transactions on Image Processing

2001;

10

(10):1521–1527.

35

Huang

J-J

,

Siu

W-C

,

Liu

T-R

.

Fast image interpolation via random forests

.

IEEE Trans Image Process

2015

;

24

(

10

):

3232

–

45

.

36

Raudvere

U

,

Kolberg

L

,

Kuzmin

I

, et al.

g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)

.

Nucleic Acids Res

2019

;

47

(

W1

):

W191

–

8

.

37

Price

MA

,

Colvin Wanshura

LE

,

Yang

J

, et al.

CSPG4, a potential therapeutic target, facilitates malignant progression of melanoma

.

Pigment Cell Melanoma Res

2011

;

24

(

6

):

1148

–

57

.

38

Ding

X

,

Li

F

,

Zhang

L

.

Knockdown of delta-like 3 restricts lipopolysaccharide-induced inflammation, migration and invasion of A2058 melanoma cells via blocking Twist1-mediated epithelial-mesenchymal transition

.

Life Sci

2019

;

226

:

149

–

55

.

39

Bertoni

F

,

Stathis

A

.

Staining the target: CD37 expression in lymphomas

.

Blood

2016

;

128

(

26

):

3022

–

3

.

40

Mouse Brain Serial Section 1 (Sagittal-Posterior) - 10x Genomics

https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-1-sagittal-posterior-1-standard-1-1-0.

41

Korsunsky

I

,

Millard

N

,

Fan

J

, et al.

Fast, sensitive and accurate integration of single-cell data with harmony

.

Nat Methods

2019

;

16

(

12

):

1289

–

96

.

42

7.5k Sorted Cells from Human Invasive Ductal Carcinoma, 3′ v3.1 - 10x Genomics

https://www.10xgenomics.com/resources/datasets/7-5-k-sorted-cells-from-human-invasive-ductal-carcinoma-3-v-3-1-3-1-standard-6-0-0.

43

Hao

Y

,

Hao

S

,

Andersen-Nissen

E

, et al.

Integrated analysis of multimodal single-cell data

.

Cell

2021

;

184

(

13

):

3573

–

3587.e29

.

44

Bracken

AP

,

Pasini

D

,

Capra

M

, et al.

EZH2 is downstream of the pRB-E2F pathway, essential for proliferation and amplified in cancer

.

EMBO J

2003

;

22

(

20

):

5323

–

35

.

45

Raccurt

M

,

Tam

SP

,

Lau

P

, et al.

Suppressor of cytokine signalling gene expression is elevated in breast carcinoma

.

Br J Cancer

2003

;

89

(

3

):

524

–

32

.

46

Kim

MS

,

Jeong

EG

,

Yoo

NJ

, et al.

Mutational analysis of oncogenic AKT E17K mutation in common solid cancers and acute leukaemias

.

Br J Cancer

2008

;

98

(

9

):

1533

–

5

.

47

Agapite

J

, et al.

Alliance of genome resources portal: unified model organism research platform

.

Nucleic Acids Res

2020

;

48

(

D1

):

D650

–

8

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

48

Kollara

A

,

Kahn

HJ

,

Marks

A

, et al.

Loss of androgen receptor associated protein 70 (ARA70) expression in a subset of HER2-positive breast cancers

.

Breast Cancer Res Treat

2001

;

67

(

3

):

245

–

53

.

49

Meisen

WH

,

Dubin

S

,

Sizemore

ST

, et al.

Changes in BAI1 and nestin expression are prognostic indicators for survival and metastases in breast cancer and provide opportunities for dual targeted therapies

.

Mol Cancer Ther

2015

;

14

(

1

):

307

–

14

.

50

Tommasi

S

,

Karm

DL

,

Wu

X

, et al.

Methylation of homeobox genes is a frequent and early epigenetic event in breast cancer

.

Breast Cancer Res

2009

;

11

(

1

):

R14

–

4

.

51

Yunes

MJ

,

Neuschatz

AC

,

Bornstein

LE

, et al.

Loss of expression of the putative tumor suppressor NES1 gene in biopsy-proven ductal carcinoma in situ predicts for invasive carcinoma at definitive surgery

.

Int J Radiat Oncol Biol Phys

2003

;

56

(

3

):

653

–

7

.

52

Anderson

A

,

Lundeberg

J

.

Sepal: identifying transcript profiles with spatial patterns by diffusion-based Modeling

.

Bioinformatics

2021

;

37

:

2644

–

50

.

53

Zeira

R

,

Land

M

,

Strzalkowski

A

, et al.

Alignment and integration of spatial transcriptomics data

.

Nat Methods

2022

;

19

(

5

):

567

–

75

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
January 2023	165
February 2023	180
March 2023	140
April 2023	158
May 2023	137
June 2023	68
July 2023	89
August 2023	68
September 2023	114
October 2023	110
November 2023	108
December 2023	63
January 2024	256
February 2024	166
March 2024	146
April 2024	127
May 2024	92
June 2024	113
July 2024	159
August 2024	116
September 2024	156
October 2024	200
November 2024	160
December 2024	123
January 2025	105
February 2025	133
March 2025	160
April 2025	94
May 2025	37

Article Contents

DIST: spatial transcriptomics enhancement using deep learning

Abstract

Introduction

Methods

Overview of DIST

DIST method

The architecture of VLN

Introduction of synthetic noises

Data analysis

Results

DIST imputes the spatial gene expression accurately

Explore spatial patterns of ST human melanoma data

Explore biological functions of Visium human IDC data

DIST denoises Visium human IDC data

Discussion

Data availability

Code availability

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

DIST: spatial transcriptomics enhancement using deep learning Free

Abstract

Introduction

Methods

Overview of DIST

DIST method

The architecture of VLN

Introduction of synthetic noises

Data analysis

Results

DIST imputes the spatial gene expression accurately

Explore spatial patterns of ST human melanoma data

Explore biological functions of Visium human IDC data

DIST denoises Visium human IDC data

Discussion

Data availability

Code availability

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

DIST: spatial transcriptomics enhancement using deep learning