-
PDF
- Split View
-
Views
-
Cite
Cite
Dian Meng, Yu Feng, Kaishen Yuan, Zitong Yu, Qin Cao, Lixin Cheng, Xubin Zheng, scMMAE: masked cross-attention network for single-cell multimodal omics fusion to enhance unimodal omics, Briefings in Bioinformatics, Volume 26, Issue 1, January 2025, bbaf010, https://doi.org/10.1093/bib/bbaf010
- Share Icon Share
Abstract
Multimodal omics provide deeper insight into the biological processes and cellular functions, especially transcriptomics and proteomics. Computational methods have been proposed for the integration of single-cell multimodal omics of transcriptomics and proteomics. However, existing methods primarily concentrate on the alignment of different omics, overlooking the unique information inherent in each omics type. Moreover, as the majority of single-cell cohorts only encompass one omics, it becomes critical to transfer the knowledge learnt from multimodal omics to enhance unimodal omics analysis. Therefore, we proposed a novel framework that leverages masked autoencoder with cross-attention mechanism, called scMMAE (single-cell multimodal masked autoencoder), to fuse multimodal omics and enhance unimodal omics analysis. scMMAE simultaneously captures both the shared features and the distinctive information of two single-cell omics modalities and transfers the knowledge to enhance single-cell transcriptome data. Comparative evaluations against benchmarking methods across various cohorts revealed a notable improvement, with an increase of up to 21% in the adjusted Rand index and up to 12% in normalized mutual information in the context of multimodal fusion. In the realm of unimodal omics, scMMAE demonstrated an overall enhancement of approximately 20% in the adjusted Rand index and nearly 10% in normalized mutual information. Other nine metrics, including the Fowlkes–Mallows index and silhouette coefficient, further underscored the high performance of scMMAE. Significantly, scMMAE exhibits an elevated level of proficiency in distinguishing between different cell types, particularly on CD4 and CD8 T cells. Availability and implementation: scMMAE source code at https://github.com/DM0815/scMMAE/.
Introduction
Single-cell multimodal omics (multi-omics) techniques offer a pivotal opportunity to deepen the understanding of biological systems. Integrating different modal data of single cells, which provides multi-faceted insights into cellular processes, allows us to create a more comprehensive and nuanced view of cellular functions and interactions. Technological advancements made it possible to measure various types of molecules within a single cell. Among the studies of these molecules, transcriptomics and proteomics are two pivotal branches in understanding cellular function and phenotype. Transcriptomics reveals the cellular responses to stimuli, differentiation processes, and the development of specific phenotypes. Proteomics provides a direct assessment of cellular function as proteins are the actual functional molecules in the cell. Techniques such as cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) [1] exemplify these innovations, enabling simultaneous analysis of transcriptomics and proteomics in individual cells. Despite the remarkable progress in sequencing technologies [1–4], a variety of computational approaches have been proposed to integrate the multimodal omics between transcriptomics and proteomics.
Currently, most methods for the integration analysis of transcriptomics and proteomics data are based on probabilistic graphical models, which can be further divided into two subcategories. One subcategory includes models based on variational autoencoders (VAE) [5] and its variants, such as scCTCLust and scMM. ScCTCLust [6] integrates transcriptomic and proteomic data from a single cell utilizing VAE and canonical correlation analysis. ScMM [7] is specifically designed for the analysis of CITE-seq data, emphasizing joint representation and predictions across different modalities. TotalVI [8] presents an end-to-end framework for the joint analysis of CITE-seq data. It probabilistically characterizes the data by integrating both biological and technological factors, such as protein background and batch effects. The other category comprises models that utilize traditional Bayesian methods with Gibbs sampling [9]. For instance, jointDIMMSC [10] and BREMSC [11] are Bayesian Random Effects Mixture Models designed for joint clustering of CITE-Seq data. These models employ Gibbs sampling to sample from the posterior distribution given the observed data. Recently, the SCOIT model [12] was introduced as a probabilistic tensor decomposition framework, specifically designed to extract embeddings from paired single-cell multi-omic data, including CITE-seq data.
On the other hand, the cross-modal features can be beneficial for unimodal analysis. The existing methods [7, 8, 13] were incapable of transferring the knowledge acquired from multimodal omics to aid unimodal omics, yet the majority of single-cell cohorts only encompass one omics particularly single-cell RNA sequencing (scRNA-seq). Therefore, it is critical to transfer the knowledge learnt from multimodal omics to assist unimodal omics analysis.
In this study, we proposed a cross-attention neural network architecture based on the masked autoencoder called scMMAE (single-cell multimodal masked autoencoder) that can simultaneously extract the common features and preserve the distinctive information of the respective omics (Fig. 1A). Moreover, scMMAE can transfer the knowledge learnt from the fusion of proteomics and transcriptomics to enhance the representation of scRNA-seq (Fig. 1B). ScMMAE contained two encoders for gene and protein expressions, a cross-attention mechanism for two omics, and the fusion of two distinct information (DI) and cross-modal features (Fig. 1C). A masked autoencoder (MAE) was applied for model pretraining and transfer learning was used to transfer knowledge to enhance single-cell transcriptome. We evaluated the performance of scMMAE in transcriptomic and proteomic integration with 10 metrics including adjusted Rand index (ARI), normalized mutual information (NMI), and Fowlkes–Mallows index (FMI) using five CITE-seq cohorts. The performance of scMMAE for transcriptomic representation was also evaluated using four scRNA-seq cohorts. Our model demonstrated exceptional performance both in the integration of single-cell transcriptomics and proteomics and facilitating single-cell transcriptome analysis. It can provide better representation in resolving cell types such as CD4 and CD8 T cells and downstream analysis such as diagnostic biomarkers identification.

Overview of scMMAE. (A) The representation of multimodal omics including the common features fused from proteins and RNAs and the distinct information from different modalities, which provide complementary information. (B) Transferring single-cell multi-omics representation to unimodal omics using scMMAE. (C) ScMMAE consists of three phases. In the first stage, self-supervised learning is applied to pre-train the model with unlabeled CITE-seq data. This involves utilizing masked expression embeddings and learnable gene and protein embeddings as inputs, which are then processed through the encoder blocks. The cross-attention blocks were applied to extract the shared features, while the distinctive features were preserved by involving residuals (|$\bigoplus $|). In the second stage, a small proportion of CITE-seq data with cellular information were fed into scMMAE for multimodal omics fusion. In the third stage, scMMAE transferred the knowledge learnt from multimodal omics fusion to enhance the representation of unimodal omics. FC: fully connected layer
Methods
Overview of ScMMAE
As transcriptomics and proteomics may include complementary information of cells, scMMAE took DI from transcriptomics and proteomics into consideration in multi-modal omics fusion (Fig. 1A). The neural network architecture of scMMAE first applied autoencoder with cross-attention mechanism to learn cross-modal information, bridging the gap between two omics. Both the encoder and decoder used multi-head self-attention layers. The cross-attention mechanism was applied in the latent space described as equation 1.
where |$E_{i}$|, |$E_{j}$| are the multi-head self-attention encoder of two omics, |$W_{1}$|, |$W_{2}$|, |$W_{3}$| present three learnable networks. Then, it combined the cross-modal information with DI from encoders of different modalities to capture the intricate relationships and dependencies between genes and proteins, mapping them into a unified latent space for better representation. We denoted DI from two omics as |$I_{i}$| and |$I_{j}$|. The fused representation can be written as equation 2.
Importantly, since most of the samples were profiled by scRNA-seq, scMMAE transferred the knowledge learnt from multi-modalities to unimodality to enhance the representation of scRNA-seq.
The training of scMMAE is comprised of three stages.
Stage 1: Limited by the number of annotated CITE-seq data, we applied a self-supervised learning method, MAE, to pretrain the modal. We masked part of the genes and proteins and forced scMMAE to reconstruct the missing inputs using unmasked cells from five CITE-seq datasets. We denoted the original input of transcriptomics and proteomics as |$X_{RNA}$| and |$X_{ADT}$|. We masked part of the input, denoted as |$X_{RNA}^{masked}$| and |$X_{ADT}^{masked}$|, and fed into the model. The model encoded the masked input into low dimensional latent space |$X_{RNA}^{masked}, X_{ADT}^{masked}\rightarrow Z_{RNA, ADT}$| and tried to reconstruct the original input through the decoder |$Z_{RNA, ADT}\rightarrow \hat X_{RNA}$|, |$\hat X_{ADT}$|, where |$Z_{RNA, ADT}$| represents the output of the encoder in the latent space and |$\hat X_{RNA}, \hat X_{ADT}$| represent the output of two decoders. The parameters of the network model can be obtained through optimizing the loss function in equation 3.
Basic information and the relationship of genes and proteins were learnt by scMMAE in this stage.
Stage 2: ScMMAE was trained by a small part of the CITE-seq data with cell annotation as labels. It was required to accurately predict cell types and learn cell information based on transcriptomics and proteomics through this process.
Stage 3: ScMMAE transferred the knowledge learnt from multimodal omics to enhance single-cell transcriptome analysis by training with only part of scRNA-seq data. The cross-attention mechanism kept the knowledge of multiomics and was revised into a self-attention for unimodal analysis.
The data structure of the input, network parameters, training processes, and other details were illustrated in the following parts.
CITE-seq and ScRNA-seq datasets preprocessing
ScMMAE adopted the popular training strategy, i.e. pre-train and fine-tuning. At the pre-training and fine-tuning stage, we utilized five CITE-seq datasets (Table 1), and four RNA-seq datasets (Table 2) during the prediction stage and detailed information regarding these datasets is provided. Of note, in the absence of annotated cell types for three of the CITE-seq datasets (PBMC5K, PBMC10K, and MALT10K), we employed weighted nearest neighbor methods to conduct multi-omics analysis using Scanpy [14]. The remaining datasets were annotated according to their sources. Since the code for the scCTCLust method is incomplete and unusable, we did not include it in the comparison. We applied totalVI, SCOIT, jointDIMMSC, scMM, and BREMSC to embed CITE-seq data cells in a common latent space as benchmark methods in the fine-tuning stage and applied Seurat [15], Scanpy, SCVI [16], and Pagoda2 [17] in single modal predicting. See Supplementary Methods S1.1 and S1.2 for procedures and parameterization of multi-omics and unimodal omics, respectively.
Datasets . | Cell . | RNA . | Protein . | Source . |
---|---|---|---|---|
SPL111 | 16828 | 13553 | 112 | [18] |
SPL206 | 15820 | 13553 | 209 | [18] |
PBMC5k | 3994 | 16581 | 29 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3 |
PBMC10k | 6855 | 16727 | 14 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3 |
MALT10k | 6838 | 16659 | 14 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3 |
Datasets . | Cell . | RNA . | Protein . | Source . |
---|---|---|---|---|
SPL111 | 16828 | 13553 | 112 | [18] |
SPL206 | 15820 | 13553 | 209 | [18] |
PBMC5k | 3994 | 16581 | 29 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3 |
PBMC10k | 6855 | 16727 | 14 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3 |
MALT10k | 6838 | 16659 | 14 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3 |
Datasets . | Cell . | RNA . | Protein . | Source . |
---|---|---|---|---|
SPL111 | 16828 | 13553 | 112 | [18] |
SPL206 | 15820 | 13553 | 209 | [18] |
PBMC5k | 3994 | 16581 | 29 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3 |
PBMC10k | 6855 | 16727 | 14 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3 |
MALT10k | 6838 | 16659 | 14 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3 |
Datasets . | Cell . | RNA . | Protein . | Source . |
---|---|---|---|---|
SPL111 | 16828 | 13553 | 112 | [18] |
SPL206 | 15820 | 13553 | 209 | [18] |
PBMC5k | 3994 | 16581 | 29 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3 |
PBMC10k | 6855 | 16727 | 14 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3 |
MALT10k | 6838 | 16659 | 14 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3 |
The first four rows contain information about the scRNA-seq datasets used to perform in the prediction phase
Datasets . | cell . | RNA . | Source . |
---|---|---|---|
PBMC3k | 2700 | 13714 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k |
IFNB | 13999 | 14053 | [19] |
CBMC | 8617 | 20501 | [1] |
BMCITE | 30672 | 17009 | [20] |
Datasets . | cell . | RNA . | Source . |
---|---|---|---|
PBMC3k | 2700 | 13714 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k |
IFNB | 13999 | 14053 | [19] |
CBMC | 8617 | 20501 | [1] |
BMCITE | 30672 | 17009 | [20] |
The first four rows contain information about the scRNA-seq datasets used to perform in the prediction phase
Datasets . | cell . | RNA . | Source . |
---|---|---|---|
PBMC3k | 2700 | 13714 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k |
IFNB | 13999 | 14053 | [19] |
CBMC | 8617 | 20501 | [1] |
BMCITE | 30672 | 17009 | [20] |
Datasets . | cell . | RNA . | Source . |
---|---|---|---|
PBMC3k | 2700 | 13714 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k |
IFNB | 13999 | 14053 | [19] |
CBMC | 8617 | 20501 | [1] |
BMCITE | 30672 | 17009 | [20] |
We applied distinct normalization strategies tailored to each data type: RPKM normalization for RNA-seq data, and centered log ratio (CLR) normalization [21] for the proteomic data to mitigate compositional effects.
where |$x_{i}$| represents the |$i$|th protein expression value in the cell, with |$n$| denoting the total number of proteins in a cell, and |$j$| iterating over all proteins.
After normalization, we selected 4000 genes with high variability along with all proteins for model input to capture the most informative features. All preprocessing procedures were executed using Scanpy’s integrated functions.
For the sepsis case studies, we collected scRNA-seq data for sepsis patients and healthy controls from the Broad Institute Single Cell Portal, portal ID SCP548 (subject PBMCs). The collection contains scRNA-seq data for 126 351 cells from 29 septic patients and 36 controls across three cohorts. The key cohorts focused on those who had a urinary tract infection (UTI) early in their illness progression. Subjects with UTIs and either mild or transient organ dysfunction (Int-URO), UTIs with evident or persistent organ dysfunction (Urosepsis, URO), bacteremic patients with sepsis in hospital wards (Bac-SEP), and patients admitted to the medical intensive care unit (ICU) with sepsis (ICU-SEP) are among the septic patients. Subjects with UTI and leukocytosis (blood WBC |$\geq $| 12 000 per |$mm^{3}$|) without organ dysfunction (Leuk-UTI), patients hospitalized in the medical intensive care unit (ICU-NoSEP), and healthy uninfected controls were included in the control samples.
Network architecture of ScMMAE
We proposed a cross-attention-based network called scMMAE that can integrate single-cell transcriptomics and proteomics data and transfer the fused knowledge to enhance scRNA-seq data. scMMAE was constructed based on the MAE framework [22], tailored for handling multimodal tasks, with the incorporation of cross-attention mechanisms for data integration (Fig. 1C). ScMMAE comprised three main stages. In the first stage, referred to as the pre-training stage, scMMAE included two encoders, two decoders, and one cross-attention architecture. Following the pre-training stage, our model discarded one of the decoder architectures, and the cross-attention output was augmented with the residual to serve as the model’s final output. Lastly, in stage 3, we streamlined the architecture, reducing its complexity compared to the models in stage 2, except for the cross-attention architecture, and used one full connectivity layer to replace original modal queries.
Transcriptomics and proteomics data reconstruction
Suppose the input CITE-seq dataset notated as |$S\ =\ {s_{RNA}^{k},{\ s}_{ADT}^{k}},\ i\ \in [1,n]$|, where k is the cell number. It consists of transcriptomics and proteomics, our objective is to learn a unified representation |$u\in \mathbb{R}^{k\ast m}$| for each sample s in integrating CITE-seq data, where m is the embedding dimension of final cell embedding, and we set it as 128 in this experiment. Before learning the unified representations, we prioritized the model to grasp gene and protein expression information, then acquired cell type information, and ultimately validated the model. Hence, the initial step is to reconstruct the transcriptomics and proteomics matrices. The input feature |$X_{in}$| in the first stage consists of the feature of transcriptomics |$X_{RNA}\in \mathbb{R}^{k\ast 4000}$| and proteomics |$X_{ADT\ }\in \mathbb{R}^{k\ \ast \ (protein\ number)}$|, 4000 represents the number of highly variable gene (HVG) we used in this experiment, and the number of proteins was determined by the respective dataset due to the differences involved in each dataset. |$X_{RNA}$| contains the information of gene expression value |$X_{RNA}^{ex}\in \mathbb{R}^{k\ast 4000}$|, and gene symbol embedding |$X_{RNA}^{sym}\in \mathbb{R}^{k\ast 4000}$|, and |$X_{ADT}$| contains the information of protein expression value |$X_{ADT}^{ex}\in \mathbb{R}^{k\ast \ (protein\ number)}$|, and gene symbol embedding |$X_{ADT}^{sym}\in \mathbb{R}^{k\ \ast \ (protein\ number)}$|. It can be represented as follows:
where |$X_{in}$| represents the input feature, |$X_{RNA}^{ex}$| and |$X_{ADT}^{ex}$| are the gene expression values and protein expression values, |$X_{RNA}^{sym}$| and |$X_{ADT}^{sym}$| are the gene symbol embedding and protein symbol embedding. We set it up as a learnable vector in this experiment.
To learn the representation |$u$|, we initially randomly masked gene and protein expression values. Subsequently, the unmasked transcriptomics and proteomics data, that is, the unmasked features |$X_{in}^{unm}$|, were encoded separately using a transformer block based on a multi-head attention mechanism. The attention of each head is calculated as follows:
where |$W_{in}^{Q}$|, |$W_{in}^{K}$|, and |$W_{in}^{V}$| are the weight matrices, |$\mathrm{\Phi }(...)$| is a function that focuses the network’s attention on the most important small part of the data, and all encoding process can be described as:
where |$\eta (...)$| is a function that concatenates the features inside, |$W^{O}$| is the output weight matrix, and h is the number of heads. In this experiment, we use two heads.
After the unmasked gene and protein data enter the encoder, |$E_{RNA}^{unm}$| and |$E_{ADT}^{unm}$| are the output of the gene encoder and protein encoder, respectively. Then |$E_{RNA}^{unm}$| and |$E_{ADT}^{unm}$| will be fed into the cross-attention structure as follows:
where |$\zeta (...)$| is an attention mechanism in transformer architecture that mixes two different modalities: transcriptomics and proteomics, cross-attention calculations are performed for transcriptomics data using queries from proteomics, and vice versa for proteomics data. |$C_{RNA}$| and |$C_{ADT}$| represent the cross-attention results of transcriptomic and proteomics, respectively. In this experiment, we will employ two cross-attention structures. The function |$Softmax (x)$| means the output values will be between 0 and 1, and the sum of all output values will be 1.
Of note, we used DI (denoted as |$I_{RNA}$| and |$I_{ADT}$|) to preserve the unique information of different omics after computing the cross-attention mechanism for each modality, which is the output of the respective modal encoder, |$E_{RNA}^{unm}$| and |$E_{ADT}^{unm}$|. The input of the decoder consists of two parts:
Subsequently, the calculation process for decoding steps mirrors that of the encoding part. Last, the outputs of the two decoders (|$D_{RNA}^{in}$| and |$D_{ADT}^{in}$|) are subjected to loss calculations with the initial transcriptomics and proteomics matrices, respectively. The formula is as follows:
where |$\alpha $| and |$\beta $| are the hyperparameters used to determine the loss weights for the two modes independently, and |$n$| is the cell number in experiments. |$x_{RNA, i}^{ex} $|, |$x_{ADT, i}^{ex}$|, |$ D_{RNA, i}^{out}$|, |$D_{ADT, i}^{out}$| are the elements of |$x_{RNA}^{ex}$|, |$x_{ADT}^{ex}$|, |$ D_{RNA}^{out}$|, and |$ D_{ADT}^{out} $|, which represents gene true expression values, gene predicted values, protein true expression values, and protein predicted values, respectively.
Learning representation of transcriptomics and proteomics
In the second stage where the model learns cell type information, we discard the decoder part, and all gene and protein data will be fed into the encoder without masking. Next, the outputs of encoders directly concatenate the output of cross-attention with DI as the final result.
Here, |$I_{RNA}$| and |$I_{ADT}$| denote the outputs of gene and protein encoder without masking, respectively. The representation |$u$| at the layer before the softmax layer will be utilized for downstream tasks. For loss calculation, |$u$| undergoes a softmax layer to generate predicted cell types and compare them with the true cell types. This stage’s objective function is defined using cross-entropy loss:
where |$n$| is the cell number of CITE-seq data in the training process, |$y_{i}$| and |$\hat{y}_{i}$| are the true cell type labels and predicted cell type labels, respectively.
Transfer multi-omic knowledge to transcriptome
In the final stage, only a single modal data was utilized, halving all structures.
where |$ u_{RNA}$| is the cell representation of transcriptomics. Notably, we employed a fully connected layer connected to itself to substitute the query from another modality data. The loss is calculated as follows:
where |$m$| is the cell number of scRNA-seq data in the training process, |$y_{j}$| and |$\hat{y}_{j}$| are the true cell type labels and predicted cell type labels, respectively.
Model training and hyperparameters
The initial step in scMMAE involves reconstructing RNA and protein matrices for transcriptomic and proteomic data, respectively. We observed distinct behaviors based on dataset size. For datasets containing fewer than 10 000 cells, scMMAE typically converged within 20 epochs. Conversely, larger datasets required approximately 150 epochs to reach convergence. Notably, scMMAE demonstrates improved resource efficiency and faster processing times compared to Bayesian Gibbs sampling-based methods such as BREMSC and jointDIMMSC, which typically take over an hour to run with default parameter settings. In contrast, scMMAE completes one epoch in approximately 15 s. As for the masking strategy, we randomly mask the genes and proteins in a cell each time, and the masking positions of the two omics are not synchronized in the first stage. In the second stage, we will set the masking ratio to 0.
In the second stage, we refined the pre-trained model by incorporating 30 % annotated data. After convergence was reached, the model generated global cell embeddings for the entire dataset, as well as local cell embeddings for the respective modalities of data, in preparation for downstream tasks. Finally, to validate our model, we employed scRNA-seq data and followed a training process similar to that of the second stage. In the training process, it should be noted that we performed pre-training and fine-tuning experiments on five cite-seq datasets separately.
In addition, we conducted ablation studies on five CITE-seq datasets to underscore the effectiveness of the cross-attention architecture. We utilized three distinct evaluation metrics (ARI, NMI, and FMI) to determine the impact of cross-attention-based compared to the baseline model, which relies solely on element-wise addition (Fig. 2A). The results showed that the indicators increased substantially on all datasets except NMI on the SPL206 dataset. These findings provide further evidence of the effectiveness of the proposed structural innovation. Additionally, we conducted mask ratio ablation experiments on five CITE-seq datasets, evaluated using ARI, NMI, and FMI, revealing that a 15% mask ratio yields optimal results (Fig. 2B–F).

Ablation experiments for different modules in scMMAE. (A) Performance of scMMAE with cross-attention mechanism and without cross-attention mechanism evaluated by ARI, NMI, and FMI. (B–F) Performance of scMMAE with different mask ratios during training in five CITE-seq datasets evaluated by ARI, NMI, and FMI. ARI: adjusted rand index, NMI: normalized mutual information, FMI: Fowlkes-Mallows index.
When choosing the number of HVG, we tested using 1000 to 5000 HVGs. When the number of HVG was set to 4000, the model performed consistently high. Therefore, we used 4000 HVGs in our experiments (Fig. S10A–E). DI and omics information were also selectively ablated to assess the individual contributions of cross-modal and modality-specific information to the performance of downstream cell type prediction tasks (Fig. S10F–J).
Training strategy was the key to the huge performance of scMMAE. We found that if the scMMAE is not pre-trained and directly processes data, the performance will drop a lot. This shown that the pre-training strategy is still necessary, which can allow the model to better understand the expression information of genes and proteins (Fig. S11).
Throughout the experiment process, we determined the optimal distribution of loss function, prioritizing transcriptomic data (0.7) over proteomic data (0.3). The finalized model configuration includes six encoder layers and four decoder layers, with all heads set to 2 and a dropout ratio maintained at 0.1.
Evaluation metric of clustering results
All clustering and community detection results are measured using the ARI [23], NMI [24], and FMI [25]. Given two sets of clusterings Y (true labels) and |$\hat{Y}$| (predicted labels) on x samples, Y contains m clusters |$\{U_{1}, U_{2}, \ldots , U_{m}\}$|, and |$\hat{Y}$| contains n clusters |$\{V_{1}, V_{2}, \ldots , V_{n}\}$|. |$n_{\text{ij}}$| denotes the number of samples belonging to |$Y_{\text{i}}$| and |$\hat{Y}_{\text{j}}$|. ARI formula is as follows:
where |$ |Y_{i}| $| and |$ |\hat{Y}_{j}| $| denote the number of samples in |$ \hat{Y}_{i} $| and |$\hat{Y}_{j} $|, respectively. NMI is a metric for evaluating network segmentation achieved by community-finding techniques, which can be computed as
FMI is used to determine the similarity between two clustering, which defined as
Other evaluation metrics are illustrated in supplementary materials. In sepsis case studies, we used the area under the receiver operating characteristic (AUROC) curve to evaluate the model’s performance across sepsis datasets.
Downstream analysis for sepsis with ScMMAE
We initially employed scMMAE to cluster data from both sepsis and control groups, followed by annotating the identified clusters using true labels. After this clustering and annotation process, we conducted a binary classification task for sepsis. The performance of the fine-tuned network was evaluated across the aforementioned datasets using the AUROC curve. Our assessment benchmarked this performance against existing biomarkers, including FCMR/PLAC8, SeptiCyte, and sNIP, along with our scMMAE model. Post establishing these clusters with data from all participants, we further analyzed the variance in cell state abundances between sepsis and control samples, with particular emphasis on the MS1 and MK subpopulations to explore changes in their proportions.
Visualization, clustering, and annotation
In our study, we employed a |$k\times m$| global cell embedding matrix to represent low-dimensional embeddings of |$k$| cells. These embeddings were subsequently utilized for downstream analyses, which included constructing a cell adjacency matrix and performing cell clustering. The adjacency matrix was derived from the cell embeddings using a K-nearest neighbors algorithm, with the number of neighbors set to the default value of 20. Cell clustering was then executed using the Leiden algorithm [26], which operates on the adjacency matrix. To aid visualization, we applied the UMAP algorithm to reduce the cell embeddings and extracted latent features to a 2D space, enabling the visual discrimination of gene expression levels across different cell clusters. For the UMAP parameters, we set the number of neighbors to 15, the minimum distance to 0.1, and the number of components to 2. These parameters were consistently applied across all benchmark methods for comparison. Following clustering, we identified differential genes within each cluster, serving as distinctive signatures for various cell types. This was achieved using the Wilcoxon rank-sum test, which assesses differences between two populations based on their relative ranks. Finally, we annotated the cell clusters utilizing marker genes identified in the previous step. In the comparison of batch effect removal, Scanpy and TotalVI required further parameters and settings for batch effect. For the fairness of comparison between different methods, we only used the default settings for Scanpy and TotalVI.
Results
Integrative analysis of transcriptomics and proteomics data with ScMMAE
We evaluated the performance of scMMAE in integrative analysis of transcriptomics and proteomics across five CITE-seq datasets including SPL111 [18], SPL206 [18], PBMC5K [27], PBMC10K [28], and MALT10K [29], which contains murine cells, murine cells, peripheral blood mononuclear cells (PBMCs), PBMCs, and (Mucosa-Associated Lymphoid Tissue) MALT cells, respectively. SPL111 and SPL206 datasets include cells collected from murine spleen and lymph nodes. PBMC5K and PBMC10K datasets contain cells collected from PBMC. MALT10K datasets consist of cells from a MALT tumor, a rare kind of malignant lymphoma. We compared scMMAE to existing benchmarking methods, including BREMSC, jointDIMMSC, scMM, SCOIT, and TotalVI, using ARI, NMI, FMI, which are three most commonly used metrics to evaluation the representation of multi-omics fusion. On top of that, we applied seven other metrics to evaluate the performance, including adjusted mutual information (AMI), silhouette coefficient (SC), etc. Among 9 of the 10 evaluated metrics, a higher value denotes superior performance, whereas for the Davies–Bouldin Index (DBI), a lower value signifies better performance.
ScMMAE consistently achieved top-tier performance on average in ARI, NMI, and FMI across the five CITE-seq datasets, against the other five methods including BREMSC, jointDIMMSC, scMM, SCOIT, and TotalVI (Fig. 3A). Some of the evaluation metrics, such as the DBI, were not applicable to methods like BREMSC and jointDIMMSC, because these methods relied on coordinate-based visualization for their computations and do not generate independent visualizations. The results demonstrated that scMMAE outperformed existing methods in eight metrics including the three most important, ARI, NMI, and FMI, and ranked secondly in Calinski–Harabaz index (CHI) and Jaccard index (JI), indicating the exceptional performance of scMMAE in multi-omics fusion.

Evaluation and comparison of scMMAE on five multimodal omics datasets. (A) The overall performance of scMMAE and five existing methods in ARI, NMI, FMI, AMI, ASW, CHI-nor, F-measure, JI, SC, and DBI-nor across five CITE-seq datasets. (B–F) The specific scores of the six models on five datasets, SPL111, SPL206, PBMC5K, PBMC10K, MALT10K. (G) Comparison of scMMAE and existing methods evaluated by DBI on five datasets. NA indicates that the model is not applicable for DBI. ARI: adjusted rand index, NMI: normalized mutual information, FMI: Fowlkes–Mallows index, AMI: adjusted mutual information, ASW: average silhouette width, CHI-nor: Calinski–Harabaz index normalization, JI: Jaccard index, SC: silhouette coefficient, DBI-nor: Davies–Bouldin index normalization.
A comparison of each of the five datasets demonstrated the comprehensive superiority of scMMAE (Fig. 3B–F). ScMMAE achieved better performance with the three most important metrics, ARI, NMI, and FMI, in all five datasets. Although scMM and SCOIT surpassed scMMAE in CHI and JI, they did not perform well in other metrics. In DBI, scMMAE also exhibited the best performance across all datasets except the PBMC5K dataset, where it was close to the best (Fig. 3G).
We projected the cells in the PBMC10K dataset (CITE-seq) into a 2D space using UMAP (Fig. 4). All methods tested in this session exhibited very similar capacities in resolving the major cell types (myeloid cell, B cell, T cell, and natural killer cell; Fig. 4A), probably due to the distinct gene expression patterns between these cell populations (Fig. 4B–E). The expression of maker genes CD4, CD8A, ITGAM, and JCHAIN were shown for CD4 T cells, CD8 T cells, macrophage, and B cells (Fig. 4B–E). In comparison to the major cell lineages, the transcriptomes of T cell subtype were very similar to each other, which could be reflected by the indiscriminate UMAP projections ([30–32]). This phenomenon left the T cell subtype identification a challenging puzzle in single-cell transcriptomic study. In our parallel comparison of different representation methods, we observed that all T cell subtypes showed clumped distributions with clear inter-cluster discrimination in the UMAP of scMMAE and TotalVI, while the discrimination of different T cell subtypes were found blurred in the UMAP of SCANPY, scMM, and SCOIT (Fig. 4A–B). In this regard, scMMAE and scMM also outperformed at the myeloid subtype identification, since SCANPY, SCOIT, and TotalVI failed in discriminating CXCL8+ macrophage from CXCL8- ones (Fig. 4A and F). Collectively, these results suggested that the scMMAE exhibited superior performance in cell subtype identification tasks. Detailed information about the subpopulations and marker genes for all cell types in the PBMC10K dataset is available in Supplementary Table S1. The clustering results for the other four CITE-seq datasets PBMC5K, MALT10K, SPL111, and SPL206 based on the scMMAE output were shown in Supplementary Figs S1–S4, respectively.

The projections of cells using the different methods based on CITE-seq data. (A) Cell types are distinguished by scMMAE, Scanpy, scMM, SCOIT, and TotalVI. The blue, red, orange, and purple circles highlight B cells, T cells, myeloid cells, and natural killer cells. (B-C) Expression of marker genes CD4 and CD8A for CD4 and CD8 T cells. (D) Expression of marker gene ITGAM for myeloid cells. (E) Expression of marker gene JCHAIN for B cells. (F) scMMAE can distinguish CXCL8+ and CXCL8- macrophage.
Enhancing transcriptomics representation with ScMMAE
A better classifier can be obtained by training with multimodal data and thus enhance unimodal classification [33, 34]. This approach also works for the representation learning in single-cell multiomics (Fig. S8; for detailed explanation please refer to Discussion). ScMMAE can transfer the deep learning model learnt from fused omics to enhance RNA representation. We applied scMMAE alongside four existing methods, Scanpy [14], Seurat [15], Pagoda2 [17], and scVI [16] to four scRNA-seq cohorts, namely IFNB [19], CBMC [1], PBMC 3K [35], and BMCITE [20]. IFNB profiled PBMC cells and stimulated the group treated with interferon beta. CBMC collected cord blood mononuclear cells from humans. PBMC 3K collected PBMCs from healthy donors. BMCITE was obtained from the bone marrow mononuclear cells of a single human donor. Our method scMMAE achieved superior performance compared to the other four methods in terms of the mean values of the 10 evaluation metrics on the four datasets (Fig. 5A). Although Seurat performed slightly better than scMMAE on one cohort (PBMC 3K), it did not achieve comparative results as scMMAE in terms of the other three cohorts (Fig. 5B). The pre-trained scMMAE, incorporating a second modal of data, proved highly beneficial for RNA representation and consistently improved performance in different situations (Fig. 5B)

Comparison of scMMAE to existing methods on single-cell transcriptomics data. (A) The overall performance of scMMAE and four existing methods in ARI, NMI, FMI, AMI, ASW, CHI-nor, F-measure, JI, SC, and DBI-nor across four RNA-seq datasets. (B) The specific scores of the five models on four datasets. ARI: adjusted rand index, NMI: normalized mutual information, FMI: Fowlkes–Mallows index, AMI: adjusted mutual information, ASW: average silhouette width, CHI-nor: Calinski–Harabaz index normalization, JI: Jaccard index, SC: silhouette coefficient, DBI-nor: Davies–Bouldin index normalization.
To demonstrate the usefulness and superiority of scMMAE for downstream analysis, we visualized the clustering results of five methods with cell type annotation [19] (IFNB; Fig. 6A). The IFNB dataset is a scRNA-seq dataset commonly used to study the effects of interferon-beta |$(IFN-\beta )$| on cells. This dataset primarily contains gene expression information from cells stimulated with |$IFN-\beta $|, making it useful for analyzing immune responses, cellular signaling pathways, and the regulatory effects of interferon on different cell types. ScMMAE resolved four major cell populations and 12 subtypes with clear boundaries. We displayed the marker genes CD3E, CD8A, ITGAM, CD19, and FCER1A for CD4 T cells, CD8 T cells, macrophages, B cells, and conventional dendritic cells. As T cell subtypes express similar transcriptomic characters, distinguishing subtypes in T cells is challenging. The UMAP projection demonstrated the capability of scMMAE to segregate CD4 and CD8 T cells into separate yet proximate clusters, while Scanpy, Seurat, Pagoda2, and scVI aggregated CD4 and CD8 T cells into a singular cluster (Fig. 6A-C). Moreover, scMMAE and scVI outperformed in myeloid subpopulation identification as macrophages were spread in two groups in Scanpy, Seurat, and Pagoda2. Collectively, the results demonstrated the effectiveness of scMMAE for scRNA-seq enhanced by multimodal knowledge. The projection of cells in the three RNA-seq datasets PBMC3K, BMCITE, and CBMC were displayed in Supplementary Figs S5–S7, respectively.

The projections of cells using the different methods based on scRNA-seq data. (A) Cell types are distinguished by scMMAE, Scanpy, Seurat, Pagoda2, and scVI. The blue, red, brown, light blue, and purple circles highlight B cells, T cells, myeloid cells, macrophage cells, and natural killer cells. (B-C) Expression of marker genes CD3E and CD8A for CD4 and CD8 T cells. (D) Expression of CD19 as a marker gene for B cells. (E) Expression of FCER1A as a marker gene for conventional dendritic cells.
ScMMAE overcomes batch effects and preserves tissue information
Appropriately removing batch effect but preserving tissue information in the meanwhile is a challenge in single-cell analysis. In the five CITE-seq datasets mentioned above, only SPL206 included different batches and tissues. SPL111 and SPL206 were from the same source but profiled 111 proteins and 206 proteins. We took SPL206 as an example to perform batch elimination and compare it with other methods. SPL206 CITE-seq dataset consisting of two batches and two tissues (lymph node and spleen). The batches are still distinct in the UMAP of Scanpy, SCOIT, and TotalVI (Fig. 7A–E). Especially in the scMMAE and scMM methods, the two batches of cells were very homogeneously mixed together (Fig. 7A and C). Removing the batch effect may lead to the elimination of tissue information. We annotated cells collected from lymph nodes and spleen (Fig. 7F–J). Cells from different tissues can be well distinguished in Scanpy (Fig. 7G), partially distinguished in scMMAE (Fig. 7F), SCOIT (Fig. 7I), and TotalVI (Fig. 7J), while they uniformly mixed together in scMM. These results demonstrate a better performance in removing batch effect with tissue information preservation.

Mitigation of batch effects with tissue-specific information preserving. (A–E) The projection of cells from different batches using scMMAE, Scanpy, scMM, SCOIT, and TotalVI. (F–J) The projection of cells from lymph nodes and spleen using scMMAE, Scanpy, scMM, SCOIT, and TotalVI.
In addition to the qualitative analysis above, we also conducted a quantitative evaluation of the model’s performance, using batch average silhouette width (Batch-ASW) and graph connectivity (GC) to assess its effectiveness in batch effect removal [36]. Our model achieved top 1 and top 2 results in GC and Batch-ASW across five methods, demonstrating its superior capability in addressing batch effects (Fig. S12).
Assisting biomarker identification with ScMMAE
ScMMAE as a representation learning method can benefit downstream analysis for scRNA-seq data. Taking a scRNA-seq profile of sepsis [37, 38] as an example, scMMAE can assist biomarker identification. Sepsis is a life-threatening disease when the immune system overreacts to infections [39–42]. This profile includes 106 545 PBMCs collected from 15 sepsis patients in ICU, 4 patients in hospital wards, 27 patients in emergency department (ED), and 19 healthy subjects. ED consists of 10 UTI with leukocytosis (blood WBC|$\geq $|12 000 per |$mm^{3}$|) but no organ dysfunction (LeuK-UTI) patients, 7 UTI with mild or transient organ dysfunction (Int-URO) patients, and 10 UTI with clear or persistent organ dysfunction (Urosepsis, URO) patients.
We clustered and annotated cells from sepsis samples and control samples using the fine-tuned scMMAE based on the CITE-seq dataset described above (Fig. 8A-B and Fig. S9). Six cell types (T cells, B cells, neural killer cells, monocytes, dendritic cells, and megakaryocytes) were annotated, and different cell states were identified including T cell states (TS), B cell states (BS), NK cell states (NS), monocyte states (MS), dendritic cell states (DS), and megakaryocytes (MK). Notably, monocyte states 1 (MS1) and MK showed an increase in sepsis compared to the controls, indicating the potential significance of these subpopulations in sepsis. Comparing the number of MS1 and MK in sepsis and healthy subjects also revealed significant growth of MK (P value=2.558e-05; Fig. 8C) and MS1 (P value=1.840e-5; Fig. 8D) in sepsis, indicating the potential of MK and MS1 as biomarkers.

Representation of cells in sepsis patients and normal controls using scMMAE to identify sepsis biomarkers. (A-B) The projection of cells using scMMAE demonstrates a larger proportion of MS1 and MK in sepsis patients (A) comparing to normal controls (B). (C) MK has a significantly larger proliferation in sepsis (P-value ¡ 0.0001). (D) MS1 has a significantly larger proliferation in sepsis (P-value ¡ 0.0001). (E) ROC curve of different methods for sepsis diagnosis including MS1 and MK proportion discovered by scMMAE and existing biomarkers (FCMR/PLAC8, septiCyte, and sNIP) (F) ROC curve of different methods for sepsis diagnosis in ED including MS1 and MK proportion discovered by scMMAE and existing biomarkers (FCMR/PLAC8, septiCyte, and sNIP). MS: monocyte states, MK: megakaryocytes, TS: T cell states, BS: B cell states, NS: NK cell states, DS: dendritic cell states, ROC: receiver operating characteristic.
We assessed the diagnostic capability of MS1 and MK using the AUROC (Fig. 8E). Using the proportion of MS1 in sepsis and control samples revealed by scMMAE achieved an AUROC of 0.90, which is higher than existing biomarkers FAIM3/PLAC8 (AUROC = 0.75) [43], SeptiCyte (AUROC = 0.56) [44], and sNIP (AUROC = 0.61) [45]. In addition, we also examined whether scMMAE could distinguish between septic and normal patients in ED situations (Fig. 8F). The performance of MS1 (AUROC = 0.77) was better than existing biomarkers including FAIM3/PLAC8 (AUROC = 0.66), SeptiCyte (AUROC = 0.48), and sNIP (AUROC = 0.65). The results demonstrated that scMMAE, as a representation of scRNA-seq enhanced by multimodal omics, can assist downstream analysis such as biomarker identification.
Discussion
The rapidly developing area of single-cell multi-omics analysis necessitates the development of methods for the collaborative analysis of multimodal data. In this study, we introduced scMMAE, a deep learning-based method for the fusion of RNAs and proteins at the single-cell level. ScMMAE can retain DI to each modality and can switch from bimodal training to unimodal prediction experiments. Utilizing self-supervised and transfer learning principles, scMMAE outperforms well compared to existing multi-omics analysis techniques and state-of-the-art methods in single-modal data analysis on real-world datasets. This represents a significant advancement in terms of interpretative capabilities, surpassing previous methods. Comprehensive ablation studies further underscore the architectural efficiency of scMMAE.
Previous multimodal omics fusion methods, such as Scanpy, Seurat, scVI, and TotalVI, have predominantly focused on aligning data from disparate modalities and integrating the information shared between the two, often overlooking the unique information intrinsic to each comic. However, the unique information from each omics might be crucial as it can complement the deficiencies presented in other modalities, thereby enhancing the overall performance of multimodal approaches. In our method, scMMAE aggregates the fused information and DI from transcriptomics and proteomics.
Leveraging the comprehensive insights gained from multi-modal omics during the training phase, scMMAE becomes more adept at interpreting and representing single-cell profiles when applied to transcriptomics datasets. From a theoretical perspective grounded in machine learning, the model can leverage distinct modalities to deduce more precise distributions or improved classification boundaries. As illustrated in Supplementary Fig. S8, discrimination of cell types using transcriptomics alone might result in multiple plausible demarcations for classification (Fig. S8A). However, incorporating proteomic information enhances the accuracy of defining the classification boundary (Fig. S8B). Once determined through integrative analysis of multi-omic data, the refined classification boundary can serve as a valuable tool for unimodal, such as transcriptomic.
In this study, we utilized multiple metrics to evaluate the clustering results. We aim to take a comprehensive view of the effectiveness of each method. For example, the scMM method consistently achieves the top rank in the CHI metric. Because CHI scores are calculated by assessing between-class variance and within-class variance, indicating that scMM places greater emphasis on the separation and cohesion of clusters. However, the scMM excessively neglects other factors, leading to low scores on various metrics. In contrast, our method, scMMAE, effectively considers each metric, achieving good results across the board.
Although scMMAE achieved good performance on single-cell transcriptomics datasets by incorporating protein information and DI, it still has limitations. Inheriting from the deep neural networks, it is hard to provide an explanation of the relationships between the RNAs and proteins. Therefore, we will try to improve scMMAE with better interpretation in the near future. In addition, we also plan to develop a variant of scMMAE to integrate scATAC-seq and scRNA-seq datasets. This will enable downstream analyses, including gene regulatory network inference and transcription factor identification.
We propose scMMAE to fuse single-cell multi-omics using cross-attention with MAE.
scMMAE transforms cross-attention to self-attention to enhance single omics (scRNA-seq) representation with around 10% improvement.
Downstream analysis reveals its capability of distinguishing sub-populations of cells, mitigating batch effects while preserving tissue-specific information, and identifying biomarkers of diseases such as sepsis.
Acknowledgements
The authors would like to thank Jinqiao Duan for his collaboration and significant contributions to this study. The computational resources are supported by SongShan Lake HPC Center (SSL-HPC) in Great Bay University.
Funding
This work was partly supported by the Guangdong Provincial Key Laboratory of Mathematical and Neural Dynamical Systems (2024B1212010004), National Natural Science Foundation of China (32300554), National Natural Science Foundation of China (32370711), National Natural Science Foundation of China (32100515), Shenzhen Science and Technology Program (JCYJ20220530152409020), CUHK direct grant for research under Award Numbers 2021.061 and 2022.080, and Shenzhen Medical Research Funding (A2303033).
Authors’ contributions
L.C. and X.Z. conceived the experiment(s), D.M. and Y.F. conducted the experiment, D.M. and Y.F. analyzed the results. X.Z., Z.Y., Q.C., and K.Y. suggested the model. D.M. and X.Z. wrote the manuscript and all authors reviewed the manuscript.
References
Doersch, Carl.
Minoura K, Abe K, Nam H. et al. .
Steier Z, Maslan A, Streets A.
Maozai T.
Huang X, Ma Z, Meng D. et al. .
Hao Y, Stuart T, Kowalski MH. et al. .
Gayoso A, Lopez R, Xing G. et al. .
Barkas N. et al. .
He K, Chen X, Xie S. et al. .
Vinh NX, Julien E, James B.
Zhang X, Yoon J, Bansal M. et al. .
Zheng X, Meng D, Chen D. et al. .