-
PDF
- Split View
-
Views
-
Cite
Cite
Meihong Gao, Shuhui Liu, Yang Qi, Xinpeng Guo, Xuequn Shang, GAE-LGA: integration of multi-omics data with graph autoencoders to identify lncRNA–PCG associations, Briefings in Bioinformatics, Volume 23, Issue 6, November 2022, bbac452, https://doi.org/10.1093/bib/bbac452
- Share Icon Share
Abstract
Long non-coding RNAs (lncRNAs) can disrupt the biological functions of protein-coding genes (PCGs) to cause cancer. However, the relationship between lncRNAs and PCGs remains unclear and difficult to predict. Machine learning has achieved a satisfactory performance in association prediction, but to our knowledge, it is currently less used in lncRNA–PCG association prediction. Therefore, we introduce GAE-LGA, a powerful deep learning model with graph autoencoders as components, to recognize potential lncRNA–PCG associations. GAE-LGA jointly explored lncRNA–PCG learning and cross-omics correlation learning for effective lncRNA–PCG association identification. The functional similarity and multi-omics similarity of lncRNAs and PCGs were accumulated and encoded by graph autoencoders to extract feature representations of lncRNAs and PCGs, which were subsequently used for decoding to obtain candidate lncRNA–PCG pairs. Comprehensive evaluation demonstrated that GAE-LGA can successfully capture lncRNA–PCG associations with strong robustness and outperformed other machine learning-based identification methods. Furthermore, multi-omics features were shown to improve the performance of lncRNA–PCG association identification. In conclusion, GAE-LGA can act as an efficient application for lncRNA–PCG association prediction with the following advantages: It fuses multi-omics information into the similarity network, making the feature representation more accurate; it can predict lncRNA–PCG associations for new lncRNAs and identify potential lncRNA–PCG associations with high accuracy.
Introduction
Long non-coding RNAs (lncRNAs) are transcripts longer than 200 nucleotides that are barely involved in the translation process[1–3]. Currently, the roles of most lncRNAs are not clear, and only 17 754 of them are annotated by GENCODE V35[4]. These annotations suggest that lncRNAs can perturb the expression of protein-coding genes (PCGs) at multiple levels and participate in several important biological processes[1, 5–7]. Given the aforementioned biological significance of lncRNAs, combined with their large number and the diversity of their mechanisms of action, it becomes necessary to explore the relationship between lncRNAs and PCGs.
Some biological experiments have been designed to verify lncRNA–PCG associations[8–11], but their large-scale use is not easily achieved due to time and financial constraints. Thus, a reliable computational tool for identifying lncRNA–PCG associations based on existing experimental data is needed. At the present time, various computational-based approaches have been designed to recognize lncRNA–PCG associations. These approaches are grouped into three categories: sequence-based, expression-based and machine learning-based. Sequence-based approaches utilize the free energy of base sequences to predict lncRNA–PCG associations[12–14]. They excel at predicting direct physical interactions between lncRNAs and PCGs, but fail to identify potential lncRNA–PCG associations. Expression-based methods calculate the degree of association between lncRNAs and PCGs based on their expression levels[15–17]. Affected by the specificity of lncRNA and PCG expression, these methods can only predict the lncRNA–PCG association in specific samples at specific stages. Machine learning-based approaches, which are also newly proposed, identify candidate lncRNA–PCG pairs by learning from existing lncRNA–PCG pairs[18, 19]. They can identify direct and potential lncRNA–PCG associations without the expression-specific effects of lncRNA and PCG.
Machine learning methods provide an efficient solution for predicting associations between lncRNAs and other objects, including lncRNA–disease association[20–28], lncRNA–miRNA association[29–31] and lncRNA–protein association[32–35], etc. Most of these methods can be grouped into the following three categories: traditional machine learning methods[20, 24, 25, 30, 33], matrix completion methods[21–23, 29, 35] and deep learning methods[26–28, 31, 32, 34]. Inspired by the above studies, machine learning-based approaches for lncRNA–PCG association prediction were proposed. A predictor based on support vector machines (SVMs), logistic regression (LR) and random forest (RF) was first constructed to investigate the relationship between lncRNA and PCG[19]. Subsequently, a deep learning-based method was proposed to screen target PCGs for lncRNA[18]. The results of these two approaches demonstrate the effectiveness of machine learning in identifying lncRNA–PCG associations, but there is still much room for improvement in identification performance. Therefore, it is of great significance for us to design a new machine learning-based model to accurately identify lncRNA–PCG association, and to explore the regulatory mechanism of lncRNA on PCG.
lncRNA–PCG association identification can be viewed as a link prediction problem that has been proved to be excellently solved by Graph Convolutional Network (GCN)[36]. In addition, lncRNAs can regulate PCG expression at multiple levels[1] and play a multi-omics synergistic regulatory role in organisms. Therefore, based on the above two points, we combined GCN encoders and multi-omics pan-cancer data to propose a new method GAE-LGA for efficient identification of candidate lncRNA–PCG associations. We first integrated functional similarity and multi-omics similarity to build similarity networks for lncRNA and PCG, respectively. Among them, the functional similarity was inferred from lncRNA–PCG association information, and the multi-omics similarity was inferred from multi-omics information of patients. Then, feature representations of lncRNAs and PCGs were learned from the constructed similarity networks using graph autoencoders. Finally, we constructed a decoder to decode the feature representations to identify potential lncRNA–PCG associations. Compared with existing lncRNA–PCG associations prediction methods, GAE-LGA has three new features:
it fuses multi-omics information into the similarity network, making the feature representation more accurate;
it can predict lncRNA–PCG associations for a new lncRNA;
it identifies potential lncRNA–PCG associations with high accuracy and without expression specificity.
Materials and methods
We designed GAE-LGA for lncRNA–PCG association prediction. As shown in Figure 1, the prediction process consists of three steps. Firstly, we collected and preprocessed the multi-omics data and lncRNA–PCG association network to obtain the required association information and multi-omics information(Figure 1A). Secondly, we used GCN encoders to aggregate and learn feature representations for lncRNAs and PCGs(Figure 1B). Finally, a decoder was used to combine feature representations for predicting candidate lncRNA–PCG association(Figure 1C).

Overview of GAE-LGA. LNC represents lncRNA and PCG represents protein coding gene. (A) In the first step, we performed a preprocessing operation to get multi-omics features and association matrix. (B) To generate feature representations of lncRNAs and PCGs, the association information (functional similarity) and multi-omics information were fused for encoding. (C) Finally, we computed the association score and activated it.
Data collection and preprocessing
Multi-omics data
We downloaded multi-omics data from the TCGA (https://portal.gdc.cancer.gov/) database. The multi-omics data are composed of single nucleotide variation (SNV), copy number variation (CNV), dna methylation (DNA Methy) and transcription profiling (TP) data, whose relationship to gene expression is shown in Table S1 (See Supplementary material). After preprocessing, each lncRNA/PCG contains 312 multi-omics features, including 66 SNV features, 132 CNV features, 48 DNA Methy featuress and 66 TP features. Their details are shown in Table 1.
Feature type . | NC . | Attribute . | NM . |
---|---|---|---|
SNV | 33 | Chromosome, Position | 66 |
CNV | 33 | Chromosome, Start, End, Aber | 132 |
DNA Methy | 12 | Chromosome, Start, End, Beta | 48 |
TP | 33 | Mean, Standard deviation | 66 |
Feature type . | NC . | Attribute . | NM . |
---|---|---|---|
SNV | 33 | Chromosome, Position | 66 |
CNV | 33 | Chromosome, Start, End, Aber | 132 |
DNA Methy | 12 | Chromosome, Start, End, Beta | 48 |
TP | 33 | Mean, Standard deviation | 66 |
Note: NC and NM represent the number of cancer types and multi-omics features, respectively. SNV, CNV, DNA Methy and TP are the abbreviations for single nucleotide variation, copy number variation, DNA methylation and transcription profiling, respectively. Aber and Beta refer to mutation type and methylation intensity, respectively.
Feature type . | NC . | Attribute . | NM . |
---|---|---|---|
SNV | 33 | Chromosome, Position | 66 |
CNV | 33 | Chromosome, Start, End, Aber | 132 |
DNA Methy | 12 | Chromosome, Start, End, Beta | 48 |
TP | 33 | Mean, Standard deviation | 66 |
Feature type . | NC . | Attribute . | NM . |
---|---|---|---|
SNV | 33 | Chromosome, Position | 66 |
CNV | 33 | Chromosome, Start, End, Aber | 132 |
DNA Methy | 12 | Chromosome, Start, End, Beta | 48 |
TP | 33 | Mean, Standard deviation | 66 |
Note: NC and NM represent the number of cancer types and multi-omics features, respectively. SNV, CNV, DNA Methy and TP are the abbreviations for single nucleotide variation, copy number variation, DNA methylation and transcription profiling, respectively. Aber and Beta refer to mutation type and methylation intensity, respectively.
Human lncRNA–PCG associations
We collected lncRNA–PCG associations from three gold standard datasets: LncRNA2Target[9], LncTarD[8] and NPInter[10]. Here, lncRNAs and PCGs with multi-omics information were used for experimental analysis. Their details are shown in Table 2. It can be found that LncRNA2Target contains 773 lncRNA–PCG associations, including 263 lncRNAs and 498 PCGs. There are 1444 lncRNA–PCG associations in LncRNA2Target, involving 238 lncRNAs and 716 PCGS. As for LncTarD, the numbers of lncRNA, PCGs and lncRNA–PCG associations are 308, 256 and 369, respectively.
Dataset . | LncRNA . | PCG . | lncRNA–PCG . |
---|---|---|---|
LncRNA2Target | 263 | 498 | 773 |
LncTarD | 238 | 716 | 1444 |
NPInter | 308 | 256 | 369 |
Dataset . | LncRNA . | PCG . | lncRNA–PCG . |
---|---|---|---|
LncRNA2Target | 263 | 498 | 773 |
LncTarD | 238 | 716 | 1444 |
NPInter | 308 | 256 | 369 |
Note: lncRNA–PCG represents the number of lncRNA–PCG associations, and PCG represents protein-coding gene
Dataset . | LncRNA . | PCG . | lncRNA–PCG . |
---|---|---|---|
LncRNA2Target | 263 | 498 | 773 |
LncTarD | 238 | 716 | 1444 |
NPInter | 308 | 256 | 369 |
Dataset . | LncRNA . | PCG . | lncRNA–PCG . |
---|---|---|---|
LncRNA2Target | 263 | 498 | 773 |
LncTarD | 238 | 716 | 1444 |
NPInter | 308 | 256 | 369 |
Note: lncRNA–PCG represents the number of lncRNA–PCG associations, and PCG represents protein-coding gene
Training and testing datasets
We used 10-fold cross-validation to obtain training and testing datasets. For them, we generated positive and negative samples by the following two methods. In method 1, we selected known lncRNA–PCG associations in the dataset as positive samples and those with unknown lncRNA–PCG associations as negative samples. In method 2, we selected positive samples in the same way as in method 1, and then randomly selected the same number of negative samples from unknown lncRNA–PCG associations.
Similarity feature
Feature representation
lncRNA–PCG association prediction
lncRNA–PCG association reconstruction
Association information for a new lncRNA
Complexity analysis
Results
Experimental setup
We performed 10-fold cross-validation on three independent datasets, NPInter, LncTarD and LncRNA2Target, to verify the performance of GAE-LGA. The details of the division of positive and negative samples of the dataset are shown in Table S3 (See Supplementary material). Here, four important parameters, GCN layer, embedding size, hidden layer and learning rate, were learned by model training, which were 2, 10, 200 and 0.001, respectively. To evaluate model performance, we calculated the area under the ROC curve (AUC), area under the precision-recall curve (AUPR), accuracy, F1-score, precision, recall and Matthews correlation coefficient (MCC) for the prediction results.
Besides, we analyzed the relationships between lncRNAs, PCGs and lncRNA–PCG associations among the datasets (Figure 2). It can be found that only 77 lncRNAs (Figure 2A), 68 PCGs (Figure 2B) and 30 lncRNA–PCG associations (Figure 2C) overlap in the three datasets. Furthermore, we compared the performance of the model with and without overlapping associations (See Figure S1, Supplementary material). It can be found that these overlapping associations slightly improved model performance, thus we distributed these associations evenly across the training and test sets in our experiments.

Relationship between lncRNAs, protein coding genes (PCGs), and lncRNA–PCG associations among three datasets. Dataset1, Dataset2 and Dataset3 represent NPInter, LncTarD and LncRNA2Target, respectively. (A) Relationship between lncRNAs. (B) Relationship between PCGs. (C) Relationship between lncRNA–PCG associations.
Comparison experiments
Effectiveness comparison
To verify the effectiveness of the features learned by GCN, we compared it with several other representation methods. The decoding method used here is the lncRNA–PCG association reconstruction mentioned in the Materials and methods section. As shown in Table S4 (See Supplementary material), we found that GCN, Deepwalk and Node2vec sequentially decreased in their ability to learn lncRNA and PCG features. Although the combination of GCN and Deepwalk/Node2vec can improve the effectiveness of feature representation, its performance is still inferior to that of GCN. This is because GCN can capture the global information of the lncRNA/PCG-related network and thus represent the node features well.
Subsequently, we compared the experimental results of GAE-LGA with those of DeepLGP[18], Convolutional Neural Network (CNN)[37], Autoencoder[38] and CNN+Autoencoder[37, 38] on three datasets to validate its performance (Table 3). As we can see, GAE-LGA performs best on the NPInter and LncRNA2Target datasets, significantly improving over state-of-the-art baseline and deep learning methods. On the NPInter dataset, GAE-LGA improves AUC, AUPR, F1-score and MCC by 5.70%, 11.92%, 0.18% and 0.05%, respectively. On the LncRNA2Target dataset, GAE-LGA improves AUC, AUPR, F1-score and MCC by 1.03%, 2.56%, 0.63% and 0.54%, respectively. As for the LncTarD dataset, DeepLGP achieves the highest AUPR, CNN+Autoencoder achieves the highest MCC, but GAE-LGA far outperforms the best method on other evaluation metrics. It improves by 1.22% and 0.37% in AUC and F1 score, respectively. These improvements are attributed to the following two aspects. On the one hand, GAE-LGA combines the association information and multi-omics information of lncRNA and PCG, making the theoretical performance of lncRNA–PCG association higher. On the other hand, GCN can learn the feature representation of lncRNA and PCG more accurately than other feature representation methods, which leads to the performance of lncRNA–PCG association prediction closer to the theoretical value.
Comparison results of the proposed GAE-LGA and other deep learning methods on datasets NPInter, LncTarD and LncRNA2Targe under the same experimental setup
Method . | NPInter . | LncTarD . | LncRNA2Targe . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . |
GAE-LGA | 0.9743 | 0.6730 | 0.8468 | 0.7262 | 0.9308 | 0.6345 | 0.8831 | 0.6123 | 0.8956 | 0.5295 | 0.7787 | 0.7871 |
DeepLGP | 0.9218 | 0.6013 | 0.7526 | 0.7216 | 0.9196 | 0.6501 | 0.7508 | 0.6102 | 0.8657 | 0.5058 | 0.7582 | 0.7815 |
CNN | 0.9102 | 0.5961 | 0.8197 | 0.7135 | 0.9103 | 0.6254 | 0.8003 | 0.6025 | 0.8802 | 0.5062 | 0.7713 | 0.7658 |
Autoencoder | 0.9013 | 0.5936 | 0.7761 | 0.6897 | 0.9042 | 0.6188 | 0.7952 | 0.5901 | 0.8752 | 0.5108 | 0.7652 | 0.7726 |
Autoencoder+CNN | 0.9205 | 0.6009 | 0.8453 | 0.7258 | 0.9158 | 0.6291 | 0.8527 | 0.6136 | 0.8865 | 0.5163 | 0.7738 | 0.7829 |
Method . | NPInter . | LncTarD . | LncRNA2Targe . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . |
GAE-LGA | 0.9743 | 0.6730 | 0.8468 | 0.7262 | 0.9308 | 0.6345 | 0.8831 | 0.6123 | 0.8956 | 0.5295 | 0.7787 | 0.7871 |
DeepLGP | 0.9218 | 0.6013 | 0.7526 | 0.7216 | 0.9196 | 0.6501 | 0.7508 | 0.6102 | 0.8657 | 0.5058 | 0.7582 | 0.7815 |
CNN | 0.9102 | 0.5961 | 0.8197 | 0.7135 | 0.9103 | 0.6254 | 0.8003 | 0.6025 | 0.8802 | 0.5062 | 0.7713 | 0.7658 |
Autoencoder | 0.9013 | 0.5936 | 0.7761 | 0.6897 | 0.9042 | 0.6188 | 0.7952 | 0.5901 | 0.8752 | 0.5108 | 0.7652 | 0.7726 |
Autoencoder+CNN | 0.9205 | 0.6009 | 0.8453 | 0.7258 | 0.9158 | 0.6291 | 0.8527 | 0.6136 | 0.8865 | 0.5163 | 0.7738 | 0.7829 |
Note: The bold value corresponds to the best performance method for each metric.
Comparison results of the proposed GAE-LGA and other deep learning methods on datasets NPInter, LncTarD and LncRNA2Targe under the same experimental setup
Method . | NPInter . | LncTarD . | LncRNA2Targe . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . |
GAE-LGA | 0.9743 | 0.6730 | 0.8468 | 0.7262 | 0.9308 | 0.6345 | 0.8831 | 0.6123 | 0.8956 | 0.5295 | 0.7787 | 0.7871 |
DeepLGP | 0.9218 | 0.6013 | 0.7526 | 0.7216 | 0.9196 | 0.6501 | 0.7508 | 0.6102 | 0.8657 | 0.5058 | 0.7582 | 0.7815 |
CNN | 0.9102 | 0.5961 | 0.8197 | 0.7135 | 0.9103 | 0.6254 | 0.8003 | 0.6025 | 0.8802 | 0.5062 | 0.7713 | 0.7658 |
Autoencoder | 0.9013 | 0.5936 | 0.7761 | 0.6897 | 0.9042 | 0.6188 | 0.7952 | 0.5901 | 0.8752 | 0.5108 | 0.7652 | 0.7726 |
Autoencoder+CNN | 0.9205 | 0.6009 | 0.8453 | 0.7258 | 0.9158 | 0.6291 | 0.8527 | 0.6136 | 0.8865 | 0.5163 | 0.7738 | 0.7829 |
Method . | NPInter . | LncTarD . | LncRNA2Targe . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . |
GAE-LGA | 0.9743 | 0.6730 | 0.8468 | 0.7262 | 0.9308 | 0.6345 | 0.8831 | 0.6123 | 0.8956 | 0.5295 | 0.7787 | 0.7871 |
DeepLGP | 0.9218 | 0.6013 | 0.7526 | 0.7216 | 0.9196 | 0.6501 | 0.7508 | 0.6102 | 0.8657 | 0.5058 | 0.7582 | 0.7815 |
CNN | 0.9102 | 0.5961 | 0.8197 | 0.7135 | 0.9103 | 0.6254 | 0.8003 | 0.6025 | 0.8802 | 0.5062 | 0.7713 | 0.7658 |
Autoencoder | 0.9013 | 0.5936 | 0.7761 | 0.6897 | 0.9042 | 0.6188 | 0.7952 | 0.5901 | 0.8752 | 0.5108 | 0.7652 | 0.7726 |
Autoencoder+CNN | 0.9205 | 0.6009 | 0.8453 | 0.7258 | 0.9158 | 0.6291 | 0.8527 | 0.6136 | 0.8865 | 0.5163 | 0.7738 | 0.7829 |
Note: The bold value corresponds to the best performance method for each metric.
Following this, we analyzed the prediction results of traditional machine learning methods SVM-based model[19], LR-based model[19] and RF-based model[19] under the same experimental setup (See Table S5, Supplementary material). As we can see, on the three datasets, the average performance metrics of deep learning methods are much higher than those of traditional machine learning methods. Especially in AUC, the advantages of deep learning methods are obvious. These improvements are attributed to the following three reasons: (i) The neural network has a strong function approximation ability, which leads to its very strong learning ability; (ii) Deep neural networks have more streamlined expression and higher sample efficiency than shallow neural networks; (iii) Deep learning has strong generalization ability, that is, the model with small error learned from the training set also has small error on the test set. In conclusion, our model has advantages in predicting lncRNA–PCG associations, significantly improving the prediction performance compared with existing machine learning methods.
Robustness comparison
We randomly drop a small fraction of known lncRNA–PCG pairs in each dataset at a ratio r |$\in \{0.8,0.85,0.9,0.95,1\}$| for comparison of method performance changes (Figure 3). As we can see, the performance of all methods drops significantly with the reduction of lncRNA–PCG pairs, except GAE-LGA which yields the most robust and highest performances across different sample sizes. Furthermore, we used the standard deviation of AUC and AUPR for different sample size groups to evaluate the robustness of the models (Figure 4). It can be found that the standard deviations of AUC of GAE-LGA on NPInter, LncTarD and LncRNA2Target are 0.0091, 0.0081 and 0.0077 (Figure 4A), which are 7.2479, 11.3529 and 17.8180 times better than those of the best comparison method (Figure 4C), and 8.0703, 12.5664 and 19.8222 times better than those of the worst comparison method (Figure 4C), respectively.

Comparison of model performance changes. (A) AUC changes on NPInter. (B) AUC changes on LncTarD. (C) AUC changes on LncRNA2Target. (D) AUPR changes on NPInter. (E) AUPR changes on LncTarD. (F) AUPR changes on LncRNA2Target.

Standard deviation analysis of model performance changes. Here, SD, Dataset1, Dataset2 and Dataset3 represent standard deviation, NPInter, LncTarD and LncRNA2Target, respectively. (A) AUC Changes. (B) AUPR Changes. (C) GAE-LGA improvement over other deep learning methods. AUC-I1 and AUPR-I1 represent the improvement of GAE-LGA in AUC change and AUPR change over the best deep learning method, respectively. AUC-I2 and AUPR-I2 represent the improvement of GAE-LGA in AUC change and AUPR change over the worst deep learning method by GAE-LGA, respectively.
The standard deviations of AUPR of GAE-LGA on NPInter, LncTarD and LncRNA2Target are 0.0282, 0.0129 and 0.0118 (Figure 4B), which are 5.3156, 9.1521 and 8.5238 times better than those of the best comparison method (Figure 4C), and 5.7036, 13.9338 and 12.4349 times better than those of the worst comparison method (Figure 4C), respectively. The reason for the improvement is that GCN can perform end-to-end learning on the feature information and network structure information of lncRNA and PCG at the same time, which thus can capture the global information of the related network to represent the features of the lncRNA and PCG well. Therefore, the GAE-LGA model based on the GCN encoder is robust with a lower standard deviation of model performance than other methods.
Ablation experiment
To evaluate the importance of multi-omics features in lncRNA–PCG association prediction, we conducted comparative experiments on the NPInter, LncTarD and LncRNA2Target datasets based on different features of lncRNAs and PCGs: functional similarity features, multi-omics similarity features and aggregated similarity (functional similarity and multi-omics similarity) features. As we can see in Table 4, the functional similarity-related experimental group has a higher prediction performance than the multi-omics similarity-related experimental group, while the aggregated similarity-related experimental group has the best prediction performance. This suggests that lncRNAs regulate PCG expression at the multi-omics level, and their multi-omics features contribute to improving model prediction performance.
Method . | NPInter . | LncTarD . | LncRNA2Target . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . |
GAE+S1 | 0.9633 | 0.6538 | 0.8601 | 0.7198 | 0.9310 | 0.6209 | 0.8729 | 0.6109 | 0.8950 | 0.6012 | 0.7523 | 0.7691 |
GAE+S2 | 0.9231 | 0.6136 | 0.8022 | 0.7053 | 0.9022 | 0.6374 | 0.8830 | 0.5926 | 0.8901 | 0.5306 | 0.7021 | 0.7528 |
GAE+S1+S2 | 0.9743 | 0.6730 | 0.8468 | 0.7262 | 0.9308 | 0.6345 | 0.8831 | 0.6123 | 0.8956 | 0.5295 | 0.7787 | 0.7871 |
Method . | NPInter . | LncTarD . | LncRNA2Target . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . |
GAE+S1 | 0.9633 | 0.6538 | 0.8601 | 0.7198 | 0.9310 | 0.6209 | 0.8729 | 0.6109 | 0.8950 | 0.6012 | 0.7523 | 0.7691 |
GAE+S2 | 0.9231 | 0.6136 | 0.8022 | 0.7053 | 0.9022 | 0.6374 | 0.8830 | 0.5926 | 0.8901 | 0.5306 | 0.7021 | 0.7528 |
GAE+S1+S2 | 0.9743 | 0.6730 | 0.8468 | 0.7262 | 0.9308 | 0.6345 | 0.8831 | 0.6123 | 0.8956 | 0.5295 | 0.7787 | 0.7871 |
Note: The bold value corresponds to the best performance method for each metric. GAE, S1 and S2 represents graph autoencoder, functional similarity features and multi-omics similarity features, respectively.
Method . | NPInter . | LncTarD . | LncRNA2Target . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . |
GAE+S1 | 0.9633 | 0.6538 | 0.8601 | 0.7198 | 0.9310 | 0.6209 | 0.8729 | 0.6109 | 0.8950 | 0.6012 | 0.7523 | 0.7691 |
GAE+S2 | 0.9231 | 0.6136 | 0.8022 | 0.7053 | 0.9022 | 0.6374 | 0.8830 | 0.5926 | 0.8901 | 0.5306 | 0.7021 | 0.7528 |
GAE+S1+S2 | 0.9743 | 0.6730 | 0.8468 | 0.7262 | 0.9308 | 0.6345 | 0.8831 | 0.6123 | 0.8956 | 0.5295 | 0.7787 | 0.7871 |
Method . | NPInter . | LncTarD . | LncRNA2Target . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . | AUC . | AUPR . | F1-score . | MCC . |
GAE+S1 | 0.9633 | 0.6538 | 0.8601 | 0.7198 | 0.9310 | 0.6209 | 0.8729 | 0.6109 | 0.8950 | 0.6012 | 0.7523 | 0.7691 |
GAE+S2 | 0.9231 | 0.6136 | 0.8022 | 0.7053 | 0.9022 | 0.6374 | 0.8830 | 0.5926 | 0.8901 | 0.5306 | 0.7021 | 0.7528 |
GAE+S1+S2 | 0.9743 | 0.6730 | 0.8468 | 0.7262 | 0.9308 | 0.6345 | 0.8831 | 0.6123 | 0.8956 | 0.5295 | 0.7787 | 0.7871 |
Note: The bold value corresponds to the best performance method for each metric. GAE, S1 and S2 represents graph autoencoder, functional similarity features and multi-omics similarity features, respectively.
Case studies
We conducted a case study of GAE-LGA prediction results using published literature that includes experimentally validated lncRNA–PCG associations (Table 5). As we can see, 12 candidate lncRNA–PCG pairs are validated by existing studies. Specifically, it has been found that CDKN2B-AS1 inhibits MMP9 expression in renal clear cell carcinoma[39], CRNDE accelerates cervical cancer progression by altering ZEB1 expression[40], H19 increases CDKN1A expression by reducing HUVEC growth[41] and MALAT1 promotes osteosarcoma development by upregulating CCND1 expression[42]. In addition, MALAT1 upregulates c-Myc expression in thymic epithelial tumors[43], upregulation of MEG3 in human osteosarcoma cells correlates with low expression of MMP9[39], MEG3 inhibits macrophage apoptosis by regulating CDKN2A expression[44] and NEAT1 induces cell differentiation in hepatoblastoma by regulating MMP9 expression[45]. Moreover, PVT1 promotes fracture healing and esophageal cancer progression by regulating the expressions of HMGA2 and ZEB1, respectively[46, 47], the correlation between TUG1 and P53 affects the cell life activities of non-small cell lung cancer[48] and UCA1 can antagonize cell cycle arrest by destabilizing EZH2[49].
LncRNA . | PCG . | Disease . | Evidence . | LncRNA . | PCG . | Disease . | Evidence . |
---|---|---|---|---|---|---|---|
CDKN2B-AS1 | MMP9 | renal clear cell carcinoma | PMID: 33608495 | MEG3 | CDKN2A | Atherosclerosis | PMID: 30672051 |
CRNDE | ZEB1 | Cervical Cancer | PMID: 33469312 | NEAT1 | MMP9 | Hepatoblastoma | PMID: 35300348 |
H19 | CDKN1A | hypoxia-related diseases | PMID: 27063004 | PVT1 | HMGA2 | Fragility fracture | PMID: 34592894 |
MALAT1 | CCND1 | osteosarcoma | PMID: 30365098 | PVT1 | ZEB1 | esophageal cancer | PMID: 33848670 |
MALAT1 | MYC | thymic epithelial tumors | PMID: 34530916 | TUG1 | TP53 | non-small cell lung cancer | PMID: 24853421 |
MEG3 | MMP9 | Osteosarcoma | PMID: 29434890 | UCA1 | EZH2 | urothelial cancer | PMID: 32537408 |
LncRNA . | PCG . | Disease . | Evidence . | LncRNA . | PCG . | Disease . | Evidence . |
---|---|---|---|---|---|---|---|
CDKN2B-AS1 | MMP9 | renal clear cell carcinoma | PMID: 33608495 | MEG3 | CDKN2A | Atherosclerosis | PMID: 30672051 |
CRNDE | ZEB1 | Cervical Cancer | PMID: 33469312 | NEAT1 | MMP9 | Hepatoblastoma | PMID: 35300348 |
H19 | CDKN1A | hypoxia-related diseases | PMID: 27063004 | PVT1 | HMGA2 | Fragility fracture | PMID: 34592894 |
MALAT1 | CCND1 | osteosarcoma | PMID: 30365098 | PVT1 | ZEB1 | esophageal cancer | PMID: 33848670 |
MALAT1 | MYC | thymic epithelial tumors | PMID: 34530916 | TUG1 | TP53 | non-small cell lung cancer | PMID: 24853421 |
MEG3 | MMP9 | Osteosarcoma | PMID: 29434890 | UCA1 | EZH2 | urothelial cancer | PMID: 32537408 |
LncRNA . | PCG . | Disease . | Evidence . | LncRNA . | PCG . | Disease . | Evidence . |
---|---|---|---|---|---|---|---|
CDKN2B-AS1 | MMP9 | renal clear cell carcinoma | PMID: 33608495 | MEG3 | CDKN2A | Atherosclerosis | PMID: 30672051 |
CRNDE | ZEB1 | Cervical Cancer | PMID: 33469312 | NEAT1 | MMP9 | Hepatoblastoma | PMID: 35300348 |
H19 | CDKN1A | hypoxia-related diseases | PMID: 27063004 | PVT1 | HMGA2 | Fragility fracture | PMID: 34592894 |
MALAT1 | CCND1 | osteosarcoma | PMID: 30365098 | PVT1 | ZEB1 | esophageal cancer | PMID: 33848670 |
MALAT1 | MYC | thymic epithelial tumors | PMID: 34530916 | TUG1 | TP53 | non-small cell lung cancer | PMID: 24853421 |
MEG3 | MMP9 | Osteosarcoma | PMID: 29434890 | UCA1 | EZH2 | urothelial cancer | PMID: 32537408 |
LncRNA . | PCG . | Disease . | Evidence . | LncRNA . | PCG . | Disease . | Evidence . |
---|---|---|---|---|---|---|---|
CDKN2B-AS1 | MMP9 | renal clear cell carcinoma | PMID: 33608495 | MEG3 | CDKN2A | Atherosclerosis | PMID: 30672051 |
CRNDE | ZEB1 | Cervical Cancer | PMID: 33469312 | NEAT1 | MMP9 | Hepatoblastoma | PMID: 35300348 |
H19 | CDKN1A | hypoxia-related diseases | PMID: 27063004 | PVT1 | HMGA2 | Fragility fracture | PMID: 34592894 |
MALAT1 | CCND1 | osteosarcoma | PMID: 30365098 | PVT1 | ZEB1 | esophageal cancer | PMID: 33848670 |
MALAT1 | MYC | thymic epithelial tumors | PMID: 34530916 | TUG1 | TP53 | non-small cell lung cancer | PMID: 24853421 |
MEG3 | MMP9 | Osteosarcoma | PMID: 29434890 | UCA1 | EZH2 | urothelial cancer | PMID: 32537408 |
Parameter analysis
GCN layer
We compared the AUCs produced by two-, three-, four- and five-layer GCN encoder models to determine the effect of GCN layers on model performance (Figure 5A). It can be found that the AUC fluctuations between models with different GCN layers are not large, but a small number of layers can speed up the convergence of the model, and a larger number of layers will make the model prone to overfitting. In this experiment, we make GCN layers equal to two.

Parameter analysis. (A) Analysis of GCN layer. (B) Analysis of embedding size. (C) Analysis of the number of hidden layer features. (D) Analysis of learning rate.
Embedding size
Embedding size refers to the size of the feature vector of lncRNAs/PCGs extracted by the model. During model training, we set the embedding size in |$\{10,50,100,200\}$| to verify its impact on GAE-LGA prediction performance (Figure 5B). It can found that the AUC does not fluctuate much between the four groups of models. When the embedding size is set to 10, the final AUC is slightly larger than that of the other groups. Thus, we choose 10 as the embedding size in this experiment.
Number of hidden layer features
The number of features in the hidden layer in the GCN encoder is an important parameter that can affect the performance of GEA-LGA for lncRNA–PCG association prediction. In this comparison, we changed the number of hidden layer features in |$\{10,50,100,200\}$| (Figure 5C). It can be found that GAE-LGA performs best when the parameter is set to 200. Therefore, we chose 200 as the number of hidden layer features in this experiment.
Learning rate
The learning rate as a hyperparameter of the model determines if and when the objective function converges to a minimum. Here, we changed the learning rate in |$\{0.1,0.01,0.001,0.0001\}$| for analysis (Figure 5D). As we can see, the model has the best performance when the learning rate is equal to 0.001. When the learning rate is in |$\{0.1, 0.01\}$|, it is prone to gradient explosion, the loss amplitude is large and the model is difficult to converge. When the learning rate is equal to 0.0001, the model falls into a local optimum and cannot achieve optimal performance. In this experiment, we choose 0.001 as the value of learning rate.
Discussion
Despite the importance of lncRNA–PCG associations for dissecting lncRNA pathogenic mechanisms, the current understanding of lncRNA–PCG association identification is still limited. Apart from some sequence-based, expression-based and two recently proposed machine learning-based prediction methods, little effort has been made to identify lncRNA–PCG associations at scale. In this study, driven by the recent progress in the multi-omics synergistic regulation of lncRNA and PCG, we explored the mechanism of action of lncRNA and PCG from a multi-omics perspective and discovered that the multi-omics features of lncRNA and PCG were associated with lncRNA–PCG associations. Based on this finding, we designed a new computational model, GAE-LGA, to predict lncRNA–PCG associations using a graph autoencoder algorithm combined with multi-omics information of genes. Through the performance and robustness comparison of GAE-LGA and other deep learning methods on three gold standard datasets, it was proved that GAE-LGA is a very effective lncRNA–PCG prediction model.
GAE-LGA can provide meaningful insights into future studies on the regulation of PCG expression by lncRNAs. Unlike traditional sequence-based lncRNA–PCG associations prediction methods which focus on the binding sites between lncRNA and its target PCGs, considering that free-energy calculations may have a high error rate and that only direct physical interactions between lncRNAs and PCGs can be identified through a base-pairing strategy. Also not like the expression-based lncRNA–PCG associations prediction methods which use the expression profiles of lncRNA and PCG to analyze their relationship. GAE-LGA predicted lncRNA–PCG associations using lncRNA–PCG learning and cross-omics correlation learning. Two learned similarities, functional similarity and multi-omics similarity, were accumulated and encoded by graph autoencoders to obtain embeddings of lncRNAs and PCGs. By combining and decoding these embeddings, GAE-LGA can identify the association for each lncRNA–PCG pair and thus, it can have broad applications: (i) GAE-LGA fuses multi-omics information into the similarity network for efficient feature representation, providing a new idea for the related analysis of lncRNA; (ii) More importantly, it identifies potential lncRNA–PCG associations with high accuracy and without expression specificity, helping to uncover the regulatory mode of lncRNAs on PCG and guide the treatment of diseases.
GAE-LGA can provide initial information for our two future studies: LncRNA-protein associations prediction and prediction of PCG-binding capacity of lncRNAs. On the one hand, lncRNAs have been reported to drive RNA–protein pair binding by promoting PCG expression[50, 51]. Since GAE-LGA focuses on identifying lncRNA–PCG associations, lncRNA–protein associations can be inferred by analyzing lncRNA–PCG associations and lncRNA-driven PCG–protein associations. On the other hand, lncRNAs may compete for binding to PCG to exert regulatory functions[52], and their binding capacities are not always equal and difficult to predict effectively. GAE-LGA can be used to measure the competitiveness of lncRNAs in regulating a specific PCG. It outputs the association score between lncRNA and PCG. Then, lncRNA–PCG associations with larger scores could be considered to be more biologically significant and thus lncRNAs belonging to these associations may preferentially regulate PCGs.
Despite the obvious advantages of GAE-LGA, some of its limitations should be known. GAE-LGA is more robust than other machine learning methods, but it may still suffer from computational biases caused by imbalanced learning samples. Well-studied lncRNAs/PCGs tend to receive ideal predictive performance because they have more connections in the known lncRNA–PCG association network. Besides, we should know that the predictive performance of GAE-LGA for new lncRNAs with unknown multi-omics characteristics is lower than that of lncRNAs with known multi-omics characteristics because of their synergistic regulatory roles.
This study proposed a graph autoencoder-based deep learning model, GAE-LGA, to identify lncRNA-related PCGs.
GAE-LGA jointly explored lncRNA-gene learning and cross-omics correlation learning to make feature representations more accurate.
GAE-LGA can successfully capture lncRNA–PCG associations with strong robustness and outperformed other machine learning-based identification methods.
Availability and implementation
The source code of GAE-LGA is available at: https://github.com/meihonggao/GAE-LGA.
Funding
This work has been supported by the National Natural Science Foundation of China [NO. 61772426, U1811262].
Author Biographies
Meihong Gao is a PhD student in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. Her research interests include computational biology and machine learning.
Shuhui Liu is a PhD student in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. Her research interests include computational biology and machine learning.
Yang Qi is a PhD student in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. Her research interests include bioinformatics and machine learning.
Xinpeng Guo is a PhD student in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. His research interests include bioinformatics and machine learning.
Xuequn Shang is a professor in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. Her research interests include data mining and bioinformatics.