GAE-LGA: integration of multi-omics data with graph autoencoders to identify lncRNA–PCG associations

Gao, Meihong; Liu, Shuhui; Qi, Yang; Guo, Xinpeng; Shang, Xuequn

doi:10.1093/bib/bbac452

Abstract

Long non-coding RNAs (lncRNAs) can disrupt the biological functions of protein-coding genes (PCGs) to cause cancer. However, the relationship between lncRNAs and PCGs remains unclear and difficult to predict. Machine learning has achieved a satisfactory performance in association prediction, but to our knowledge, it is currently less used in lncRNA–PCG association prediction. Therefore, we introduce GAE-LGA, a powerful deep learning model with graph autoencoders as components, to recognize potential lncRNA–PCG associations. GAE-LGA jointly explored lncRNA–PCG learning and cross-omics correlation learning for effective lncRNA–PCG association identification. The functional similarity and multi-omics similarity of lncRNAs and PCGs were accumulated and encoded by graph autoencoders to extract feature representations of lncRNAs and PCGs, which were subsequently used for decoding to obtain candidate lncRNA–PCG pairs. Comprehensive evaluation demonstrated that GAE-LGA can successfully capture lncRNA–PCG associations with strong robustness and outperformed other machine learning-based identification methods. Furthermore, multi-omics features were shown to improve the performance of lncRNA–PCG association identification. In conclusion, GAE-LGA can act as an efficient application for lncRNA–PCG association prediction with the following advantages: It fuses multi-omics information into the similarity network, making the feature representation more accurate; it can predict lncRNA–PCG associations for new lncRNAs and identify potential lncRNA–PCG associations with high accuracy.

long non-coding RNAs, protein-coding genes, lncRNA–PCG associations prediction, cross-omics correlation learning, graph autoencoders

Introduction

Long non-coding RNAs (lncRNAs) are transcripts longer than 200 nucleotides that are barely involved in the translation process[1–3]. Currently, the roles of most lncRNAs are not clear, and only 17 754 of them are annotated by GENCODE V35[4]. These annotations suggest that lncRNAs can perturb the expression of protein-coding genes (PCGs) at multiple levels and participate in several important biological processes[1, 5–7]. Given the aforementioned biological significance of lncRNAs, combined with their large number and the diversity of their mechanisms of action, it becomes necessary to explore the relationship between lncRNAs and PCGs.

Some biological experiments have been designed to verify lncRNA–PCG associations[8–11], but their large-scale use is not easily achieved due to time and financial constraints. Thus, a reliable computational tool for identifying lncRNA–PCG associations based on existing experimental data is needed. At the present time, various computational-based approaches have been designed to recognize lncRNA–PCG associations. These approaches are grouped into three categories: sequence-based, expression-based and machine learning-based. Sequence-based approaches utilize the free energy of base sequences to predict lncRNA–PCG associations[12–14]. They excel at predicting direct physical interactions between lncRNAs and PCGs, but fail to identify potential lncRNA–PCG associations. Expression-based methods calculate the degree of association between lncRNAs and PCGs based on their expression levels[15–17]. Affected by the specificity of lncRNA and PCG expression, these methods can only predict the lncRNA–PCG association in specific samples at specific stages. Machine learning-based approaches, which are also newly proposed, identify candidate lncRNA–PCG pairs by learning from existing lncRNA–PCG pairs[18, 19]. They can identify direct and potential lncRNA–PCG associations without the expression-specific effects of lncRNA and PCG.

Machine learning methods provide an efficient solution for predicting associations between lncRNAs and other objects, including lncRNA–disease association[20–28], lncRNA–miRNA association[29–31] and lncRNA–protein association[32–35], etc. Most of these methods can be grouped into the following three categories: traditional machine learning methods[20, 24, 25, 30, 33], matrix completion methods[21–23, 29, 35] and deep learning methods[26–28, 31, 32, 34]. Inspired by the above studies, machine learning-based approaches for lncRNA–PCG association prediction were proposed. A predictor based on support vector machines (SVMs), logistic regression (LR) and random forest (RF) was first constructed to investigate the relationship between lncRNA and PCG[19]. Subsequently, a deep learning-based method was proposed to screen target PCGs for lncRNA[18]. The results of these two approaches demonstrate the effectiveness of machine learning in identifying lncRNA–PCG associations, but there is still much room for improvement in identification performance. Therefore, it is of great significance for us to design a new machine learning-based model to accurately identify lncRNA–PCG association, and to explore the regulatory mechanism of lncRNA on PCG.

lncRNA–PCG association identification can be viewed as a link prediction problem that has been proved to be excellently solved by Graph Convolutional Network (GCN)[36]. In addition, lncRNAs can regulate PCG expression at multiple levels[1] and play a multi-omics synergistic regulatory role in organisms. Therefore, based on the above two points, we combined GCN encoders and multi-omics pan-cancer data to propose a new method GAE-LGA for efficient identification of candidate lncRNA–PCG associations. We first integrated functional similarity and multi-omics similarity to build similarity networks for lncRNA and PCG, respectively. Among them, the functional similarity was inferred from lncRNA–PCG association information, and the multi-omics similarity was inferred from multi-omics information of patients. Then, feature representations of lncRNAs and PCGs were learned from the constructed similarity networks using graph autoencoders. Finally, we constructed a decoder to decode the feature representations to identify potential lncRNA–PCG associations. Compared with existing lncRNA–PCG associations prediction methods, GAE-LGA has three new features:

it fuses multi-omics information into the similarity network, making the feature representation more accurate;
it can predict lncRNA–PCG associations for a new lncRNA;
it identifies potential lncRNA–PCG associations with high accuracy and without expression specificity.

Materials and methods

We designed GAE-LGA for lncRNA–PCG association prediction. As shown in Figure 1, the prediction process consists of three steps. Firstly, we collected and preprocessed the multi-omics data and lncRNA–PCG association network to obtain the required association information and multi-omics information(Figure 1A). Secondly, we used GCN encoders to aggregate and learn feature representations for lncRNAs and PCGs(Figure 1B). Finally, a decoder was used to combine feature representations for predicting candidate lncRNA–PCG association(Figure 1C).

Figure 1

Overview of GAE-LGA. LNC represents lncRNA and PCG represents protein coding gene. (A) In the first step, we performed a preprocessing operation to get multi-omics features and association matrix. (B) To generate feature representations of lncRNAs and PCGs, the association information (functional similarity) and multi-omics information were fused for encoding. (C) Finally, we computed the association score and activated it.

Open in new tab Download slide

Data collection and preprocessing

Multi-omics data

We downloaded multi-omics data from the TCGA (https://portal.gdc.cancer.gov/) database. The multi-omics data are composed of single nucleotide variation (SNV), copy number variation (CNV), dna methylation (DNA Methy) and transcription profiling (TP) data, whose relationship to gene expression is shown in Table S1 (See Supplementary material). After preprocessing, each lncRNA/PCG contains 312 multi-omics features, including 66 SNV features, 132 CNV features, 48 DNA Methy featuress and 66 TP features. Their details are shown in Table 1.

Table 1

Open in new tab

Details of lncRNA/PCG-related multi-omics features

Feature type	NC	Attribute	NM
SNV	33	Chromosome, Position	66
CNV	33	Chromosome, Start, End, Aber	132
DNA Methy	12	Chromosome, Start, End, Beta	48
TP	33	Mean, Standard deviation	66

Feature type	NC	Attribute	NM
SNV	33	Chromosome, Position	66
CNV	33	Chromosome, Start, End, Aber	132
DNA Methy	12	Chromosome, Start, End, Beta	48
TP	33	Mean, Standard deviation	66

Note: NC and NM represent the number of cancer types and multi-omics features, respectively. SNV, CNV, DNA Methy and TP are the abbreviations for single nucleotide variation, copy number variation, DNA methylation and transcription profiling, respectively. Aber and Beta refer to mutation type and methylation intensity, respectively.

Table 1

Open in new tab

Details of lncRNA/PCG-related multi-omics features

Feature type	NC	Attribute	NM
SNV	33	Chromosome, Position	66
CNV	33	Chromosome, Start, End, Aber	132
DNA Methy	12	Chromosome, Start, End, Beta	48
TP	33	Mean, Standard deviation	66

Feature type	NC	Attribute	NM
SNV	33	Chromosome, Position	66
CNV	33	Chromosome, Start, End, Aber	132
DNA Methy	12	Chromosome, Start, End, Beta	48
TP	33	Mean, Standard deviation	66

Note: NC and NM represent the number of cancer types and multi-omics features, respectively. SNV, CNV, DNA Methy and TP are the abbreviations for single nucleotide variation, copy number variation, DNA methylation and transcription profiling, respectively. Aber and Beta refer to mutation type and methylation intensity, respectively.

Human lncRNA–PCG associations

We collected lncRNA–PCG associations from three gold standard datasets: LncRNA2Target[9], LncTarD[8] and NPInter[10]. Here, lncRNAs and PCGs with multi-omics information were used for experimental analysis. Their details are shown in Table 2. It can be found that LncRNA2Target contains 773 lncRNA–PCG associations, including 263 lncRNAs and 498 PCGs. There are 1444 lncRNA–PCG associations in LncRNA2Target, involving 238 lncRNAs and 716 PCGS. As for LncTarD, the numbers of lncRNA, PCGs and lncRNA–PCG associations are 308, 256 and 369, respectively.

Table 2

Open in new tab

Statistics of lncRNA–PCG association datasets

Dataset	LncRNA	PCG	lncRNA–PCG
LncRNA2Target	263	498	773
LncTarD	238	716	1444
NPInter	308	256	369

Note: lncRNA–PCG represents the number of lncRNA–PCG associations, and PCG represents protein-coding gene

Table 2

Open in new tab

Statistics of lncRNA–PCG association datasets

Dataset	LncRNA	PCG	lncRNA–PCG
LncRNA2Target	263	498	773
LncTarD	238	716	1444
NPInter	308	256	369

Note: lncRNA–PCG represents the number of lncRNA–PCG associations, and PCG represents protein-coding gene

Training and testing datasets

We used 10-fold cross-validation to obtain training and testing datasets. For them, we generated positive and negative samples by the following two methods. In method 1, we selected known lncRNA–PCG associations in the dataset as positive samples and those with unknown lncRNA–PCG associations as negative samples. In method 2, we selected positive samples in the same way as in method 1, and then randomly selected the same number of negative samples from unknown lncRNA–PCG associations.

Similarity feature

The similarity feature consists of two parts: the functional similarity feature and the multi-omics similarity feature. These details are shown in Table S2 (See Supplementary material). For lncRNA, the similarity feature was calculated as follows:

$$\begin{align}& S_l(l,l^{\prime})=\mathrm{accum}(F_l(l,l^{\prime}),M_l(l,l^{\prime})), \end{align}$$

(1)

where |$l$| and |$l^{\prime}$| both represent lncRNA |$\mathrm{and\ accum}(\cdot )$| represents an accumulation operation (the concatenation of vectors). |$F_l(l,l)$| is the functional similarity between |$l$| and |$l^{\prime}$|⁠, reflecting the similarity of |$l$| and |$l^{\prime}$| in their association with PCGs and and is defined as follows:

$$\begin{align}& F_l(l,l^{\prime})=\exp \left(- \left( \frac{1}{m}\sum_{l=1}^{m} {\Vert A_{lg}(l,:) \Vert^2}\right) \bigl(\Vert A_{lg}(l,:)-A_{lg}(l^{\prime},:) \Vert ^2\bigr) \right), \end{align}$$

(2)

where |$A_{lg}$| represents the existing lncRNA–PCG association. |$M_l(l,l^{\prime})$| is the multi-omics similarity between |$l$| and |$l^{\prime}$|⁠, which represents the similarity of |$l$| and |$l^{\prime}$| in cross-omics regulation and is defined as follows:

$$\begin{align}& M_l(l,l^{\prime})=\exp \left(- \left(\frac{1}{m} \sum_{l=1}^{m} {\Vert O_l(l,:) \Vert^2}\right) \bigl(\Vert O_l(l,:)-O_l(l^{\prime},:) \Vert ^2\bigr) \right), \end{align}$$

(3)

where |$O_l$| represents the multi-omics features of lncRNAs. By a method similar to computing |$S_l$|⁠, we computed similarity feature |$S_g$| for PCG. Here, the functional similarity feature and multi-omics similarity feature of PCG were calculated by Equations S1 and S2 (See Supplementary material), respectively. After that, |$S_l$| and |$S_g$| were used for feature representation learning.

Feature representation

We used GCNs as encoders to learn feature representations of lncRNAs and PCGs from their similarity features. GCN is a general concept of applying neural networks on graphs[36]. Its layer-wise propagation rule is difined as follows:

$$\begin{align}& H^{(l+1)}=\sigma\left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)}\right), \end{align}$$

(4)

where |$A$| repesents the adjacency matrix of graph, |$\tilde{A}=A+I$| and |$I$| represents the identity matrix, |$\tilde{D}$| represents the degree matrix of matrix |$\tilde{A}$| and |$\tilde{D}_{ii} = \sum _{j}\tilde{A}_{ij}$|⁠, and |$W^{(l)}$| represents a layer-specific weight matrix. It should be note that |$H(0) = X$| represents the initial feature matrix.

We used a two-layer GCN to learn feature representations for lncRNAs. We first calculated |$\hat{A_l} = \tilde{D_l}^{-\frac{1}{2}} \tilde{A_l} \tilde{D_l}^{-\frac{1}{2}}$| in a pre-processing step. Then the GCN encoder model for lncRNA takes the following simple form:

$$\begin{align}& Z_l=f(S_l,A_l)= \mathrm{softmax} \bigg (\hat{A_l} \mathrm{ReLU} \bigg( \hat{A_l}S_lW_l^{(0)}\bigg) W_l^{(1)} \bigg) \end{align}$$

(5)

Similarly, we used a two-layer GCN to learn feature representations for PCGs. We first calculated |$\hat{A_g} = \tilde{D_g}^{-\frac{1}{2}} \tilde{A_g} \tilde{D_g}^{-\frac{1}{2}}$| in a pre-processing step. Our GCN encoder model for PCG then takes the simple form as follows:

$$\begin{align}& Z_g=f(S_g,A_g)= \mathrm{softmax} \bigg (\hat{A_g} \mathrm{ReLU} \bigg( \hat{A_g}S_g W_g^{(0)}\bigg) W_g^{(1)} \bigg) \end{align}$$

(6)

lncRNA–PCG association prediction

lncRNA–PCG association reconstruction

Having learned feature representations for lncRNAs and PCGs, we used the following decoder to rebuild the lncRNA–PCG association network:

$$\begin{align}& Z=\mathrm{DEC}(Z_l,Z_g)= \mathrm{sigmoid} \bigg(Z_l \times Z_g^{\mathrm{T}} \bigg), \end{align}$$

(7)

where |$Z$| represents the predicted lncRNA–PCG associations and |$\mathrm{DEC}(\cdot )$| represents a reconstruction function. Existing lncRNA–PCG associations are represented by |$A$|⁠. We trained the model so that |$Z$| was as close to |$A$| as possible. The difference between |$A$| and |$Z$| is constrained as follows:

$$\begin{align}& L= \sum \limits_{i = 1}^m \sum \limits_{j=1}^n (A(i,j)-Z(i,j))^2, \end{align}$$

(8)

where |$m$| and |$n$| represent the number of lncRNAs and PCGs, respectively.

Association information for a new lncRNA

We predicted lncRNA–PCG associations for new lncRNAs by integrating the relationships between their neighboring nodes and PCGs. For a new lncRNA |$l$|⁠, its association with PCGs is calculated as follows:

$$\begin{align}& A^{\prime}(l,:)=\frac{\sum_{l_p\in{N(l)}}A(l_p,:)}{|N(l)|}, \end{align}$$

(9)

where |$N(l)$| represents the neighbor nodes of |$l$| and it is determined by comparing the similarity features between |$l$| and other nodes as follows:

$$\begin{align}& N(l)=\{l_{1},l_{2},\cdots,l_{k}\},S_l(l,l_p)>\mathrm{mean}(S_l), \end{align}$$

(10)

where |$l=\{1,2,\cdots ,m\}$|⁠, |$l_p=\{1,2,\cdots ,m\}$|⁠, |$p=\{1,2,\cdots ,k\}$|⁠, |$k=|N(l)|$| represents the number of neighbor nodes of |$l$|⁠, and |$\mathrm{mean}(\cdot )$| represents a mean operation. It should be note that |$k \le (m-1)$|⁠.

Complexity analysis

The computational complexity of GEA-LGA consists of the computational complexity of the graph autoencoder (Equations 5 and 6) and the decoder (Equation 7). It is defined as follows:

$$\begin{align}& \mathcal{O}=\mathcal{O}((r_1+r_2)mhn)+\mathcal{O}(mhn)=\mathcal{O}((r_1+r_2+1)mhn)), \end{align}$$

(11)

where |$r_1,r_2,h$| represent the number of edges in lncRNA similarity network, the number of edges in PCG similarity network and the number of hidden layer nodes of the GCN network. It can be found that the computational complexity of GAE-LGA mainly depends on the number of lncRNAs and PCGs in the association network. For existing prediction tasks, their values are all about 1000, thus the computational complexity of GAE-LGA is within an acceptable range.

Results

Experimental setup

We performed 10-fold cross-validation on three independent datasets, NPInter, LncTarD and LncRNA2Target, to verify the performance of GAE-LGA. The details of the division of positive and negative samples of the dataset are shown in Table S3 (See Supplementary material). Here, four important parameters, GCN layer, embedding size, hidden layer and learning rate, were learned by model training, which were 2, 10, 200 and 0.001, respectively. To evaluate model performance, we calculated the area under the ROC curve (AUC), area under the precision-recall curve (AUPR), accuracy, F1-score, precision, recall and Matthews correlation coefficient (MCC) for the prediction results.

Besides, we analyzed the relationships between lncRNAs, PCGs and lncRNA–PCG associations among the datasets (Figure 2). It can be found that only 77 lncRNAs (Figure 2A), 68 PCGs (Figure 2B) and 30 lncRNA–PCG associations (Figure 2C) overlap in the three datasets. Furthermore, we compared the performance of the model with and without overlapping associations (See Figure S1, Supplementary material). It can be found that these overlapping associations slightly improved model performance, thus we distributed these associations evenly across the training and test sets in our experiments.

Figure 2

Relationship between lncRNAs, protein coding genes (PCGs), and lncRNA–PCG associations among three datasets. Dataset1, Dataset2 and Dataset3 represent NPInter, LncTarD and LncRNA2Target, respectively. (A) Relationship between lncRNAs. (B) Relationship between PCGs. (C) Relationship between lncRNA–PCG associations.

Open in new tab Download slide

Comparison experiments

Effectiveness comparison

To verify the effectiveness of the features learned by GCN, we compared it with several other representation methods. The decoding method used here is the lncRNA–PCG association reconstruction mentioned in the Materials and methods section. As shown in Table S4 (See Supplementary material), we found that GCN, Deepwalk and Node2vec sequentially decreased in their ability to learn lncRNA and PCG features. Although the combination of GCN and Deepwalk/Node2vec can improve the effectiveness of feature representation, its performance is still inferior to that of GCN. This is because GCN can capture the global information of the lncRNA/PCG-related network and thus represent the node features well.

Subsequently, we compared the experimental results of GAE-LGA with those of DeepLGP[18], Convolutional Neural Network (CNN)[37], Autoencoder[38] and CNN+Autoencoder[37, 38] on three datasets to validate its performance (Table 3). As we can see, GAE-LGA performs best on the NPInter and LncRNA2Target datasets, significantly improving over state-of-the-art baseline and deep learning methods. On the NPInter dataset, GAE-LGA improves AUC, AUPR, F1-score and MCC by 5.70%, 11.92%, 0.18% and 0.05%, respectively. On the LncRNA2Target dataset, GAE-LGA improves AUC, AUPR, F1-score and MCC by 1.03%, 2.56%, 0.63% and 0.54%, respectively. As for the LncTarD dataset, DeepLGP achieves the highest AUPR, CNN+Autoencoder achieves the highest MCC, but GAE-LGA far outperforms the best method on other evaluation metrics. It improves by 1.22% and 0.37% in AUC and F1 score, respectively. These improvements are attributed to the following two aspects. On the one hand, GAE-LGA combines the association information and multi-omics information of lncRNA and PCG, making the theoretical performance of lncRNA–PCG association higher. On the other hand, GCN can learn the feature representation of lncRNA and PCG more accurately than other feature representation methods, which leads to the performance of lncRNA–PCG association prediction closer to the theoretical value.

Table 3

Open in new tab

Comparison results of the proposed GAE-LGA and other deep learning methods on datasets NPInter, LncTarD and LncRNA2Targe under the same experimental setup

Method	NPInter				LncTarD				LncRNA2Targe
	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC
GAE-LGA	0.9743	0.6730	0.8468	0.7262	0.9308	0.6345	0.8831	0.6123	0.8956	0.5295	0.7787	0.7871
DeepLGP	0.9218	0.6013	0.7526	0.7216	0.9196	0.6501	0.7508	0.6102	0.8657	0.5058	0.7582	0.7815
CNN	0.9102	0.5961	0.8197	0.7135	0.9103	0.6254	0.8003	0.6025	0.8802	0.5062	0.7713	0.7658
Autoencoder	0.9013	0.5936	0.7761	0.6897	0.9042	0.6188	0.7952	0.5901	0.8752	0.5108	0.7652	0.7726
Autoencoder+CNN	0.9205	0.6009	0.8453	0.7258	0.9158	0.6291	0.8527	0.6136	0.8865	0.5163	0.7738	0.7829

Method	NPInter				LncTarD				LncRNA2Targe
	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC
GAE-LGA	0.9743	0.6730	0.8468	0.7262	0.9308	0.6345	0.8831	0.6123	0.8956	0.5295	0.7787	0.7871
DeepLGP	0.9218	0.6013	0.7526	0.7216	0.9196	0.6501	0.7508	0.6102	0.8657	0.5058	0.7582	0.7815
CNN	0.9102	0.5961	0.8197	0.7135	0.9103	0.6254	0.8003	0.6025	0.8802	0.5062	0.7713	0.7658
Autoencoder	0.9013	0.5936	0.7761	0.6897	0.9042	0.6188	0.7952	0.5901	0.8752	0.5108	0.7652	0.7726
Autoencoder+CNN	0.9205	0.6009	0.8453	0.7258	0.9158	0.6291	0.8527	0.6136	0.8865	0.5163	0.7738	0.7829

Note: The bold value corresponds to the best performance method for each metric.

Table 3

Open in new tab

Comparison results of the proposed GAE-LGA and other deep learning methods on datasets NPInter, LncTarD and LncRNA2Targe under the same experimental setup

Method	NPInter				LncTarD				LncRNA2Targe
	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC
GAE-LGA	0.9743	0.6730	0.8468	0.7262	0.9308	0.6345	0.8831	0.6123	0.8956	0.5295	0.7787	0.7871
DeepLGP	0.9218	0.6013	0.7526	0.7216	0.9196	0.6501	0.7508	0.6102	0.8657	0.5058	0.7582	0.7815
CNN	0.9102	0.5961	0.8197	0.7135	0.9103	0.6254	0.8003	0.6025	0.8802	0.5062	0.7713	0.7658
Autoencoder	0.9013	0.5936	0.7761	0.6897	0.9042	0.6188	0.7952	0.5901	0.8752	0.5108	0.7652	0.7726
Autoencoder+CNN	0.9205	0.6009	0.8453	0.7258	0.9158	0.6291	0.8527	0.6136	0.8865	0.5163	0.7738	0.7829

Method	NPInter				LncTarD				LncRNA2Targe
	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC
GAE-LGA	0.9743	0.6730	0.8468	0.7262	0.9308	0.6345	0.8831	0.6123	0.8956	0.5295	0.7787	0.7871
DeepLGP	0.9218	0.6013	0.7526	0.7216	0.9196	0.6501	0.7508	0.6102	0.8657	0.5058	0.7582	0.7815
CNN	0.9102	0.5961	0.8197	0.7135	0.9103	0.6254	0.8003	0.6025	0.8802	0.5062	0.7713	0.7658
Autoencoder	0.9013	0.5936	0.7761	0.6897	0.9042	0.6188	0.7952	0.5901	0.8752	0.5108	0.7652	0.7726
Autoencoder+CNN	0.9205	0.6009	0.8453	0.7258	0.9158	0.6291	0.8527	0.6136	0.8865	0.5163	0.7738	0.7829

Note: The bold value corresponds to the best performance method for each metric.

Following this, we analyzed the prediction results of traditional machine learning methods SVM-based model[19], LR-based model[19] and RF-based model[19] under the same experimental setup (See Table S5, Supplementary material). As we can see, on the three datasets, the average performance metrics of deep learning methods are much higher than those of traditional machine learning methods. Especially in AUC, the advantages of deep learning methods are obvious. These improvements are attributed to the following three reasons: (i) The neural network has a strong function approximation ability, which leads to its very strong learning ability; (ii) Deep neural networks have more streamlined expression and higher sample efficiency than shallow neural networks; (iii) Deep learning has strong generalization ability, that is, the model with small error learned from the training set also has small error on the test set. In conclusion, our model has advantages in predicting lncRNA–PCG associations, significantly improving the prediction performance compared with existing machine learning methods.

Robustness comparison

We randomly drop a small fraction of known lncRNA–PCG pairs in each dataset at a ratio r |$\in \{0.8,0.85,0.9,0.95,1\}$| for comparison of method performance changes (Figure 3). As we can see, the performance of all methods drops significantly with the reduction of lncRNA–PCG pairs, except GAE-LGA which yields the most robust and highest performances across different sample sizes. Furthermore, we used the standard deviation of AUC and AUPR for different sample size groups to evaluate the robustness of the models (Figure 4). It can be found that the standard deviations of AUC of GAE-LGA on NPInter, LncTarD and LncRNA2Target are 0.0091, 0.0081 and 0.0077 (Figure 4A), which are 7.2479, 11.3529 and 17.8180 times better than those of the best comparison method (Figure 4C), and 8.0703, 12.5664 and 19.8222 times better than those of the worst comparison method (Figure 4C), respectively.

Figure 3

Comparison of model performance changes. (A) AUC changes on NPInter. (B) AUC changes on LncTarD. (C) AUC changes on LncRNA2Target. (D) AUPR changes on NPInter. (E) AUPR changes on LncTarD. (F) AUPR changes on LncRNA2Target.

Open in new tab Download slide

Figure 4

Standard deviation analysis of model performance changes. Here, SD, Dataset1, Dataset2 and Dataset3 represent standard deviation, NPInter, LncTarD and LncRNA2Target, respectively. (A) AUC Changes. (B) AUPR Changes. (C) GAE-LGA improvement over other deep learning methods. AUC-I1 and AUPR-I1 represent the improvement of GAE-LGA in AUC change and AUPR change over the best deep learning method, respectively. AUC-I2 and AUPR-I2 represent the improvement of GAE-LGA in AUC change and AUPR change over the worst deep learning method by GAE-LGA, respectively.

Open in new tab Download slide

The standard deviations of AUPR of GAE-LGA on NPInter, LncTarD and LncRNA2Target are 0.0282, 0.0129 and 0.0118 (Figure 4B), which are 5.3156, 9.1521 and 8.5238 times better than those of the best comparison method (Figure 4C), and 5.7036, 13.9338 and 12.4349 times better than those of the worst comparison method (Figure 4C), respectively. The reason for the improvement is that GCN can perform end-to-end learning on the feature information and network structure information of lncRNA and PCG at the same time, which thus can capture the global information of the related network to represent the features of the lncRNA and PCG well. Therefore, the GAE-LGA model based on the GCN encoder is robust with a lower standard deviation of model performance than other methods.

Ablation experiment

To evaluate the importance of multi-omics features in lncRNA–PCG association prediction, we conducted comparative experiments on the NPInter, LncTarD and LncRNA2Target datasets based on different features of lncRNAs and PCGs: functional similarity features, multi-omics similarity features and aggregated similarity (functional similarity and multi-omics similarity) features. As we can see in Table 4, the functional similarity-related experimental group has a higher prediction performance than the multi-omics similarity-related experimental group, while the aggregated similarity-related experimental group has the best prediction performance. This suggests that lncRNAs regulate PCG expression at the multi-omics level, and their multi-omics features contribute to improving model prediction performance.

Table 4

Open in new tab

Comparison results of the effect of different features on model performance

Method	NPInter				LncTarD				LncRNA2Target
	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC
GAE+S1	0.9633	0.6538	0.8601	0.7198	0.9310	0.6209	0.8729	0.6109	0.8950	0.6012	0.7523	0.7691
GAE+S2	0.9231	0.6136	0.8022	0.7053	0.9022	0.6374	0.8830	0.5926	0.8901	0.5306	0.7021	0.7528
GAE+S1+S2	0.9743	0.6730	0.8468	0.7262	0.9308	0.6345	0.8831	0.6123	0.8956	0.5295	0.7787	0.7871

Method	NPInter				LncTarD				LncRNA2Target
	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC
GAE+S1	0.9633	0.6538	0.8601	0.7198	0.9310	0.6209	0.8729	0.6109	0.8950	0.6012	0.7523	0.7691
GAE+S2	0.9231	0.6136	0.8022	0.7053	0.9022	0.6374	0.8830	0.5926	0.8901	0.5306	0.7021	0.7528
GAE+S1+S2	0.9743	0.6730	0.8468	0.7262	0.9308	0.6345	0.8831	0.6123	0.8956	0.5295	0.7787	0.7871

Note: The bold value corresponds to the best performance method for each metric. GAE, S1 and S2 represents graph autoencoder, functional similarity features and multi-omics similarity features, respectively.

Table 4

Open in new tab

Comparison results of the effect of different features on model performance

Method	NPInter				LncTarD				LncRNA2Target
	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC
GAE+S1	0.9633	0.6538	0.8601	0.7198	0.9310	0.6209	0.8729	0.6109	0.8950	0.6012	0.7523	0.7691
GAE+S2	0.9231	0.6136	0.8022	0.7053	0.9022	0.6374	0.8830	0.5926	0.8901	0.5306	0.7021	0.7528
GAE+S1+S2	0.9743	0.6730	0.8468	0.7262	0.9308	0.6345	0.8831	0.6123	0.8956	0.5295	0.7787	0.7871

Method	NPInter				LncTarD				LncRNA2Target
	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC	AUC	AUPR	F1-score	MCC
GAE+S1	0.9633	0.6538	0.8601	0.7198	0.9310	0.6209	0.8729	0.6109	0.8950	0.6012	0.7523	0.7691
GAE+S2	0.9231	0.6136	0.8022	0.7053	0.9022	0.6374	0.8830	0.5926	0.8901	0.5306	0.7021	0.7528
GAE+S1+S2	0.9743	0.6730	0.8468	0.7262	0.9308	0.6345	0.8831	0.6123	0.8956	0.5295	0.7787	0.7871

Note: The bold value corresponds to the best performance method for each metric. GAE, S1 and S2 represents graph autoencoder, functional similarity features and multi-omics similarity features, respectively.

Case studies

We conducted a case study of GAE-LGA prediction results using published literature that includes experimentally validated lncRNA–PCG associations (Table 5). As we can see, 12 candidate lncRNA–PCG pairs are validated by existing studies. Specifically, it has been found that CDKN2B-AS1 inhibits MMP9 expression in renal clear cell carcinoma[39], CRNDE accelerates cervical cancer progression by altering ZEB1 expression[40], H19 increases CDKN1A expression by reducing HUVEC growth[41] and MALAT1 promotes osteosarcoma development by upregulating CCND1 expression[42]. In addition, MALAT1 upregulates c-Myc expression in thymic epithelial tumors[43], upregulation of MEG3 in human osteosarcoma cells correlates with low expression of MMP9[39], MEG3 inhibits macrophage apoptosis by regulating CDKN2A expression[44] and NEAT1 induces cell differentiation in hepatoblastoma by regulating MMP9 expression[45]. Moreover, PVT1 promotes fracture healing and esophageal cancer progression by regulating the expressions of HMGA2 and ZEB1, respectively[46, 47], the correlation between TUG1 and P53 affects the cell life activities of non-small cell lung cancer[48] and UCA1 can antagonize cell cycle arrest by destabilizing EZH2[49].

Table 5

Open in new tab

Case studies of identified lncRNA–PCG associations

LncRNA	PCG	Disease	Evidence	LncRNA	PCG	Disease	Evidence
CDKN2B-AS1	MMP9	renal clear cell carcinoma	PMID: 33608495	MEG3	CDKN2A	Atherosclerosis	PMID: 30672051
CRNDE	ZEB1	Cervical Cancer	PMID: 33469312	NEAT1	MMP9	Hepatoblastoma	PMID: 35300348
H19	CDKN1A	hypoxia-related diseases	PMID: 27063004	PVT1	HMGA2	Fragility fracture	PMID: 34592894
MALAT1	CCND1	osteosarcoma	PMID: 30365098	PVT1	ZEB1	esophageal cancer	PMID: 33848670
MALAT1	MYC	thymic epithelial tumors	PMID: 34530916	TUG1	TP53	non-small cell lung cancer	PMID: 24853421
MEG3	MMP9	Osteosarcoma	PMID: 29434890	UCA1	EZH2	urothelial cancer	PMID: 32537408

LncRNA	PCG	Disease	Evidence	LncRNA	PCG	Disease	Evidence
CDKN2B-AS1	MMP9	renal clear cell carcinoma	PMID: 33608495	MEG3	CDKN2A	Atherosclerosis	PMID: 30672051
CRNDE	ZEB1	Cervical Cancer	PMID: 33469312	NEAT1	MMP9	Hepatoblastoma	PMID: 35300348
H19	CDKN1A	hypoxia-related diseases	PMID: 27063004	PVT1	HMGA2	Fragility fracture	PMID: 34592894
MALAT1	CCND1	osteosarcoma	PMID: 30365098	PVT1	ZEB1	esophageal cancer	PMID: 33848670
MALAT1	MYC	thymic epithelial tumors	PMID: 34530916	TUG1	TP53	non-small cell lung cancer	PMID: 24853421
MEG3	MMP9	Osteosarcoma	PMID: 29434890	UCA1	EZH2	urothelial cancer	PMID: 32537408

Table 5

Open in new tab

Case studies of identified lncRNA–PCG associations

LncRNA	PCG	Disease	Evidence	LncRNA	PCG	Disease	Evidence
CDKN2B-AS1	MMP9	renal clear cell carcinoma	PMID: 33608495	MEG3	CDKN2A	Atherosclerosis	PMID: 30672051
CRNDE	ZEB1	Cervical Cancer	PMID: 33469312	NEAT1	MMP9	Hepatoblastoma	PMID: 35300348
H19	CDKN1A	hypoxia-related diseases	PMID: 27063004	PVT1	HMGA2	Fragility fracture	PMID: 34592894
MALAT1	CCND1	osteosarcoma	PMID: 30365098	PVT1	ZEB1	esophageal cancer	PMID: 33848670
MALAT1	MYC	thymic epithelial tumors	PMID: 34530916	TUG1	TP53	non-small cell lung cancer	PMID: 24853421
MEG3	MMP9	Osteosarcoma	PMID: 29434890	UCA1	EZH2	urothelial cancer	PMID: 32537408

LncRNA	PCG	Disease	Evidence	LncRNA	PCG	Disease	Evidence
CDKN2B-AS1	MMP9	renal clear cell carcinoma	PMID: 33608495	MEG3	CDKN2A	Atherosclerosis	PMID: 30672051
CRNDE	ZEB1	Cervical Cancer	PMID: 33469312	NEAT1	MMP9	Hepatoblastoma	PMID: 35300348
H19	CDKN1A	hypoxia-related diseases	PMID: 27063004	PVT1	HMGA2	Fragility fracture	PMID: 34592894
MALAT1	CCND1	osteosarcoma	PMID: 30365098	PVT1	ZEB1	esophageal cancer	PMID: 33848670
MALAT1	MYC	thymic epithelial tumors	PMID: 34530916	TUG1	TP53	non-small cell lung cancer	PMID: 24853421
MEG3	MMP9	Osteosarcoma	PMID: 29434890	UCA1	EZH2	urothelial cancer	PMID: 32537408

Parameter analysis

GCN layer

We compared the AUCs produced by two-, three-, four- and five-layer GCN encoder models to determine the effect of GCN layers on model performance (Figure 5A). It can be found that the AUC fluctuations between models with different GCN layers are not large, but a small number of layers can speed up the convergence of the model, and a larger number of layers will make the model prone to overfitting. In this experiment, we make GCN layers equal to two.

Figure 5

Parameter analysis. (A) Analysis of GCN layer. (B) Analysis of embedding size. (C) Analysis of the number of hidden layer features. (D) Analysis of learning rate.

Open in new tab Download slide

Embedding size

Embedding size refers to the size of the feature vector of lncRNAs/PCGs extracted by the model. During model training, we set the embedding size in |$\{10,50,100,200\}$| to verify its impact on GAE-LGA prediction performance (Figure 5B). It can found that the AUC does not fluctuate much between the four groups of models. When the embedding size is set to 10, the final AUC is slightly larger than that of the other groups. Thus, we choose 10 as the embedding size in this experiment.

Number of hidden layer features

The number of features in the hidden layer in the GCN encoder is an important parameter that can affect the performance of GEA-LGA for lncRNA–PCG association prediction. In this comparison, we changed the number of hidden layer features in |$\{10,50,100,200\}$| (Figure 5C). It can be found that GAE-LGA performs best when the parameter is set to 200. Therefore, we chose 200 as the number of hidden layer features in this experiment.

Learning rate

The learning rate as a hyperparameter of the model determines if and when the objective function converges to a minimum. Here, we changed the learning rate in |$\{0.1,0.01,0.001,0.0001\}$| for analysis (Figure 5D). As we can see, the model has the best performance when the learning rate is equal to 0.001. When the learning rate is in |$\{0.1, 0.01\}$|⁠, it is prone to gradient explosion, the loss amplitude is large and the model is difficult to converge. When the learning rate is equal to 0.0001, the model falls into a local optimum and cannot achieve optimal performance. In this experiment, we choose 0.001 as the value of learning rate.

Discussion

Despite the importance of lncRNA–PCG associations for dissecting lncRNA pathogenic mechanisms, the current understanding of lncRNA–PCG association identification is still limited. Apart from some sequence-based, expression-based and two recently proposed machine learning-based prediction methods, little effort has been made to identify lncRNA–PCG associations at scale. In this study, driven by the recent progress in the multi-omics synergistic regulation of lncRNA and PCG, we explored the mechanism of action of lncRNA and PCG from a multi-omics perspective and discovered that the multi-omics features of lncRNA and PCG were associated with lncRNA–PCG associations. Based on this finding, we designed a new computational model, GAE-LGA, to predict lncRNA–PCG associations using a graph autoencoder algorithm combined with multi-omics information of genes. Through the performance and robustness comparison of GAE-LGA and other deep learning methods on three gold standard datasets, it was proved that GAE-LGA is a very effective lncRNA–PCG prediction model.

GAE-LGA can provide meaningful insights into future studies on the regulation of PCG expression by lncRNAs. Unlike traditional sequence-based lncRNA–PCG associations prediction methods which focus on the binding sites between lncRNA and its target PCGs, considering that free-energy calculations may have a high error rate and that only direct physical interactions between lncRNAs and PCGs can be identified through a base-pairing strategy. Also not like the expression-based lncRNA–PCG associations prediction methods which use the expression profiles of lncRNA and PCG to analyze their relationship. GAE-LGA predicted lncRNA–PCG associations using lncRNA–PCG learning and cross-omics correlation learning. Two learned similarities, functional similarity and multi-omics similarity, were accumulated and encoded by graph autoencoders to obtain embeddings of lncRNAs and PCGs. By combining and decoding these embeddings, GAE-LGA can identify the association for each lncRNA–PCG pair and thus, it can have broad applications: (i) GAE-LGA fuses multi-omics information into the similarity network for efficient feature representation, providing a new idea for the related analysis of lncRNA; (ii) More importantly, it identifies potential lncRNA–PCG associations with high accuracy and without expression specificity, helping to uncover the regulatory mode of lncRNAs on PCG and guide the treatment of diseases.

GAE-LGA can provide initial information for our two future studies: LncRNA-protein associations prediction and prediction of PCG-binding capacity of lncRNAs. On the one hand, lncRNAs have been reported to drive RNA–protein pair binding by promoting PCG expression[50, 51]. Since GAE-LGA focuses on identifying lncRNA–PCG associations, lncRNA–protein associations can be inferred by analyzing lncRNA–PCG associations and lncRNA-driven PCG–protein associations. On the other hand, lncRNAs may compete for binding to PCG to exert regulatory functions[52], and their binding capacities are not always equal and difficult to predict effectively. GAE-LGA can be used to measure the competitiveness of lncRNAs in regulating a specific PCG. It outputs the association score between lncRNA and PCG. Then, lncRNA–PCG associations with larger scores could be considered to be more biologically significant and thus lncRNAs belonging to these associations may preferentially regulate PCGs.

Despite the obvious advantages of GAE-LGA, some of its limitations should be known. GAE-LGA is more robust than other machine learning methods, but it may still suffer from computational biases caused by imbalanced learning samples. Well-studied lncRNAs/PCGs tend to receive ideal predictive performance because they have more connections in the known lncRNA–PCG association network. Besides, we should know that the predictive performance of GAE-LGA for new lncRNAs with unknown multi-omics characteristics is lower than that of lncRNAs with known multi-omics characteristics because of their synergistic regulatory roles.

Key Points

This study proposed a graph autoencoder-based deep learning model, GAE-LGA, to identify lncRNA-related PCGs.
GAE-LGA jointly explored lncRNA-gene learning and cross-omics correlation learning to make feature representations more accurate.
GAE-LGA can successfully capture lncRNA–PCG associations with strong robustness and outperformed other machine learning-based identification methods.

Availability and implementation

The source code of GAE-LGA is available at: https://github.com/meihonggao/GAE-LGA.

Funding

This work has been supported by the National Natural Science Foundation of China [NO. 61772426, U1811262].

Author Biographies

Meihong Gao is a PhD student in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. Her research interests include computational biology and machine learning.

Shuhui Liu is a PhD student in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. Her research interests include computational biology and machine learning.

Yang Qi is a PhD student in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. Her research interests include bioinformatics and machine learning.

Xinpeng Guo is a PhD student in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. His research interests include bioinformatics and machine learning.

Xuequn Shang is a professor in the School of Computer Science and Engineering at Northwestern Polytechnical University, Xi’an, China. Her research interests include data mining and bioinformatics.

References

1.

Statello

L

,

Guo

C-J

,

Chen

L-L

, et al.

Gene regulation by long non-coding rnas and its biological functions

.

Nat Rev Mol Cell Biol

2021

;

22

(

2

):

96

–

118

.

Google Scholar

Crossref

WorldCat

2.

Palazzo

AF

,

Koonin

EV

.

Functional long non-coding rnas evolve from junk transcripts

.

Cell

2020

;

183

(

5

):

1151

–

61

.

Google Scholar

Crossref

WorldCat

3.

Yao

R-W

,

Wang

Y

,

Chen

L-L

.

Cellular functions of long noncoding rnas

.

Nat Cell Biol

2019

;

21

(

5

):

542

–

51

.

Google Scholar

Crossref

WorldCat

4.

Frankish

A

,

Diekhans

M

,

Jungreis

I

, et al.

Gencode 2021

.

Nucleic Acids Res

2021

;

49

(

D1

):

D916

–

23

.

Google Scholar

Crossref

WorldCat

5.

Uszczynska-Ratajczak

B

,

Lagarde

J

,

Frankish

A

, et al.

Towards a complete map of the human long non-coding rna transcriptome

.

Nat Rev Genet

2018

;

19

(

9

):

535

–

48

.

Google Scholar

Crossref

WorldCat

6.

Gao

M

,

Liu

S

,

Yang

Q

, et al.

Imrelnc: Identifying immune-related lncrna characteristics in human cancers based on heuristic correlation optimization

.

Front Genet

2021

;

12

. https://doi.org/10.3389/fgene.2021.792541.

Google Scholar

OpenURL Placeholder Text

WorldCat

7.

Senft

AD

,

Macfarlan

TS

.

Transposable elements shape the evolution of mammalian development

.

Nat Rev Genet

2021

;

22

(11):

691

–

711

.

Google Scholar

Crossref

WorldCat

8.

Zhao

H

,

Shi

J

,

Zhang

Y

, et al.

Lnctard: a manually-curated database of experimentally-supported functional lncrna-target regulations in human diseases

.

Nucleic Acids Res

2020

;

48

(

D1

):

D118

–

26

.

Google Scholar

OpenURL Placeholder Text

WorldCat

9.

Cheng

L

,

Wang

P

,

Tian

R

, et al.

Lncrna2target v2. 0: a comprehensive database for target genes of lncrnas in human and mouse

.

Nucleic Acids Res

2019

;

47

(

D1

):

D140

–

4

.

Google Scholar

Crossref

WorldCat

10.

Teng

X

,

Chen

X

,

Xue

H

, et al.

Npinter v4. 0: an integrated database of ncrna interactions

.

Nucleic Acids Res

2020

;

48

(

D1

):

D160

–

5

.

Google Scholar

OpenURL Placeholder Text

WorldCat

11.

Chang

K-C

,

Diermeier

SD

,

Allen

TY

, et al.

Matar25 lncrna regulates the tensin1 gene to impact breast cancer progression

.

Nat Commun

2020

;

11

(

1

):

1

–

19

.

Google Scholar

OpenURL Placeholder Text

WorldCat

12.

Fukunaga

T

,

Hamada

M

.

Riblast: an ultrafast rna–rna interaction prediction system based on a seed-and-extension approach

.

Bioinformatics

2017

;

33

(

17

):

2666

–

74

.

Google Scholar

Crossref

WorldCat

13.

Gawronski

AR

,

Uhl

M

,

Zhang

Y

, et al.

Mechrna: prediction of lncrna mechanisms from rna–rna and rna–protein interactions

.

Bioinformatics

2018

;

34

(

18

):

3101

–

10

.

Google Scholar

Crossref

WorldCat

14.

Mann

M

,

Wright

PR

,

Backofen

R

.

Intarna 2.0: enhanced and customizable prediction of rna–rna interactions

.

Nucleic Acids Res

2017

;

45

(

W1

):

W435

–

9

.

Google Scholar

Crossref

WorldCat

15.

Liao

Q

,

Liu

C

,

Yuan

X

, et al.

Large-scale prediction of long non-coding rna functions in a coding–non-coding gene co-expression network

.

Nucleic Acids Res

2011

;

39

(

9

):

3864

–

78

.

Google Scholar

Crossref

WorldCat

16.

Zhang

J

,

Le

TD

,

Liu

L

, et al.

Inferring and analyzing module-specific lncrna–mrna causal regulatory networks in human cancer

.

Brief Bioinform

2019

;

20

(

4

):

1403

–

19

.

Google Scholar

Crossref

WorldCat

17.

Li

Z

,

Liu

L

,

Jiang

S

, et al.

Lncexpdb: an expression database of human long non-coding rnas

.

Nucleic Acids Res

2021

;

49

(

D1

):

D962

–

8

.

Google Scholar

Crossref

WorldCat

18.

Zhao

T

,

Yang

H

,

Peng

J

, et al.

Deeplgp: a novel deep learning method for prioritizing lncrna target genes

.

Bioinformatics

2020

;

36

(

16

):

4466

–

72

.

Google Scholar

Crossref

WorldCat

19.

Zhang

Y

,

Yi

T

,

Ji

H

, et al.

Designing a general method for predicting the regulatory relationships between long noncoding rnas and protein-coding genes based on multi-omics characteristics

.

Bioinformatics

2020

;

36

(

7

):

2025

–

32

.

Google Scholar

Crossref

WorldCat

20.

Chen

X

.

Katzlda: Katz measure for the lncrna-disease association prediction

.

Sci Rep

2015

;

5

(

1

):

1

–

11

.

Google Scholar

OpenURL Placeholder Text

WorldCat

21.

Zeng

M

,

Chengqian

L

,

Zhang

F

, et al.

Sdlda: lncrna-disease association prediction based on singular value decomposition and deep learning

.

Methods

2020

;

179

:

73

–

80

.

Google Scholar

Crossref

WorldCat

22.

Zhang

Z-C

,

Zhang

X-F

,

Min

W

, et al.

A graph regularized generalized matrix factorization model for predicting links in biomedical bipartite networks

.

Bioinformatics

2020

;

36

(

11

):

3474

–

81

.

Google Scholar

Crossref

WorldCat

23.

Xiao

Q

,

Luo

J

,

Liang

C

, et al.

A graph regularized non-negative matrix factorization method for identifying microrna-disease associations

.

Bioinformatics

2018

;

34

(

2

):

239

–

48

.

Google Scholar

Crossref

WorldCat

24.

Qingfeng

Chen

,

Dehuan

Lai

,

Wei

Lan

,

Ximin

Wu

,

Baoshan

Chen

,

Jin

Liu

,

Yi-Ping Phoebe

Chen

, and

Jianxin

Wang

.

Ildmsf: inferring associations between long non-coding rna and disease based on multi-similarity fusion

.

IEEE/ACM Trans Comput Biol Bioinform

,

18

(3):

1106

–

12

,

2019

.

Google Scholar

Crossref

WorldCat

25.

Jingwen

Y

,

Ping

P

,

Wang

L

, et al.

A novel probability model for lncrna–disease association prediction based on the naïve bayesian classifier

.

Genes

2018

;

9

(

7

):

345

.

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Xuan

P

,

Gong

Z

,

Cui

H

, et al.

Fully connected autoencoder and convolutional neural network with attention-based method for inferring disease-related lncrnas

.

Brief Bioinform

2022

;

23

(

3

):

bbac089

.

Google Scholar

Crossref

WorldCat

27.

Sheng

N

,

Huang

L

,

Wang

Y

, et al.

Multi-channel graph attention autoencoders for disease-related lncrnas prediction

.

Brief Bioinform

2022

;

23

(

2

):bbab604.

Google Scholar

OpenURL Placeholder Text

WorldCat

28.

Lan

W

,

Wu

X

,

Chen

Q

, et al.

Ganlda: graph attention network for lncrna-disease associations prediction

.

Neurocomputing

2022

;

469

:

384

–

93

.

Google Scholar

Crossref

WorldCat

29.

Liu

H

,

Ren

G

,

Chen

H

, et al.

Predicting lncrna–mirna interactions based on logistic matrix factorization with neighborhood regularized

.

Knowledge-Based Systems

2020

;

191

:105261.

Google Scholar

OpenURL Placeholder Text

WorldCat

30.

Yang

L

,

Li

L-P

,

Yi

H-C

.

Deepwalk based method to predict lncrna-mirna associations via lncrna-mirna-disease-protein-drug graph

.

BMC bioinformatics

2022

;

22

(

12

):

1

–

16

.

Google Scholar

OpenURL Placeholder Text

WorldCat

31.

Yang

S

,

Yan Wang

Y

,

Lin

DS

, et al.

Lncmirnet: predicting lncrna–mirna interaction based on deep learning of ribonucleic acid sequences

.

Molecules

2020

;

25

(

19

):

4372

.

Google Scholar

Crossref

WorldCat

32.

Wekesa

JS

,

Luan

Y

,

Chen

M

, et al.

A hybrid prediction method for plant lncrna-protein interaction

.

Cell

2019

;

8

(

6

):

521

.

Google Scholar

Crossref

WorldCat

33.

Zhao

Q

,

Zhang

Y

,

Huan

H

, et al.

Irwnrlpi: integrating random walk and neighborhood regularized logistic matrix factorization for lncrna-protein interaction prediction

.

Front Genet

2018

;

9

:

239

.

Google Scholar

Crossref

WorldCat

34.

Wekesa

JS

,

Meng

J

,

Luan

Y

.

Multi-feature fusion for deep learning to predict plant lncrna-protein interaction

.

Genomics

2020

;

112

(

5

):

2928

–

36

.

Google Scholar

Crossref

WorldCat

35.

Ma

Y

,

He

T

,

Jiang

X

.

Projection-based neighborhood non-negative matrix factorization for lncrna-protein interaction prediction

.

Front Genet

2019

;

10

:

1148

.

Google Scholar

Crossref

WorldCat

36.

Kipf

TN

,

Welling

M

.

Semi-supervised classification with graph convolutional networks

.

arXiv preprint arXiv:1609.02907

.

2016

.

37.

Wang

L

,

You

Z-H

,

Huang

Y-A

, et al.

An efficient approach based on multi-sources information to predict circrna–disease associations using deep convolutional neural network

.

Bioinformatics

2020

;

36

(

13

):

4038

–

46

.

Google Scholar

Crossref

WorldCat

38.

Ji

C

,

Zhen Gao

X

,

Ma

QW

, et al.

Aemda: inferring mirna–disease associations based on deep autoencoder

.

Bioinformatics

2021

;

37

(

1

):

66

–

72

.

Google Scholar

Crossref

WorldCat

39.

Shi

Y

,

Lv

C

,

Shi

L

, et al.

Meg3 inhibits proliferation and invasion and promotes apoptosis of human osteosarcoma cells

.

Oncol Lett

2018

;

15

(

2

):

1917

–

23

.

Google Scholar

OpenURL Placeholder Text

WorldCat

40.

Ren

L

,

Yang

S

,

Cao

Q

, et al.

Crnde contributes cervical cancer progression by regulating mir-4262/zeb1 axis

.

Onco Targets Ther

2021

;

14

:

355

.

Google Scholar

Crossref

WorldCat

41.

Voellenkle

C

,

Garcia-Manteiga

JM

,

Pedrotti

S

, et al.

Implication of long noncoding rnas in the endothelial cell response to hypoxia revealed by rna-sequencing

.

Sci Rep

2016

;

6

(

1

):

1

–

13

.

Google Scholar

Crossref

WorldCat

42.

Duan

G

,

Zhang

C

,

Changke

X

, et al.

Knockdown of malat1 inhibits osteosarcoma progression via regulating the mir-34a/cyclin d1 axis

.

Int J Oncol

2019

;

54

(

1

):

17

–

28

.

Google Scholar

OpenURL Placeholder Text

WorldCat

43.

Iaiza

A

,

Tito

C

,

Ianniello

Z

, et al.

Mettl3-dependent malat1 delocalization drives c-myc induction in thymic epithelial tumors

.

Clin Epigenetics

2021

;

13

(

1

):

1

–

15

.

Google Scholar

Crossref

WorldCat

44.

Yan

L

,

Liu

Z

,

Yin

H

, et al.

Silencing of meg3 inhibited ox-ldl-induced inflammation and apoptosis in macrophages via modulation of the meg3/mir-204/cdkn2a regulatory axis

.

Cell Biol Int

2019

;

43

(

4

):

409

–

20

.

Google Scholar

Crossref

WorldCat

45.

Hu

Y

,

Zai

H

,

Jiang

W

, et al.

Hepatoblastoma: Derived exosomal lncrna neat1 induces bmscs differentiation into tumor-supporting myofibroblasts via modulating the mir-132/mmp9 axis

.

J Oncol

2022

;

2022

. https://doi.org/10.1155/2022/7630698.

Google Scholar

OpenURL Placeholder Text

WorldCat

46.

Ji

X

,

Li

Z

,

Wang

W

, et al.

Downregulation of long non-coding rna pvt1 enhances fracture healing via regulating microrna-497-5p/hmga2 axis

.

Bioengineered

2021

;

12

(

1

):

8125

–

34

.

Google Scholar

Crossref

WorldCat

47.

Jing

H

,

Gao

W

.

Long noncoding rna pvt1 promotes tumour progression via the mir-128/zeb1 axis and predicts poor prognosis in esophageal cancer

.

Clin Res Hepatol Gastroenterol

2021

;

45

(

4

):101701.

Google Scholar

OpenURL Placeholder Text

WorldCat

48.

Zhang

EB

,

Yin

DD

,

Sun

M

, et al.

P53-regulated long non-coding rna tug1 affects cell proliferation in human non-small cell lung cancer, partly through epigenetically regulating hoxb7 expression

.

Cell Death Dis

2014

;

5

(

5

):

e1243

–

3

.

Google Scholar

Crossref

WorldCat

49.

Dong

Z

,

Gao

M

,

Li

C

, et al.

Lncrna uca1 antagonizes arsenic-induced cell cycle arrest through destabilizing ezh2 and facilitating nfatc2 expression

.

Advanced Science

2020

;

7

(

11

):

1903630

.

Google Scholar

Crossref

WorldCat

50.

Munschauer

M

,

Nguyen

CT

,

Sirokman

K

, et al.

The norad lncrna assembles a topoisomerase complex critical for genome stability

.

Nature

2018

;

561

(

7721

):

132

–

6

.

Google Scholar

Crossref

WorldCat

51.

Hentze

MW

,

Castello

A

,

Schwarzl

T

, et al.

A brave new world of rna-binding proteins

.

Nat Rev Mol Cell Biol

2018

;

19

(

5

):

327

–

41

.

Google Scholar

Crossref

WorldCat

52.

He

J

,

Qiaozhu

Z

,

Hu

BO

, et al.

A novel, liver-specific long noncoding rna linc01093 suppresses hcc progression by interaction with igf2bp1 to facilitate decay of gli1 mrna

.

Cancer Lett

2019

;

450

:

98

–

109

.

Google Scholar

Crossref

WorldCat

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
October 2022	30
November 2022	158
December 2022	44
January 2023	25
February 2023	30
March 2023	31
April 2023	38
May 2023	25
June 2023	23
July 2023	16
August 2023	13
September 2023	13
October 2023	42
November 2023	19
December 2023	18
January 2024	110
February 2024	62
March 2024	65
April 2024	40
May 2024	42
June 2024	33
July 2024	46
August 2024	40
September 2024	62
October 2024	62
November 2024	39
December 2024	85
January 2025	36
February 2025	75
March 2025	90
April 2025	29

Article Contents

GAE-LGA: integration of multi-omics data with graph autoencoders to identify lncRNA–PCG associations

Abstract

Introduction

Materials and methods

Data collection and preprocessing

Multi-omics data

Human lncRNA–PCG associations

Training and testing datasets

Similarity feature

Feature representation

lncRNA–PCG association prediction

lncRNA–PCG association reconstruction

Association information for a new lncRNA

Complexity analysis

Results

Experimental setup

Comparison experiments

Effectiveness comparison

Robustness comparison

Ablation experiment

Case studies

Parameter analysis

GCN layer

Embedding size

Number of hidden layer features

Learning rate

Discussion

Availability and implementation

Funding

Author Biographies

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only