Abstract

PIWI proteins and Piwi-Interacting RNAs (piRNAs) are commonly detected in human cancers, especially in germline and somatic tissues, and correlate with poorer clinical outcomes, suggesting that they play a functional role in cancer. As the problem of combinatorial explosions between ncRNA and disease exposes gradually, new bioinformatics methods for large-scale identification and prioritization of potential associations are therefore of interest. However, in the real world, the network of interactions between molecules is enormously intricate and noisy, which poses a problem for efficient graph mining. Line graphs can extend many heterogeneous networks to replace dichotomous networks. In this study, we present a new graph neural network framework, line graph attention networks (LGAT). And we apply it to predict PiRNA disease association (GAPDA). In the experiment, GAPDA performs excellently in 5-fold cross-validation with an AUC of 0.9038. Not only that, it still has superior performance compared with methods based on collaborative filtering and attribute features. The experimental results show that GAPDA ensures the prospect of the graph neural network on such problems and can be an excellent supplement for future biomedical research.

Introduction

Piwi-interacting RNA (piRNA) is typically 24–32 nucleotides in length, which is a small, non-coding RNA that clusters at transposon loci in the genome [1–5]. PiRNA interacts exclusively with PIWI proteins which belong to germline-specific subclade of the Argonaute family [6–10]. Because of piRNA, the depletion of PIWI results in a sharp increase in transposon messenger RNA expression. Thus, its most famous role of is to repress transposons and maintain germline genome integrity through DNA methylation [11, 12]. Compared with microRNA (miRNA) and small interfering RNA (siRNA) that are small RNAs, piRNA has the following characteristics: (i) longer than miRNA or siRNA; (ii) only exists in animals; (iii) more diverse sequences and constitutes the largest class of noncoding RNA and (iv) testes-specific [6, 12–15].

Recently, emerging evidence suggested that piRNA and PIWI proteins are abnormally expressed in various cancers [16–21]. Therefore, the function and potential mechanism of piRNA in cancer become one of the important research directions in tumor diagnosis and treatment. For example, Fu et al. found that abnormal expression of piR-021285 promoted methylation of ARHGAP11A at the 5’-UTR/first exon CpG site, thereby promoting mRNA apoptosis and inhibiting apoptosis of breast cancer cells in 2015 [22]. Subsequently, Tan et al. [23] found that downregulation of piRNA-36,712 in breast cancer increases SLUG levels, while P21 and E-cadherin levels were reduced, thereby promoting the malignant phenotype of cancer. In addition, piR-30188 binds to OIP5-AS1 to inhibit glioma cell progression while low expression of OIP5-AS1 reduces CEBPA levels and promotes the malignant phenotype of glioma cells which was discovered by Liu et al. in 2017 [24]. Also glioblastoma, Jacobs et al. [25] found that piR-8041 can inhibit the expression of the tumor cell marker ALCAM/CD166, with the clinical role of targeted therapy. In addition, piRNA is directly or indirectly involved in the formation of liver cancer. In 2016, Rizzo et al. [26] found that hsa_piR_013306 accumulates only in hepatocellular carcinomas.

piRNA is gaining enormous attention, and tens of thousands of them have been identified in mammals and are rapidly accumulating. To accelerate research in this field and provide access to piRNA data and annotations, multiple databases such as piRNABank [27], piRBase [28] and piRNAQuest [29] have been successively established. Subsequently, the role of piRNA and PIWI proteins in the epigenetics of cancer is constantly being discovered, and some of them can serve as novel biomarkers and therapeutic targets. Taking this as an opportunity, an experimentally supported piRNA-disease correlation database called piRDisease [30] was proposed, which made it possible to predict potential associations on a large scale. Although many disease-related ncRNA prediction models have been proposed and gradually developed, predictors for disease-related piRNA are relatively unexplored [31–34].

In this paper, we propose a piRNA-disease association predictor based on line graph attention network, called GAPDA. This study has two main contributions: (i) a new graph neural network framework, line graph attention network, is proposed that can extend many heterogeneous networks to replace dichotomous networks. (ii) Different from traditional collaborative filtering and attribute-based methods, the proposed method integrates disease semantic information and piRNA sequence information, which improves prediction accuracy and has higher coverage. On the association dataset piRDisease, GAPDA achieves an the area under the receiver operating charac- teristic (ROC) curve (AUC) of 0.9038 with an accuracy of 85.69%. Overall, this method can provide new ideas for cancer mechanism research. In the meantime, it has a good application prospect for attention-based Graph Neural Networks on this kind of problem. Moreover, it is hoped that the work of this paper can promote more research on association prediction based on graph neural network. Our source code and data can be downloaded on github (https://github.com/kaizheng-academic/GAPDA/tree/main).

Methods

Dataset

With the gradual emergence of the role of piRNA in disease diagnosis and prognosis, piRNAs have become a research hotspot. How to effectively use the data obtained from piRNA-related experiments is an urgent problem to be solved. The piRDisease database [30], proposed by Azhar et al. in 2019, collects experimentally supported piRNA disease associations from over 2500 articles. After removing some piRNAs that could not be verified on piRbase, the positive sample set Tp consisting of 1212 associations was obtained (contains 501 piRNAs and 22 diseases). We refer to the processed dataset as GPRD. Therefore, the training dataset T can be defined as follows:
(1)
where Tn is a negative subset containing 1212 associations which are randomly extracted from all 9810 unconfirmed associations between piRNA and disease. Known associations represent 10.9% of all associations.

The construction of new piRNA-disease association networks

The construction of the line graph

So far, experimentally validated piRNA-disease associations are limited and the large number of piRNAs results in a sparse piRNA-disease association network. Due to the complexity of biological data, the network representation computed by the sparse network does not cover all behavior information in the real world. Therefore, we try to construct the network by considering the association as nodes, thus achieving the goal of enriching the hidden representations contained in the sparse network of network information [35]. The transformation process of the line graph is shown in Figure 3A. Specifically, we transfer the edges in the original figure into points in the line graph. In this study, the nodes of the original graph are piRNAs and diseases. The edge is a piRNA-disease association, and there is an edge between two nodes if a piRNA is associated with a disease. Nodes in the line graph are piRNA-disease associations. The edge is the relationship between associations. If two associations have the same piRNA or the same disease, there is an edge between the two nodes. This transforms the problem under study from link prediction to node prediction. In detail, on the ground of n associations, the new association adjacency matrix is calculated as below
(2)
where n stands for the number of associations. The element ai,j is set to 1 if node i (the ith association) and node j (the jth association) in the network are related, otherwise it is set to 0. In particular, the relationships between associations are various. For example, associations of the same piRNA can form a bipartite network; associations of the same disease can form a bipartite network; associations of similar piRNAs can form a weighted network; associations of similar diseases can form a weighted network. In this paper, we utilize piRNAs and diseases as link vectors, respectively, and define them as follows:
(3)
(4)
It is worth noting that both the original graph and the line graph are undirected graphs with edge weights of 0 or 1. AR. denotes the line graph with links between piRNA-identical associations, where the element is αi,jR. Similarly, the line graph with links between disease-identical associations and the line graph with links between piRNA or disease-identical associations are denoted as AD and ARD.

The attribute features of nodes

The attribute features are obtained by fusing piRNA sequence features and disease semantic features. In this study, sequence decisions carrying genetic information were selected as the data source for describing piRNAs. And the final descriptor of piRNA is calculated by k-mer algorithm [36]. Therefore, the piRNA attribute feature descriptor Feature(pa) can be defined as follows:
(5)
where Fps(pa)[i] is the probability that the ith k-mer appears in the sequence. pa is the ath piRNA. i is the ith k-mer. kmer count is the number of times such k-mer appears in the sequence. length(pa) is the sequence length of piRNA pa. The process is shown in Figure 1.
The flowchart for calculating piRNA sequence features.
Figure 1

The flowchart for calculating piRNA sequence features.

Describing disease attribute features is a vital and difficult task. Until now, approaches for building directed acyclic graphs (DAG) guided by Medical Subject Headings (MeSH) have been widely employed to quantify the link between diseases [37, 38]. MeSH is the authoritative standard vocabulary produced by the National Medical Library. Because of its strict classification of diseases, it can deconstruct the semantic relationship of diseases. Taking Lip Neoplasms (LN) as an example, its MeSH ID is ‘C04.588.443.591.550; C07.465.409.640; C07.465.565.550’, and the corresponding parent nodes are Mouth Neoplasms and Lip Disease whose MeSH IDs are ‘C04.588.443.591; C07.465.565.550’ and ‘C07.465.409.640’, as shown in Figure 2. Similarly, Mouth Neoplasms and Lip Disease also have their parent nodes, Mouth Disease and Head and Neck Neoplasms. According to the aforementioned analysis, Lip Neoplasms and other related diseases can be expressed as DAGLN=(LN,TLN,ELN), where TLN is a collection of nodes in DAGLN that contain LN, such as ‘Head and Neck Neoplams’ and ‘Mouth Disease’. Furthermore, ELN is a collection of edges between different nodes, such as the edge between ‘Stomatognathic Disease’ and ‘Mouth Disease’. Based on former research production [34], the semantic contribution C of disease w to disease d is calculated as follows:
(6)
where the semantic decay factor is set at 0.5 in this case. w is the parent node of w. The greater contribution of disease w to disease d indicates the tighter distance in the DAG between disease w to disease d. According to the above formula, the semantic value V of the disease d can be described as follows:
(7)
The directed acyclic graphs (DAG) of Lip Neoplasms.
Figure 2

The directed acyclic graphs (DAG) of Lip Neoplasms.

Based on the hypothesis that two diseases are more related if they share more common parent nodes, the semantic similarity score SS of disease a and disease b can be defined as follows:
(8)
Although the above calculation method quantifies the correlation to some extent, its performance is limited. If there exists a disease that shares parent nodes with all diseases, intuitively this disease has no specificity for other diseases and the weight should below. Therefore, the semantic similarity SS, which introduces specificity, is defined as follows:
(9)
(10)
Although SS takes specificity into account, it comes at the expense of certain data. Therefore, the comprehensive similarity S is calculated to take the advantage of the two methods
(11)
The degree to which disease db. is related to other diseases is used here as the disease description Feature(db).

The network features of nodes

The Gaussian interaction profile kernel similarity (GIP) is a collaborative filtering algorithm that is widely used for association prediction [39–45]. The similarity between pa and pb based on the interaction information can be calculated as follows:
(12)
where V(pa) is the ath row of the association matrix A, which is the interaction of piRNA pa with all diseases and ψp is the kernel width coefficient, which can be defined as follows:
(13)
where nump denotes the total quantity of piRNAs. The similarity between diseases can also be calculated by the GIP algorithm in the same way
(14)
(15)
where numd denotes the total quantity of disease. The Gaussian interaction profile kernel similarity of disease db to other diseases is used as the disease GIP description Feature(db) [46–50]. The Gaussian interaction profile kernel similarity of piRNA pa to other piRNA is used as the piRNA GIP description Feature(pa).

Graph attention layer

The attention mechanism is widely used in many sequence-based tasks. In contrast to Graph Convolutional Network (GCN), which treats all the node’s neighbors equally, Graph Attention Network (GAT) incorporates the self-attention mechanism into the propagation process, in which each node’s hidden state is calculated by itself and its neighbors [51]. The GAT network is composed of basic graph attention layers, and the following formula is used to determine the attention coefficients:
(16)
where ai,j is the attention factor of node i to j, and Ni denotes the neighboring nodes of node i. The input feature of node is h=[h1,h2,,hNN] and hiR1×NF, where NN denotes the number of nodes and NF denotes the feature dimension, respectively. The output feature of the node is h=[h1,h2,,hNN] and hiR1×NF. WRNF×NF is the linear transformation weight matrix applied on each node, and aR2×NF is the weight vector. Finally, softmax is used for normalization and LeakyReLU is introduced for nonlinear transformation.

Line graph attention networks

In this paper, we propose a line graph attention networks, called GAPDA, to predict biologically significant but unmapped piRNA-disease associations. The main idea of GAPDA is to aggregate the properties of neighboring nodes using graph attention networks to compute the hidden state of each association. As seen in Figure 3B and C, the proposed computational model’s prediction process is separated into three major steps: preprocessing data, training the model and scoring the potential associations.

The flowchart of GAPDA for predicting piRNA-disease association.
Figure 3

The flowchart of GAPDA for predicting piRNA-disease association.

First, the preprocessing data include the new association network, the piRNA sequence characteristics Fps, the piRNA’s Gaussian interaction profile kernel similarity Fpg, the disease semantic features Fds and the drug Gaussian interaction profile kernel similarity Fdg. As a result, we develop the following final descriptors defining the piRNA-disease association:
(17)
where represents vector concatenation.

Then, the model is trained with the calculated final descriptors. The details are as follows: (i) The node (association) feature F is linearly mapped by the shared weight matrix W to obtain the enhanced high-dimensional feature; (ii) concatenate the high-dimensional feature of its own and the neighbor node (association); (iii) use the single-layer feedforward neural network to map the spliced high-dimensional features to a real number to quantify the correlation between associations; (iv) use softmax to calculate the attention coefficient; (v) fuse the features of the association and the neighboring associations into a new association representation based on the calculated attention coefficients. In particular, the number of nodes mapping the feature F to an enhanced feature is 2424; (vi) the new association representation is input into the fully connected layer.

Finally, the data required for the potential association is fed into the trained model to obtain the prediction score.

The hyperparameter settings in this study are as follows: the output size of the first Graph Attention layer is 8; the number of attention heads in the first GAT layer is 8; the dropout rate is 0.02; the factor for L2 regularization is 1.25e−4 and the learning rate is 5e−2. In addition, the visualization of the neural network architecture can be seen in Figure 3B. It is worth noting that the line graphs that need to be constructed are different due to the different predicted associations. Therefore, predicting specific potential associations requires retraining the model (~5800 parameters, 100 epochs, ~20 min).

Experimental results

The performance of GAPDA on the benchmark dataset

In this part, we choose αi,jR as an element for abstract network topology. To evaluate the performance of the proposed method, it is applied to the benchmark database GPRD. The GAPDA achieves an average AUC of 0.9038 (Figure 4). In detail, the AUCs of GAPDA are 0.9115, 0.8943, 0.9109, 0.9167 and 0.8859. In addition, Table 1 lists the results of the detailed evaluation criteria: with the average accuracy (Acc.) of 0.8569; the precision (Pre.) is 0.8550; the Recall (Rec.) is 0.8638; and the F1-score is 0.8577. Their standard deviations are 0.92, 3.56, 4.16 and 0.92%, respectively. From the results, the lowest accuracy in the five experiments reach 0.8395, and the highest accuracy reaches 0.8642. In addition, we calculated the specificity, which was 0.8933, 0.8873, 0.8548, 0.8139 and 0.8723, with a mean value of 0.8619. Meanwhile, this experiment relies on the network structure to make predictions, and the prediction results obtained by different attribute networks have errors. Overall, our approach yields convincing results, suggesting that GAPDA can provide powerful candidates for piRNA as a biomarker and has the potential to drive disease diagnosis and to identify disease mechanisms.

(A) ROC curves performed by GAPDA on benchmark dataset. (B) PR curves performed by GAPDA on benchmark dataset.
Figure 4

(A) ROC curves performed by GAPDA on benchmark dataset. (B) PR curves performed by GAPDA on benchmark dataset.

Table 1

The 5-fold cross-validation results performed by GAPDA on benchmark dataset

Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.86420.83910.90120.89330.8690
20.83950.80220.90120.88730.8488
30.86360.87290.85120.85480.8619
40.85540.90950.78930.81390.8451
50.86160.85140.87600.87230.8635
Average0.8569 ± 0.00920.8550 ± 0.03560.8638 ± 0.04160.8619 ± 0.03180.8577 ± 0.0091
Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.86420.83910.90120.89330.8690
20.83950.80220.90120.88730.8488
30.86360.87290.85120.85480.8619
40.85540.90950.78930.81390.8451
50.86160.85140.87600.87230.8635
Average0.8569 ± 0.00920.8550 ± 0.03560.8638 ± 0.04160.8619 ± 0.03180.8577 ± 0.0091
Table 1

The 5-fold cross-validation results performed by GAPDA on benchmark dataset

Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.86420.83910.90120.89330.8690
20.83950.80220.90120.88730.8488
30.86360.87290.85120.85480.8619
40.85540.90950.78930.81390.8451
50.86160.85140.87600.87230.8635
Average0.8569 ± 0.00920.8550 ± 0.03560.8638 ± 0.04160.8619 ± 0.03180.8577 ± 0.0091
Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.86420.83910.90120.89330.8690
20.83950.80220.90120.88730.8488
30.86360.87290.85120.85480.8619
40.85540.90950.78930.81390.8451
50.86160.85140.87600.87230.8635
Average0.8569 ± 0.00920.8550 ± 0.03560.8638 ± 0.04160.8619 ± 0.03180.8577 ± 0.0091

In addition, we collected additional test sets independent of the cross-validation process to demonstrate the applicability and performance of their models in ‘real’ case use. We manually collected 22 experimentally validated piRNA-disease associations as an independent test set. Specifically, we used all data from the cross-validation process as the training set and tested the trained models on the independent test set. The proposed model obtained an accuracy of 86.36% on the independent test set (Table 2). The results indicate that GAPDA provided with actual application value.

Table 2

The performance of GAPDA on the independent test set

No.piRNADiseaseReferencesPredictions
1DQ577854Breast cancer27,177,224
2DQ588919Dysplastic liver nodules and hepatocellular carcinoma27,429,044
3DQ580140Gastric cancer25,779,424
4DQ571174Gastric cancer25,779,424
5DQ585247Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
6DQ590749Gastric cancer25,779,424
7DQ571873Dysplastic liver nodules and hepatocellular carcinoma29,789,629
8DQ591522Gastric cancer25,779,424×
9DQ593431Alzheimer disease28,127,595×
10DQ595023Alzheimer disease28,127,595
11DQ596276Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
12DQ596465Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
13DQ596465Dysplastic liver nodules and hepatocellular carcinoma29,789,629
14DQ597397Renal cell carcinoma25,998,508
15DQ597945Gastric cancer25,779,424
16DQ597997Dysplastic liver nodules and hepatocellular carcinoma29,789,629
17DQ600116Gastric cancer25,779,424
18DQ600269Gastric cancer25,779,424
19DQ600689Gastric cancer25,779,424
20DQ574391Gastric cancer25,779,424
21DQ576200Alzheimer disease28,127,595×
22DQ570485Gastric cancer25,779,424
No.piRNADiseaseReferencesPredictions
1DQ577854Breast cancer27,177,224
2DQ588919Dysplastic liver nodules and hepatocellular carcinoma27,429,044
3DQ580140Gastric cancer25,779,424
4DQ571174Gastric cancer25,779,424
5DQ585247Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
6DQ590749Gastric cancer25,779,424
7DQ571873Dysplastic liver nodules and hepatocellular carcinoma29,789,629
8DQ591522Gastric cancer25,779,424×
9DQ593431Alzheimer disease28,127,595×
10DQ595023Alzheimer disease28,127,595
11DQ596276Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
12DQ596465Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
13DQ596465Dysplastic liver nodules and hepatocellular carcinoma29,789,629
14DQ597397Renal cell carcinoma25,998,508
15DQ597945Gastric cancer25,779,424
16DQ597997Dysplastic liver nodules and hepatocellular carcinoma29,789,629
17DQ600116Gastric cancer25,779,424
18DQ600269Gastric cancer25,779,424
19DQ600689Gastric cancer25,779,424
20DQ574391Gastric cancer25,779,424
21DQ576200Alzheimer disease28,127,595×
22DQ570485Gastric cancer25,779,424
Table 2

The performance of GAPDA on the independent test set

No.piRNADiseaseReferencesPredictions
1DQ577854Breast cancer27,177,224
2DQ588919Dysplastic liver nodules and hepatocellular carcinoma27,429,044
3DQ580140Gastric cancer25,779,424
4DQ571174Gastric cancer25,779,424
5DQ585247Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
6DQ590749Gastric cancer25,779,424
7DQ571873Dysplastic liver nodules and hepatocellular carcinoma29,789,629
8DQ591522Gastric cancer25,779,424×
9DQ593431Alzheimer disease28,127,595×
10DQ595023Alzheimer disease28,127,595
11DQ596276Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
12DQ596465Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
13DQ596465Dysplastic liver nodules and hepatocellular carcinoma29,789,629
14DQ597397Renal cell carcinoma25,998,508
15DQ597945Gastric cancer25,779,424
16DQ597997Dysplastic liver nodules and hepatocellular carcinoma29,789,629
17DQ600116Gastric cancer25,779,424
18DQ600269Gastric cancer25,779,424
19DQ600689Gastric cancer25,779,424
20DQ574391Gastric cancer25,779,424
21DQ576200Alzheimer disease28,127,595×
22DQ570485Gastric cancer25,779,424
No.piRNADiseaseReferencesPredictions
1DQ577854Breast cancer27,177,224
2DQ588919Dysplastic liver nodules and hepatocellular carcinoma27,429,044
3DQ580140Gastric cancer25,779,424
4DQ571174Gastric cancer25,779,424
5DQ585247Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
6DQ590749Gastric cancer25,779,424
7DQ571873Dysplastic liver nodules and hepatocellular carcinoma29,789,629
8DQ591522Gastric cancer25,779,424×
9DQ593431Alzheimer disease28,127,595×
10DQ595023Alzheimer disease28,127,595
11DQ596276Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
12DQ596465Cardiovascular diseases (CDC,CF,CCS) cardiac regeneration28,289,238
13DQ596465Dysplastic liver nodules and hepatocellular carcinoma29,789,629
14DQ597397Renal cell carcinoma25,998,508
15DQ597945Gastric cancer25,779,424
16DQ597997Dysplastic liver nodules and hepatocellular carcinoma29,789,629
17DQ600116Gastric cancer25,779,424
18DQ600269Gastric cancer25,779,424
19DQ600689Gastric cancer25,779,424
20DQ574391Gastric cancer25,779,424
21DQ576200Alzheimer disease28,127,595×
22DQ570485Gastric cancer25,779,424

Ablation experiment

To better evaluate the performance of the proposed method, we compare it with two methods that only use attribute information or network information. The results are shown in Table 3. The evaluation indicators of GAPDA are higher than the other two traditional methods, especially the accuracy. Therefore, the attention-based approach has better performance than traditional attribute-based and collaborative filtering-based approaches. In addition, other evaluation parameters are higher than the average performance. There are many reasons for the superior performance of GAPDA. First, the two traditional methods only consider attribute information or network information, and do not combine the two sources of heterogeneous knowledge. However, the proposed method combines four kinds of information into an attribute network, which can well quantify the characteristics of the association. Second, the introduction of attention mechanisms allows the hidden representation of nodes to be computed through neighbor behavior. This operation can effectively improve the performance of the model. Third, the new abstract network topology we built also helps improve performance. In the real world, networks are often heterogeneous. This method abstracts existing networks into adjacency matrix with uniform size, which is conducive to the fusion between heterogeneous networks.

Table 3

Comparison of different types of prediction method on benchmark dataset

MethodAUCAUPRAccuracyPrecisionSpecificityRecallF1-score
Att-based0.87250.84650.82000.82470.82300.81430.8189
CF-based0.90320.88220.82800.83290.83120.82600.8272
GAPDA0.90380.89440.85690.85500.85550.86380.8577
MethodAUCAUPRAccuracyPrecisionSpecificityRecallF1-score
Att-based0.87250.84650.82000.82470.82300.81430.8189
CF-based0.90320.88220.82800.83290.83120.82600.8272
GAPDA0.90380.89440.85690.85500.85550.86380.8577
Table 3

Comparison of different types of prediction method on benchmark dataset

MethodAUCAUPRAccuracyPrecisionSpecificityRecallF1-score
Att-based0.87250.84650.82000.82470.82300.81430.8189
CF-based0.90320.88220.82800.83290.83120.82600.8272
GAPDA0.90380.89440.85690.85500.85550.86380.8577
MethodAUCAUPRAccuracyPrecisionSpecificityRecallF1-score
Att-based0.87250.84650.82000.82470.82300.81430.8189
CF-based0.90320.88220.82800.83290.83120.82600.8272
GAPDA0.90380.89440.85690.85500.85550.86380.8577

The sensitivity analysis of line graphs

In Section The construction of the line graph, an abstract network topology method of reconstructing the associated network is proposed and we design three strategies to generate an abstract network topology. In Section The performance of GAPDA on the benchmark dataset, the results of αi,jR have been described. Therefore, in this section, we evaluate the other two strategies to evaluate the performance of the abstract network topology approach. For comparison, we integrate the heterogeneous network by fusing the adjacency matrix AR and the adjacency matrix AD into ARD. Thus, the adjacency matrix αi,jRD is represented as follows:
(17)

As shown in Tables 4 and 5 and Figure 5: (i) based on any abstract network topology, the performance of the proposed method is higher than the average of the traditional methods. This shows that the attribute network constructed with an abstract network topology can combine multiple knowledge sources to restore the true state of the network. This can improve model performance. (ii) Most evaluation criteria of AD and ARD strategies are inferior to AR, of which AD strategy is the most obvious. The reason is that the elements with value = 1 in the adjacency matrix AD are too dense, which makes its abstract network topology specificity insufficient, and ARD is similar. The above two information shows that different abstract network topologies affect the performance of the model to varying degrees, so giving them different weights can improve the effectiveness.

Table 4

The 5-fold cross-validation results performed by GAPDA (AD) on benchmark dataset

Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.82300.77540.90950.89060.8371
20.77980.72080.91360.88200.8058
30.86570.88650.83880.84700.8620
40.85120.81020.91740.90480.8605
50.78310.88270.65290.72460.7506
Average0.8206 ± 0.03480.8151 ± 0.06350.8464 ± 0.10100.8171 ± 0.07310.8232 ± 0.0416
Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.82300.77540.90950.89060.8371
20.77980.72080.91360.88200.8058
30.86570.88650.83880.84700.8620
40.85120.81020.91740.90480.8605
50.78310.88270.65290.72460.7506
Average0.8206 ± 0.03480.8151 ± 0.06350.8464 ± 0.10100.8171 ± 0.07310.8232 ± 0.0416
Table 4

The 5-fold cross-validation results performed by GAPDA (AD) on benchmark dataset

Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.82300.77540.90950.89060.8371
20.77980.72080.91360.88200.8058
30.86570.88650.83880.84700.8620
40.85120.81020.91740.90480.8605
50.78310.88270.65290.72460.7506
Average0.8206 ± 0.03480.8151 ± 0.06350.8464 ± 0.10100.8171 ± 0.07310.8232 ± 0.0416
Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.82300.77540.90950.89060.8371
20.77980.72080.91360.88200.8058
30.86570.88650.83880.84700.8620
40.85120.81020.91740.90480.8605
50.78310.88270.65290.72460.7506
Average0.8206 ± 0.03480.8151 ± 0.06350.8464 ± 0.10100.8171 ± 0.07310.8232 ± 0.0416
Table 5

The 5-fold cross-validation results performed by GAPDA (ARD) on benchmark dataset

Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.88070.90040.85600.86280.8776
20.83950.80220.90120.88730.8488
30.83680.81470.87190.86220.8423
40.85330.81320.91740.90530.8621
50.81820.78310.88020.86320.8288
Average0.8457 ± 0.02080.8227 ± 0.04040.8853 ± 0.02170.8754 ± 0.01940.8519 ± 0.0167
Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.88070.90040.85600.86280.8776
20.83950.80220.90120.88730.8488
30.83680.81470.87190.86220.8423
40.85330.81320.91740.90530.8621
50.81820.78310.88020.86320.8288
Average0.8457 ± 0.02080.8227 ± 0.04040.8853 ± 0.02170.8754 ± 0.01940.8519 ± 0.0167
Table 5

The 5-fold cross-validation results performed by GAPDA (ARD) on benchmark dataset

Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.88070.90040.85600.86280.8776
20.83950.80220.90120.88730.8488
30.83680.81470.87190.86220.8423
40.85330.81320.91740.90530.8621
50.81820.78310.88020.86320.8288
Average0.8457 ± 0.02080.8227 ± 0.04040.8853 ± 0.02170.8754 ± 0.01940.8519 ± 0.0167
Testing setAccuracySensitivityPrecisionSpecificityF1-score
10.88070.90040.85600.86280.8776
20.83950.80220.90120.88730.8488
30.83680.81470.87190.86220.8423
40.85330.81320.91740.90530.8621
50.81820.78310.88020.86320.8288
Average0.8457 ± 0.02080.8227 ± 0.04040.8853 ± 0.02170.8754 ± 0.01940.8519 ± 0.0167
Table 6

Comparison of existing piRNA-disease prediction methods

Five-fold cross-validationMethodsGAPDAiPiDi-PULiPiDA-sHNpiRDA
AUC0.90380.85410.8859-
Independent test setMethodsGAPDAiPiDi-PULiPiDA-sHNpiRDA
Acc.0.86360.68180.7727
FDR0.17390.31810.1904
Five-fold cross-validationMethodsGAPDAiPiDi-PULiPiDA-sHNpiRDA
AUC0.90380.85410.8859-
Independent test setMethodsGAPDAiPiDi-PULiPiDA-sHNpiRDA
Acc.0.86360.68180.7727
FDR0.17390.31810.1904
Table 6

Comparison of existing piRNA-disease prediction methods

Five-fold cross-validationMethodsGAPDAiPiDi-PULiPiDA-sHNpiRDA
AUC0.90380.85410.8859-
Independent test setMethodsGAPDAiPiDi-PULiPiDA-sHNpiRDA
Acc.0.86360.68180.7727
FDR0.17390.31810.1904
Five-fold cross-validationMethodsGAPDAiPiDi-PULiPiDA-sHNpiRDA
AUC0.90380.85410.8859-
Independent test setMethodsGAPDAiPiDi-PULiPiDA-sHNpiRDA
Acc.0.86360.68180.7727
FDR0.17390.31810.1904
(A) ROC curves performed by GAPDA (${A}^D$) on benchmark dataset. (B) PR curves performed by GAPDA (${A}^D$) on benchmark dataset. (C) ROC curves performed by GAPDA (${A}^{RD}$) on benchmark dataset. (D) PR curves performed by GAPDA (${A}^{RD}$) on benchmark dataset.
Figure 5

(A) ROC curves performed by GAPDA (AD) on benchmark dataset. (B) PR curves performed by GAPDA (AD) on benchmark dataset. (C) ROC curves performed by GAPDA (ARD) on benchmark dataset. (D) PR curves performed by GAPDA (ARD) on benchmark dataset.

Comparison with other existing methods

Relevant computational models have been proposed, and we choose three of them using attribute information and network information as features to compare with GAPDA [52, 53]. From the results, the proposed method outperforms the existing method in the five-fold crossover experiment (Table 6). To test the robustness of the model, we add an equal number of negative samples randomly selected from unlabeled association pairs to the independent test set and calculate the false discovery rate (FDR). The proposed method outperforms the existing methods in terms of pattern recognition due to the similarity of the original information used by the three models. We believe that line graphs have a positive impact on the predictive performance of the models, which is consistent with the conclusions of previous work [35]. In addition, we compare the performance of existing methods on independent test sets. Because piRDA [54] does not provide code and uses the ten-fold cross-validation evaluation model. Therefore, we only do performance comparisons in independent test sets. We chose an optimal set of parameters to obtain the prediction results of the piRDA online model on the independent test set. As shown in Table 6, the proposed method significantly outperforms iPiDi-PUL and piRDA in the independent test set, which indicates that GAPDA performs better than the existing methods in ‘real’ case. Since no code is available for iPiDA-sHN, only data from the five-fold cross-validation experiments are available. Therefore, the experimental results of iPiDA-sHN are not calculated in this experiment.

Conclusion

Since the network of interactions between molecules in the real world is enormously intricate and noisy, how to mine graphs efficiently has become a research hotspot. In this study, we propose a piRNA-disease association prediction framework based on the line graph attention network to capture graph features and calculate the hidden representations of associations in the network based on neighbor nodes. Supported by the line graph, GAPDA shows encouraging results in predicting piRNA-disease associations. In detail, in the 5-fold cross-validation, GAPDA gets an AUC of 0.9038, AUPR of 0.8774 and accuracy of 0.8569. In addition, we compare two traditional methods and different strategies to generate abstract network topologies. Experiments show that GAPDA can be an excellent complement to future biomedical research and has determined the prospect of the graph neural grid on such problems. We hope that the proposed method can provide a powerful candidate for piRNA biomarkers and can be extended to other graph-based tasks. However, GAPDA still has limitations. First, the size of the transformed line graph is increased compared with the original network, which makes training time-consuming. In addition, the model is not applicable to all diseases and piRNAs. It can only predict associations between piRNAs with RNA sequences and diseases with MeSH IDs and requires data collection for new piRNAs and diseases. In future work, we will focus more on improving the applicability of the model as well as its computational efficiency.

Key Points
  • A new graph neural network framework, line graph attention networks (LGAT), with association as the node, is proposed that can extend many heterogeneous networks to replace dichotomous networks.

  • Applying LGAT to piRNA-disease association prediction, a new prediction model GAPDA is proposed. This GAT-based approach brings together the advantages of representational learning and network-based approaches.

  • Different from traditional collaborative filtering and attribute-based methods, the proposed method integrates disease semantic information and piRNA sequence information, which improves prediction accuracy and has higher coverage.

Data availability

The GAPDA prediction code, together with example datasets, input data files, is available at [https://github.com/kaizheng-academic/GAPDA/tree/main]. Further codes written for and used in this study are available from the corresponding author upon reasonable request.

Acknowledgements

The authors would like to thank all anonymous reviewers for their constructive advice.

Funding

Science and Technology Innovation 2030-‘Brain Science and Brain-like Research’ Major Project (Grant 2021ZD0200403); National Natural Science Foundation of China (Grants 62172355, 61702444); Qingtan scholar talent project of Zaozhuang University; Fundamental Research Funds for the Central Universities of Central South University (2021zzts0206).

Author Biographies

Kai Zheng is a PhD student in the Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Hunan, China. Her current research interests include pattern recognition, intelligent information processing and its applications in bioinformatics.

Xin-Lu Zhang an engineer in the 36th Research Institute of China Electronics Technology Group Corporation, received his PhD in computer application technology from Chinese Academy of Sciences in 2021. His research interests include pattern recognition, machine learning and neural machine translation.

Lei Wang is a professor at the Guangxi Academy of Science. His research interests include data mining, machine learning, deep learning, computational biology and bioinformatics.

Zhu-Hong You is a professor at the Guangxi Academy of Science. His research interests include neural networks, intelligent information processing, sparse representation and its applications in bioinformatics.

Zhao-Hui Zhan is a PhD candidate at the City University of Hong Kong. Her research interests include machine learning and pattern recognition.

Yao-Yuan Li is a PhD student in the Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Hunan, China. His current research interests include bioinformatics, machine learning and heterogeneous network.

References

1.

Yin
H
,
Lin
H
.
An epigenetic activation role of Piwi and a Piwi-associated piRNA in Drosophila melanogaster
.
Nature
2007
;
450
:
304
8
.

2.

Iwasaki
YW
,
Siomi
MC
,
Siomi
H
.
PIWI-interacting RNA: its bio-genesis and functions
.
Annu Rev Biochem
2015
;
84
:
405
33
.

3.

Grimson
A
,
Srivastava
M
,
Fahey
B
, et al.
Early origins and evolution of microRNAs and Piwi-interacting RNAs in animals
.
Nature
2008
;
455
:
1193
7
.

4.

Aravin
AA
,
Hannon
GJ
,
Brennecke
J
.
The Piwi-piRNA pathway provides an adaptive defense in the transposon arms race
.
Science
2007
;
318
:
761
4
.

5.

Malone
CD
,
Brennecke
J
,
Dus
M
, et al.
Specialized piRNA pathways act in germline and somatic tissues of the Drosophila ovary
.
Cell
2009
;
137
:
522
35
.

6.

Leslie
M
.
The immune system's compact genomic counterpart
.
Am Assoc Adv Sci
2013
;
339
:
25
7
.

7.

Pall
GS
,
Codony-Servat
C
,
Byrne
J
, et al.
Carbodiimide-mediated cross-linking of RNA to nylon membranes improves the detection of siR-NA, miRNA and piRNA by northern blot
.
Nucleic Acids Res
2007
;
35
:
e60
.

8.

Marcon
E
,
Babak
T
,
Chua
G
, et al.
miRNA and piRNA localization in the male mammalian meiotic nucleus
.
Chromosome Res
2008
;
16
:
243
60
.

9.

Armisen
J
,
Gilchrist
MJ
,
Wilczynska
A
, et al.
Abundant and dynamically expressed miRNAs, piRNAs, and other small RNAs in the vertebrate Xenopus tropicalis
.
Genome Res
2009
;
19
:
1766
75
.

10.

Moyano
M
,
Stefani
G
.
piRNA involvement in genome stability and human cancer
.
J Hematol Oncol
2015
;
8
:
38
.

11.

Brennecke
J
,
Aravin
AA
,
Stark
A
, et al.
Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila
.
Cell
2007
;
128
:
1089
103
.

12.

Siomi
MC
,
Sato
K
,
Pezic
D
, et al.
PIWI-interacting small RNAs: the vanguard of genome defence
.
Nat Rev Mol Cell Biol
2011
;
12
:
246
58
.

13.

Rajasethupathy
P
,
Antonov
I
,
Sheridan
R
, et al.
A role for neuronal piRNAs in the epigenetic control of memory-related synaptic plasticity
.
Cell
2012
;
149
:
693
707
.

14.

Houwing
S
,
Kamminga
LM
,
Berezikov
E
, et al.
A role for Piwi and piRNAs in germ cell maintenance and transposon silencing in Zebrafish
.
Cell
2007
;
129
:
69
82
.

15.

Moazed
D
.
Small RNAs in transcriptional gene silencing and genome defence
.
Nature
2009
;
457
:
413
20
.

16.

Zou
AE
,
Zheng
H
,
Saad
MA
, et al.
The non-coding landscape of head and neck squamous cell carcinoma
.
Oncotarget
2016
;
7
:
51211
22
.

17.

Chu
H
,
Hui
G
,
Yuan
L
, et al.
Identification of novel piRNAs in bladder cancer
.
Cancer Lett
2015
;
356
:
561
7
.

18.

Cheng
J
,
Guo
J-M
,
Xiao
B-X
, et al.
piRNA, the new non-coding RNA, is aberrantly expressed in human cancer cells
.
Clin Chim Acta
2011
;
412
:
1621
5
.

19.

Assumpcao
CB
,
Calcagno
DQ
,
Araújo
TMT
, et al.
The role of piRNA and its potential clinical implications in cancer
.
Epigenomics
2015
;
7
:
975
84
.

20.

Ng
KW
,
Anderson
C
,
Marshall
EA
, et al.
Piwi-interacting RNAs in cancer: emerging functions and clinical utility
.
Mol Cancer
2016
;
15
:
5
.

21.

Romano
G
,
Veneziano
D
,
Acunzo
M
, et al.
Small non-coding RNA and cancer
.
Carcinogenesis
2017
;
38
:
485
91
.

22.

Fu
A
,
Jacobs
DI
,
Hoffman
AE
, et al.
PIWI-interacting RNA 021285 is involved in breast tumorigenesis possibly by remodeling the cancer epigenome
.
Carcinogenesis
2015
;
36
:
1094
102
.

23.

Tan
L
,
Mai
D
,
Zhang
B
, et al.
PIWI-interacting RNA-36712 restrains breast cancer progression and chemoresistance by interaction with SEPW1 pseudogene SEPW1P RNA
.
Mol Cancer
2019
;
18
:
9
.

24.

Liu
X
,
Zheng
J
,
Xue
Y
, et al.
PIWIL3/OIP5-AS1/miR-367-3p/CEBPA feedback loop regulates the biological behavior of glioma cells
.
Theranostics
2018
;
8
:
1084
105
.

25.

Jacobs
DI
,
Qin
Q
,
Fu
A
, et al.
piRNA-8041 is downregulated in human glioblastoma and suppresses tumor growth in vitro and in vivo
.
Oncotarget
2018
;
9
:
37616
26
.

26.

Rizzo
F
,
Rinaldi
A
,
Marchese
G
, et al.
Specific patterns of PIWI-interacting small noncoding RNA expression in dysplastic liver nodules and hepatocellular carcinoma
.
Oncotarget
2016
;
7
:
54650
61
.

27.

Sai Lakshmi
S
,
Agrawal
S
.
piRNABank: a web resource on classified and clustered Piwi-interacting RNAs
.
Nucleic Acids Res
2007
;
36
:
D173
7
.

28.

Wang
J
,
Zhang
P
,
Lu
Y
, et al.
piRBase: a com-prehensive database of piRNA sequences
.
Nucleic Acids Res
2018
;
47
:
D175
80
.

29.

Sarkar
A
,
Maji
RK
,
Saha
S
, et al.
piRNAQuest: searching the piRNAome for silencers
.
BMC Genomics
2014
;
15
:
555
.

30.

Muhammad
A
,
Waheed
R
,
Khan
NA
, et al.
piRDisease v1. 0: a manually curated database for piRNA associated diseases
.
Database (Oxford)
2019
;
2019
:baz052. https://doi.org/0.1093/database/baz052.

31.

Wang
L
,
Wang
H-F
,
Liu
S-R
, et al.
Predicting protein-protein interactions from matrix-based protein sequence using convolution neural network and feature-selective rotation forest
.
Sci Rep
2019
;
9
:
9848
.

32.

Zheng
K
,
You
ZH
,
Wang
L
, et al. MISSIM: improved miRNA-disease association prediction model based on chaos game representation and broad learning system. In: Huang DS, Huang ZK, Hussain A (eds).
Intelligent Computing Methodologies
. ICIC, Lecture Notes in Computer Science.
2019
;
11645
. Springer, Cham. https://doi.org/10.1007/978-3-030-26766-7_36.

33.

Zheng
K
,
You
Z-H
,
Wang
L
, et al.
MLMDA: a machine learning approach to predict and validate MicroRNA–disease associations by integrating of heterogenous information sources
.
J Transl Med
2019
;
17
:
1
14
.

34.

Wang
L
,
You
Z-H
,
Chen
X
, et al.
LMTRDA: Using logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities
.
PLoS Comput Biol
2019
;
15
:
e1006865
.

35.

Cai
L
,
Li
J
,
Wang
J
, et al.
Line graph neural networks for link prediction
.
IEEE Trans Pattern Anal Mach Intell
2021
;
44
:
1
5113
.

36.

Kirk
JM
,
Kim
SO
,
Inoue
K
, et al.
Functional classification of long non-coding RNAs by k-mer content
.
Nat Genet
2018
;
50
:
1474
82
.

37.

Xiang
Z
,
Qin
T
,
Qin
ZS
, et al.
A genome-wide MeSH-based literature mining sys-tem predicts implicit gene-to-gene relationships and networks
.
BMC Syst Biol
2013
;
7
:
S9
.

38.

Xuan
P
,
Han
K
,
Guo
M
, et al.
Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors
.
PloS one
2013
;
8
:
e70204
.

39.

van
Laarhoven
T
,
Nabuurs
SB
,
Marchiori
E
.
Gaussian interaction pro-file kernels for predicting drug–target interaction
.
Bioinformatics
2011
;
27
:
3036
43
.

40.

Zheng
K
,
You
Z-H
,
Li
J-Q
, et al.
iCDA-CGR: Identification of circRNA-disease associations based on Chaos Game Representation
.
PLoS Comput Biol
2020
;
16
:
e1007872
.

41.

Wang
Y-B
,
You
Z-H
,
Yang
S
, et al.
A deep learning-based method for drug-target interaction prediction based on long short-term memory neural network
.
BMC Med Inform Decis Mak
2020
;
20
:
1
9
.

42.

Wang
M-N
,
You
Z-H
,
Wang
L
, et al.
LDGRNMF: LncRNA-disease associations prediction based on graph regularized non-negative matrix factorization
.
Neurocomputing
2020
;
424
:
236
45
.

43.

Zheng
K
,
You
Z-H
,
Wang
L
, et al.
DBMDA: a unified embedding for sequence-based miRNA similarity measure with applications to predict and validate miRNA-disease associations
.
Mol Ther Nucleic Acids
2020
;
19
:
602
11
.

44.

Fan
C
,
Lei
X
,
Wu
F-X
.
Prediction of CircRNA-disease associations using KATZ model based on heterogeneous networks
.
Int J Biol Sci
2018
;
14
:
1950
9
.

45.

Zheng
K
,
You
Z-H
,
Wang
L
, et al.
SPRDA: a matrix completion approach based on the structural perturbation to infer disease-associated Piwi-Interacting RNAs
.
bioRxiv 2020.07.02.185611
.
2020
. https://doi.org/10.1101/2020.07.02.185611

46.

Wang
L
,
You
Z-H
,
Li
L-P
, et al. Predicting circRNA-disease associations using deep generative adversarial network based on multi-source fusion in-formation. In:
2019 IEEE International Conference on Bioinformatics and Biomedi-cine (BIBM)
.
San Diego, CA, USA, IEEE
,
2019
,
145
52
.

47.

Zheng
K
,
You
Z-H
,
Wang
L
, et al.
MISSIM: an incremental learning-based model with applications to the prediction of miRNA-disease association
.
IEEE/ACM Trans Comput Biol Bioinform
2020
;
18
(5):
1733
42
.

48.

You
Z-H
,
Li
X
,
Chan
KC
.
An improved sequence-based prediction protocol for protein-protein interactions using amino acids substitution matrix and rotation forest ensemble classifiers
.
Neurocomputing
2017
;
228
:
277
82
.

49.

Wang
Y-B
,
You
Z-H
,
Li
X
, et al.
Predicting protein–protein interactions from protein sequences by a stacked sparse autoencoder deep neural network
.
Mol Biosyst
2017
;
13
:
1336
44
.

50.

Zheng
K
,
You
Z-H
,
Wang
L
, et al.
iMDA-BN: identification of miRNA-disease associations based on the biological network and graph embedding algorithm
.
Comput Struct Biotechnol J
2020
;
18
:
2391
400
.

51.

Velickovic P, Cucurull G, Casanova A, et al.

Graph attention networks
.
Stat
2017
;
1050
:20.

52.

Wei
H
,
Ding
Y
,
Liu
B
.
iPiDA-sHN: Identification of Piwi-interacting RNA-disease associations by selecting high quality negative samples
.
Comput Biol Chem
2020
;
88
:
107361
.

53.

Wei
H
,
Xu
Y
,
Liu
B
.
iPiDi-PUL: identifying Piwi-interacting RNA-disease associations based on positive unlabeled learning
.
Brief Bioinform
2021
;
22.3
:
bbaa058
.

54.

Ali
SD
,
Tayara
H
,
Kil To Chong
.
Identification of piRNA disease associations using deep learning
.
Comput Struct Biotechnol J
2022
;
20
:
1208
17
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)