Abstract

Identifying disease-associated microRNAs (miRNAs) could help understand the deep mechanism of diseases, which promotes the development of new medicine. Recently, network-based approaches have been widely proposed for inferring the potential associations between miRNAs and diseases. However, these approaches ignore the importance of different relations in meta-paths when learning the embeddings of miRNAs and diseases. Besides, they pay little attention to screening out reliable negative samples which is crucial for improving the prediction accuracy. In this study, we propose a novel approach named MGCNSS with the multi-layer graph convolution and high-quality negative sample selection strategy. Specifically, MGCNSS first constructs a comprehensive heterogeneous network by integrating miRNA and disease similarity networks coupled with their known association relationships. Then, we employ the multi-layer graph convolution to automatically capture the meta-path relations with different lengths in the heterogeneous network and learn the discriminative representations of miRNAs and diseases. After that, MGCNSS establishes a highly reliable negative sample set from the unlabeled sample set with the negative distance-based sample selection strategy. Finally, we train MGCNSS under an unsupervised learning manner and predict the potential associations between miRNAs and diseases. The experimental results fully demonstrate that MGCNSS outperforms all baseline methods on both balanced and imbalanced datasets. More importantly, we conduct case studies on colon neoplasms and esophageal neoplasms, further confirming the ability of MGCNSS to detect potential candidate miRNAs. The source code is publicly available on GitHub https://github.com/15136943622/MGCNSS/tree/master

INTRODUCTION

MicroRNA (miRNA) is one type of non-coding single-stranded-RNAs, which usually has a length of about 22 nucleotides [1]. Research has demonstrated that many miRNAs could participate in the regulation of gene expression after transcription in both animals and plants [2]. They could also serve as potential diagnostic markers and therapeutic targets [3]. These miRNAs regulate about one-third of human genes that are highly associated with complex human diseases. For example, miR-17, miR-92a and miR-31 have been validated as biomarkers for colorectal cancer, which is crucial for the diagnosis and treatment of colorectal cancer [4]. Thus, identifying the associations between miRNAs and diseases could promote research on the mechanism of diseases and help the treatment of diseases.

Identifying miRNA–disease associations through the wet-experimental strategy is time-consuming, labor-intensive and low efficiency [5]. More recently, the development of biotechnology has promoted the emergence of computational-based methods which could greatly improve prediction efficiency. Currently, these computational-based approaches can be divided into three categories: similarity-based methods, machine learning-based methods and graph-based methods [6].

For these similarity-based methods, they usually pay much attention to the similarities calculation between miRNAs and diseases. Therefore, miRNAs and diseases will represented as feature vectors utilizing their multiple types of biological data, such as miRNA sequence, miRNA annotation and gene annotations. The similarity-based methods generally assume that functional similarity miRNAs tend to be associated with these diseases that share similar phenotypes [7]. Up to now, various similarity calculation models such as disease semantic similarity [8, 9], miRNA functional similarity [10] and Gaussian Interaction Profile (GIP) kernel similarity [11] have been proposed to measure their similarity effectively. For example, Jiang [12] put forward a novel kernel-based fusion strategy to integrate multiple similarities of miRNAs and diseases, and then predict their potential association relationships. GATMDA [13] treated lncRNAs as mediators and put forward the lncRNA–miRNA–disease regulatory mechanism to enhance the similarity calculation of miRNAs and diseases. Besides, Yu et al. [14] employed the miRNA target genes with GO annotations to systematically measure the similarity between miRNAs. To fully use the global network similarity measures, Chen et al. [15] employed the Random Walk with Restart (RWR) algorithm on the miRNA-miRNA functional similarity network to predict miRNA–disease association. MFSP [16] inferred the functional similarity of miRNAs with the pathway-based miRNA-miRNA relations and measured the miRNS similarities between their corresponding associated disease sets. Mørk et al. [17]. constructed one miRNA-target-disease heterogeneous network and treated proteins as the intermediary to measure the similarities between miRNAs and diseases. Although these approaches above could measure the similarity between miRNAs or diseases, their calculation models are relatively simple. More importantly, they pay little attention to utilizing the complex relationships of miRNAs and diseases in the heterogeneous network, which could limit their accuracy in the similarity calculation.

Machine learning-based approaches usually employ various efficient models such as regularized learning, Random Walk, Support Vector Machine (SVM [18]), or decision trees to discover potential miRNA–disease associations. These methods usually have two steps, which are the training step and the inference step [19]. For example, LRLSHMDA [20] was a semi-supervised model that employed a Laplacian regularized least squares classifier and effectively utilized the implicit information of vertices and edges to infer microbe-disease associations. AMVML [21] was an adaptive learning-based approach, which could learn novel similarity graphs and similarity graphs for miRNAs and diseases from different views and predict the miRNA–disease associations. EDTMDA [22] put forward one computational framework that adopted the dimensionality reduction strategy to remove redundant features and employed multiple decision trees to infer the association relationships. MTLMDA [23] employed the multi-task learning technique which could exploit both miRNA–disease and gene-disease networks at the same time to complete the association prediction task. EGBMMDA [24] trained a regression tree in a gradient boosting framework, which was the first decision tree-based model used to predict miRNA–disease associations. RFMDA [25] selected robust features with the filter-based feature selection strategy and adopted the Random Forest (RF) model as a classifier to infer the association relationship between miRNAs and diseases. Meanwhile, DNRLMF-MDA [11] calculated association probabilities through logical matrix factorization and further improved predictive performance through dynamic neighborhood regularization. MTDN learned the features from the DTINET [26] adopted the RWR algorithm to obtain initial features and compressed features with DCA and applied the matrix completion approach to calculate projections between different nodes. CCL-DTI [27] treated the multimodal knowledge as input and investigated the effectiveness of the different contrastive loss on the prediction model. At the same time, DeepCompoundNet [28] fully considered multi-source of chemicals and proteins such as protein features and drug properties to predict drug–target interactions. However, these above models mainly have two drawbacks: (1) they usually concatenated the embeddings of miRNAs and diseases learned from different sources, and (2) they failed to extract discriminative embeddings of miRNAs and diseases from the networks with rich semantic information, which limits the ability in learning the representations.

Meanwhile, Graph-based methods have attracted more and more attention due to their outstanding performance. These approaches always construct one heterogeneous graph based on the known miRNA–disease association relationships as well as the similarities of miRNAs and diseases [29]. The heterogeneous graph exhibits a powerful ability to depict the complex relationships between miRNAs and diseases, which has been commonly employed in miRNA–disease association prediction tasks. For example, GCN and GAT models were widely adopted in the field of graph learning due to their powerful performance [30]. Recently, MMGCN [31] utilized GCN as encoders to obtain features of miRNAs and diseases under different similarity views and then enhanced the representation learning process through multi-channel attention mechanisms. GCNDTI [32] constructed a drug-protein pair network and treated node pairs as independent nodes, which transformed the link prediction problem into a node pair classification problem. MAGCN [33] introduced lncRNA-miRNA interactions and miRNA–disease associations to represent miRNAs and diseases, then did MDA predictions via graph convolution networks with the multi-channel attention mechanism and convolutional neural network combiner. MKGAT [34] applied multi-layer GAT to update miRNA or disease features and then fused them through the attention mechanism. AMHMDA [35] first constructed a heterogeneous hyper-graph and applied the attention-aware multi-view similarity strategy to learn the embeddings of miRNAs and diseases in the constructed hyper-graph. DTIHNC [36] utilized GAT and RWR models respectively to learn both direct neighbor and multi-hop neighbor information and then applied a multi-level CNN to fuse node features. HGIMDA [37] adopted a graph neural network-based encoder to aggregate node neighborhood information to obtain low-dimensional embeddings of nodes for association prediction. SFAGE [6] optimized the original features through random walks and added a reinforcement layer in the hidden layer of the graph convolutional network to preserve the similarity information in the feature space. AutoEdge-CCP [38] models circRNA-cancer-drug interactions by employing a multi-source heterogeneous network, where each molecule combines intrinsic attribute information. However, these approaches do not pay much attention to the multiple meta-path-based relationships between miRNAs and diseases, while these paths usually contain rich meaningful semantics that crucially contribute to the learning of miRNA and disease representations. Besides, they only randomly select the negative samples from the unlabeled sample set, which affects their training performance.

Generally, heterogeneous networks face certain difficulties in the representation learning of nodes due to their complicated structure [39]. Hence, is one of the great challenges to design an automated learning framework to fully explore the complex meta-path-based relationships over heterogeneous graphs for embedding learning [40]. In addition, it is difficult to obtain high-quality negative samples for the miRNA–disease association prediction task. Consequently, lots of current approaches usually select samples from unlabeled sample sets as negative samples randomly, but there often exist numerous false negative samples, thus affecting the accuracy of prediction models. Therefore, selecting high-reliable and quality negative training samples is of great significance for miRNA–disease association prediction tasks.

In summary, we propose a novel prediction model with the multi-layer graph convolution and negative sample selection strategy (named MGCNSS). Firstly, we collect their multi-types of similarity and obtain the integrated miRNA and disease similarity networks to fully capture the similarity between miRNAs or diseases. Then, we adopt the multi-layer graph convolution module to automatically capture the rich semantics of meta-paths with lengths between miRNAs and diseases and learn their embeddings from different layers. After that, we adopt the negative sample selection strategy to screen out the high-quality negative training samples with the distanced-based strategy. Finally, MGCNSS predicts the miRNA–disease associations with the learned embeddings of miRNAs and diseases. The workflow has been displayed in Figure 1. We summarize the contributions of this study as follows:

  • MGCNSS could automatically capture the rich semantics of meta-path with different lengths between miRNAs and diseases for learning their discriminative embeddings.

  • A negative sample selection strategy is put forward to screen out high-reliable negative samples for training, which could enhance the performance of the prediction model.

  • We conduct comprehensive experiments on balanced and unbalanced datasets, and the results demonstrate that MGCNSS outperforms all baseline methods on the evaluation metrics.

The overall architecture of MGCNSS, which mainly has four steps. (A) MGCNSS establishes the integrated miRNA and disease similarity matrix by fusing their different types of similarities. Then, the RWR algorithm is applied to learn the initial feature matrix ${H^{(0)}}$ for miRNAs and diseases. (B) MGCNSS adopts a multi-layer graph convolution module to capture the rich semantics of meta-paths with different lengths and learns the discriminative embeddings of miRNAs and diseases. (C) MGCNSS employs cosine-distance and Euclidean-distance-based strategies to select high-quality negative samples. (D) MGCNSS predicts the miRNA–disease associations with their learned embeddings.
Figure 1

The overall architecture of MGCNSS, which mainly has four steps. (A) MGCNSS establishes the integrated miRNA and disease similarity matrix by fusing their different types of similarities. Then, the RWR algorithm is applied to learn the initial feature matrix H(0) for miRNAs and diseases. (B) MGCNSS adopts a multi-layer graph convolution module to capture the rich semantics of meta-paths with different lengths and learns the discriminative embeddings of miRNAs and diseases. (C) MGCNSS employs cosine-distance and Euclidean-distance-based strategies to select high-quality negative samples. (D) MGCNSS predicts the miRNA–disease associations with their learned embeddings.

MATERIALS AND METHODS

Dataset

In this study, we collected the experimental data from HMDD v2.0 [41], which has 495 miRNAs, 383 diseases, and 5430 miRNA–disease associations that have been experimentally verified. Further, MGCNSS could construct the miRNA–disease association network and the corresponding association matrix that is denoted as A=[Aij]RNm×Nd, where Nm and Nd represent the number of miRNAs and diseases, respectively. For matrixA, if Aij=1, there will be an association relationship between mi and dj. Otherwise, Aij=0 indicates that there is no association relationship between mi and dj.

MiRNA functional similarity

We downloaded the miRNA functional similarity scores from HMDD [42]. In this dataset, the functional similarities between all miRNAs are represented in one matrix FMRNm×Nm and FM(mi,mj) denotes the similarity value between miRNA mi and miRNA mj. It is worth noting that a higher level of similarity between two miRNAs if the corresponding score is higher.

Disease semantic similarity

The relationships between diseases have been well described based on the Medical Subject Headings (MeSH) descriptors [43], which could be constructed as one directed acyclic graph (DAG). MGCNSS measures the disease semantic similarity based on this DAG. Specifically, MGCNSS calculates the semantic contribution of disease d to D according to the following formula:

(1)

where d is the ancestor node of D, and Δ represents the semantic contribution decay factor, which is usually 0.5. According to equation (1), the semantic value of D can be obtained by

(2)

where T(D) denotes the ancestor node set of D including D itself. Therefore, the semantic similarity between disease di and disease dj can be calculated as follows:

(3)

Besides, if disease d occurs only in the DAG of one disease D, but not in the DAG of other diseases, it is necessary to increase the contribution score of disease d to D [44]. Therefore, MGCNSS measures the semantic contribution of disease d to D, which is formulated as follows:

(4)

Similar with the calculation strategy of FD1, we can formulate the equation for DSM2 and FD2:

(5)
(6)

Finally, these two types of calculation approaches are combined:

(7)

where FD is the disease semantic similarity matrix.

GIP kernel similarity

Similar to the previous research [45], MGCNSS measures the GIP kernel similarity [11] for miRNAs and diseases based on the miRNA–disease adjacency matrix A.

Take building the GIP kernel similarity for miRNA (GM) as an example. Firstly, in the term of the adjacency matrix A, the ith and jth row are treated as the disease interaction profiles for miRNA mi and mj, which are denoted as Ri and Rj [10]. Then, we measure the GIP kernel similarity between miRNA mi and miRNA mj based on Ri and Rj. The corresponding similarity could be constructed by

(8)

where αm is the controller of the bandwidth of the kernel and it can be calculated as follows:

(9)

where Nm represents the number of miRNAs, and αm is usually set to 1 referring to previous study.

In a similar manner, MGCNSS could establish the similarity matrix GD for diseases. Specifically, MGCNSS treats columns in matrix A as miRNA interaction profiles for the corresponding diseases. Without loss of generality, the relationship vectors for di and dj are represented as Ci and Cj, and their similarity is formulated as follows:

(10)
(11)

where Nd represents the number of diseases and βm is also set to 1.

lncRNA-based similarity

Many studies have shown that lncRNAs participate in various biological processes, including DNA methylation, post-transcriptional regulation of RNA, and protein translation regulation. As a result, lncRNAs have association relationships with miRNAs and diseases. MGCNSS adopts the association relationships to measure the similarity between miRNAs and diseases separately.

The raw data is downloaded from Star-base v2.0 database [46], and the miRNA-lncRNA association matrix and disease-lncRNA association matrix can be obtained from GATMDA [13]. Finally, we adopt an edit-distance algorithm [47] to obtain the lncRNA-based similarity matrices for miRNAs and diseases, named LMRNm×Nm and LDRNd×Nd, respectively.

MGCNSS model

The overall architecture of MGCNSS has been demonstrated in Figure 1. The model architecture could be summarized into four main parts: multi-source data fusion, multi-layer graph convolution, negative sample selection and model prediction.

Multi-source data fusion

Now we have obtained miRNA functional similarity matrix(FM), miRNA Gaussian kernel-based similarity matrix GM, and lncRNA-based miRNA similarity matrix LM, respectively. Next, MGCNSS could get an integrated miRNA similarity matrix name IM based on FM, GM and LM, which is formulated as follows:

(12)

where α1, α2 and α3 are hyper-parameters.

MGCNSS establishes the disease semantic similarity matrix FD, Gaussian kernel-based disease similarity matrix GD and the lncRNA-based similarity disease matrix LD. The integrated disease similarity matrix ID is denoted as follows:

(13)

where β1, β2 and β3 are hyper-parameters. These hyper-parameters are investigated in the Result section.

MGCNSS will construct the heterogeneous miRNA–disease association network Nhete, which is used for the multi-layer graph convolution module. The matrix representation of Nhete is denoted as M, which is formulated as follows:

(14)

where IMRNm×Nm, IDRNd×Nd and ARNm×Nd,MR(Nd+Nm)×(Nd+Nm).

Finally, MGCNSS initializes the features of miRNAs and diseases with RWR algorithm [36] based on adjacency matrix A. Specifically, The RWR algorithm is formulated as follows:

(15)

where r represents the restart probability and t is the number of iterations. Dt0R(Nd+Nm)×1 is the initial vector of the tth node. Besides, AN is the normalized adjacency matrix A. The algorithm will be performed until Frobenius|D(k+1)Dk|106 and Dtk is the out feature of the tth node at kth iteration. In this way, all the initial features of miRNAs and diseases are obtained and form the feature matrix denoted as Dk.

Multilayer graph convolution

Generally, meta-paths have a powerful ability to capture multiple relationships of nodes in the heterogeneous network. It is essential to comprehensively explore the rich complex semantics to learn the embeddings of miRNAs and diseases. Here we employ multi-layer graph convolution to learn their embeddings.

As shown in Figure 1(B), the graph convolutional module consists of multiple graph convolutional layers, which could fully capture the meaning of meta-paths with different lengths. Specifically, the first layer of convolution can be represented as follows:

(16)

where H(1)R(Nm+Nd)×d is the feature matrix for meta-paths with length 1. H(0)=Dk is the input feature that obtained from RWR, and W(1)R(Nm+Nd)×d is the learnable weight matrix, where d is the embedding size for output features of miRNAs and diseases.

The two-layer convolution is formulated as follows:

(17)

where H(2)R(Nm+Nd)×d and W(2)Rd×d.

Similarly, the l-layer convolution is denoted as follows:

(18)

Finally, MGCNSS could establish l feature matrixes, corresponding to the meta-paths with l different lengths. Notably, lower-order convolutional layers tend to focus on the neighbors of miRNAs or diseases, while higher-order convolutional layers could capture the relationships from long distances. Then, we apply attention coefficients here to combine the l feature matrices with different weights:

(19)

where H is the final output of the multi-layer graph convolution module and the ultimate feature matrix of miRNAs and diseases.

Figure 2 is a toy example demonstrating the process of the multi-layer graph convolution in capturing the semantics of meta-paths with length 2 automatically.

A toy example for learning the importance of meta-paths. (A) There are two node types and two edge types. (B) The miRNA–disease heterogeneous network, where the numbers on yellow edges and blue edges represent the similarities between the different nodes and the association relationships, respectively. (C) For the meta-paths with length equal to 2, we display all the paths from node $m_{2}$ to $d_{1}$. For each path, we first multiply the weights of its two edges to measure the total weight of the corresponding path. Then, MGCNSS could construct the weight matrix $M_{c}^{2}$ in terms of the meta-paths with length 2. (D) The integrated result from $m_{2}$ to $d_{1}$ is shown in the weight matrix $M_{c}^{2}$.
Figure 2

A toy example for learning the importance of meta-paths. (A) There are two node types and two edge types. (B) The miRNA–disease heterogeneous network, where the numbers on yellow edges and blue edges represent the similarities between the different nodes and the association relationships, respectively. (C) For the meta-paths with length equal to 2, we display all the paths from node m2 to d1. For each path, we first multiply the weights of its two edges to measure the total weight of the corresponding path. Then, MGCNSS could construct the weight matrix Mc2 in terms of the meta-paths with length 2. (D) The integrated result from m2 to d1 is shown in the weight matrix Mc2.

Negative sample selection strategy

In the association prediction tasks, the quality of negative samples always affects the performance of the prediction models [48]. Positive samples can be directly collected, but obtaining the ground-truth negative samples is a challenging task [49]. Therefore, most current approaches always treat the known miRNA–disease associations as the positive samples, while the remained relationships are regarded as the unlabeled samples. Moreover, previous research usually randomly selects samples with a certain number from unlabeled samples to form the negative sample set [50]. Generally, this negative selection strategy may introduce some dirty samples, which may interfere with the model training process and reduce the prediction accuracy.

To solve this problem, we propose a distance-based negative sample selection strategy including cosine distance and Euclidean distance to select reliable negative samples. Firstly, we combine the k-means clustering algorithm [51] to generate a centroid vector Cp for the positive sample set Pset and a centroid vector Cu for the unlabeled sample set Uset. Specifically, Cp and Cu are calculated by the following formula:

(20)
(21)

where vi denotes the ith miRNA–disease pair and Fvi denotes its vector representation. Specifically, FviR1×(Nm+Nd) is formed by concatenating the feature vectors of miRNA–disease pairs from the positive sample set. Similarly, FvjR1×(Nm+Nd) is the concatenated vectors in unlabeled sample set.

Next, we compare the cosine similarity (CS) between each unlabeled sample and these two centroid vectors Cp and Cu. For one sample viUset, if CSp is greater, we put it into a potential positive sample set Pl. Conversely, if CSu is greater, it will be put into a potential negative sample set Nl. The formulas for CS calculation for CSp and CSu are defined as follows:

(22)
(23)

In this manner, we can obtain Pl set and Nl set and calculate their corresponding novel centroid vectors Cp and Cu with Equations (21) and (22), respectively. Then we compare the samples viUset with the new centroid vectors Cp and Cu by Euclidean similarity (ES) measurement and divide them accordingly. The formula for calculating the ES is as follows:

(24)
(25)

MGCNSS divides the samples in Uset into the Pl and Nl according to their ESp and ESu values and starts the next iteration.

This process is repeated until the centroids converge, which meets the following conditions:

(26)
(27)

where is the Frobenius norm operation. The Nl in the last iteration is treated as the final reliable negative sample set denoted as Nl. The selection process is shown in Figure 3 and the algorithm is presented in Algorithm 1.

graphic

The process of negative sample selection strategy. First, MGCNSS generates centroid vectors $C_{p}$ and $C_{u}$ from positive samples and the remaining unlabeled samples, respectively. Then, MGCNSS calculates the CS between each sample in the unlabeled sample set and the centroid vectors $C_{p}$ and $C_{u}$, respectively. Based on the CS, we could divide the unlabeled samples into two groups, Likely Positive Pairs (LP) and Likely Negative Pairs (LN). Next, we update the two centroid vectors using LN and LP. Moreover, MGCNSS adopts ES to repeat these steps until the centroid vectors $C_{p}$ and $C_{u}$ converge. Finally, we regard LN as the reliable negative sample set.
Figure 3

The process of negative sample selection strategy. First, MGCNSS generates centroid vectors Cp and Cu from positive samples and the remaining unlabeled samples, respectively. Then, MGCNSS calculates the CS between each sample in the unlabeled sample set and the centroid vectors Cp and Cu, respectively. Based on the CS, we could divide the unlabeled samples into two groups, Likely Positive Pairs (LP) and Likely Negative Pairs (LN). Next, we update the two centroid vectors using LN and LP. Moreover, MGCNSS adopts ES to repeat these steps until the centroid vectors Cp and Cu converge. Finally, we regard LN as the reliable negative sample set.

Model training

Finally, we train MGCNSS by minimizing the binary cross-entropy loss function to optimize the model parameters, which is formulated as follows:

(28)

where Hm represents the mth row vector of the feature matrix obtained from the convolutional layer, σ denotes the sigmoid function, and Hm,Hd represents the inner product of Hm and Hd.

Time complexity analysis

Here we analyze the time complexity of MGCNSS. As is shown in Figure 1, MGCNSS mainly has three modules, which are multi-source data fusion, multi-layer graph convolution and negative sample selection. Suppose that there are m miRNAs and n diseases, MGCNSS has to measure the similarities between all the miRNAs and diseases, and the time complexity is O(m2+n2). The time complexity for integrating different miRNA and disease similarity networks is O(3m2+3n2). The time complexity for multi-layer graph convolution is O(mnmnd), where d is the dimension of embeddings of miRNAs and diseases. The time complexity for negative sample selection is O(k(mn)/2), where k is the number of iterations. As a result, the total complexity for MGCNSS is O(m2+n2)+O(mnmnd)+ O(k(mn)/2), which is equivalent O(mnmnd). Since d is a constant, the final time complexity for MGCNSS is O(m2n2). In this study, the running time for MGCNSS is about 5 s to complete a five-fold crossover experiment and the running time for performing 2000 epochs is about 146 minutes.

RESULTS

In this section, we first briefly introduce the implementation details and evaluation metrics used in this study. Then the comparison results of MGCNSS as well as the baselines are well presented. After that, the ablation experiments and the parameter sensitivity experiments are demonstrated. Last, the case studies are displayed.

Implementation details and evaluation metrics

For MGCNSS, the embedding size of the miRNAs and diseases for prediction is 256, the learning rate is 0.0005 and the weight decay is 0.0005. Besides, the number of training epochs is uniformly 2000. In the multi-layer graph convolution module, the number of convolution layers l is 2.

Meanwhile, to comprehensively evaluate the performance of the prediction models, we establish two types of experiment datasets according to the ratio between the number of positive and negative pairs. Specifically, on the balanced dataset, the ratio between the number of positive and negative samples is 1:1. On the imbalanced dataset, the ratio between the number of positive and negative samples is 1:5 and 1:10.

Finally, we employ the widely used evaluation metrics widely used [52] to evaluate the performance of MGCNSS and the comparison approaches, which are accuracy (ACC), area under receiver operating characteristic curves (AUC) and area under the precision-recall curves (AUPR). Specifically, AUC is commonly employed to evaluate the stability of the prediction model, while ACC is adopted to measure the ability of the prediction model in the accuracy predicting aspect. AUPR is another crucial metric for evaluating the performance of prediction models on the imbalanced dataset.

Besides, we employ the 5-folder cross-validation strategy (5-CV) to further evaluate the performance of MGCNSS. Figure 4 demonstrates the process of 5-CV on the balanced dataset. MGCNSS first employs the negative sample selection strategy to choose the likely negative samples from all the negative samples. Then MGCNSS selects M samples from the likely negative sample set. The number of positive samples is also M. After that, MGCNSS divides the positive and selected negative samples into five folders randomly and performs the 5-CV experiment. Each testing folder will be used to evaluate the performance of MGCNSS in turn for each iteration. Finally, MGCNSS adopts the average value of each iteration as the evaluation result for the corresponding epochs. The description above is the execution process for the one-time epoch. With the increase of the epoch number, BCE loss will be converged and MGCNSS could get the best performance and the corresponding hyperparameters.

The 5-folder cross-validation strategy used in this study.
Figure 4

The 5-folder cross-validation strategy used in this study.

Moreover, the selection for the hyperparameters used in this study is based on the training dataset. The negative sample-selection strategy is also applied to the training dataset as well as the test dataset. Besides, in the parameter sensitivity analysis section, the presented results are derived from the test dataset.

Comparison with other baseline models

Here, we mainly select eleven competitive methods to compare with the proposed model, which are SVM [18], RF [53], XGBoost [54], GCN [55], GAT [56], DTIGAT [57], DTICNN [58], NeoDTI [59], MSGCL [3], AMHMDA [35] and GATMDA [13]:

  • SVM [18]: is a classic supervised learning algorithm, and we feed the learned embeddings of drugs and targets directly for MDA prediction.

  • RF [53]: is an ensemble learning method that combines multiple decision trees for DTI prediction.

  • XGBoost [54]: is a widely used gradient boosting framework in which the features of miRNAs and diseases are fed directly for MDA prediction.

  • GCN [55]: is a neural network architecture designed for graph-structured data, which is employed to learn the embeddings of miRNAs and diseases for MDA association prediction.

  • GAT [56]: is also a neural network architecture that utilizes the attention mechanism in the feature learning process.

  • DTIGAT [57]: is an end-to-end framework that assigns different weights to node neighbors with the self-attention mechanisms for DTI prediction.

  • DTICNN [58]: adopts RWR algorithm to extract features and employs a denoising auto-encoder for dimensionality reduction for MDA predictions.

  • NeoDTI [59]: is an end-to-end model that could integrate different information and automatically learn topology-preserve representations.

  • MSGCL [3]: adopts the multi-view self-supervised contrastive learning for MDA prediction that could enhance the latent representation by maximizing the consistency between different views.

  • AMHMDA [35]: applies the attention-aware multi-view similarity strategy to learn the embeddings of nodes from the heterogeneous hyper-graph to predict the miRNA–disease associations.

  • GATMDA [13]: could both fuse linear and non-linear embeddings of miRNAs and diseases and adopt the RF model to complete the prediction task.

We first perform the comparison experiment on the balanced dataset, in which the ratio between the number of positive and negative samples is 1:1. The results are shown in Table 1. It can be observed that MGCNSS outperforms all baseline methods significantly in this scenario. Specifically, the results of MGCNSS on AUC, ACC and AUPR metrics are 0.9874, 0.9453 and 0.9882 respectively. Besides, XGBoost wins the second rank on AUC and AUPR, respectively, and the corresponding values are 0.9353 and 0.9355. Meanwhile, GAT gets the second highest score on ACC metric and its value is 0.8647.

Table 1

The evaluation results of MGCNSS and baseline methods with 1:1 ratios on AUC, ACC and AUPR metrics.

ModelAUCACCAUPR
SVM [18]0.8990±0.00120.8051±0.01320.9059±0.0099
RF[53]0.9255±0.00210.8412±0.00230.9201±0.0027
XGBoost [54]0.9353±0.01210.8637±0.01420.9355±0.0019
GCN [55]0.8340±0.00120.7840±0.00240.8715±0.0042
GAT [56]0.8734±0.00230.8012±0.00320.8703±0.0041
DTIGAT [57]0.9153±0.00040.8647±0.00540.9109±0.0032
DTICNN [58]0.9057±0.00320.8230±0.00430.8979±0.0042
NeoDTI [59]0.8301±0.01530.7915±0.00410.8568±0.0021
MSGCL [3]0.8370±0.00420.7763±0.00120.9162±0.0031
MHMDA [35]0.9176±0.00540.8063±0.01650.9167±0.0147
GATMDA [13]0.9279±0.01200.8471±0.02310.9269±0.0039
MGCNSS(ours)0.9874±0.00780.9453±0.00890.9882±0.0013
MGCNSS w/o NSST0.94371±0.00570.8859±0.01120.9125±0.0188
ModelAUCACCAUPR
SVM [18]0.8990±0.00120.8051±0.01320.9059±0.0099
RF[53]0.9255±0.00210.8412±0.00230.9201±0.0027
XGBoost [54]0.9353±0.01210.8637±0.01420.9355±0.0019
GCN [55]0.8340±0.00120.7840±0.00240.8715±0.0042
GAT [56]0.8734±0.00230.8012±0.00320.8703±0.0041
DTIGAT [57]0.9153±0.00040.8647±0.00540.9109±0.0032
DTICNN [58]0.9057±0.00320.8230±0.00430.8979±0.0042
NeoDTI [59]0.8301±0.01530.7915±0.00410.8568±0.0021
MSGCL [3]0.8370±0.00420.7763±0.00120.9162±0.0031
MHMDA [35]0.9176±0.00540.8063±0.01650.9167±0.0147
GATMDA [13]0.9279±0.01200.8471±0.02310.9269±0.0039
MGCNSS(ours)0.9874±0.00780.9453±0.00890.9882±0.0013
MGCNSS w/o NSST0.94371±0.00570.8859±0.01120.9125±0.0188

Note: MGCNSS w/o NSST is the variant of MGCNSS and we only compare it with MGCNSS. The best results are marked in bold and the second best is underlined.

Table 1

The evaluation results of MGCNSS and baseline methods with 1:1 ratios on AUC, ACC and AUPR metrics.

ModelAUCACCAUPR
SVM [18]0.8990±0.00120.8051±0.01320.9059±0.0099
RF[53]0.9255±0.00210.8412±0.00230.9201±0.0027
XGBoost [54]0.9353±0.01210.8637±0.01420.9355±0.0019
GCN [55]0.8340±0.00120.7840±0.00240.8715±0.0042
GAT [56]0.8734±0.00230.8012±0.00320.8703±0.0041
DTIGAT [57]0.9153±0.00040.8647±0.00540.9109±0.0032
DTICNN [58]0.9057±0.00320.8230±0.00430.8979±0.0042
NeoDTI [59]0.8301±0.01530.7915±0.00410.8568±0.0021
MSGCL [3]0.8370±0.00420.7763±0.00120.9162±0.0031
MHMDA [35]0.9176±0.00540.8063±0.01650.9167±0.0147
GATMDA [13]0.9279±0.01200.8471±0.02310.9269±0.0039
MGCNSS(ours)0.9874±0.00780.9453±0.00890.9882±0.0013
MGCNSS w/o NSST0.94371±0.00570.8859±0.01120.9125±0.0188
ModelAUCACCAUPR
SVM [18]0.8990±0.00120.8051±0.01320.9059±0.0099
RF[53]0.9255±0.00210.8412±0.00230.9201±0.0027
XGBoost [54]0.9353±0.01210.8637±0.01420.9355±0.0019
GCN [55]0.8340±0.00120.7840±0.00240.8715±0.0042
GAT [56]0.8734±0.00230.8012±0.00320.8703±0.0041
DTIGAT [57]0.9153±0.00040.8647±0.00540.9109±0.0032
DTICNN [58]0.9057±0.00320.8230±0.00430.8979±0.0042
NeoDTI [59]0.8301±0.01530.7915±0.00410.8568±0.0021
MSGCL [3]0.8370±0.00420.7763±0.00120.9162±0.0031
MHMDA [35]0.9176±0.00540.8063±0.01650.9167±0.0147
GATMDA [13]0.9279±0.01200.8471±0.02310.9269±0.0039
MGCNSS(ours)0.9874±0.00780.9453±0.00890.9882±0.0013
MGCNSS w/o NSST0.94371±0.00570.8859±0.01120.9125±0.0188

Note: MGCNSS w/o NSST is the variant of MGCNSS and we only compare it with MGCNSS. The best results are marked in bold and the second best is underlined.

Moreover, we vary the ratio between the number of positive and negative samples, which are 1:5 and 1:10. The corresponding results are also listed in Table 2. From the results, we can see that the proposed method wins the best performance on all the evaluation metrics. Specifically, the results of MGCNSS with the 1:5 ratio are 0.9861, 0.9586 and 0.9758 on AUC, ACC and AUPR metrics, while those of MGCNSS with the 1:10 ratio are 0.9871, 0.9786 and 0.9385 on the corresponding metrics. Besides, XGBoost gets the second best on AUC and AUPR with the 1:5 ratio and 1:10 ratio, on ACC with the 1:10 ratio. Meanwhile, DTIGAT ranks second on ACC with a 1:5 ratio.

Table 2

The evaluation results of MGCNSS and baseline methods on 1:5 and 1:10 ratios

ModelAUCACCAUPR
1:51:101:51:101:51:10
SVM [18]0.9105±0.00120.9010±0.00190.8963±0.00310.9129±0.00250.7440±0.00180.6093±0.0076
RF [53]0.9279±0.00410.9291±0.00670.9078±0.01210.9402±0.00670.7721±0.00320.6861±0.0038
XGBoost [54]0.9404±0.00420.9579±0.00220.9179±0.00870.9566±0.00450.8511±0.00310.8099±0.0041
GCN [55]0.8291±0.00980.8410±0.00330.9122±0.00440.9435±0.00230.6367±0.00890.5115±0.0066
GAT [56]0.8637±0.00110.8682±0.00420.8509±0.00430.9040±0.00240.6841±0.00320.4669±0.0078
DTIGAT [57]0.9103±0.00430.9085±0.01130.9198±0.00980.9381±0.01200.8009±0.03210.6987±0.0132
DTICNN [58]0.9066±0.00420.9087±0.03130.8903±0.00230.9281±0.00380.7119±0.01380.5915±0.0201
NeoDTI [59]0.8430±0.00290.8287±0.00310.8144±0.00650.8597±0.00870.6531±0.00430.5219±0.0098
MSGCL [3]0.8371±0.00770.8385±0.00980.8042±0.00990.8584±0.00430.7391±0.01870.6271±0.0032
AMHMDA [35]0.9178±0.01630.9144±0.00670.8927±0.01810.9320±0.00390.7431±0.02290.6395±0.0312
GATMDA [13]0.9247±0.00770.9274±0.03920.9170±0.01240.9465±0.01910.8048±0.05210.7253±0.0391
MGCNSS(ours)0.9861±0.00160.9871±0.00250.9586±0.01780.9786±0.00920.9758±0.00770.9385±0.0112
MGCNSS w/o NSST0.9439±0.00660.9442±0.01880.9082±0.00760.9311±0.01020.8615±0.05420.8053±0.0201
ModelAUCACCAUPR
1:51:101:51:101:51:10
SVM [18]0.9105±0.00120.9010±0.00190.8963±0.00310.9129±0.00250.7440±0.00180.6093±0.0076
RF [53]0.9279±0.00410.9291±0.00670.9078±0.01210.9402±0.00670.7721±0.00320.6861±0.0038
XGBoost [54]0.9404±0.00420.9579±0.00220.9179±0.00870.9566±0.00450.8511±0.00310.8099±0.0041
GCN [55]0.8291±0.00980.8410±0.00330.9122±0.00440.9435±0.00230.6367±0.00890.5115±0.0066
GAT [56]0.8637±0.00110.8682±0.00420.8509±0.00430.9040±0.00240.6841±0.00320.4669±0.0078
DTIGAT [57]0.9103±0.00430.9085±0.01130.9198±0.00980.9381±0.01200.8009±0.03210.6987±0.0132
DTICNN [58]0.9066±0.00420.9087±0.03130.8903±0.00230.9281±0.00380.7119±0.01380.5915±0.0201
NeoDTI [59]0.8430±0.00290.8287±0.00310.8144±0.00650.8597±0.00870.6531±0.00430.5219±0.0098
MSGCL [3]0.8371±0.00770.8385±0.00980.8042±0.00990.8584±0.00430.7391±0.01870.6271±0.0032
AMHMDA [35]0.9178±0.01630.9144±0.00670.8927±0.01810.9320±0.00390.7431±0.02290.6395±0.0312
GATMDA [13]0.9247±0.00770.9274±0.03920.9170±0.01240.9465±0.01910.8048±0.05210.7253±0.0391
MGCNSS(ours)0.9861±0.00160.9871±0.00250.9586±0.01780.9786±0.00920.9758±0.00770.9385±0.0112
MGCNSS w/o NSST0.9439±0.00660.9442±0.01880.9082±0.00760.9311±0.01020.8615±0.05420.8053±0.0201

Note: Since MGCNSS w/o NSST is the variant of MGCNSS, we only compare it with MGCNSS. The best results are marked in bold and the second best is underlined.

Table 2

The evaluation results of MGCNSS and baseline methods on 1:5 and 1:10 ratios

ModelAUCACCAUPR
1:51:101:51:101:51:10
SVM [18]0.9105±0.00120.9010±0.00190.8963±0.00310.9129±0.00250.7440±0.00180.6093±0.0076
RF [53]0.9279±0.00410.9291±0.00670.9078±0.01210.9402±0.00670.7721±0.00320.6861±0.0038
XGBoost [54]0.9404±0.00420.9579±0.00220.9179±0.00870.9566±0.00450.8511±0.00310.8099±0.0041
GCN [55]0.8291±0.00980.8410±0.00330.9122±0.00440.9435±0.00230.6367±0.00890.5115±0.0066
GAT [56]0.8637±0.00110.8682±0.00420.8509±0.00430.9040±0.00240.6841±0.00320.4669±0.0078
DTIGAT [57]0.9103±0.00430.9085±0.01130.9198±0.00980.9381±0.01200.8009±0.03210.6987±0.0132
DTICNN [58]0.9066±0.00420.9087±0.03130.8903±0.00230.9281±0.00380.7119±0.01380.5915±0.0201
NeoDTI [59]0.8430±0.00290.8287±0.00310.8144±0.00650.8597±0.00870.6531±0.00430.5219±0.0098
MSGCL [3]0.8371±0.00770.8385±0.00980.8042±0.00990.8584±0.00430.7391±0.01870.6271±0.0032
AMHMDA [35]0.9178±0.01630.9144±0.00670.8927±0.01810.9320±0.00390.7431±0.02290.6395±0.0312
GATMDA [13]0.9247±0.00770.9274±0.03920.9170±0.01240.9465±0.01910.8048±0.05210.7253±0.0391
MGCNSS(ours)0.9861±0.00160.9871±0.00250.9586±0.01780.9786±0.00920.9758±0.00770.9385±0.0112
MGCNSS w/o NSST0.9439±0.00660.9442±0.01880.9082±0.00760.9311±0.01020.8615±0.05420.8053±0.0201
ModelAUCACCAUPR
1:51:101:51:101:51:10
SVM [18]0.9105±0.00120.9010±0.00190.8963±0.00310.9129±0.00250.7440±0.00180.6093±0.0076
RF [53]0.9279±0.00410.9291±0.00670.9078±0.01210.9402±0.00670.7721±0.00320.6861±0.0038
XGBoost [54]0.9404±0.00420.9579±0.00220.9179±0.00870.9566±0.00450.8511±0.00310.8099±0.0041
GCN [55]0.8291±0.00980.8410±0.00330.9122±0.00440.9435±0.00230.6367±0.00890.5115±0.0066
GAT [56]0.8637±0.00110.8682±0.00420.8509±0.00430.9040±0.00240.6841±0.00320.4669±0.0078
DTIGAT [57]0.9103±0.00430.9085±0.01130.9198±0.00980.9381±0.01200.8009±0.03210.6987±0.0132
DTICNN [58]0.9066±0.00420.9087±0.03130.8903±0.00230.9281±0.00380.7119±0.01380.5915±0.0201
NeoDTI [59]0.8430±0.00290.8287±0.00310.8144±0.00650.8597±0.00870.6531±0.00430.5219±0.0098
MSGCL [3]0.8371±0.00770.8385±0.00980.8042±0.00990.8584±0.00430.7391±0.01870.6271±0.0032
AMHMDA [35]0.9178±0.01630.9144±0.00670.8927±0.01810.9320±0.00390.7431±0.02290.6395±0.0312
GATMDA [13]0.9247±0.00770.9274±0.03920.9170±0.01240.9465±0.01910.8048±0.05210.7253±0.0391
MGCNSS(ours)0.9861±0.00160.9871±0.00250.9586±0.01780.9786±0.00920.9758±0.00770.9385±0.0112
MGCNSS w/o NSST0.9439±0.00660.9442±0.01880.9082±0.00760.9311±0.01020.8615±0.05420.8053±0.0201

Note: Since MGCNSS w/o NSST is the variant of MGCNSS, we only compare it with MGCNSS. The best results are marked in bold and the second best is underlined.

Meanwhile, to fully investigate the predictive performance of MGCNSS when the negative sample selective strategy does not include information from the test set and only includes information from the training set, we perform this ablation experiment named MGCNSS w/o NSST (MGCNSSwithoutNegative Sample Selective strategy on Testing set). Specifically, the negative sample selective strategy only applies to the training sample set in the training process and does not apply to the testing sample set in the testing process. We perform the corresponding experiments under different ratios (1:1, 1:5 and 1:10) and their corresponding results are displayed in Tables 1 and 2.

The results demonstrate that the performance of MGCNSS w/o NSST is inferior to MGCNSS. For example, in Table 1, the values of MGCNSS w/o NSST on AUC, ACC and AUPR are 0.9437, 0.8859 and 0.9125, which are lower than those of MGCNSS by 3.1, 6.3 and 7.6% respectively. The metric values of MGCNSS w/o NSST in Table 2 are also inferior to those of MGCNSS and here we don’t repeat these results anymore. In conclusion, the results presented in Tables 1 and 2 could illustrate that the negative sample selection strategy is essential for MGCNSS. The results of this ablation experiment demonstrate that the proposed negative sample selection strategy affects the performance of MGCNSS.

The ablation experiments

Multi-source data fusion, multi-layer graph convolution and negative sample selection

In MGCNSS, there are three essential modules which are the multiple similarities integration module (denoted as MI, see Figure 1A), the meta-path-based multi-layer graph convolution module (denoted as MP; see Figure 1B) and the negative sample selection module (denoted as SS; see Figure 1C). Here, we conduct ablation experiments to investigate the effectiveness of these modules. Specifically, the ablation experiments are performed with MGCNSS without MI (MGCNSS w/o MI), MGCNSS without MP(MGCNSS w/o MP), MGCNSS without SS (MGCNSS w/o SS) and MGCNSS.

The result shows that both MI, MP and SS are essential for MGCNSS. Specifically, MGCNSS outperforms MGCNSS w/o MI, MGCNSS w/o MP and MGCNSS w/o SS by 3.81, 1.11 and 4.42% on the AUC metric, respectively. For the ACC metric, MGCNSS outperforms MGCNSS w/o MI, MGCNSS w/o MP and MGCNSS w/o SS by 4.15, 3.00 and 5.02%. Besides, MGCNSS outperforms MGCNSS w/o MI, MGCNSS w/o MP and MGCNSS w/o SS by 3.22, 0.70 and 5.76% on the AUPR metric, respectively. MGCNSS achieves the best performance on all three metrics. The corresponding results are shown in Figure 5.

The ablation experimental results of MGCNSS w/o MI, MGCNSS w/o MP, MGCNSS w/o SS and MGCNSS with 1:1 ratio.
Figure 5

The ablation experimental results of MGCNSS w/o MI, MGCNSS w/o MP, MGCNSS w/o SS and MGCNSS with 1:1 ratio.

Besides, we conducted validation on the imbalanced dataset. Specifically, the experiments were performed with the 1:5 ratio and the results are shown in Figure 6. The results demonstrate that on the imbalanced dataset, the SS module significantly improves the model performance. The values of MGCNSS on AUC, ACC and AUPR metrics are 0.9871, 0.9671 and 0.9609, respectively. Compared with the variants of MGCNSS, the performance of MGCNSS is competitive. The results shown in Figure 6 further show SS, MP and MI modules are all essential in improving the prediction accuracy. Meanwhile, the results demonstrate that meta-path-based multi-layer graph convolution plays an essential role in improving the performance of MGCNSS.

The ablation experimental results of MGCNSS w/o MI, MGCNSS w/o MP, MGCNSS w/o SS and MGCNSS with 1:5 ratio.
Figure 6

The ablation experimental results of MGCNSS w/o MI, MGCNSS w/o MP, MGCNSS w/o SS and MGCNSS with 1:5 ratio.

Different similarities of miRNAs and diseases in multi-source data fusion

In the multi-source data fusion module (Figure 1A), MGCNSS integrates three different types of similarities, which are miRNA functional similarity and disease semantic similarity (denoted as FSM), GIP Kernel similarity of miRNAs and diseases (denoted as GSM), lncRNA-based similarity of miRNAs and diseases (denoted as LSM), respectively. This ablation experiment is formulated as MGCNSS w/o FSM, MGCNSS w/o GSM, MGCNSS w/o LSM and MGCNSS, and their corresponding results on the balanced dataset are shown in Figure 7.

The ablation experimental results of MGCNSS w/o FSM, MGCNSS w/o GSM, MGCNSS w/o LSM and MGCNSS.
Figure 7

The ablation experimental results of MGCNSS w/o FSM, MGCNSS w/o GSM, MGCNSS w/o LSM and MGCNSS.

The results shown in Figure 7 indicate that MGCNSS wins the highest scores on AUC, ACC and AUPR metrics. To be specific, MGCNSS gets the highest scores on the AUC, ACC and AUPR metrics and their corresponding scores are 0.9882, 0.9475 and 0.9888. Compared with MGCNSS, the scores of MGCNSS w/o GSM, MGCNSS w/o FSM and MGCNSS w/o LSM on ACC are 2.4, 2.1 and 0.4% lower than MGCNSS. The results illustrate that these three types of similarities are all essential for MGCNSS.

The performance of MGCNSS based on meta-paths with different lengths

To fully evaluate the effect of meta-paths with different lengths on MGCNSS, we divide the meta-paths into different combinations {1}, {2}, {3}, {1,2,3} and {1,2}, which are shown in Table 3. Specifically, the corresponding results for each combination are also presented in Table 3.

Table 3

The performance of MGCNSS based on meta-paths with different lengths

Meta-path lengthAUCACCAUPR
{1}0.98190.93550.9835
{2}0.91600.85730.9192
{3}0.91570.86000.9340
{1,2,3}0.96430.90880.9722
{1,2}0.98740.94530.9882
Meta-path lengthAUCACCAUPR
{1}0.98190.93550.9835
{2}0.91600.85730.9192
{3}0.91570.86000.9340
{1,2,3}0.96430.90880.9722
{1,2}0.98740.94530.9882
Table 3

The performance of MGCNSS based on meta-paths with different lengths

Meta-path lengthAUCACCAUPR
{1}0.98190.93550.9835
{2}0.91600.85730.9192
{3}0.91570.86000.9340
{1,2,3}0.96430.90880.9722
{1,2}0.98740.94530.9882
Meta-path lengthAUCACCAUPR
{1}0.98190.93550.9835
{2}0.91600.85730.9192
{3}0.91570.86000.9340
{1,2,3}0.96430.90880.9722
{1,2}0.98740.94530.9882

MGCNSS on meta-path with lengths {1,2} wins the best performance, and the AUC, ACC and AUPR values are 0.9874, 0.9453 and 0.9882 respectively. Besides, MGCNSS with length {1} achieves the second rank on AUC, ACC and AUPR metrics are 0.9819, 0.9355 and 0.9835. Meanwhile, we also find that the performance of MGCNSS on meta-paths with length {1,2,3} is competitive. This may be because embeddings learned from the meta-path with length 3 may have noise, which leads to a decrease in the performance of MGCNSS. It is worth noting that meta-paths with length {1,2} have the greatest impact on the performance of MGCNSS.

Parameter sensitivity analysis

In this section, we first conduct the parameter sensitivity analysis experiments, which are the learning rates, embedding sizes, and the number of graph convolution layers. Then, we introduce the procedure for selecting hyperparameters in Equations (12) and (13).

The first parameter is the learning rate, which is a hyperparameter that controls how much to change one model in response to the estimated error [60]. It is one crucial task to choose a proper learning rate for MGCNSS. In this study, we choose the learning rate from 0.0001, 0.0005, 0.001, 0.01 and 0.1, respectively, and their corresponding results are shown in Figure 8(A). MGCNSS achieves the best performance on all metrics when the learning rate is 0.0005. The results show that the performance of MGCNSS gets better when the learning rate increases from 0.0001 to 0.0005, while the metrics decrease when the learning rate ranges from 0.0005 and 0.1. As a result, MGCNSS adopts 0.0005 as its best learning rate.

The parameter sensitivity analysis with different learning rates, embedding sizes and the number of convolution layer.
Figure 8

The parameter sensitivity analysis with different learning rates, embedding sizes and the number of convolution layer.

The second parameter is the embedding size of miRNAs and diseases, which is essential for the performance of MGCNSS. In this study, we vary the embedding size from 64, 128, 256, 512 and 1,024 and the corresponding results are shown in Figure 8(B). It can be seen that MGCNSS gets its best performance when the embedding size is 256. In conclusion, we adopt 256 as the best embedding size for MGCNSS.

The third parameter is the number of convolution layers, which affects the performance of MGCNSS. Here we choose the number of graph convolution layers from 1, 2, 3 and 4 and then obtain the AUC, ACC and AUPR values. Results shown in Figure 8(C) illustrate that the proposed model wins the highest scores when the number of graph convolution layers is 2. Notably, the performance of MGCNSS begins to decline when the number of convolutional layers is larger than 2. It may be that 1-length and 2-length meta-paths could have been able to fully capture the semantics between nodes, while the longer paths would not be helpful for the embedding learning of miRNAs and diseases. Therefore, the number of graph convolutional layers is set to 2.

Besides, for hyperparameters in Eq. 12 and Eq 13, MGCNSS adopts the BCEloss to select their proper values. As is shown in Figure 9, we can see that the BCE loss tends to converge with the increase of epoch number. Specifically, when the epoch number is larger than 1500, the BCE loss is almost converged and the corresponding AUC, ACC and AUPR values change in a small range.

The change of AUC, ACC and AUPR values accompanied by the BCE loss under different epochs.
Figure 9

The change of AUC, ACC and AUPR values accompanied by the BCE loss under different epochs.

Meanwhile, Figure 10 presents the correspondence values of hyperparameters in Equations (12) and (13) with the increase of the epoch number. In this study, MGCNSS chooses the α1, α2, α3 and β1β2, and β3 value when the epoch number is equal to 2000. Their corresponding values are 0.77, 0.19, 0.04 and 0.16, 0.82, 0.02, which indicates that the miRNA functional similarity network (α1) and Gaussian kernel-based disease similarity network (β2) have relatively higher weights.

The change of different hyperparameters under different epochs.
Figure 10

The change of different hyperparameters under different epochs.

The performance of MGCNSS under different negative sample selection strategy

As we know, the negative sample selection strategy is crucial to the performance of MGCNSS. Hence, there are some other negative sample selection strategies [5, 60]. Here, we choose the k-means clustering strategy in [5] and compare it with our distance-based (DB) selection strategy. The corresponding results are displayed in Figure 11.

The results of MGCNSS under different negative sample selection strategies.
Figure 11

The results of MGCNSS under different negative sample selection strategies.

This study employs the AUC, ACC and AUPR metrics to evaluate the performance of MGCNSS with different negative sample selection strategies. We can see that MGCNSS with our DB strategy achieves higher scores, and the AUC, ACC and AUPR values are 0.9874, 0.9453 and 0.9882, respectively, while MGCNSS with the k-means strategy gets 0.9366, 0.8729 and 0.9301 on AUC, ACC and AUPR, respectively. The results in this sub-experiment further illustrate the effectiveness of the proposed negative sample selection strategy.

The results of MGCNSS and other approaches under statistical significance test

The statistical significance test is another commonly used manner to verify the resulting stability of each prediction model. Here, we employ the paired t-test model [61] to perform the significance analysis. Specifically, the results of MGCNSS and the baselines are paired in terms of AUC, ACC and AUPR metric values respectively. In the paired t-test, we set the significance level as 0.05. The null hypothesis (H0) is that the performance of MGCNSS is not significantly better than baseline models on the given evaluation metrics, while the hypothesis (H1) is that MGCNSS is significantly better than baseline models. If the P-value is less than 0.05, we will reject H0 and accept H1. The corresponding results have been fully displayed in Table 4, which indicates that the performance of MGCNSS is superior to all the baseline approaches.

Table 4

The statistical significance analysis for MGCNSS and baseline approaches

Model comparisonAUCACCAUPR
MGCNSS vs SVM [18]3.3E-108.2E-092.8E-10
MGCNSS vs RF [53]1.7E-092.4E-087.2E-10
MGCNSS vs XGBoost [54]4.9E-095.7E-082.1E-09
MGCNSS vs GCN [55]5.7E-114.5E-098.3E-11
MGCNSS vs GAT [56]2.1E-106.7E-099.0E-11
MGCNSS vs DTIGAT [57]1.0E-096.0E-084.8E-10
MGCNSS vs DTICNN [58]5.7E-101.3E-082.4E-10
MGCNSS vs NeoDTI [59]5.3E-115.0E-095.2E-11
MGCNSS vs MSGCL [3]6.2E-113.6E-095.7E-10
MGCNSS vs AMHMDA [35]1.3E-097.8E-096.0E-10
MGCNSS vs GATMDA [13]1.3E-093.1E-081.1E-09
Model comparisonAUCACCAUPR
MGCNSS vs SVM [18]3.3E-108.2E-092.8E-10
MGCNSS vs RF [53]1.7E-092.4E-087.2E-10
MGCNSS vs XGBoost [54]4.9E-095.7E-082.1E-09
MGCNSS vs GCN [55]5.7E-114.5E-098.3E-11
MGCNSS vs GAT [56]2.1E-106.7E-099.0E-11
MGCNSS vs DTIGAT [57]1.0E-096.0E-084.8E-10
MGCNSS vs DTICNN [58]5.7E-101.3E-082.4E-10
MGCNSS vs NeoDTI [59]5.3E-115.0E-095.2E-11
MGCNSS vs MSGCL [3]6.2E-113.6E-095.7E-10
MGCNSS vs AMHMDA [35]1.3E-097.8E-096.0E-10
MGCNSS vs GATMDA [13]1.3E-093.1E-081.1E-09
Table 4

The statistical significance analysis for MGCNSS and baseline approaches

Model comparisonAUCACCAUPR
MGCNSS vs SVM [18]3.3E-108.2E-092.8E-10
MGCNSS vs RF [53]1.7E-092.4E-087.2E-10
MGCNSS vs XGBoost [54]4.9E-095.7E-082.1E-09
MGCNSS vs GCN [55]5.7E-114.5E-098.3E-11
MGCNSS vs GAT [56]2.1E-106.7E-099.0E-11
MGCNSS vs DTIGAT [57]1.0E-096.0E-084.8E-10
MGCNSS vs DTICNN [58]5.7E-101.3E-082.4E-10
MGCNSS vs NeoDTI [59]5.3E-115.0E-095.2E-11
MGCNSS vs MSGCL [3]6.2E-113.6E-095.7E-10
MGCNSS vs AMHMDA [35]1.3E-097.8E-096.0E-10
MGCNSS vs GATMDA [13]1.3E-093.1E-081.1E-09
Model comparisonAUCACCAUPR
MGCNSS vs SVM [18]3.3E-108.2E-092.8E-10
MGCNSS vs RF [53]1.7E-092.4E-087.2E-10
MGCNSS vs XGBoost [54]4.9E-095.7E-082.1E-09
MGCNSS vs GCN [55]5.7E-114.5E-098.3E-11
MGCNSS vs GAT [56]2.1E-106.7E-099.0E-11
MGCNSS vs DTIGAT [57]1.0E-096.0E-084.8E-10
MGCNSS vs DTICNN [58]5.7E-101.3E-082.4E-10
MGCNSS vs NeoDTI [59]5.3E-115.0E-095.2E-11
MGCNSS vs MSGCL [3]6.2E-113.6E-095.7E-10
MGCNSS vs AMHMDA [35]1.3E-097.8E-096.0E-10
MGCNSS vs GATMDA [13]1.3E-093.1E-081.1E-09

Visualization for the embeddings of miRNA–disease pairs learned by MGCNSS

To better demonstrate the effectiveness of MGCNSS, we display the embedding learning process of miRNA–disease pairs. Similar to the SCSMDA method [60], the positive pairs (red points) and negative pairs (blue points) are pre-selected and their embeddings are visualized by t-SNE tool [62] under different epochs. The results are shown in Figure 12.

Visualization for the embeddings of miRNA–disease pairs learned by MGCNSS under different epochs.
Figure 12

Visualization for the embeddings of miRNA–disease pairs learned by MGCNSS under different epochs.

It can be seen that the boundary between positive and negative pairs seems in chaos when the epoch is 0. With the increase of epoch number, the embeddings of positive and negative pairs become clear gradually. When the epoch number reaches 2000, the positive and negative pairs are almost separated with one distinct boundary. Meanwhile, It is worth noting that even if the epoch number is 2000, there may still be overlapping between positive and negative pairs, indicating the presence of significant challenges in the prediction task of miRNA–disease associations. Overall, the results confirm the powerful ability of MGCNSS to learn the discriminative embeddings of miRNAs and diseases.

Besides, to fully demonstrate the positions of positive samples and negative samples after executing our negative sample selection strategy, we visualize their 2D projections in Figure 13. Specifically, without the negative sample selection strategy, MGCNSS will divide the training samples into two categories, which are the positive samples (red points) and negative samples (blue points) in Figure 13(A). Meanwhile, after executing the proposed distanced-based negative sample selection strategy, MGCNSS divides the training samples into three categories (see Figure 13B), which are positive samples (red points), likely negative samples (green points) and likely positive samples (blue points).

Comparison results of MGCNSS without and with negative sample screening strategy.
Figure 13

Comparison results of MGCNSS without and with negative sample screening strategy.

Specifically, the local area A in Figure 13(A) contains the positive pairs and negative pairs. If there is no negative sample selection strategy, all the negative samples in area A could be the candidate negative samples. In fact, since the local area A gathers many ground-truth positive samples, the negative samples having a close distance to the ground-truth positive samples may not be the ground-truth negative samples. In other words, they may be the likely positive samples and should not be treated as candidate negative samples for training and testing. Correspondingly, in Figure 13(B), because of the negative sample selection strategy, MGCNSS marked the negative samples in local area A as the likely positive samples (blue points), not the negative samples for training and testing. In this way, MGCNSS will select the negative samples from the likely negative samples instead of the likely positive samples. As a result, MGCNSS will establish a high-quality negative sample set, which will improve its performance. The extensive results in the former section could further verify the effectiveness of the negative sample selection strategy.

Case study

We conduct the case study to further evaluate the performance of MGCNSS, which is similar to our previous research [29]. Firstly, we train MGCNSS with all the miRNA–disease associations in the HMDDv2.0 dataset. Then, we predict the potential associated miRNAs for the selected diseases. After that we screen out the top 50 predicted miRNAs based on their corresponding scores. To validate these predicted associations, we search the evidence based on HMDDv4.0 [63] and dbDEMC [64] database. Currently, in HMDDv4.0, there are 53 553 miRNA–disease association entries which include 1817 human miRNA genes, 79 virus-derived miRNAs and 2379 diseases from 37 022 papers. Besides, dbDEMC is designed to store and display differentially expressed miRNAs in cancers detected by high- and low-throughput strategies. Its current version contains 2584 miRNAs and 40 cancer types for humans.

Specifically, disease colon neoplasms and esophageal neoplasms are selected for validation. Colon neoplasms is a common malignant tumor that occurs in the colon of the digestive tract [65]. It ranks third in the incidence of gastrointestinal tumors and is increasing year by year, causing more than one million cases and 500 000 deaths annually. We first investigate the miRNAs associated with colon neoplasms, and the results are shown in Table 5. The results demonstrate that the top 50 predicted miRNAs have all been confirmed by HMDD or dbDEMC. As another high-incidence disease, esophageal neoplasm is a malignant tumor occurring in the esophageal tissue. Its morbidity and mortality are high, ranking 8th and 6th in all types of cancer, respectively [66]. Early diagnosis is beneficial to improving the survival rate of patients. The corresponding results (Table 6) show that 49 of the top 50 can be verified by HMDD or dbDEMC. The results of the case study fully illustrate the ability of MGCNSS to detect novel associations between miRNAs and diseases.

Table 5

Top 50 colon neoplasms-related miRNAs predicted by MGCNSS

RANKmiRNA nameEvidenceRANKmiRNA nameEvidence
1hsa-mir-21confirmed26hsa-mir-199aconfirmed
2hsa-mir-155confirmed27hsa-mir-19bconfirmed
3hsa-mir-146aconfirmed28hsa-mir-9confirmed
4hsa-mir-221confirmed29hsa-mir-34cconfirmed
5hsa-mir-29aconfirmed30hsa-mir-141confirmed
6hsa-mir-143confirmed31hsa-let-7iconfirmed
7hsa-mir-125bconfirmed32hsa-mir-200aconfirmed
8hsa-mir-20aconfirmed33hsa-mir-15aconfirmed
9hsa-mir-34aconfirmed34hsa-mir-101confirmed
10hsa-let-7aconfirmed35hsa-mir-28confirmed
11hsa-mir-133aconfirmed36hsa-mir-34bconfirmed
12hsa-mir-15bconfirmed37hsa-mir-142confirmed
13hsa-mir-29bconfirmed38hsa-mir-196aconfirmed
14hsa-mir-1confirmed39hsa-mir-192confirmed
15hsa-mir-132confirmed40hsa-mir-133bconfirmed
16hsa-mir-30bconfirmed41hsa-mir-206confirmed
17hsa-let-23bconfirmed42hsa-mir-31confirmed
18hsa-let-7cconfirmed43hsa-let-7dconfirmed
19hsa-mir-223confirmed44hsa-mir-20bconfirmed
20hsa-mir-375confirmed45hsa-mir-137confirmed
21hsa-let-7fconfirmed46hsa-mir-29cconfirmed
22hsa-mir-19aconfirmed47hsa-mir-181bconfirmed
23hsa-mir-16confirmed48hsa-mir-7confirmed
24hsa-mir-107confirmed49hsa-mir-148aconfirmed
25hsa-mir-222confirmed50hsa-mir-30aconfirmed
RANKmiRNA nameEvidenceRANKmiRNA nameEvidence
1hsa-mir-21confirmed26hsa-mir-199aconfirmed
2hsa-mir-155confirmed27hsa-mir-19bconfirmed
3hsa-mir-146aconfirmed28hsa-mir-9confirmed
4hsa-mir-221confirmed29hsa-mir-34cconfirmed
5hsa-mir-29aconfirmed30hsa-mir-141confirmed
6hsa-mir-143confirmed31hsa-let-7iconfirmed
7hsa-mir-125bconfirmed32hsa-mir-200aconfirmed
8hsa-mir-20aconfirmed33hsa-mir-15aconfirmed
9hsa-mir-34aconfirmed34hsa-mir-101confirmed
10hsa-let-7aconfirmed35hsa-mir-28confirmed
11hsa-mir-133aconfirmed36hsa-mir-34bconfirmed
12hsa-mir-15bconfirmed37hsa-mir-142confirmed
13hsa-mir-29bconfirmed38hsa-mir-196aconfirmed
14hsa-mir-1confirmed39hsa-mir-192confirmed
15hsa-mir-132confirmed40hsa-mir-133bconfirmed
16hsa-mir-30bconfirmed41hsa-mir-206confirmed
17hsa-let-23bconfirmed42hsa-mir-31confirmed
18hsa-let-7cconfirmed43hsa-let-7dconfirmed
19hsa-mir-223confirmed44hsa-mir-20bconfirmed
20hsa-mir-375confirmed45hsa-mir-137confirmed
21hsa-let-7fconfirmed46hsa-mir-29cconfirmed
22hsa-mir-19aconfirmed47hsa-mir-181bconfirmed
23hsa-mir-16confirmed48hsa-mir-7confirmed
24hsa-mir-107confirmed49hsa-mir-148aconfirmed
25hsa-mir-222confirmed50hsa-mir-30aconfirmed
Table 5

Top 50 colon neoplasms-related miRNAs predicted by MGCNSS

RANKmiRNA nameEvidenceRANKmiRNA nameEvidence
1hsa-mir-21confirmed26hsa-mir-199aconfirmed
2hsa-mir-155confirmed27hsa-mir-19bconfirmed
3hsa-mir-146aconfirmed28hsa-mir-9confirmed
4hsa-mir-221confirmed29hsa-mir-34cconfirmed
5hsa-mir-29aconfirmed30hsa-mir-141confirmed
6hsa-mir-143confirmed31hsa-let-7iconfirmed
7hsa-mir-125bconfirmed32hsa-mir-200aconfirmed
8hsa-mir-20aconfirmed33hsa-mir-15aconfirmed
9hsa-mir-34aconfirmed34hsa-mir-101confirmed
10hsa-let-7aconfirmed35hsa-mir-28confirmed
11hsa-mir-133aconfirmed36hsa-mir-34bconfirmed
12hsa-mir-15bconfirmed37hsa-mir-142confirmed
13hsa-mir-29bconfirmed38hsa-mir-196aconfirmed
14hsa-mir-1confirmed39hsa-mir-192confirmed
15hsa-mir-132confirmed40hsa-mir-133bconfirmed
16hsa-mir-30bconfirmed41hsa-mir-206confirmed
17hsa-let-23bconfirmed42hsa-mir-31confirmed
18hsa-let-7cconfirmed43hsa-let-7dconfirmed
19hsa-mir-223confirmed44hsa-mir-20bconfirmed
20hsa-mir-375confirmed45hsa-mir-137confirmed
21hsa-let-7fconfirmed46hsa-mir-29cconfirmed
22hsa-mir-19aconfirmed47hsa-mir-181bconfirmed
23hsa-mir-16confirmed48hsa-mir-7confirmed
24hsa-mir-107confirmed49hsa-mir-148aconfirmed
25hsa-mir-222confirmed50hsa-mir-30aconfirmed
RANKmiRNA nameEvidenceRANKmiRNA nameEvidence
1hsa-mir-21confirmed26hsa-mir-199aconfirmed
2hsa-mir-155confirmed27hsa-mir-19bconfirmed
3hsa-mir-146aconfirmed28hsa-mir-9confirmed
4hsa-mir-221confirmed29hsa-mir-34cconfirmed
5hsa-mir-29aconfirmed30hsa-mir-141confirmed
6hsa-mir-143confirmed31hsa-let-7iconfirmed
7hsa-mir-125bconfirmed32hsa-mir-200aconfirmed
8hsa-mir-20aconfirmed33hsa-mir-15aconfirmed
9hsa-mir-34aconfirmed34hsa-mir-101confirmed
10hsa-let-7aconfirmed35hsa-mir-28confirmed
11hsa-mir-133aconfirmed36hsa-mir-34bconfirmed
12hsa-mir-15bconfirmed37hsa-mir-142confirmed
13hsa-mir-29bconfirmed38hsa-mir-196aconfirmed
14hsa-mir-1confirmed39hsa-mir-192confirmed
15hsa-mir-132confirmed40hsa-mir-133bconfirmed
16hsa-mir-30bconfirmed41hsa-mir-206confirmed
17hsa-let-23bconfirmed42hsa-mir-31confirmed
18hsa-let-7cconfirmed43hsa-let-7dconfirmed
19hsa-mir-223confirmed44hsa-mir-20bconfirmed
20hsa-mir-375confirmed45hsa-mir-137confirmed
21hsa-let-7fconfirmed46hsa-mir-29cconfirmed
22hsa-mir-19aconfirmed47hsa-mir-181bconfirmed
23hsa-mir-16confirmed48hsa-mir-7confirmed
24hsa-mir-107confirmed49hsa-mir-148aconfirmed
25hsa-mir-222confirmed50hsa-mir-30aconfirmed
Table 6

Top 50 esophageal neoplasms-related miRNAs predicted by MGCNSS

RANKmiRNA nameEvidenceRANKmiRNA nameEvidence
1hsa-mir-125bconfirmed26hsa-mir-10bconfirmed
2hsa-mir-16confirmed27hsa-let-7fconfirmed
3hsa-mir-17confirmed28hsa-mir-193bconfirmed
4hsa-mir-9confirmed29hsa-mir-132confirmed
5hsa-mir-221confirmed30hsa-let-7dconfirmed
6hsa-mir-195confirmed31hsa-mir-124confirmed
7hsa-mir-222confirmed32hsa-mir-30aconfirmed
8hsa-mir-15bconfirmed33hsa-mir-302bconfirmed
9hsa-mir-18aconfirmed34hsa-mir-103aunconfirmed
10hsa-mir-20bconfirmed35hsa-mir-194confirmed
11hsa-mir-7confirmed36hsa-mir-137confirmed
12hsa-mir-29aconfirmed37hsa-mir-206confirmed
13hsa-mir-23bconfirmed38hsa-mir-181aconfirmed
14hsa-mir-200bconfirmed39hsa-mir-95confirmed
15hsa-mir-181bconfirmed40hsa-mir-24confirmed
16hsa-mir-142confirmed41hsa-mir-218confirmed
17hsa-let-7iconfirmed42hsa-mir-30cconfirmed
18hsa-mir-107confirmed43hsa-mir-32confirmed
19hsa-mir-106aconfirmed44hsa-mir-429confirmed
20hsa-mir-1confirmed45hsa-mir-29bconfirmed
21hsa-mir-302cconfirmed46hsa-mir-302dconfirmed
22hsa-mir-19bconfirmed47hsa-mir-372confirmed
23hsa-mir-133bconfirmed48hsa-mir-144confirmed
24hsa-mir-106bconfirmed49hsa-mir-128confirmed
25hsa-mir-146bconfirmed50hsa-mir-191confirmed
RANKmiRNA nameEvidenceRANKmiRNA nameEvidence
1hsa-mir-125bconfirmed26hsa-mir-10bconfirmed
2hsa-mir-16confirmed27hsa-let-7fconfirmed
3hsa-mir-17confirmed28hsa-mir-193bconfirmed
4hsa-mir-9confirmed29hsa-mir-132confirmed
5hsa-mir-221confirmed30hsa-let-7dconfirmed
6hsa-mir-195confirmed31hsa-mir-124confirmed
7hsa-mir-222confirmed32hsa-mir-30aconfirmed
8hsa-mir-15bconfirmed33hsa-mir-302bconfirmed
9hsa-mir-18aconfirmed34hsa-mir-103aunconfirmed
10hsa-mir-20bconfirmed35hsa-mir-194confirmed
11hsa-mir-7confirmed36hsa-mir-137confirmed
12hsa-mir-29aconfirmed37hsa-mir-206confirmed
13hsa-mir-23bconfirmed38hsa-mir-181aconfirmed
14hsa-mir-200bconfirmed39hsa-mir-95confirmed
15hsa-mir-181bconfirmed40hsa-mir-24confirmed
16hsa-mir-142confirmed41hsa-mir-218confirmed
17hsa-let-7iconfirmed42hsa-mir-30cconfirmed
18hsa-mir-107confirmed43hsa-mir-32confirmed
19hsa-mir-106aconfirmed44hsa-mir-429confirmed
20hsa-mir-1confirmed45hsa-mir-29bconfirmed
21hsa-mir-302cconfirmed46hsa-mir-302dconfirmed
22hsa-mir-19bconfirmed47hsa-mir-372confirmed
23hsa-mir-133bconfirmed48hsa-mir-144confirmed
24hsa-mir-106bconfirmed49hsa-mir-128confirmed
25hsa-mir-146bconfirmed50hsa-mir-191confirmed
Table 6

Top 50 esophageal neoplasms-related miRNAs predicted by MGCNSS

RANKmiRNA nameEvidenceRANKmiRNA nameEvidence
1hsa-mir-125bconfirmed26hsa-mir-10bconfirmed
2hsa-mir-16confirmed27hsa-let-7fconfirmed
3hsa-mir-17confirmed28hsa-mir-193bconfirmed
4hsa-mir-9confirmed29hsa-mir-132confirmed
5hsa-mir-221confirmed30hsa-let-7dconfirmed
6hsa-mir-195confirmed31hsa-mir-124confirmed
7hsa-mir-222confirmed32hsa-mir-30aconfirmed
8hsa-mir-15bconfirmed33hsa-mir-302bconfirmed
9hsa-mir-18aconfirmed34hsa-mir-103aunconfirmed
10hsa-mir-20bconfirmed35hsa-mir-194confirmed
11hsa-mir-7confirmed36hsa-mir-137confirmed
12hsa-mir-29aconfirmed37hsa-mir-206confirmed
13hsa-mir-23bconfirmed38hsa-mir-181aconfirmed
14hsa-mir-200bconfirmed39hsa-mir-95confirmed
15hsa-mir-181bconfirmed40hsa-mir-24confirmed
16hsa-mir-142confirmed41hsa-mir-218confirmed
17hsa-let-7iconfirmed42hsa-mir-30cconfirmed
18hsa-mir-107confirmed43hsa-mir-32confirmed
19hsa-mir-106aconfirmed44hsa-mir-429confirmed
20hsa-mir-1confirmed45hsa-mir-29bconfirmed
21hsa-mir-302cconfirmed46hsa-mir-302dconfirmed
22hsa-mir-19bconfirmed47hsa-mir-372confirmed
23hsa-mir-133bconfirmed48hsa-mir-144confirmed
24hsa-mir-106bconfirmed49hsa-mir-128confirmed
25hsa-mir-146bconfirmed50hsa-mir-191confirmed
RANKmiRNA nameEvidenceRANKmiRNA nameEvidence
1hsa-mir-125bconfirmed26hsa-mir-10bconfirmed
2hsa-mir-16confirmed27hsa-let-7fconfirmed
3hsa-mir-17confirmed28hsa-mir-193bconfirmed
4hsa-mir-9confirmed29hsa-mir-132confirmed
5hsa-mir-221confirmed30hsa-let-7dconfirmed
6hsa-mir-195confirmed31hsa-mir-124confirmed
7hsa-mir-222confirmed32hsa-mir-30aconfirmed
8hsa-mir-15bconfirmed33hsa-mir-302bconfirmed
9hsa-mir-18aconfirmed34hsa-mir-103aunconfirmed
10hsa-mir-20bconfirmed35hsa-mir-194confirmed
11hsa-mir-7confirmed36hsa-mir-137confirmed
12hsa-mir-29aconfirmed37hsa-mir-206confirmed
13hsa-mir-23bconfirmed38hsa-mir-181aconfirmed
14hsa-mir-200bconfirmed39hsa-mir-95confirmed
15hsa-mir-181bconfirmed40hsa-mir-24confirmed
16hsa-mir-142confirmed41hsa-mir-218confirmed
17hsa-let-7iconfirmed42hsa-mir-30cconfirmed
18hsa-mir-107confirmed43hsa-mir-32confirmed
19hsa-mir-106aconfirmed44hsa-mir-429confirmed
20hsa-mir-1confirmed45hsa-mir-29bconfirmed
21hsa-mir-302cconfirmed46hsa-mir-302dconfirmed
22hsa-mir-19bconfirmed47hsa-mir-372confirmed
23hsa-mir-133bconfirmed48hsa-mir-144confirmed
24hsa-mir-106bconfirmed49hsa-mir-128confirmed
25hsa-mir-146bconfirmed50hsa-mir-191confirmed

Besides, to verify the ability of MGCNSS in finding novel miRNA–disease associations, we select Colon neoplasms and predict its top-10 associated miRNAs. Meanwhile, we also select part of the baseline approaches in Table 1 and predict their top-10 associated miRNAs of Colon neoplasms and sort their ranks according to their scores. The corresponding results are displayed in Table 7. The results demonstrate that all the top-10 predicted miRNAs by MGCNSS could be confirmed by HMDDv4.0 [63] and dbDEMC [64] database. In particular, hsa-Let-7a could be predicted by MGCNSS (Rank 10), while the remained approaches could not infer that this miRNA has an association relationship with Colon neoplasms in the top-10 predicted results. The miRNA hsa-Let-7a has been confirmed by PMID-31434447 in the HMDD database and SourceID-GSE2564 in dbDEMC. The results of this experiment could illustrate the stable prediction ability of the proposed model.

Table 7

Top 10 colon neoplasms-related miRNAs predicted by MGCNSS and other baseline approaches

MethodsRank1Rank2Rank3Rank4Rank5Rank6Rank7Rank8Rank9Rank10
MGCNSShsa-mir-21hsa-mir-155hsa-mir-146ahsa-mir-221hsa-mir-29ahsa-mir-143hsa-mir-125bhsa-mir-20ahsa-mir-34ahsa-Let-7a
SVM [18]hsa-mir-21hsa-mir-181ahsa-mir-183hsa-mir-210hsa-mir-208ahsa-mir-146ahsa-mir-34ahsa-mir-31hsa-mir-137hsa-mir-30
RF [53]hsa-mir-223hsa-mir-126hsa-mir-21hsa-mir-146ahsa-mir-145hsa-mir-107hsa-mir-210hsa-mir-103hsa-mir-486hsa-mir-99a
XGBoost [54]hsa-mir-21hsa-mir-155hsa-mir-34chsa-mir-181ahsa-mir-221hsa-mir-16hsa-mir-200chsa-mir-133bhsa-mir-182hsa-mir-449b
GCN [55]hsa-mir-132hsa-mir-34chsa-mir-20ahsa-mir-223hsa-mir-19ahsa-mir-192hsa-mir-199ahsa-mir-214hsa-let-7fhsa-mir-155
GAT [56]hsa-mir-143hsa-mir-29ahsa-mir-16hsa-mir-133bhsa-mir-19bhsa-mir-15bhsa-mir-107hsa-mir-141hsa-mir-30ahsa-mir-31
DTIGAT [57]hsa-mir-133ahsa-mir-20ahsa-mir-29bhsa-mir-125bhsa-mir-155hsa-mir-181ahsa-mir-106bhsa-mir-15bhsa-mir-127hsa-mir-137
DTICNN [3]hsa-mir-15bhsa-mir-125bhsa-mir-19bhsa-mir-34ahsa-mir-155hsa-mir-107hsa-mir-7hsa-mir-21hsa-mir-200ahsa-mir-34c
NeoDTI [35]hsa-mir-155hsa-mir-221hsa-mir-15bhsa-mir-30bhsa-mir-223hsa-mir-28hsa-let-7ihsa-mir-1hsa-mir-21hsa-mir-143
AMHMDA [13]hsa-mir-17hsa-mir-221hsa-mir-20ahsa-mir-34ahsa-mir-155hsa-mir-19ahsa-mir-148ahsa-mir-222hsa-mir-21hsa-mir-101
MethodsRank1Rank2Rank3Rank4Rank5Rank6Rank7Rank8Rank9Rank10
MGCNSShsa-mir-21hsa-mir-155hsa-mir-146ahsa-mir-221hsa-mir-29ahsa-mir-143hsa-mir-125bhsa-mir-20ahsa-mir-34ahsa-Let-7a
SVM [18]hsa-mir-21hsa-mir-181ahsa-mir-183hsa-mir-210hsa-mir-208ahsa-mir-146ahsa-mir-34ahsa-mir-31hsa-mir-137hsa-mir-30
RF [53]hsa-mir-223hsa-mir-126hsa-mir-21hsa-mir-146ahsa-mir-145hsa-mir-107hsa-mir-210hsa-mir-103hsa-mir-486hsa-mir-99a
XGBoost [54]hsa-mir-21hsa-mir-155hsa-mir-34chsa-mir-181ahsa-mir-221hsa-mir-16hsa-mir-200chsa-mir-133bhsa-mir-182hsa-mir-449b
GCN [55]hsa-mir-132hsa-mir-34chsa-mir-20ahsa-mir-223hsa-mir-19ahsa-mir-192hsa-mir-199ahsa-mir-214hsa-let-7fhsa-mir-155
GAT [56]hsa-mir-143hsa-mir-29ahsa-mir-16hsa-mir-133bhsa-mir-19bhsa-mir-15bhsa-mir-107hsa-mir-141hsa-mir-30ahsa-mir-31
DTIGAT [57]hsa-mir-133ahsa-mir-20ahsa-mir-29bhsa-mir-125bhsa-mir-155hsa-mir-181ahsa-mir-106bhsa-mir-15bhsa-mir-127hsa-mir-137
DTICNN [3]hsa-mir-15bhsa-mir-125bhsa-mir-19bhsa-mir-34ahsa-mir-155hsa-mir-107hsa-mir-7hsa-mir-21hsa-mir-200ahsa-mir-34c
NeoDTI [35]hsa-mir-155hsa-mir-221hsa-mir-15bhsa-mir-30bhsa-mir-223hsa-mir-28hsa-let-7ihsa-mir-1hsa-mir-21hsa-mir-143
AMHMDA [13]hsa-mir-17hsa-mir-221hsa-mir-20ahsa-mir-34ahsa-mir-155hsa-mir-19ahsa-mir-148ahsa-mir-222hsa-mir-21hsa-mir-101
Table 7

Top 10 colon neoplasms-related miRNAs predicted by MGCNSS and other baseline approaches

MethodsRank1Rank2Rank3Rank4Rank5Rank6Rank7Rank8Rank9Rank10
MGCNSShsa-mir-21hsa-mir-155hsa-mir-146ahsa-mir-221hsa-mir-29ahsa-mir-143hsa-mir-125bhsa-mir-20ahsa-mir-34ahsa-Let-7a
SVM [18]hsa-mir-21hsa-mir-181ahsa-mir-183hsa-mir-210hsa-mir-208ahsa-mir-146ahsa-mir-34ahsa-mir-31hsa-mir-137hsa-mir-30
RF [53]hsa-mir-223hsa-mir-126hsa-mir-21hsa-mir-146ahsa-mir-145hsa-mir-107hsa-mir-210hsa-mir-103hsa-mir-486hsa-mir-99a
XGBoost [54]hsa-mir-21hsa-mir-155hsa-mir-34chsa-mir-181ahsa-mir-221hsa-mir-16hsa-mir-200chsa-mir-133bhsa-mir-182hsa-mir-449b
GCN [55]hsa-mir-132hsa-mir-34chsa-mir-20ahsa-mir-223hsa-mir-19ahsa-mir-192hsa-mir-199ahsa-mir-214hsa-let-7fhsa-mir-155
GAT [56]hsa-mir-143hsa-mir-29ahsa-mir-16hsa-mir-133bhsa-mir-19bhsa-mir-15bhsa-mir-107hsa-mir-141hsa-mir-30ahsa-mir-31
DTIGAT [57]hsa-mir-133ahsa-mir-20ahsa-mir-29bhsa-mir-125bhsa-mir-155hsa-mir-181ahsa-mir-106bhsa-mir-15bhsa-mir-127hsa-mir-137
DTICNN [3]hsa-mir-15bhsa-mir-125bhsa-mir-19bhsa-mir-34ahsa-mir-155hsa-mir-107hsa-mir-7hsa-mir-21hsa-mir-200ahsa-mir-34c
NeoDTI [35]hsa-mir-155hsa-mir-221hsa-mir-15bhsa-mir-30bhsa-mir-223hsa-mir-28hsa-let-7ihsa-mir-1hsa-mir-21hsa-mir-143
AMHMDA [13]hsa-mir-17hsa-mir-221hsa-mir-20ahsa-mir-34ahsa-mir-155hsa-mir-19ahsa-mir-148ahsa-mir-222hsa-mir-21hsa-mir-101
MethodsRank1Rank2Rank3Rank4Rank5Rank6Rank7Rank8Rank9Rank10
MGCNSShsa-mir-21hsa-mir-155hsa-mir-146ahsa-mir-221hsa-mir-29ahsa-mir-143hsa-mir-125bhsa-mir-20ahsa-mir-34ahsa-Let-7a
SVM [18]hsa-mir-21hsa-mir-181ahsa-mir-183hsa-mir-210hsa-mir-208ahsa-mir-146ahsa-mir-34ahsa-mir-31hsa-mir-137hsa-mir-30
RF [53]hsa-mir-223hsa-mir-126hsa-mir-21hsa-mir-146ahsa-mir-145hsa-mir-107hsa-mir-210hsa-mir-103hsa-mir-486hsa-mir-99a
XGBoost [54]hsa-mir-21hsa-mir-155hsa-mir-34chsa-mir-181ahsa-mir-221hsa-mir-16hsa-mir-200chsa-mir-133bhsa-mir-182hsa-mir-449b
GCN [55]hsa-mir-132hsa-mir-34chsa-mir-20ahsa-mir-223hsa-mir-19ahsa-mir-192hsa-mir-199ahsa-mir-214hsa-let-7fhsa-mir-155
GAT [56]hsa-mir-143hsa-mir-29ahsa-mir-16hsa-mir-133bhsa-mir-19bhsa-mir-15bhsa-mir-107hsa-mir-141hsa-mir-30ahsa-mir-31
DTIGAT [57]hsa-mir-133ahsa-mir-20ahsa-mir-29bhsa-mir-125bhsa-mir-155hsa-mir-181ahsa-mir-106bhsa-mir-15bhsa-mir-127hsa-mir-137
DTICNN [3]hsa-mir-15bhsa-mir-125bhsa-mir-19bhsa-mir-34ahsa-mir-155hsa-mir-107hsa-mir-7hsa-mir-21hsa-mir-200ahsa-mir-34c
NeoDTI [35]hsa-mir-155hsa-mir-221hsa-mir-15bhsa-mir-30bhsa-mir-223hsa-mir-28hsa-let-7ihsa-mir-1hsa-mir-21hsa-mir-143
AMHMDA [13]hsa-mir-17hsa-mir-221hsa-mir-20ahsa-mir-34ahsa-mir-155hsa-mir-19ahsa-mir-148ahsa-mir-222hsa-mir-21hsa-mir-101

Moreover, to comprehensively investigate and compare the ability of MGCNSS to find novel associations, we conduct this experiment. Specifically, we choose Colon neoplasms as the target disease and collect its corresponding associated miRNAs by each comparison model with their predicted scores. All the predicted miRNAs for the comparison model will form their corresponding predicted miRNA set. Without loss of generality, we name the predicted miRNA set of MGCNSS as A, and the predicted miRNA set of each comparison approach as B. Then, we evaluate the results of MGCNSS with each baseline one by one under three metrics of A and B, which are |AB|/|AB|, |A|/|AB| and |B|/|AB|, respectively. The results are displayed in Table 8. Specifically, the values in column |A|/|AB| are always larger than those in column |B|/|AB|. Taking MGCNSS and GATMDA for example, the value for |A|/|AB| is 0.9164, while the value for |B|/|AB| is 0.6104. From the results, we can find that MGCNSS outperforms other baselines in finding novel miRNA–disease associations.

Table 8

The comparison results of MGCNSS and other baselines in finding novel associations

Conbinations|AB|/|AB||A|/|AB||B|/|AB|
MGCNSS vs SVM [18]0.58640.89510.6914
MGCNSS vs RF [53]0.63870.93550.7032
MGCNSS vs XGBoost [54]0.73380.94160.7922
MGCNSS vs GCN [55]0.36360.87880.4848
MGCNSS vs GAT [56]0.30000.85290.4471
MGCNSS vs DTIGAT [57]0.30160.76720.5344
MGCNSS vs DTICNN [58]0.43040.91770.5127
MGCNSS vs NeoDTI [59]0.34360.88960.4540
MGCNSS vs MSGCL [3]0.29700.87880.4182
MGCNSS vs AMHMDA [35]0.39800.80110.5967
MGCNSS vs GATMDA [13]0.55190.91460.6104
Conbinations|AB|/|AB||A|/|AB||B|/|AB|
MGCNSS vs SVM [18]0.58640.89510.6914
MGCNSS vs RF [53]0.63870.93550.7032
MGCNSS vs XGBoost [54]0.73380.94160.7922
MGCNSS vs GCN [55]0.36360.87880.4848
MGCNSS vs GAT [56]0.30000.85290.4471
MGCNSS vs DTIGAT [57]0.30160.76720.5344
MGCNSS vs DTICNN [58]0.43040.91770.5127
MGCNSS vs NeoDTI [59]0.34360.88960.4540
MGCNSS vs MSGCL [3]0.29700.87880.4182
MGCNSS vs AMHMDA [35]0.39800.80110.5967
MGCNSS vs GATMDA [13]0.55190.91460.6104
Table 8

The comparison results of MGCNSS and other baselines in finding novel associations

Conbinations|AB|/|AB||A|/|AB||B|/|AB|
MGCNSS vs SVM [18]0.58640.89510.6914
MGCNSS vs RF [53]0.63870.93550.7032
MGCNSS vs XGBoost [54]0.73380.94160.7922
MGCNSS vs GCN [55]0.36360.87880.4848
MGCNSS vs GAT [56]0.30000.85290.4471
MGCNSS vs DTIGAT [57]0.30160.76720.5344
MGCNSS vs DTICNN [58]0.43040.91770.5127
MGCNSS vs NeoDTI [59]0.34360.88960.4540
MGCNSS vs MSGCL [3]0.29700.87880.4182
MGCNSS vs AMHMDA [35]0.39800.80110.5967
MGCNSS vs GATMDA [13]0.55190.91460.6104
Conbinations|AB|/|AB||A|/|AB||B|/|AB|
MGCNSS vs SVM [18]0.58640.89510.6914
MGCNSS vs RF [53]0.63870.93550.7032
MGCNSS vs XGBoost [54]0.73380.94160.7922
MGCNSS vs GCN [55]0.36360.87880.4848
MGCNSS vs GAT [56]0.30000.85290.4471
MGCNSS vs DTIGAT [57]0.30160.76720.5344
MGCNSS vs DTICNN [58]0.43040.91770.5127
MGCNSS vs NeoDTI [59]0.34360.88960.4540
MGCNSS vs MSGCL [3]0.29700.87880.4182
MGCNSS vs AMHMDA [35]0.39800.80110.5967
MGCNSS vs GATMDA [13]0.55190.91460.6104

DISCUSSIONS

MGCNSS achieves the best performance among all the 11 miRNA–disease association prediction approaches. Extensive results could fully demonstrate its effectiveness and stability in different conditions. In this section, we would like to analyze the advantages and drawbacks as follows.

The meta-path based multi-layer graph convolution

In the miRNA–disease association network, there are two types of nodes and two types of edges. The meta-paths with different lengths have rich semantic meaning between nodes in this heterogeneous network. For example, meta-path miRNA1-miRNA2-disease1 denotes that if miRNA1 and miRNA2 have a higher similar value, and miRNA2 and disease1 have an association relationship, miRNA1 may also have an association relationship with disease1 with high probability. This assumption has been widely accepted and utilized [8].

MGCNSS adopts the multi-layer graph convolution could capture the meta-paths with different lengths. It could comprehensively learn the embeddings of miRNAs and diseases, which could improve the prediction performance. The results of the ablation experiments demonstrate that the meta-based multi-layer graph convolution is essential for MGCNSS (see Figures 5 and 6). Besides, the performance of MGCNSS based on meta-paths with different lengths (See Table 3) is also well investigated. The results illustrate that meta-paths with length 1 and 2 have the best performance. The meta-path with length 3 may introduce noise, which lowers the indicators of MGCNSS in the experiment. In the next work, we would like to analyze the effect of network quality on the meta-path-based embedding learning.

Negative sample selection strategy

According to the ablation study, we can find that the negative sample selection strategy has a great impact on the performance of MGCNSS. To verify the effectiveness of MGCNSS, we conduct the ablation study. The results show that the proposed negative sample selection strategy is of great help in improving the performance of MGCNSS. To visually demonstrate the role of negative sample selection, we depict two sub-figures in Figure 12, which indicates that with the increase of epoch number, the boundary between the positive and negative pairs is gradually clear. Besides, Figure 13 presents that MGCNSS could avoid selecting the negative samples that belong to the positive sample gather area (see Figure 13B). Our intuition is that samples in belong to the positive sample gather area should be the likely positive samples, not the likely negative samples. In this way, MGCNSS could select the more high-quality negative samples and have a better prediction result.

The result of case study

It is crucial to discover novel association relationships between miRNAs and diseases for each prediction model. To verify this ability of MGCNSS, we conduct this case study. In the first group experiment, MGCNSS is employed to infer miRNAs for disease colon neoplasm and esophageal neoplasm respectively. Results demonstrate that the proposed approach has a creditable performance. All the top-50 predicted miRNAs for colon neoplasms and 49 of the top 50 for esophageal neoplasm could be verified by HMDD or dbDEMC.

Besides, we also present the top-10 predicted miRNAs for colon neoplasm by MGCNSS as well as other baseline approaches. The results demonstrate that all the top-10 predicted miRNAs by MGCNSS are identified in HMDD or dbDEMC. More importantly, for miRNA hsa-let-7a, MGCNSS could infer that it has an association relationship with colon neoplasm, while the other baseline approaches could not find this association relationship. Therefore, MGCNSS has a more powerful ability to find novel associations.

CONCLUSION

In this study, we proposed MGCNSS for miRNA–disease association prediction based on a multi-layer graph convolution and negative sample selection strategy. Specifically, MGCNSS employs multi-layer graph convolution to automatically capture the meta-path relations with different lengths in the heterogeneous network and learn the discriminative representations of miRNAs and diseases. Besides, MGCNSS establishes a high-quality negative sample set by choosing the likely negative samples from the unlabeled sample set with the distanced-based sample selection strategy. The extensive results fully demonstrate that MGCNSS outperforms all baseline methods on the experimental dataset under different scenarios. The results of the case study further demonstrate the effectiveness of MGCNSS in miRNA–disease association prediction.

We will perform future works from the following three aspects. Firstly, some other biological entity association information such as miRNA-lncRNA associations could be employed for measuring the similarities of miRNAs and diseases from more comprehensive perspectives. In this way, a high-quality miRNA–disease heterogeneous network could be established, enabling learning more discriminative embeddings of miRNAs and drugs. Secondly, we could construct the miRNA–disease-related-biological knowledge graph, and predict the underlying associations between miRNA–disease by employing the knowledge graph embedding technique. Thirdly, since the association relationship prediction problem between different entities is one of the foundation tasks in bioinformatics, we would like to try to apply our proposed model to other link prediction problems, such as the disease–gene association, and microbe–drug association prediction task.

Key Points
  • MGCNSS could automatically capture the rich semantics of meta-path with different lengths between miRNAs and diseases for learning their embeddings.

  • A negative sample selection strategy is proposed to screen out high-quality negative samples to enhance the performance of the prediction model.

  • The results demonstrate that MGCNSS outperforms all baseline methods on the evaluation metrics.

ACKNOWLEDGMENTS

The authors thank the anonymous reviewers for their valuable suggestions.

FUNDING

This work is supported by funds from the National Key R&D Program of China (2023YFC2206400), National Science Foundation of China (NO. 62371423, 61801432, and 62271132) and the Key Scientific Research Projects of Colleges and Universities in Henan Province (No. 22A520010), Key Scientific and Technological Project of Henan Province (NO. 232102211027).

AUTHOR CONTRIBUTIONS STATEMENT

Zhen Tian and Chenguang Han developed the codes, conceived the experiment and drafted the whole manuscript together. Zhen Tian, Lewen Xu and Wei Song set up the general idea of this study. Wei Song and Zhixia Teng revised this manuscript. All authors have read and approved the manuscript.

Author Biographies

Zhen Tian, PhD (Harbin Institute of Technology), is an associate professor at the School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China. His current research interests include computational biology, complex network analysis and data mining.

Chenguang Han is currently studying for a Master’s Degree of Computer Science and Technology at Zhengzhou University, Zhengzhou, China. His research interests include text mining, bioinformatics and deep learning.

Lewen Xu is studying for her Master’s degree of Software Engineering at Zhengzhou University, Zhengzhou, China. Her current research interests include computational biology, data mining and deep learning.

Zhixia Teng received a BS degree in information management and information systems from Northeast Forestry University in 2005, the MS and PhD degrees in computer science and technology from Northeast Forestry University in 2008, and from Harbin Institute of Technology in 2016, respectively. She is currently working at the College of Information and Computer Engineering, Northeast Forestry University. Her main research interests include machine learning and bioinformatics.

Wei Song received his PhD degree from Zhengzhou University, Zhengzhou China. He is an associate professor at the School of Computer and Artificial Intelligence, Zhengzhou University. His research interests include data science, data mining and machine learning.

References

1.

Huang
H-Y
,
Lin
Y-C-D
,
Cui
S
, et al.
Mirtarbase update 2022: an informative resource for experimentally validated miRNA–target interactions
.
Nucleic Acids Res
2022
;
50
(
D1
):
D222
30
.

2.

Liang
Y
,
Zheng
Y
,
Bingyi
J
, et al.
Research progress of miRNA–disease association prediction and comparison of related algorithms
.
Brief Bioinform
2022
;
23
(
3
):
bbac066
.

3.

Ruan
X
,
Jiang
C
,
Lin
P
, et al.
MSGCL: inferring miRNA–disease associations based on multi-view self-supervised graph structure contrastive learning
.
Brief Bioinform
2023
;
24
(
2
):
bbac623
.

4.

Rotelli
MT
,
Di Lena
M
,
Cavallini
A
, et al.
Fecal microRNA profile in patients with colorectal carcinoma before and after curative surgery
.
Int J Colorectal Dis
2015
;
30
:
891
8
.

5.

Zhao
Y
,
Chen
X
,
Yin
J
.
Adaptive boosting-based computational model for predicting potential miRNA-disease associations
.
Bioinformatics
2019
;
35
(
22
):
4730
8
.

6.

Ma
M
,
Na
S
,
Zhang
X
, et al.
SFGAE: a self-feature-based graph auto encoder model for miRNA–disease associations prediction
.
Brief Bioinform
2022
;
23
(
5
):
bbac340
.

7.

Juan
X
,
Li
C-X
,
Li
Y-S
, et al.
miRNA–miRNA synergistic network: construction via co-regulating functional modules and disease miRNA topological features
.
Nucleic Acids Res
2011
;
39
(
3
):
825
36
.

8.

Zeng
X
,
Zhang
X
,
Zou
Q
.
Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks
.
Brief Bioinform
2016
;
17
(
2
):
193
203
.

9.

Zou
Q
,
Li
J
,
Song
L
, et al.
Similarity computation strategies in the microRNA-disease network: a survey
.
Brief Funct Genomics
2016
;
15
(
1
):
55
64
.

10.

Chu
Y
,
Wang
X
,
Dai
Q
, et al.
MDA-GCNFTG: identifying miRNA-disease associations based on graph convolutional networks via graph sampling through the feature and topology graph
.
Brief Bioinform
2021
;
22
(
6
):
bbab165
.

11.

Yan
C
,
Wang
J
,
Ni
P
, et al.
DNRLMF-MDA: predicting microRNA-disease associations based on similarities of microRNAs and diseases
.
IEEE/ACM Trans Comput Biol Bioinform
2017
;
16
(
1
):
233
43
.

12.

Jiang
L
,
Ding
Y
,
Tang
J
,
Guo
F
.
MDA-SKF: similarity kernel fusion for accurately discovering miRNA-disease association
.
Front Genet
2018
;
9
:
618
.

13.

Li
G
,
Fang
T
,
Zhang
Y
, et al.
Predicting miRNA-disease associations based on graph attention network with multi-source information
.
BMC Bioinform
2022
;
23
(
1
):
244
.

14.

Guangchuang
Y
,
Xiao
C-L
,
Bo
X
, et al.
A new method for measuring functional similarity of microRNAs
.
J Integr Omics
2011
;
1
(
1
):
49
54
.

15.

Chen
X
,
Yan
G-Y
.
Semi-supervised learning for potential human microRNA-disease associations inference
.
Sci Rep
2014
;
4
(
1
):
5501
.

16.

Ding
P
,
Luo
J
,
Xiao
Q
,
Chen
X
.
A path-based measurement for human miRNA functional similarities using miRNA-disease associations
.
Sci Rep
2016
;
6
(
1
):
32533
.

17.

Mørk
S
,
Pletscher-Frankild
S
,
Caro
AP
, et al.
Protein-driven inference of miRNA–disease associations
.
Bioinformatics
2014
;
30
(
3
):
392
7
.

18.

Yang
Y
,
Li
J
,
Yang
Y
. The research of the fast SVM classifier method.
2015 12th international computer conference on wavelet active media technology and information processing (ICCWAMTIP)
. IEEE,
2015
, 121–4.

19.

Razzaghi
P
,
Abbasi
K
,
Ghasemi
JB
.
Multivariate pattern recognition by machine learning methods[M].
Machine Learning and Pattern Recognition Methods in Chemistry from Multivariate and Data Driven Modeling
.
Elsevier
,
2023
, 47–72.

20.

Wang
F
,
Huang
Z-A
,
Chen
X
, et al.
LRLSHMDA: Laplacian regularized least squares for human microbe–disease association prediction
.
Sci Rep
2017
;
7
(
1
):
7601
.

21.

Liang
C
,
Shengpeng
Y
,
Luo
J
.
Adaptive multi-view multi-label learning for identifying disease-associated candidate miRNAs
.
PLoS Comput Biol
2019
;
15
(
4
):
e1006931
.

22.

Chen
X
,
Zhu
C-C
,
Yin
J
.
Ensemble of decision tree reveals potential miRNA-disease associations
.
PLoS Comput Biol
2019
;
15
(
7
):
e1007209
.

23.

He
Q
,
Qiao
W
,
Fang
H
,
Bao
Y
.
Improving the identification of miRNA-disease associations with multi-task learning on gene-disease networks
.
Brief Bioinform
2023
;
24
:
06
.

24.

Chen
X
,
Huang
L
,
Xie
D
,
Zhao
Q
.
EGBMMDA: extreme gradient boosting machine for miRNA-disease association prediction
.
Cell Death Dis
2018
;
9
(
1
):
3
.

25.

Chen
X
,
Wang
C-C
,
Yin
J
,
You
Z-H
.
Novel human miRNA-disease association inference based on Random Forest
.
Mol Ther-Nucleic Acids
2018
;
13
:
568
79
.

26.

Luo
Y
,
Zhao
X
,
Zhou
J
, et al.
A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information
.
Nat Commun
2017
;
8
(
1
):
573
.

27.

Dehghan
A
,
Abbasi
K
,
Razzaghi
P
, et al.
CCL-DTI: contributing the contrastive loss in drug–target interaction prediction
.
BMC Bioinform
2024
;
25
(
1
):
48
.

28.

Palhamkhani
F
,
Alipour
M
,
Dehnad
A
, et al.
DeepCompoundNet: enhancing compound–protein interaction prediction with multimodal convolutional neural networks
.
J Biomol Struct Dyn
2023
,
1–10
.

29.

Lou
Z
,
Cheng
Z
,
Li
H
, et al.
Predicting miRNA–disease associations via learning multimodal networks and fusing mixed neighborhood information
.
Brief Bioinform
2022
;
23
(
5
):
bbac159
.

30.

Tian
Z
,
Fang
H
,
Teng
Z
,
Ye
Y
.
GOGCN: graph convolutional network on gene ontology for functional similarity analysis of genes
.
IEEE/ACM Trans Comput Biol Bioinform
2022
;
20
(
2
):
1053
64
.

31.

Wei
Y
,
Wang
X
,
Nie
L
, et al. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video.
Proceedings of the 27th ACM international conference on multimedia
.
2019
,
1437–45
.

32.

Zhao
T
,
Yang
H
,
Valsdottir
LR
, et al.
Identifying drug–target interactions based on graph convolutional network and deep neural network
.
Brief Bioinform
2021
;
22
(
2
):
2141
50
.

33.

Wang
W
,
Chen
H
.
Predicting miRNA-disease associations based on lncRNA–miRNA interactions and graph convolution networks
.
Brief Bioinform
2023
;
24
(
1
):
bbac495
.

34.

Wang
W
,
Chen
H
.
Predicting miRNA-disease associations based on graph attention networks and dual Laplacian regularized least squares
.
Brief Bioinform
2022
;
23
(
5
):bbac292.

35.

Ning
Q
,
Zhao
Y
,
Gao
J
, et al.
AMHMDA: attention aware multi-view similarity networks and hypergraph learning for miRNA–disease associations identification
.
Brief Bioinform
2023
;
24
(
2
):
bbad094
.

36.

Jiang
L
,
Sun
J
,
Wang
Y
, et al.
Identifying drug–target interactions via heterogeneous graph attention networks combined with cross-modal similarities
.
Brief Bioinform
2022
;
23
(
2
):
bbac016
.

37.

Chen
X
,
Yan
CC
,
Zhang
X
, et al.
HGIMDA: heterogeneous graph inference for miRNA-disease association prediction
.
Oncotarget
2016
;
7
(
40
):
65257
.

38.

Chen
Y
,
Wang
J
,
Wang
C
,
Zou
Q
.
AutoEdge-CCP: a novel approach for predicting cancer-associated circRNAs and drugs based on automated edge embedding
.
PLoS Comput Biol
2024
;
20
(
1
):
1
20
.

39.

Wang
X
,
Liu
N
,
Han
H
, et al. Self-supervised heterogeneous graph neural network with co-contrastive learning.
Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining
.
2021
, 1726–36.

40.

Cao
C
,
Wang
C
,
Yang
S
,
Zou
Q
.
CircSI-SSL: circRNA-binding site identification based on self-supervised learning
.
Bioinformatics
2024
;
40
(
1
):
btae004
.

41.

Li
Y
,
Qiu
C
,
Jian
T
, et al.
HMDD v2. 0: a database for experimentally supported human microRNA and disease associations
.
Nucleic Acids Res
2014
;
42
(
D1
):
D1070
4
.

42.

Wang
D
,
Wang
J
,
Ming
L
, et al.
Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases
.
Bioinformatics
2010
;
26
(
13
):
1644
50
.

43.

Lipscomb
CE
.
Medical subject headings (mesh)
.
Bull Med Libr Assoc
2000
;
88
(
3
):
265
.

44.

Xuan
P
,
Han
K
,
Guo
M
, et al.
Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors
.
PloS One
2013
;
8
(
8
):e70204.

45.

Li
Z
,
Jiashu Li
R
,
Nie
Z-HY
,
Bao
W
.
A graph auto-encoder model for miRNA-disease associations prediction
.
Brief Bioinform
2021
;
22
(
4
):
bbaa240
.

46.

Yang
J-H
,
Li
J-H
,
Shao
P
, et al.
starBase: a database for exploring microRNA–mRNA interaction maps from Argonaute Clip-Seq and Degradome-Seq data
.
Nucleic Acids Res
2011
;
39
(
suppl_1
):
D202
9
.

47.

Gao
X
,
Xiao
B
,
Tao
D
,
Li
X
.
A survey of graph edit distance
.
Pattern Anal Appl
2010
;
13
:
113
29
.

48.

Li
Y
,
Sun
H
,
Fang
W
, et al.
Sure: screening unlabeled samples for reliable negative samples based on reinforcement learning
.
Inform Sci
2023
;
629
:
299
312
.

49.

Guo
R
,
Chen
H
,
Wang
W
, et al.
Predicting potential miRNA-disease associations based on more reliable negative sample selection
.
BMC Bioinform
2022
;
23
(
1
):
1
12
.

50.

Li
F
,
Dong
S
,
Leier
A
, et al.
Positive-unlabeled learning in bioinformatics and computational biology: a brief review
.
Brief Bioinform
2022
;
23
(
1
):
bbab461
.

51.

Krishna
K
,
Narasimha
M
,
Murty
.
Genetic k-means algorithm
.
IEEE Trans Syst Man Cybern B Cybern
1999
;
29
(
3
):
433
9
.

52.

Wang
W
,
Zhang
L
,
Sun
J
, et al.
Predicting the potential human lncRNA–miRNA interactions based on graph convolution network with conditional random field
.
Brief Bioinform
2022
;
23
(
6
):
bbac463
.

53.

Qing-Wen
W
,
Xia
J-F
,
Ni
J-C
,
Zheng
C-H
.
GAERF: predicting lncRNA-disease associations by graph auto-encoder and Random Forest
.
Brief Bioinform
2021
;
22
(
5
):
bbaa391
.

54.

Chen
T
,
Guestrin
C
.
Xgboost: A scalable tree boosting system
.
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining
.
2016
, 785–94.

55.

Thomas
N
.
Kipf and max welling. Semi-supervised classification with graph convolutional networks
.
arXiv preprint arXiv:1609.02907
.
2016
.

56.

Velickovic
P
,
Cucurull
G
,
Casanova
A
, et al. .
Graph attention networks
.
stat
2017
;
1050
(
20
):
10–48550
.

57.

Wang
H
,
Zhou
G
,
Liu
S
, et al.
Drug-target interaction prediction with graph attention networks
.
arXiv preprint arXiv:2107.06099
.
2021
.

58.

Peng
J
,
Li
J
,
Shang
X
.
A learning-based method for drug-target interaction prediction based on feature representation learning and deep neural network
.
BMC Bioinform
2020
;
21
(
13
):
1
13
.

59.

Wan
F
,
Hong
L
,
Xiao
A
, et al.
NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug–target interactions
.
Bioinformatics
2019
;
35
(
1
):
104
11
.

60.

Tian
Z
,
Yue
Y
,
Fang
H
, et al.
Predicting microbe–drug associations with structure-enhanced contrastive learning and self-paced negative sampling strategy
.
Brief Bioinform
2023
;
24
(
2
):
bbac634
.

61.

Mowery
BD
.
The paired t-test
.
Pediatr Nurs
2011
;
37
(
6
):320–22.

62.

Van der Maaten
L
,
Hinton
G
.
Visualizing data using t-SNe
.
J Mach Learn Rese
2008
;
9
(
11
).

63.

Jiang
Q
,
Wang
Y
,
Hao
Y
, et al.
mir2disease: a manually curated database for microRNA deregulation in human disease
.
Nucleic Acids Res
2009
;
37
(
suppl_1
):
D98
D104
.

64.

Yang
Z
,
Liangcai
W
,
Wang
A
, et al.
dbDMEC 2.0: updated database of differentially expressed miRNAs in human cancers
.
Nucleic Acids Res
2017
;
45
(
D1
):
D812
8
.

65.

Sternberg's diagnostic surgical pathology[M]
. Lippincott Williams & Wilkins,
2004
.

66.

Zhang
Y
.
Epidemiology of esophageal cancer
.
World J Gastroenterol: WJG
2013
;
19
(
34
):
5598
606
.

Author notes

Zhen Tian and Chenguang Han contribute equally to the paper.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.