Abstract

In recent years, increasing biological experiments and scientific studies have demonstrated that microRNA (miRNA) plays an important role in the development of human complex diseases. Therefore, discovering miRNA–disease associations can contribute to accurate diagnosis and effective treatment of diseases. Identifying miRNA–disease associations through computational methods based on biological data has been proven to be low-cost and high-efficiency. In this study, we proposed a computational model named Stacked Autoencoder for potential MiRNA–Disease Association prediction (SAEMDA). In SAEMDA, all the miRNA–disease samples were used to pretrain a Stacked Autoencoder (SAE) in an unsupervised manner. Then, the positive samples and the same number of selected negative samples were utilized to fine-tune SAE in a supervised manner after adding an output layer with softmax classifier to the SAE. SAEMDA can make full use of the feature information of all unlabeled miRNA–disease pairs. Therefore, SAEMDA is suitable for our dataset containing small labeled samples and large unlabeled samples. As a result, SAEMDA achieved AUCs of 0.9210 and 0.8343 in global and local leave-one-out cross validation. Besides, SAEMDA obtained an average AUC and standard deviation of 0.9102 ± /−0.0029 in 100 times of 5-fold cross validation. These results were better than those of previous models. Moreover, we carried out three case studies to further demonstrate the predictive accuracy of SAEMDA. As a result, 82% (breast neoplasms), 100% (lung neoplasms) and 90% (esophageal neoplasms) of the top 50 predicted miRNAs were verified by databases. Thus, SAEMDA could be a useful and reliable model to predict potential miRNA–disease associations.

Introduction

MicroRNAs (miRNAs) are a class of endogenous small noncoding RNAs with the length of about 20–24 nucleotides and they play important regulatory roles in cells [1]. Studies have indicated the involvement of miRNAs in a number of important life processes including cell growth [2], cell differentiation [3], cell proliferation [4] and cell death [5]. In addition, miRNAs have become a research hotspot in the field of biomedicine due to their vital roles in the occurrence and development of human diseases [6, 7]. A series of studies have confirmed that miRNAs are associated with various diseases [8–12]. For example, He et al. [13] demonstrated that abnormal expression of the mir-17-92 cluster can induce B cell lymphoma. In addition, miR-200c can inhibit the clonal expansion of breast cancer cells in vitro and suppress the tumor formation driven by human breast cancer stem cells in vivo [14]. Furthermore, the expression levels of miR-17-3p and miR-92 were significantly elevated in plasma of patients with colorectal cancer and the levels were significantly decreased after surgical removal of the primary tumors, indicating that these miRNAs can be used as biomarkers for the diagnosis of colorectal cancer [15]. Moreover, Hu et al. [16] found that the expression levels of miR-30d, miR-499, miR-486 and miR-1 from the serum were significantly related with overall survival of patients with non–small-cell lung cancer. In addition to miRNAs mentioned above, many other miRNAs are associated with the diagnosis, treatment and prognosis of human complex diseases [17–19]. Therefore, it is very important to identify associations between miRNAs and human complex diseases. The traditional biological experiment for identifying miRNA–disease associations was adopted in the early days, but the experimental period is long and the cost is high. Nowadays, with the increasing amount of available biological data, computational methods for miRNA–disease association prediction emerge as auxiliary tools for the traditional experiments. It can effectively shorten the time and cost of the traditional experiments by performing experimental verification on those highly probable associations predicted by computational models.

Over the past few years, based on the assumption that miRNAs with similar functions tend to be associated with similar diseases [20], researchers have developed a variety of miRNA–disease association prediction models, which can be divided into three categories [21]. The first type of prediction models is the score function–based models that used probability distributions or statistical analysis to establish score functions. For example, Chen et al. [22] developed a computational model called Within and Between Score for MiRNA–Disease Association prediction (WBSMDA). They defined two different types of functions to calculate Within-score and Between-score of miRNA–disease pair and integrated these two scores to obtain the final association score. Mørk et al. [23] further proposed a model called miRNA–Protein–Disease Association prediction (miRPD) to infer potential miRNA–disease associations. They defined miRNA–disease association scoring function based on miRNA–protein and protein–disease association scores. Here, protein was introduced as a mediator for the miRNA–disease inference.

The second type of prediction models is network algorithm–based models, which take advantage of miRNA and disease similarity from different perspectives. For example, Chen et al. [24] proposed a novel model named Random Walk with Restart for MiRNA–Disease Association (RWRMDA) by implementing random walk on the miRNA functional similarity network to prioritize candidate miRNAs for the disease of interest. In addition, Shi et al. [25] constructed a protein–protein interaction (PPI) network and further implemented random walk on the PPI network with the disease genes and miRNA targets as seed nodes to obtain two gene rank lists, respectively. Then, the miRNA–disease association can be identified through investigating the functional link between miRNA targets and disease genes based on the gene set enrichment analysis for the sets of miRNA targets and disease genes on above two gene lists, respectively. Later, a new computational model named MiRNAs associated with Diseases Prediction (MIDP) was developed by Xuan et al. [26]. MIDP adopts random walk algorithm in the miRNA similarity network to predict potential associated miRNAs for diseases, which have some known related miRNAs. For diseases without any known related miRNAs, they exploited miRNA similarity network, disease similarity network and known miRNA–disease associations to construct an miRNA–disease bilayer network. Then, they performed random walk on this bilayer network and thus the model could work for diseases without any known related miRNAs. Furthermore, Yu et al. [27] proposed a network information flow model to predict miRNA–disease associations. First, they established a MicroRNAome–phenome network by integrating miRNA–disease association network, disease semantic and phenotypic similarity network as well as miRNA functional similarity network. Then, for a given disease, the information flow leaving the candidate miRNA can be computed based on network information flow model and further used as the association score between the miRNA and the given disease. In addition, Chen et al. [28] presented a model called Heterogeneous Graph Inference for MiRNA-Disease Association prediction (HGIMDA). They defined association score of unlabeled miRNA–disease pair by summarizing all paths connecting investigated miRNA and disease with a length of three in miRNA–disease heterogeneous network. You et al. [29] developed the Path-Based MiRNA-Disease Association (PBMDA) predictive method. First, all paths connecting the investigated miRNA and disease with the length less than or equal to three were searched in miRNA–disease heterogeneous network. Then, the association score between investigated miRNA and disease can be computed based on the number of paths and the length of each path. Chen et al. [30] further proposed a model named Triple Layer Heterogeneous Network Based Inference for MiRNA-Disease Association prediction (TLHNMDA). This model built a triple layer heterogeneous network containing miRNA, disease and long noncoding RNA (lncRNA) nodes. Based on this triple layer network, an iterative equation was constructed to obtain the miRNA–disease correlation probability. Besides, Chen et al. [31] developed a model of Matrix Decomposition and Heterogeneous Graph Inference (MDHGI) for miRNA–disease association prediction. Firstly, the Sparse Learning Method (SLM) was used to reconstruct a new miRNA–disease association adjacency matrix. Then, a heterogeneous graph was built based on the reconstructed adjacency matrix, miRNA similarity matrix and disease similarity matrix. Lastly, an iterative equation was formulated to infer the correlation probability of miRNA–disease pairs.

The third type of prediction models is constructed based on machine learning algorithms. For example, Xuan et al. [32] proposed a model named Human-Disease-related MiRNA Prediction (HDMP) based on weighted k most similar neighbors. Firstly, they constructed the miRNA functional similarity matrix. Then, the association score between a given disease and a candidate miRNA can be obtained by summing subscores of the k neighbors of the candidate miRNA. Each neighbor’s subscore can be computed by inspecting two key metrics including the neighbor’s weight and the similarity between the candidate miRNA and its neighbor. In this study, the neighbor would be assigned higher weight if the neighbor and the candidate miRNA belong to the same family or cluster. Besides, Chen et al. [33] developed a model named Regularized Least Squares for MiRNA-Disease-Association (RLSMDA). They constructed the semisupervised classifier in the miRNA and disease space, respectively, under the framework of regularized least squares (RLS), and then combined the optimal classifiers in two different spaces to obtain the probability of miRNA–disease pair. Chen et al. [34] further proposed another model named Restricted Boltzmann Machine for Multiple types of MiRNA-Disease Association prediction (RBMMMDA). RBMMMDA was constructed based on the restricted Boltzmann machine, a two-layer undirected graphical model consisting of layers of visible units and hidden units, respectively. A visible unit represented a disease and a hidden unit stood for an unknown miRNA–disease pair. Innovatively, RBMMMDA could predict not only the probability of potential associations but also the types of associations. In addition, Pasquier and Gardes proposed a novel model named MiRAI by using distributional semantics to reveal information attached to miRNAs and diseases [35]. They firstly represented the distributional information on miRNAs and diseases in a high-dimensional vector space. Then, the association probability between miRNAs and diseases can be computed based on their vector similarity. Li et al. [36] used the singular value thresholding (SVT) algorithm to develop a model named Matrix Completion for MiRNA-Disease Association prediction (MCMDA). Matrix completion algorithm was adopted to update the miRNA–disease adjacency matrix to obtain the final miRNA–disease association matrix. Furthermore, Chen et al. [37] developed a model named Ranking-based K-Nearest Neighbors for MiRNA-Disease Association prediction (RKNNMDA). They utilized KNN algorithm to obtain the K nearest neighbors of investigated miRNA and employed support vector machine (SVM) to re-rank the K neighbors. Then, the association score between the investigated miRNA and candidate disease can be calculated by inspecting the association information between the K neighbors and the candidate disease. Similarly, the authors also computed association score from the perspective of disease. Finally, they integrated the association scores from two different perspectives to predict potential miRNA–disease associations. Moreover, Extreme Gradient Boosting Machine for MiRNA-Disease Association (EGBMMDA) was raised by Chen et al. [38]. EGBMMDA constructed three different types of features and connected them to generate composite feature vectors as input. The probability of potential miRNA–disease association was obtained by training a regression tree under the framework of gradient boosting. Recently, Zhu et al. [39] developed the model of Bayesian Ranking for MiRNA-Disease Association prediction (BRMDA). They improved Bayesian Personalized Ranking algorithm and defined a new optimization criterion by incorporating miRNA bias and adding similarity information of miRNA and disease to infer potential miRNA–disease associations. In addition, a neighborhood-based approach was utilized to predict associations for new diseases and miRNAs.

Although the above models show reliable performance to some extent, each still has its own limitations and needs further improvement. Since deep learning technology can better learn representations of data and has been successively used in many domains such as genomics and drug discovery in recent years [40], we consider applying it to the prediction of miRNA–disease associations. In addition, only pairs with known label could be used to train an ordinary multilayer perceptron network, so we need to pretrain multilayer perceptron network by using all miRNA–disease pairs to reduce the impact of too few known associations on the predictive accuracy to some extent. Inspired by Bahi et al. [41], we presented a model of Stacked Autoencoder for potential MiRNA-Disease Association prediction (SAEMDA) that took advantage of both deep learning and pretraining. We first pretrain the Stacked Autoencoder (SAE) using all miRNA–disease pairs in an unsupervised manner. Then, positive samples and the same number of randomly selected negative samples were utilized to fine-tune the SAE in a supervised manner. Predictive performance of our method was evaluated by three kinds of cross validation. As a result, SAEMDA obtained the AUC of 0.9210 in global leave-one-out cross validation (LOOCV), the AUC of 0.8343 in local LOOCV as well as the average AUC and standard deviation of 0.9102 ± 0.0029 in 100 times of 5-fold cross validation. In addition, we also carried out three case studies to demonstrate the prediction accuracy of SAEMDA. In the three different types of case studies for breast neoplasms (BN), lung neoplasms (LN) and esophageal neoplasms (EN), 41, 50 and 45 of the top 50 predicted potentially related miRNAs were verified by databases.

Results

Performance evaluation

We first obtained the training data from HMDD v2.0 [42] containing 5430 known associations between 495 miRNAs and 383 diseases and then adopt global and local LOOCV to verify the accuracy of SAEMDA. In global and local LOOCV, each known association was left out as a test sample in turn and the remaining known associations were regarded as training samples. Besides, all unlabeled miRNA–disease pairs were considered as candidate samples in global LOOCV, while candidate samples were unlabeled pairs between miRNAs and investigated disease in local LOOCV. For both global and local LOOCV, we scored the test sample and candidate samples through SAEMDA and obtained the rank of the test sample through comparing the score of the test sample with those of candidate samples. Then, we evaluated the performance of SAEMDA by drawing a receiver operating characteristics (ROC) curve and calculating the area under the ROC curve (AUC). As shown in Figure 1, SAEMDA obtained an AUC of 0.9210 in global LOOCV, which was superior to PBMDA (0.9169), EGBMMDA (0.9123), MDHGI (0.8945), TLHNMDA (0.8795), MCMDA (0.8749), MaxFlow (0.8629), RLSMDA (0.8426), HDMP (0.8366) and WBSMDA (0.8030). In local LOOCV, the AUC of SAEMDA was 0.8343 and better than those of all the other models: PBMDA (0.8341), EGBMMDA (0.8221), MDHGI (0.8240), TLHNMDA (0.7756), MCMDA (0.7718), MaxFlow (0.7774), RLSMDA (0.6953), HDMP (0.7702), WBSMDA (0.8031), MiRAI (0.6299) and MIDP (0.8196). It is worth mentioning that MIDP was not suitable for global LOOCV comparison, because it was a local ranking method based on random walk and could not simultaneously make predictions for all diseases. In addition, global LOOCV could not be applied to MiRAI, either. The predicted association scores were positively correlated with the number of miRNAs known to be associated with different diseases, so the comparison of predicted association scores for different diseases was unreasonable. It can also be seen that the AUC of MiRAI was significantly lower than other methods, because the predictive accuracy of MiRAI would be severely affected by data sparsity. There were only 83 important diseases with at least 20 associated miRNAs considered in the original literature [35]. In contrast, the number of diseases is far more than 83 and many diseases have fewer associated miRNAs in our dataset.

AUCs of SAEMDA under global and local LOOCV compared with some previous computational models.
Figure 1

AUCs of SAEMDA under global and local LOOCV compared with some previous computational models.

In addition, we performed 5-fold cross validation to further evaluate the performance of SAEMDA. All known miRNA–disease associations were randomly divided into five equally sized subsets. Each subset was used as the test set in turn, while the other four subsets were used as training sets. We applied SAEMDA to score all unlabeled miRNA–disease pairs and test samples. Then we got the rank of each test sample through comparing the score of each test sample with the scores of all unlabeled pairs. To reduce the bias caused by random division of known miRNA–disease associations, we repeated 5-fold cross validation for 100 times. As a result (Table 1), SAEMDA obtained an average AUC and standard deviation of 0.9102+/−0.0029, which was higher than those of eight previous models and slightly lower than that of PBMDA. It is worth noting that all prediction models were compared with SAEMDA under the same dataset in LOOCV and 5-fold cross validation.

Table 1

Performance comparison between SAEMDA and other nine models under 5-fold cross-validation

Prediction modelAUCStandard deviation
SAEMDA0.91020.0029
PBMDA0.91720.0007
EGBMMDA0.90480.0012
MDHGI0.87940.0021
TLHNMDA0.87950.0010
MCMDA0.87670.0011
MaxFlow0.85790.001
RLSMDA0.85690.0020
HDMP0.83420.0010
WBSMDA0.81850.0009
Prediction modelAUCStandard deviation
SAEMDA0.91020.0029
PBMDA0.91720.0007
EGBMMDA0.90480.0012
MDHGI0.87940.0021
TLHNMDA0.87950.0010
MCMDA0.87670.0011
MaxFlow0.85790.001
RLSMDA0.85690.0020
HDMP0.83420.0010
WBSMDA0.81850.0009
Table 1

Performance comparison between SAEMDA and other nine models under 5-fold cross-validation

Prediction modelAUCStandard deviation
SAEMDA0.91020.0029
PBMDA0.91720.0007
EGBMMDA0.90480.0012
MDHGI0.87940.0021
TLHNMDA0.87950.0010
MCMDA0.87670.0011
MaxFlow0.85790.001
RLSMDA0.85690.0020
HDMP0.83420.0010
WBSMDA0.81850.0009
Prediction modelAUCStandard deviation
SAEMDA0.91020.0029
PBMDA0.91720.0007
EGBMMDA0.90480.0012
MDHGI0.87940.0021
TLHNMDA0.87950.0010
MCMDA0.87670.0011
MaxFlow0.85790.001
RLSMDA0.85690.0020
HDMP0.83420.0010
WBSMDA0.81850.0009

Case studies

In our work, we carried out three different types of case studies to further illustrate the predictive power of SAEMDA. In the first case study, we obtained known associations from HMDD v2.0 database and then verified the predicted results through dbDEMC [43] and miR2Disease [44] database. We chose BN, the most common malignant disease in women, as the investigated disease. BN begins as a local disease and can spread to lymph nodes and other organs [45]. Clinical breast examination is one of the main methods to detect BN and early diagnosis can greatly improve the cure rate of BN [46]. Studies have found that most of BN patients have abnormal miRNA expression [47], implying that miRNA could be a potential biomarker for the diagnosis of BN. For example, Heneghan et al. [48] found that the expression of miR-195 was significantly increased in BN patients. We utilized SAEMDA to reveal more miRNAs related to BN. As a result, 8 out of the top 10 and 41 out of the top 50 potential miRNAs were confirmed based on dbDEMC and miR2Disease databases (Table 2).

Table 2

Validation of the top 50 miRNAs predicted to be associated with BN by SAEMDA based on the known associations in HMDD v2.0. The first column records the top 1–25 predicted miRNAs and the third column records the 26–50 predicted miRNAs

miRNAEvidencemiRNAEvidence
hsa-mir-196adbDEMC; miR2Diseasehsa-mir-210dbDEMC; miR2Disease
hsa-mir-1246unconfirmedhsa-mir-101dbDEMC; miR2Disease
hsa-mir-198dbDEMChsa-mir-125adbDEMC; miR2Disease
hsa-mir-29adbDEMChsa-mir-99bdbDEMC
hsa-mir-205dbDEMC; miR2Diseasehsa-let-7fdbDEMC; miR2Disease
hsa-mir-200bdbDEMC; miR2Diseasehsa-mir-590dbDEMC
hsa-mir-200cdbDEMC; miR2Diseasehsa-mir-7dbDEMC; miR2Disease
hsa-mir-635unconfirmedhsa-mir-144dbDEMC
hsa-mir-27bdbDEMChsa-mir-499aunconfirmed
hsa-mir-143dbDEMC; miR2Diseasehsa-mir-141dbDEMC; miR2Disease
hsa-mir-103aunconfirmedhsa-mir-195dbDEMC; miR2Disease
hsa-mir-19bdbDEMChsa-mir-191dbDEMC; miR2Disease
hsa-mir-93dbDEMChsa-mir-204dbDEMC; miR2Disease
hsa-mir-363dbDEMChsa-mir-200adbDEMC; miR2Disease
hsa-mir-133adbDEMChsa-mir-650dbDEMC
hsa-let-7adbDEMC; miR2Diseasehsa-mir-10bdbDEMC; miR2Disease
hsa-mir-124dbDEMChsa-mir-125bmiR2Disease
hsa-mir-29bdbDEMC; miR2Diseasehsa-mir-30eunconfirmed
hsa-mir-30amiR2Diseasehsa-mir-449aunconfirmed
hsa-mir-20amiR2Diseasehsa-mir-1972unconfirmed
hsa-mir-1273aunconfirmedhsa-mir-23bdbDEMC
hsa-mir-433dbDEMChsa-mir-34bdbDEMC
hsa-mir-31dbDEMC; miR2Diseasehsa-mir-95dbDEMC
hsa-mir-221dbDEMC; miR2Diseasehsa-mir-1302unconfirmed
hsa-mir-223dbDEMChsa-mir-505dbDEMC
miRNAEvidencemiRNAEvidence
hsa-mir-196adbDEMC; miR2Diseasehsa-mir-210dbDEMC; miR2Disease
hsa-mir-1246unconfirmedhsa-mir-101dbDEMC; miR2Disease
hsa-mir-198dbDEMChsa-mir-125adbDEMC; miR2Disease
hsa-mir-29adbDEMChsa-mir-99bdbDEMC
hsa-mir-205dbDEMC; miR2Diseasehsa-let-7fdbDEMC; miR2Disease
hsa-mir-200bdbDEMC; miR2Diseasehsa-mir-590dbDEMC
hsa-mir-200cdbDEMC; miR2Diseasehsa-mir-7dbDEMC; miR2Disease
hsa-mir-635unconfirmedhsa-mir-144dbDEMC
hsa-mir-27bdbDEMChsa-mir-499aunconfirmed
hsa-mir-143dbDEMC; miR2Diseasehsa-mir-141dbDEMC; miR2Disease
hsa-mir-103aunconfirmedhsa-mir-195dbDEMC; miR2Disease
hsa-mir-19bdbDEMChsa-mir-191dbDEMC; miR2Disease
hsa-mir-93dbDEMChsa-mir-204dbDEMC; miR2Disease
hsa-mir-363dbDEMChsa-mir-200adbDEMC; miR2Disease
hsa-mir-133adbDEMChsa-mir-650dbDEMC
hsa-let-7adbDEMC; miR2Diseasehsa-mir-10bdbDEMC; miR2Disease
hsa-mir-124dbDEMChsa-mir-125bmiR2Disease
hsa-mir-29bdbDEMC; miR2Diseasehsa-mir-30eunconfirmed
hsa-mir-30amiR2Diseasehsa-mir-449aunconfirmed
hsa-mir-20amiR2Diseasehsa-mir-1972unconfirmed
hsa-mir-1273aunconfirmedhsa-mir-23bdbDEMC
hsa-mir-433dbDEMChsa-mir-34bdbDEMC
hsa-mir-31dbDEMC; miR2Diseasehsa-mir-95dbDEMC
hsa-mir-221dbDEMC; miR2Diseasehsa-mir-1302unconfirmed
hsa-mir-223dbDEMChsa-mir-505dbDEMC
Table 2

Validation of the top 50 miRNAs predicted to be associated with BN by SAEMDA based on the known associations in HMDD v2.0. The first column records the top 1–25 predicted miRNAs and the third column records the 26–50 predicted miRNAs

miRNAEvidencemiRNAEvidence
hsa-mir-196adbDEMC; miR2Diseasehsa-mir-210dbDEMC; miR2Disease
hsa-mir-1246unconfirmedhsa-mir-101dbDEMC; miR2Disease
hsa-mir-198dbDEMChsa-mir-125adbDEMC; miR2Disease
hsa-mir-29adbDEMChsa-mir-99bdbDEMC
hsa-mir-205dbDEMC; miR2Diseasehsa-let-7fdbDEMC; miR2Disease
hsa-mir-200bdbDEMC; miR2Diseasehsa-mir-590dbDEMC
hsa-mir-200cdbDEMC; miR2Diseasehsa-mir-7dbDEMC; miR2Disease
hsa-mir-635unconfirmedhsa-mir-144dbDEMC
hsa-mir-27bdbDEMChsa-mir-499aunconfirmed
hsa-mir-143dbDEMC; miR2Diseasehsa-mir-141dbDEMC; miR2Disease
hsa-mir-103aunconfirmedhsa-mir-195dbDEMC; miR2Disease
hsa-mir-19bdbDEMChsa-mir-191dbDEMC; miR2Disease
hsa-mir-93dbDEMChsa-mir-204dbDEMC; miR2Disease
hsa-mir-363dbDEMChsa-mir-200adbDEMC; miR2Disease
hsa-mir-133adbDEMChsa-mir-650dbDEMC
hsa-let-7adbDEMC; miR2Diseasehsa-mir-10bdbDEMC; miR2Disease
hsa-mir-124dbDEMChsa-mir-125bmiR2Disease
hsa-mir-29bdbDEMC; miR2Diseasehsa-mir-30eunconfirmed
hsa-mir-30amiR2Diseasehsa-mir-449aunconfirmed
hsa-mir-20amiR2Diseasehsa-mir-1972unconfirmed
hsa-mir-1273aunconfirmedhsa-mir-23bdbDEMC
hsa-mir-433dbDEMChsa-mir-34bdbDEMC
hsa-mir-31dbDEMC; miR2Diseasehsa-mir-95dbDEMC
hsa-mir-221dbDEMC; miR2Diseasehsa-mir-1302unconfirmed
hsa-mir-223dbDEMChsa-mir-505dbDEMC
miRNAEvidencemiRNAEvidence
hsa-mir-196adbDEMC; miR2Diseasehsa-mir-210dbDEMC; miR2Disease
hsa-mir-1246unconfirmedhsa-mir-101dbDEMC; miR2Disease
hsa-mir-198dbDEMChsa-mir-125adbDEMC; miR2Disease
hsa-mir-29adbDEMChsa-mir-99bdbDEMC
hsa-mir-205dbDEMC; miR2Diseasehsa-let-7fdbDEMC; miR2Disease
hsa-mir-200bdbDEMC; miR2Diseasehsa-mir-590dbDEMC
hsa-mir-200cdbDEMC; miR2Diseasehsa-mir-7dbDEMC; miR2Disease
hsa-mir-635unconfirmedhsa-mir-144dbDEMC
hsa-mir-27bdbDEMChsa-mir-499aunconfirmed
hsa-mir-143dbDEMC; miR2Diseasehsa-mir-141dbDEMC; miR2Disease
hsa-mir-103aunconfirmedhsa-mir-195dbDEMC; miR2Disease
hsa-mir-19bdbDEMChsa-mir-191dbDEMC; miR2Disease
hsa-mir-93dbDEMChsa-mir-204dbDEMC; miR2Disease
hsa-mir-363dbDEMChsa-mir-200adbDEMC; miR2Disease
hsa-mir-133adbDEMChsa-mir-650dbDEMC
hsa-let-7adbDEMC; miR2Diseasehsa-mir-10bdbDEMC; miR2Disease
hsa-mir-124dbDEMChsa-mir-125bmiR2Disease
hsa-mir-29bdbDEMC; miR2Diseasehsa-mir-30eunconfirmed
hsa-mir-30amiR2Diseasehsa-mir-449aunconfirmed
hsa-mir-20amiR2Diseasehsa-mir-1972unconfirmed
hsa-mir-1273aunconfirmedhsa-mir-23bdbDEMC
hsa-mir-433dbDEMChsa-mir-34bdbDEMC
hsa-mir-31dbDEMC; miR2Diseasehsa-mir-95dbDEMC
hsa-mir-221dbDEMC; miR2Diseasehsa-mir-1302unconfirmed
hsa-mir-223dbDEMChsa-mir-505dbDEMC

In the second case study, we sought to verify the performance of SAEMDA when it was applied to disease without any known associated miRNAs and took LN as the investigated disease. The training data were also collected from HMDD v2.0 database. We removed all association information for LN from the training data to simulate LN as a new disease. LN is one of the malignant tumors with the fastest increase in morbidity and mortality [49]. About 230 000 new cases of LN will be diagnosed in the United States in 2021 [49]. Although great progress has been made in imaging diagnostic techniques at present, there is no desirable method to significantly improve the early detection rate of LN, which makes most patients miss the optimal treatment period [50]. Therefore, it is very important to find an effective method of early screening and diagnosis for LN. Some studies have found that the occurrence of LN is closely related to miRNAs [21]. For example, the miR-17-92 cluster was found to be overexpressed in human LN [51]. Besides, the expression level of miR-224 in non–small cell lung cancer (NSCLC) is higher than that in normal lung tissue and it can promote tumor progression in NSCLC [52]. We trained SAEMDA to infer potential LN-related miRNAs. The validation results showed that all the top 50 predicted miRNAs were confirmed by HMDD v2.0, dbDEMC and miR2Disease (Table 3).

Table 3

Validation of the top 50 miRNAs predicted to be associated with LN by SAEMDA based on the known associations in HMDD v2.0. Especially, LN was considered as new disease by removing association information of LN from HMDD v2.0. The first column records the top 1–25 predicted miRNAs and the third column records the 26–50 predicted miRNAs

miRNAEvidencemiRNAEvidence
hsa-mir-21dbDEMC; miR2Disease; HMDDhsa-mir-223HMDD
hsa-mir-155dbDEMC; miR2Disease; HMDDhsa-mir-146bmiR2Disease; HMDD
hsa-mir-92aHMDDhsa-mir-19adbDEMC; miR2Disease; HMDD
hsa-mir-30amiR2Disease; HMDDhsa-mir-24miR2Disease; HMDD
hsa-mir-19bdbDEMC; HMDDhsa-mir-125bmiR2Disease; HMDD
hsa-mir-195dbDEMC; miR2Diseasehsa-mir-181adbDEMC; HMDD
hsa-mir-17miR2Disease; HMDDhsa-mir-125adbDEMC; miR2Disease; HMDD
hsa-mir-29cdbDEMC; miR2Disease; HMDDhsa-mir-34adbDEMC; HMDD
hsa-mir-210dbDEMC; miR2Disease; HMDDhsa-mir-145dbDEMC; miR2Disease; HMDD
hsa-mir-29adbDEMC; miR2Disease; HMDDhsa-let-7cdbDEMC; miR2Disease; HMDD
hsa-mir-16dbDEMC; miR2Diseasehsa-mir-27adbDEMC; HMDD
hsa-mir-126dbDEMC; miR2Disease; HMDDhsa-mir-15bdbDEMC
hsa-mir-26adbDEMC; miR2Disease; HMDDhsa-mir-1dbDEMC; miR2Disease; HMDD
hsa-mir-142HMDDhsa-mir-199bdbDEMC; miR2Disease; HMDD
hsa-mir-29bdbDEMC; miR2Disease; HMDDhsa-mir-9miR2Disease; HMDD
hsa-mir-200cdbDEMC; miR2Disease; HMDDhsa-let-7emiR2Disease; HMDD
hsa-mir-146adbDEMC; miR2Disease; HMDDhsa-mir-22miR2Disease; HMDD
hsa-mir-150dbDEMC; miR2Disease; HMDDhsa-let-7bmiR2Disease; HMDD
hsa-mir-7miR2Disease; HMDDhsa-mir-30emiR2Disease; HMDD
hsa-mir-15adbDEMChsa-mir-148adbDEMC; HMDD
hsa-mir-106bdbDEMChsa-let-7ddbDEMC; miR2Disease; HMDD
hsa-mir-18adbDEMC; miR2Disease; HMDDhsa-mir-221dbDEMC; HMDD
hsa-mir-23bdbDEMChsa-mir-192dbDEMC; miR2Disease; HMDD
hsa-let-7adbDEMC; miR2Disease; HMDDhsa-mir-196adbDEMC; HMDD
hsa-mir-20adbDEMC; miR2Disease; HMDDhsa-mir-20bdbDEMC
miRNAEvidencemiRNAEvidence
hsa-mir-21dbDEMC; miR2Disease; HMDDhsa-mir-223HMDD
hsa-mir-155dbDEMC; miR2Disease; HMDDhsa-mir-146bmiR2Disease; HMDD
hsa-mir-92aHMDDhsa-mir-19adbDEMC; miR2Disease; HMDD
hsa-mir-30amiR2Disease; HMDDhsa-mir-24miR2Disease; HMDD
hsa-mir-19bdbDEMC; HMDDhsa-mir-125bmiR2Disease; HMDD
hsa-mir-195dbDEMC; miR2Diseasehsa-mir-181adbDEMC; HMDD
hsa-mir-17miR2Disease; HMDDhsa-mir-125adbDEMC; miR2Disease; HMDD
hsa-mir-29cdbDEMC; miR2Disease; HMDDhsa-mir-34adbDEMC; HMDD
hsa-mir-210dbDEMC; miR2Disease; HMDDhsa-mir-145dbDEMC; miR2Disease; HMDD
hsa-mir-29adbDEMC; miR2Disease; HMDDhsa-let-7cdbDEMC; miR2Disease; HMDD
hsa-mir-16dbDEMC; miR2Diseasehsa-mir-27adbDEMC; HMDD
hsa-mir-126dbDEMC; miR2Disease; HMDDhsa-mir-15bdbDEMC
hsa-mir-26adbDEMC; miR2Disease; HMDDhsa-mir-1dbDEMC; miR2Disease; HMDD
hsa-mir-142HMDDhsa-mir-199bdbDEMC; miR2Disease; HMDD
hsa-mir-29bdbDEMC; miR2Disease; HMDDhsa-mir-9miR2Disease; HMDD
hsa-mir-200cdbDEMC; miR2Disease; HMDDhsa-let-7emiR2Disease; HMDD
hsa-mir-146adbDEMC; miR2Disease; HMDDhsa-mir-22miR2Disease; HMDD
hsa-mir-150dbDEMC; miR2Disease; HMDDhsa-let-7bmiR2Disease; HMDD
hsa-mir-7miR2Disease; HMDDhsa-mir-30emiR2Disease; HMDD
hsa-mir-15adbDEMChsa-mir-148adbDEMC; HMDD
hsa-mir-106bdbDEMChsa-let-7ddbDEMC; miR2Disease; HMDD
hsa-mir-18adbDEMC; miR2Disease; HMDDhsa-mir-221dbDEMC; HMDD
hsa-mir-23bdbDEMChsa-mir-192dbDEMC; miR2Disease; HMDD
hsa-let-7adbDEMC; miR2Disease; HMDDhsa-mir-196adbDEMC; HMDD
hsa-mir-20adbDEMC; miR2Disease; HMDDhsa-mir-20bdbDEMC
Table 3

Validation of the top 50 miRNAs predicted to be associated with LN by SAEMDA based on the known associations in HMDD v2.0. Especially, LN was considered as new disease by removing association information of LN from HMDD v2.0. The first column records the top 1–25 predicted miRNAs and the third column records the 26–50 predicted miRNAs

miRNAEvidencemiRNAEvidence
hsa-mir-21dbDEMC; miR2Disease; HMDDhsa-mir-223HMDD
hsa-mir-155dbDEMC; miR2Disease; HMDDhsa-mir-146bmiR2Disease; HMDD
hsa-mir-92aHMDDhsa-mir-19adbDEMC; miR2Disease; HMDD
hsa-mir-30amiR2Disease; HMDDhsa-mir-24miR2Disease; HMDD
hsa-mir-19bdbDEMC; HMDDhsa-mir-125bmiR2Disease; HMDD
hsa-mir-195dbDEMC; miR2Diseasehsa-mir-181adbDEMC; HMDD
hsa-mir-17miR2Disease; HMDDhsa-mir-125adbDEMC; miR2Disease; HMDD
hsa-mir-29cdbDEMC; miR2Disease; HMDDhsa-mir-34adbDEMC; HMDD
hsa-mir-210dbDEMC; miR2Disease; HMDDhsa-mir-145dbDEMC; miR2Disease; HMDD
hsa-mir-29adbDEMC; miR2Disease; HMDDhsa-let-7cdbDEMC; miR2Disease; HMDD
hsa-mir-16dbDEMC; miR2Diseasehsa-mir-27adbDEMC; HMDD
hsa-mir-126dbDEMC; miR2Disease; HMDDhsa-mir-15bdbDEMC
hsa-mir-26adbDEMC; miR2Disease; HMDDhsa-mir-1dbDEMC; miR2Disease; HMDD
hsa-mir-142HMDDhsa-mir-199bdbDEMC; miR2Disease; HMDD
hsa-mir-29bdbDEMC; miR2Disease; HMDDhsa-mir-9miR2Disease; HMDD
hsa-mir-200cdbDEMC; miR2Disease; HMDDhsa-let-7emiR2Disease; HMDD
hsa-mir-146adbDEMC; miR2Disease; HMDDhsa-mir-22miR2Disease; HMDD
hsa-mir-150dbDEMC; miR2Disease; HMDDhsa-let-7bmiR2Disease; HMDD
hsa-mir-7miR2Disease; HMDDhsa-mir-30emiR2Disease; HMDD
hsa-mir-15adbDEMChsa-mir-148adbDEMC; HMDD
hsa-mir-106bdbDEMChsa-let-7ddbDEMC; miR2Disease; HMDD
hsa-mir-18adbDEMC; miR2Disease; HMDDhsa-mir-221dbDEMC; HMDD
hsa-mir-23bdbDEMChsa-mir-192dbDEMC; miR2Disease; HMDD
hsa-let-7adbDEMC; miR2Disease; HMDDhsa-mir-196adbDEMC; HMDD
hsa-mir-20adbDEMC; miR2Disease; HMDDhsa-mir-20bdbDEMC
miRNAEvidencemiRNAEvidence
hsa-mir-21dbDEMC; miR2Disease; HMDDhsa-mir-223HMDD
hsa-mir-155dbDEMC; miR2Disease; HMDDhsa-mir-146bmiR2Disease; HMDD
hsa-mir-92aHMDDhsa-mir-19adbDEMC; miR2Disease; HMDD
hsa-mir-30amiR2Disease; HMDDhsa-mir-24miR2Disease; HMDD
hsa-mir-19bdbDEMC; HMDDhsa-mir-125bmiR2Disease; HMDD
hsa-mir-195dbDEMC; miR2Diseasehsa-mir-181adbDEMC; HMDD
hsa-mir-17miR2Disease; HMDDhsa-mir-125adbDEMC; miR2Disease; HMDD
hsa-mir-29cdbDEMC; miR2Disease; HMDDhsa-mir-34adbDEMC; HMDD
hsa-mir-210dbDEMC; miR2Disease; HMDDhsa-mir-145dbDEMC; miR2Disease; HMDD
hsa-mir-29adbDEMC; miR2Disease; HMDDhsa-let-7cdbDEMC; miR2Disease; HMDD
hsa-mir-16dbDEMC; miR2Diseasehsa-mir-27adbDEMC; HMDD
hsa-mir-126dbDEMC; miR2Disease; HMDDhsa-mir-15bdbDEMC
hsa-mir-26adbDEMC; miR2Disease; HMDDhsa-mir-1dbDEMC; miR2Disease; HMDD
hsa-mir-142HMDDhsa-mir-199bdbDEMC; miR2Disease; HMDD
hsa-mir-29bdbDEMC; miR2Disease; HMDDhsa-mir-9miR2Disease; HMDD
hsa-mir-200cdbDEMC; miR2Disease; HMDDhsa-let-7emiR2Disease; HMDD
hsa-mir-146adbDEMC; miR2Disease; HMDDhsa-mir-22miR2Disease; HMDD
hsa-mir-150dbDEMC; miR2Disease; HMDDhsa-let-7bmiR2Disease; HMDD
hsa-mir-7miR2Disease; HMDDhsa-mir-30emiR2Disease; HMDD
hsa-mir-15adbDEMChsa-mir-148adbDEMC; HMDD
hsa-mir-106bdbDEMChsa-let-7ddbDEMC; miR2Disease; HMDD
hsa-mir-18adbDEMC; miR2Disease; HMDDhsa-mir-221dbDEMC; HMDD
hsa-mir-23bdbDEMChsa-mir-192dbDEMC; miR2Disease; HMDD
hsa-let-7adbDEMC; miR2Disease; HMDDhsa-mir-196adbDEMC; HMDD
hsa-mir-20adbDEMC; miR2Disease; HMDDhsa-mir-20bdbDEMC

In the third case study, to demonstrate the generalization ability of SAEMDA on different datasets, we obtained the training data from HMDD v1.0 containing 1395 known associations between 271 miRNAs and 137 diseases. EN was selected for the case study. EN is one of the most high-risk cancers in the world and its mortality rate ranks sixth among all cancers [53]. During recent years, the incidence of EN in Asia has gradually increased [54]. Although chemotherapy, radiotherapy and other technologies are developing rapidly, they cannot provide satisfactory treatment for advanced EN patients [54]. Therefore, identifying biomarkers of EN for early diagnosis will make a significant impact on the prospects for diagnosis and treatment of EN. Current studies show that the occurrence, development and prognosis of EN are related to the abnormal regulation of miRNAs [55]. For example, miR-377 can suppress initiation and progression of EN by inhibiting CD133 and VEGF [56]. In addition, miR-296 was overexpressed in esophageal squamous cell cancer tissues and downregulation of miR-296 can suppress growth of EN cells [57]. Here, we employed SAEMDA to predict EN-associated miRNAs based on known associations in HMDD v1.0. As a result, 45 out of the top 50 predicted miRNAs were verified by HMDD v2.0, dbDEMC and miR2Disease databases (Table 4).

Table 4

Validation of the top 50 miRNAs predicted to be associated with EN by SAEMDA based on the known associations in HMDD v1.0. The first column records the top 1–25 predicted miRNAs and the third column records the 26–50 predicted miRNAs

miRNAEvidencemiRNAEvidence
hsa-mir-155dbDEMC; HMDDhsa-mir-208bunconfirmed
hsa-mir-365unconfirmedhsa-mir-92bdbDEMC
hsa-mir-448dbDEMChsa-mir-200bdbDEMC
hsa-mir-221dbDEMChsa-let-7ddbDEMC
hsa-mir-146adbDEMC; HMDDhsa-let-7idbDEMC
hsa-let-7cdbDEMC; HMDDhsa-mir-29adbDEMC
hsa-mir-222dbDEMChsa-mir-181bdbDEMC
hsa-mir-20adbDEMC; HMDDhsa-mir-181adbDEMC
hsa-mir-92aHMDDhsa-let-7 gdbDEMC
hsa-mir-514unconfirmedhsa-mir-125bdbDEMC
hsa-mir-338dbDEMChsa-mir-210dbDEMC; HMDD
hsa-mir-137dbDEMChsa-mir-141dbDEMC; HMDD
hsa-mir-18adbDEMChsa-mir-300unconfirmed
hsa-mir-145dbDEMC; HMDDhsa-mir-383dbDEMC
hsa-mir-423dbDEMChsa-mir-515unconfirmed
hsa-mir-19adbDEMC; HMDDhsa-mir-602dbDEMC
hsa-mir-29cdbDEMC; HMDDhsa-mir-196bdbDEMC; HMDD
hsa-mir-199bdbDEMChsa-mir-135bdbDEMC; HMDD
hsa-mir-23bdbDEMChsa-mir-206dbDEMC
hsa-let-7bdbDEMC; HMDDhsa-mir-127dbDEMC
hsa-mir-520bdbDEMChsa-mir-98dbDEMC; HMDD
hsa-mir-335dbDEMChsa-mir-9dbDEMC
hsa-mir-330dbDEMChsa-mir-373dbDEMC; miR2Disease
hsa-mir-223dbDEMC; miR2Disease; HMDDhsa-mir-132dbDEMC
hsa-mir-34adbDEMC; HMDDhsa-mir-134dbDEMC
miRNAEvidencemiRNAEvidence
hsa-mir-155dbDEMC; HMDDhsa-mir-208bunconfirmed
hsa-mir-365unconfirmedhsa-mir-92bdbDEMC
hsa-mir-448dbDEMChsa-mir-200bdbDEMC
hsa-mir-221dbDEMChsa-let-7ddbDEMC
hsa-mir-146adbDEMC; HMDDhsa-let-7idbDEMC
hsa-let-7cdbDEMC; HMDDhsa-mir-29adbDEMC
hsa-mir-222dbDEMChsa-mir-181bdbDEMC
hsa-mir-20adbDEMC; HMDDhsa-mir-181adbDEMC
hsa-mir-92aHMDDhsa-let-7 gdbDEMC
hsa-mir-514unconfirmedhsa-mir-125bdbDEMC
hsa-mir-338dbDEMChsa-mir-210dbDEMC; HMDD
hsa-mir-137dbDEMChsa-mir-141dbDEMC; HMDD
hsa-mir-18adbDEMChsa-mir-300unconfirmed
hsa-mir-145dbDEMC; HMDDhsa-mir-383dbDEMC
hsa-mir-423dbDEMChsa-mir-515unconfirmed
hsa-mir-19adbDEMC; HMDDhsa-mir-602dbDEMC
hsa-mir-29cdbDEMC; HMDDhsa-mir-196bdbDEMC; HMDD
hsa-mir-199bdbDEMChsa-mir-135bdbDEMC; HMDD
hsa-mir-23bdbDEMChsa-mir-206dbDEMC
hsa-let-7bdbDEMC; HMDDhsa-mir-127dbDEMC
hsa-mir-520bdbDEMChsa-mir-98dbDEMC; HMDD
hsa-mir-335dbDEMChsa-mir-9dbDEMC
hsa-mir-330dbDEMChsa-mir-373dbDEMC; miR2Disease
hsa-mir-223dbDEMC; miR2Disease; HMDDhsa-mir-132dbDEMC
hsa-mir-34adbDEMC; HMDDhsa-mir-134dbDEMC
Table 4

Validation of the top 50 miRNAs predicted to be associated with EN by SAEMDA based on the known associations in HMDD v1.0. The first column records the top 1–25 predicted miRNAs and the third column records the 26–50 predicted miRNAs

miRNAEvidencemiRNAEvidence
hsa-mir-155dbDEMC; HMDDhsa-mir-208bunconfirmed
hsa-mir-365unconfirmedhsa-mir-92bdbDEMC
hsa-mir-448dbDEMChsa-mir-200bdbDEMC
hsa-mir-221dbDEMChsa-let-7ddbDEMC
hsa-mir-146adbDEMC; HMDDhsa-let-7idbDEMC
hsa-let-7cdbDEMC; HMDDhsa-mir-29adbDEMC
hsa-mir-222dbDEMChsa-mir-181bdbDEMC
hsa-mir-20adbDEMC; HMDDhsa-mir-181adbDEMC
hsa-mir-92aHMDDhsa-let-7 gdbDEMC
hsa-mir-514unconfirmedhsa-mir-125bdbDEMC
hsa-mir-338dbDEMChsa-mir-210dbDEMC; HMDD
hsa-mir-137dbDEMChsa-mir-141dbDEMC; HMDD
hsa-mir-18adbDEMChsa-mir-300unconfirmed
hsa-mir-145dbDEMC; HMDDhsa-mir-383dbDEMC
hsa-mir-423dbDEMChsa-mir-515unconfirmed
hsa-mir-19adbDEMC; HMDDhsa-mir-602dbDEMC
hsa-mir-29cdbDEMC; HMDDhsa-mir-196bdbDEMC; HMDD
hsa-mir-199bdbDEMChsa-mir-135bdbDEMC; HMDD
hsa-mir-23bdbDEMChsa-mir-206dbDEMC
hsa-let-7bdbDEMC; HMDDhsa-mir-127dbDEMC
hsa-mir-520bdbDEMChsa-mir-98dbDEMC; HMDD
hsa-mir-335dbDEMChsa-mir-9dbDEMC
hsa-mir-330dbDEMChsa-mir-373dbDEMC; miR2Disease
hsa-mir-223dbDEMC; miR2Disease; HMDDhsa-mir-132dbDEMC
hsa-mir-34adbDEMC; HMDDhsa-mir-134dbDEMC
miRNAEvidencemiRNAEvidence
hsa-mir-155dbDEMC; HMDDhsa-mir-208bunconfirmed
hsa-mir-365unconfirmedhsa-mir-92bdbDEMC
hsa-mir-448dbDEMChsa-mir-200bdbDEMC
hsa-mir-221dbDEMChsa-let-7ddbDEMC
hsa-mir-146adbDEMC; HMDDhsa-let-7idbDEMC
hsa-let-7cdbDEMC; HMDDhsa-mir-29adbDEMC
hsa-mir-222dbDEMChsa-mir-181bdbDEMC
hsa-mir-20adbDEMC; HMDDhsa-mir-181adbDEMC
hsa-mir-92aHMDDhsa-let-7 gdbDEMC
hsa-mir-514unconfirmedhsa-mir-125bdbDEMC
hsa-mir-338dbDEMChsa-mir-210dbDEMC; HMDD
hsa-mir-137dbDEMChsa-mir-141dbDEMC; HMDD
hsa-mir-18adbDEMChsa-mir-300unconfirmed
hsa-mir-145dbDEMC; HMDDhsa-mir-383dbDEMC
hsa-mir-423dbDEMChsa-mir-515unconfirmed
hsa-mir-19adbDEMC; HMDDhsa-mir-602dbDEMC
hsa-mir-29cdbDEMC; HMDDhsa-mir-196bdbDEMC; HMDD
hsa-mir-199bdbDEMChsa-mir-135bdbDEMC; HMDD
hsa-mir-23bdbDEMChsa-mir-206dbDEMC
hsa-let-7bdbDEMC; HMDDhsa-mir-127dbDEMC
hsa-mir-520bdbDEMChsa-mir-98dbDEMC; HMDD
hsa-mir-335dbDEMChsa-mir-9dbDEMC
hsa-mir-330dbDEMChsa-mir-373dbDEMC; miR2Disease
hsa-mir-223dbDEMC; miR2Disease; HMDDhsa-mir-132dbDEMC
hsa-mir-34adbDEMC; HMDDhsa-mir-134dbDEMC

Discussion

Predicting potential miRNA–disease associations enables researchers to better understand the mechanisms of diseases and promotes the diagnosis, treatment and prognosis of complex diseases. In this study, we developed SAEMDA that can be an effective supplement to traditional biological experimental methods. In SAEMDA, all miRNA–disease samples were used to pretrain an SAE. Then, the SAE was fine-tuned with the positive samples and the same number of negative samples. SAEMDA obtained better performance than other models in three types of cross validation. SAEMDA is superior to previous methods mainly because it makes full use of the information of all unlabeled samples in the training process. In addition, the results of three kinds of case studies further illustrated the reliable prediction performance of SAEMDA. In addition to miRNA–disease association prediction, there are many important link prediction problems in the field of bioinformatics, such as lncRNA–disease association prediction [58], circular RNA (circRNA)–disease association prediction [59] and protein–protein interaction prediction [60]. In the task of miRNA–disease association prediction, SAEMDA shows good performance. Therefore, the framework of SAEMDA could be considered to be utilized to solve above link prediction problems.

The reliable performance of SAEMDA was due to the following aspects. Firstly, the data used in our study contain 189 585 miRNA–disease pairs for 495 miRNAs and 383 diseases, with only 5430 known associations. SAEMDA was especially suitable for the dataset composed of a large amount of unlabeled data and a small amount of labeled data, because SAEMDA adopts a combination of unsupervised pretraining and supervised fine-tuning. The pretraining process enabled the model to learn the features of all miRNA–disease pairs and made up for the defect that traditional supervised learning model only can be trained with label samples. Besides, fine-tuning process enabled the model to learn label information of a small amount of labeled data for further performance improvement. Secondly, SAEMDA integrated diverse similarity networks so that the features could better capture the information of all miRNA–disease pairs. Finally, we selected Adam optimizer in the training process of SAEMDA, as it is more efficient than traditional Stochastic Gradient Descent (SGD) optimizer.

However, SAEMDA still has some limitations. Firstly, hyperparameter of neural networks (such as the number of hidden layers and the number of neurons per layer) was not well determined. Secondly, SAEMDA obtained larger standard deviation than comparison models in 100 times of 5-fold cross validation. Therefore, SAEMDA was slightly inferior to other models in terms of stability, which is a common problem in deep learning. Thirdly, positive and negative samples are needed in the process of fine-tuning, but randomly selecting unlabeled samples as negative samples would bring inaccurate information. Finally, there is room for improvement in splicing similarity of disease and miRNA as features of disease–miRNA pair. Therefore, how to construct and extract reliable features of miRNA–disease pairs would be a future research direction of prediction method design. Besides, it is necessary to design appropriate methods to change the way of negative sample selection. Clustering algorithm could be considered to be used in the process of negative sample selection [61–63]. In addition, it may be an important direction to design effective methods to introduce other biological information to help predict potential miRNA–disease associations.

Materials and methods

Materials

First, the human miRNA–disease associations were obtained from HMDD v2.0 database [42]. Specifically, there were 495 miRNAs, 383 diseases and 5430 experimentally verified miRNA–diseases associations. We used nd and nm to represent the number of diseases and miRNAs, respectively. The adjacency matrix A with the size of nm×nd was utilized to represent all miRNA–disease pairs. A(i,j) is equal to 1 if miRNA m(i) is related to disease d(j); otherwise, it is 0. Besides, the miRNA functional similarity scores were calculated in previous study [64] and can be downloaded from http://www.cuilab.cn/files/images/cuilab/misim.zip. The matrix FS was utilized to denote miRNA functional similarity matrix. In addition, we described the relationships between two diseases through the Directed Acyclic Graph (DAG) and used two different methods to calculate disease semantic similarity according to previous study [32]. Based on the assumption that the greater the common part of the DAGs of two diseases, the greater the semantic similarity value, we calculated the first type of disease semantic similarity matrix SS1 through the method in previous study [32]. Because the different disease terms in the same layer of DAG should have different contributions to the semantic value of investigated disease, we redefined semantic contribution of per disease term and further calculated the second type of disease semantic similarity matrix SS2 [32]. Furthermore, based on the assumption that similar diseases (miRNAs) have similar pattern of interaction and noninteraction with the miRNAs (diseases), we calculated Gaussian interaction profile kernel similarity matrix KD (KM) of diseases (miRNAs) according to the previous method [65]. It should be noted that in each turn of LOOCV and 5-fold cross validation, KD and KM would be recalculated based on all known association information except the test sample. Finally, we combined Gaussian interaction profile kernel similarity of miRNAs with miRNA functional similarity to get the integrated miRNA similarity matrix SM according to the method in previous study [22] as follows:
(1)
Similarly, we also calculated the integrated disease similarity matrix SD by integrating Gaussian interaction profile kernel similarity of disease and two kinds of disease semantic similarity.
(2)
Flowchart of SAEMDA to predict potential miRNA–disease associations.
Figure 2

Flowchart of SAEMDA to predict potential miRNA–disease associations.

SAEMDA

In this study, we proposed a new model named SAEMDA to predict potential miRNA–disease associations. The flowchart of SAEMDA is depicted in Figure 2. The first step of SAEMDA is data preparation, which is to denote the miRNA–disease pairs as feature vectors. As presented in previous sections, we constructed the adjacency matrix A of miRNA–disease pairs (nm × nd), the integrated miRNA similarity matrix SM (nm × nm) and the integrated disease similarity matrix SD (nd × nd). From them, nm and nd features were extracted for each miRNA and disease, respectively. Concatenating the feature vectors of the investigated disease and miRNA yielded nm + nd features for each miRNA–disease pair. Among all miRNA–disease pairs, a total of 5430 pairs were known associations and the remaining miRNA–disease pairs were unlabeled.

The second step of SAEMDA is the unsupervised pretraining of SAE based on all miRNA–disease pairs. The deep learning model of SAE can be constructed by stacking several autoencoders (AEs) [66]. An AE is composed of an encoder and a decoder. The encoder learns new representation by mapping input features from the input layer to the hidden layer, while the decoder reconstructs the original inputs from the hidden layer to the output layer. In addition, the input layer and the output layer have the same number of neurons. The AE can reduce the dimensionality of the original data. After inputting the feature vector X of training sample to AE, the representation of the hidden layer was defined as follows:
(3)
where |$\sigma$|⁠, W and b represented the activation function (tanh in our study), the weight matrix and the bias vector of the encoder, respectively. Then, the output |${X}^{\prime}$| with the same shape as X was reconstructed based on representation of the hidden layer as follows:
(4)
where |${W}^{\prime }$| and |${b}^{\prime }$| denoted the weight matrix and the bias vector of the decoder. Next, the AE was trained to minimize the reconstruction cost based on Adam optimizer:
(5)

In this study, SAE was constructed by stacking three AEs according to previous research [41]. The unsupervised pretraining of SAE was carried out as follows:

  1. An AE was trained using the feature vectors of all miRNA–disease pairs.

  2. The decoder layer was removed from the AE. Then, a new AE was constructed with the feature vectors generated by the first AE as input.

  3. The new AE was trained, while weights and bias of the previously trained AE remain unchanged.

  4. Repeated steps 2 and 3 until three AEs are stacked.

After the unsupervised pretraining, we obtained the weight matrices W1, W2 and W3 as well as the bias vectors of b1, b2 and b3 of SAE. Then, the third step of SAEMDA is supervised fine-tuning of SAE based on positive and negative samples. Here, the 5430 known miRNA–disease associations were taken as positive samples. In addition, 5430 negative samples were randomly selected from the unlabeled miRNA–disease pairs. The fine-tuning process contained the following steps:

  1. An output layer was added into the SAE obtained in the pretraining process. Here, the weight matrix W4 and bias vector b4 between the output layer and previous layer were randomly initialized.

  2. Positive samples and the same number of selected negative samples were used to train the SAE.

Finally, the trained SAE can be used to predict potential miRNA–disease associations. It is worth noting that SAEMDA used the tanh activation function in each hidden layer and the softmax classifier in the output layer. Besides, cross entropy was used as loss function in the fine-tuning process and Adam optimizer was utilized to optimize SAE. In addition, we set the number of hidden layers of three AE as 512, 256 and 128, respectively. After setting the hyperparameters of the model, we trained SAEMDA with a learning rate of 0.0001 to obtain the final miRNA–disease association score.

Key Points
  • SAEMDA was especially suitable for the dataset composed of a large amount of unlabeled data and a small amount of labeled data.

  • SAEMDA integrated diverse similarity networks. Therefore, the features could better capture the information of all miRNA–disease pairs.

  • We selected Adam optimizer in the training process of SAEMDA, as it is more efficient than traditional Stochastic Gradient Descent (SGD) optimizer.

  • Leave-one-out cross validation and case studies were implemented to evaluate the prediction performance of SAEMDA.

Data availability

We provided the python code and data for SAEMDA at https://github.com/xpnbs/SAEMDA.

Funding

This work was supported by Fundamental Research Funds for the Central Universities (2019ZDPY01).

Chun-Chun Wang is a PhD student of School of Information and Control Engineering, China University of Mining and Technology. His research interests include bioinformatics, complex network algorithm and machine learning.

Tian-Hao Li is a master’s student of School of Information and Control Engineering, China University of Mining and Technology. His research interests include bioinformatics and machine learning.

Li Huang is a PhD student of Academy of Arts and Design, Tsinghua University. His research interests include bioinformatics, complex network algorithm and machine learning.

Xing Chen, PhD, is a professor of China University of Mining and Technology. He is the associate dean of Artificial Intelligence Research Institute, China University of Mining and Technology. He is also the founding director of Institute of Bioinformatics, China University of Mining and Technology and Big Data Research Center, China University of Mining and Technology. His research interests include complex disease-related noncoding RNA biomarker prediction, computational models for drug discovery and early detection of human complex disease based on big data and artificial intelligence algorithms.

References

1.

Ambros
 
V
.
microRNAs: tiny regulators with great potential
.
Cell
 
2001
;
107
:
823
6
.

2.

Bartel
 
DP
.
MicroRNAs: genomics, biogenesis, mechanism, and function
.
Cell
 
2004
;
116
:
281
97
.

3.

Xiao
 
C
,
Calado
 
DP
,
Galler
 
G
, et al.  
MiR-150 controls B cell differentiation by targeting the transcription factor c-Myb
.
Cell
 
2007
;
131
:
146
59
.

4.

Johnnidis
 
JB
,
Harris
 
MH
,
Wheeler
 
RT
, et al.  
Regulation of progenitor cell proliferation and granulocyte function by microRNA-223
.
Nature
 
2008
;
451
:
1125
9
.

5.

Kim Jin
 
H
,
Woo Hye
 
R
,
Kim
 
J
, et al.  
Trifurcate feed-forward regulation of age-dependent cell death involving miR164 in Arabidopsis
.
Science
 
2009
;
323
:
1053
7
.

6.

Mendell Joshua
 
T
,
Olson
 
EN
.
MicroRNAs in stress signaling and human disease
.
Cell
 
2012
;
148
:
1172
87
.

7.

Lu
 
J
,
Getz
 
G
,
Miska
 
EA
, et al.  
MicroRNA expression profiles classify human cancers
.
Nature
 
2005
;
435
:
834
8
.

8.

Esquela-Kerscher
 
A
,
Slack
 
FJ
.
Oncomirs - microRNAs with a role in cancer
.
Nat Rev Cancer
 
2006
;
6
:
259
69
.

9.

Latronico
 
MV
,
Catalucci
 
D
,
Condorelli
 
G
.
Emerging role of microRNAs in cardiovascular biology
.
Circ Res
 
2007
;
101
:
1225
36
.

10.

Krutzfeldt
 
J
,
Stoffel
 
M
.
MicroRNAs: a new class of regulatory genes affecting metabolism
.
Cell Metab
 
2006
;
4
:
9
12
.

11.

Barwari
 
T
,
Joshi
 
A
,
Mayr
 
M
.
MicroRNAs in cardiovascular disease
.
J Am Coll Cardiol
 
2016
;
68
:
2577
84
.

12.

Szabo
 
G
,
Bala
 
S
.
MicroRNAs in liver disease
.
Nat Rev Gastroenterol Hepatol
 
2013
;
10
:
542
52
.

13.

He
 
L
,
Thomson
 
JM
,
Hemann
 
MT
, et al.  
A microRNA polycistron as a potential human oncogene
.
Nature
 
2005
;
435
:
828
33
.

14.

Shimono
 
Y
,
Zabala
 
M
,
Cho
 
RW
, et al.  
Downregulation of miRNA-200c links breast cancer stem cells with normal stem cells
.
Cell
 
2009
;
138
:
592
603
.

15.

Ng
 
EK
,
Chong
 
WW
,
Jin
 
H
, et al.  
Differential expression of microRNAs in plasma of patients with colorectal cancer: a potential marker for colorectal cancer screening
.
Gut
 
2009
;
58
:
1375
81
.

16.

Hu
 
Z
,
Chen
 
X
,
Zhao
 
Y
, et al.  
Serum microRNA signatures identified in a genome-wide serum microRNA expression profiling predict survival of non-small-cell lung cancer
.
J Clin Oncol
 
2010
;
28
:
1721
6
.

17.

Calin
 
GA
,
Croce
 
CM
.
MicroRNA signatures in human cancers
.
Nat Rev Cancer
 
2006
;
6
:
857
66
.

18.

Slack
 
FJ
,
Weidhaas
 
JB
.
MicroRNA in cancer prognosis
.
N Engl J Med
 
2008
;
359
:
2720
2
.

19.

Bouchie
 
A
.
First microRNA mimic enters clinic
.
Nat Biotechnol
 
2013
;
31
:
577
7
.

20.

Jiang
 
Q
,
Hao
 
Y
,
Wang
 
G
, et al.  
Prioritization of disease microRNAs through a human phenome-microRNAome network
.
BMC Syst Biol
 
2010
;
4
:
S2
.

21.

Chen
 
X
,
Xie
 
D
,
Zhao
 
Q
, et al.  
MicroRNAs and complex diseases: from experimental results to computational models
.
Brief Bioinform
 
2019
;
20
:
515
39
.

22.

Chen
 
X
,
Yan
 
CC
,
Zhang
 
X
, et al.  
WBSMDA: within and between score for MiRNA-disease association prediction
.
Sci Rep
 
2016
;
6
:
21106
.

23.

Mork
 
S
,
Pletscher-Frankild
 
S
,
Palleja Caro
 
A
, et al.  
Protein-driven inference of miRNA-disease associations
.
Bioinformatics
 
2014
;
30
:
392
7
.

24.

Chen
 
X
,
Liu
 
MX
,
Yan
 
GY
.
RWRMDA: predicting novel human microRNA-disease associations
.
Mol Biosyst
 
2012
;
8
:
2792
8
.

25.

Shi
 
H
,
Xu
 
J
,
Zhang
 
G
, et al.  
Walking the interactome to identify human miRNA-disease associations through the functional link between miRNA targets and disease genes
.
BMC Syst Biol
 
2013
;
7
:
101
.

26.

Xuan
 
P
,
Han
 
K
,
Guo
 
Y
, et al.  
Prediction of potential disease-associated microRNAs based on random walk
.
Bioinformatics
 
2015
;
31
:
1805
15
.

27.

Yu
 
H
,
Chen
 
X
,
Lu
 
L
.
Large-scale prediction of microRNA-disease associations by combinatorial prioritization algorithm
.
Sci Rep
 
2017
;
7
:
43792
.

28.

Chen
 
X
,
Yan
 
CC
,
Zhang
 
X
, et al.  
HGIMDA: heterogeneous graph inference for miRNA-disease association prediction
.
Oncotarget
 
2016
;
7
:
65257
69
.

29.

You
 
ZH
,
Huang
 
ZA
,
Zhu
 
Z
, et al.  
PBMDA: a novel and effective path-based computational model for miRNA-disease association prediction
.
PLoS Comput Biol
 
2017
;
13
:e1005455.

30.

Chen
 
X
,
Qu
 
J
,
Yin
 
J
.
TLHNMDA: triple layer heterogeneous network based inference for MiRNA-disease association prediction
.
Front Genet
 
2018
;
9
:
234
.

31.

Chen
 
X
,
Yin
 
J
,
Qu
 
J
, et al.  
MDHGI: matrix decomposition and heterogeneous graph inference for miRNA-disease association prediction
.
PLoS Comput Biol
 
2018
;
14
:e1006418.

32.

Xuan
 
P
,
Han
 
K
,
Guo
 
M
, et al.  
Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors
.
PLoS One
 
2013
;
8
:e70204.

33.

Chen
 
X
,
Yan
 
GY
.
Semi-supervised learning for potential human microRNA-disease associations inference
.
Sci Rep
 
2014
;
4
:
5501
.

34.

Chen
 
X
,
Yan
 
CC
,
Zhang
 
X
, et al.  
RBMMMDA: predicting multiple types of disease-microRNA associations
.
Sci Rep
 
2015
;
5
:
13877
.

35.

Pasquier
 
C
,
Gardes
 
J
.
Prediction of miRNA-disease associations with a vector space model
.
Sci Rep
 
2016
;
6
:
27036
.

36.

Li
 
JQ
,
Rong
 
ZH
,
Chen
 
X
, et al.  
MCMDA: matrix completion for MiRNA-disease association prediction
.
Oncotarget
 
2017
;
8
:
21187
99
.

37.

Chen
 
X
,
Wu
 
QF
,
Yan
 
GY
.
RKNNMDA: ranking-based KNN for MiRNA-disease association prediction
.
RNA Biol
 
2017
;
14
:
952
62
.

38.

Chen
 
X
,
Huang
 
L
,
Xie
 
D
, et al.  
EGBMMDA: extreme gradient boosting machine for MiRNA-disease association prediction
.
Cell Death Dis
 
2018
;
9
:3.

39.

Zhu
 
C-C
,
Wang
 
C-C
,
Zhao
 
Y
, et al.  
Identification of miRNA–disease associations via multiple information integration with Bayesian ranking
.
Brief Bioinform
 
2021
;
22
:
bbab302
.

40.

LeCun
 
Y
,
Bengio
 
Y
,
Hinton
 
G
.
Deep learning
.
Nature
 
2015
;
521
:
436
44
.

41.

Bahi
 
M
,
Batouche
 
M
. Deep semi-supervised learning for DTI prediction using large datasets and H2O-spark platform. In:
2018 International Conference on Intelligent Systems and Computer Vision (ISCV). Fez, Morocco
,
2018
, p.
1
7
.
IEEE
,
New York, NY, USA
.

42.

Li
 
Y
,
Qiu
 
C
,
Tu
 
J
, et al.  
HMDD v2.0: a database for experimentally supported human microRNA and disease associations
.
Nucleic Acids Res
 
2014
;
42
:
D1070
4
.

43.

Yang
 
Z
,
Ren
 
F
,
Liu
 
C
, et al.  
dbDEMC: a database of differentially expressed miRNAs in human cancers
.
BMC Genomics
 
2010
;
11
:
S5
.

44.

Jiang
 
Q
,
Wang
 
Y
,
Hao
 
Y
, et al.  
miR2Disease: a manually curated database for microRNA deregulation in human disease
.
Nucleic Acids Res
 
2009
;
37
:
D98
104
.

45.

Ma
 
L
.
Determinants of breast cancer progression
.
Sci Transl Med
 
2014
;
6
:243fs225.

46.

Elmore
 
JG
,
Armstrong
 
K
,
Lehman
 
CD
, et al.  
Screening for breast cancer
.
JAMA
 
2005
;
293
:
1245
56
.

47.

Mulrane
 
L
,
McGee
 
SF
,
Gallagher
 
WM
, et al.  
miRNA dysregulation in breast cancer
.
Cancer Res
 
2013
;
73
:
6554
62
.

48.

Heneghan
 
HM
,
Miller
 
N
,
Lowery
 
AJ
, et al.  
Circulating microRNAs as novel minimally invasive biomarkers for breast cancer
.
Ann Surg
 
2010
;
251
:
499
505
.

49.

Siegel
 
RL
,
Miller
 
KD
,
Fuchs
 
HE
, et al.  
Cancer statistics, 2021
.
CA Cancer J Clin
 
2021
;
71
:
7
33
.

50.

Hirsch
 
FR
,
Scagliotti
 
GV
,
Mulshine
 
JL
, et al.  
Lung cancer: current therapies and new targeted treatments
.
Lancet
 
2017
;
389
:
299
311
.

51.

Hayashita
 
Y
,
Osada
 
H
,
Tatematsu
 
Y
, et al.  
A polycistronic microRNA cluster, miR-17-92, is overexpressed in human lung cancers and enhances cell proliferation
.
Cancer Res
 
2005
;
65
:
9628
32
.

52.

Cui
 
R
,
Meng
 
W
,
Sun
 
HL
, et al.  
MicroRNA-224 promotes tumor progression in nonsmall cell lung cancer
.
Proc Natl Acad Sci U S A
 
2015
;
112
:
E4288
97
.

53.

Pennathur
 
A
,
Gibson
 
MK
,
Jobe
 
BA
, et al.  
Oesophageal carcinoma
.
Lancet
 
2013
;
381
:
400
12
.

54.

El-Serag
 
HB
,
Sweet
 
S
,
Winchester
 
CC
, et al.  
Update on the epidemiology of gastro-oesophageal reflux disease: a systematic review
.
Gut
 
2014
;
63
:
871
80
.

55.

Sakai
 
NS
,
Samia-Aly
 
E
,
Barbera
 
M
, et al.  
A review of the current understanding and clinical utility of miRNAs in esophageal cancer
.
Semin Cancer Biol
 
2013
;
23
:
512
21
.

56.

Li
 
B
,
Xu
 
WW
,
Han
 
L
, et al.  
MicroRNA-377 suppresses initiation and progression of esophageal cancer by inhibiting CD133 and VEGF
.
Oncogene
 
2017
;
36
:
3986
4000
.

57.

Hong
 
L
,
Han
 
Y
,
Zhang
 
H
, et al.  
The prognostic and chemotherapeutic value of miR-296 in esophageal squamous cell carcinoma
.
Ann Surg
 
2010
;
251
:
1056
63
.

58.

Chen
 
X
,
Yan
 
CC
,
Zhang
 
X
, et al.  
Long non-coding RNAs and complex diseases: from experimental results to computational models
.
Brief Bioinform
 
2016
;
18
:
558
76
.

59.

Wang
 
C-C
,
Han
 
C-D
,
Zhao
 
Q
, et al.  
Circular RNAs and complex diseases: from experimental results to computational models
.
Brief Bioinform
 
2021
;
22
:bbab286.

60.

Hu
 
L
,
Wang
 
X
,
Huang
 
Y-A
, et al.  
A survey on computational models for predicting protein–protein interactions
.
Brief Bioinform
 
2021
;
22
:bbab036.

61.

Zhao
 
Y
,
Chen
 
X
,
Yin
 
J
.
Adaptive boosting-based computational model for predicting potential miRNA-disease associations
.
Bioinformatics
 
2019
;
35
:
4730
8
.

62.

Hu
 
L
,
Zhang
 
J
,
Pan
 
X
, et al.  
HiSCF: leveraging higher-order structures for clustering analysis in biological networks
.
Bioinformatics
 
2020
;
37
:
542
50
.

63.

Hu
 
L
,
Chan
 
KCC
,
Yuan
 
X
, et al.  
A variational Bayesian framework for cluster analysis in a complex network
.
IEEE Trans Knowl Data Eng
 
2020
;
32
:
2115
28
.

64.

Wang
 
D
,
Wang
 
J
,
Lu
 
M
, et al.  
Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases
.
Bioinformatics
 
2010
;
26
:
1644
50
.

65.

van
 
Laarhoven
 
T
,
Nabuurs
 
SB
,
Marchiori
 
E
.
Gaussian interaction profile kernels for predicting drug-target interaction
.
Bioinformatics
 
2011
;
27
:
3036
43
.

66.

Bengio
 
Y
,
Lamblin
 
P
,
Popovici
 
D
, et al.  Greedy layer-wise training of deep networks. In:
Advances in neural information processing systems
,
2007
,
153
60
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)