Abstract

Emerging evidence suggests that circular RNA (circRNA) is an important regulator of a variety of pathological processes and serves as a promising biomarker for many complex human diseases. Nevertheless, there are relatively few known circRNA–disease associations, and uncovering new circRNA–disease associations by wet-lab methods is time consuming and costly. Considering the limitations of existing computational methods, we propose a novel approach named MNMDCDA, which combines high-order graph convolutional networks (high-order GCNs) and deep neural networks to infer associations between circRNAs and diseases. Firstly, we computed different biological attribute information of circRNA and disease separately and used them to construct multiple multi-source similarity networks. Then, we used the high-order GCN algorithm to learn feature embedding representations with high-order mixed neighborhood information of circRNA and disease from the constructed multi-source similarity networks, respectively. Finally, the deep neural network classifier was implemented to predict associations of circRNAs with diseases. The MNMDCDA model obtained AUC scores of 95.16%, 94.53%, 89.80% and 91.83% on four benchmark datasets, i.e., CircR2Disease, CircAtlas v2.0, Circ2Disease and CircRNADisease, respectively, using the 5-fold cross-validation approach. Furthermore, 25 of the top 30 circRNA–disease pairs with the best scores of MNMDCDA in the case study were validated by recent literature. Numerous experimental results indicate that MNMDCDA can be used as an effective computational tool to predict circRNA–disease associations and can provide the most promising candidates for biological experiments.

Introduction

Circular RNA (circRNA), a novel member of the noncoding cancer genome, is a single-stranded endogenous non-coding RNA (ncRNA) molecule with a continuous loop structure [1, 2], which is generated by a back splicing event between the downstream 5′ end splice site and the upstream 3′ end splice site [3, 4]. CircRNA was first discovered by Sanger and colleagues in the 1970s, and they observed the presence of circRNA in Sendai viruses and plant infection viruses by using electron microscopy [5, 6]. Subsequently, in the 1990s, endogenous circRNAs were first identified in human cells and large and abundant circRNAs were also found in the Sry gene of mouse [7, 8]. Despite the early discovery of circRNAs, they have long been ignored as a by-product of ‘shear noise’ or abnormal shear and have not attracted much attention from scholars. It was not until 2013 when two research papers on circRNAs were published in Nature that the mystery of circRNAs was unveiled and people began to really understand them [3, 9]. These studies demonstrate that circRNA is not a ‘splicing by-product’ of messenger RNA, but a class of RNA molecules that play important roles in cells, and it has significant biological functions. Therefore, the development of computational biology methods and deep sequencing technologies has made circRNA a frontier of research in the RNA field, which is crucial to reveal new functions and potential roles of circRNAs [10, 11].

Recently, computational approaches have emerged as an effective strategy to infer circRNA–disease associations to overcome the inherent shortcomings of wet-lab approaches [12]. Moreover, prioritizing the most promising circRNAs for candidate diseases by computational methods will also help to discover their molecular behavior in the identification of viral circRNAs as well as in carcinogenesis [13]. For instance, Xiao et al. [14] proposed a computational model called iCDA-CMG (identifying circRNA-disease associations-collective matrix completion with graph learning) based on collective matrix completion and graph learning algorithm to identify circRNA–disease associations. Zheng et al. [15] developed an approach with chaos game representation and support vector machine (SVM) to infer unobserved associations between circRNAs and diseases. Yang et al. [16] employed accelerated attribute network embedding and stacked auto-encoder algorithms to obtain feature representations of circRNA and disease and then used XGBoost classifier to obtain prediction results. Lu et al. [17] designed a CDASOR model, which adopted a convolutional neural network coupled with a bidirectional long short-term memory network to discover the underlying circRNAs with target diseases.

Although the above models have their own advantages and have achieved encouraging results. However, it is worth noting that they still have some problems: (1) Existing models are based on incompletely correlated biological information, with redundancy and noise between data, and they do not sufficiently fuse circRNA similarity and disease similarity. (2) The current circRNA–disease association data validated by wet-lab experiments are limited. The constructed circRNA–disease association networks are relatively sparse, with many false positive and false negative associations among their descriptors, and thus, the prediction models cannot be fully trained. (3) Most of the existing models were proposed based on one dataset only, and the generalization performance of the prediction models was not verified on other circRNA–disease association datasets.

To overcome these problems, we present a novel computational model called MNMDCDA that fuses mixed neighborhood information of circRNAs and diseases from multi-source similarity networks, which combines a high-order graph convolutional network (high-order GCN) and a deep neural network (DNN) to predict circRNA–disease associations. This model can use high-order GCN to overcome the problem that the original GCN cannot obtain high-order neighborhood information of each node from its neighbors at different distances in the network. It can learn the high-order mixed neighborhood embedding representations of circRNAs and diseases in a specific way.

Specifically, in the first step, we calculated the similarities for circRNA pairs and disease pairs and constructed 12 multi-source similarity networks by integrating various biological data information from four gold standard datasets. In the second step, based on each multi-source similarity network, we extracted the low-dimensional feature representations of circRNAs and diseases using the high-order GCN algorithm to learn the higher-order mixed neighborhood information of each node. In the third step, we introduced a DNN as a binary classifier to accurately identify the potential associations between circRNAs and diseases. The framework of MNMDCDA is shown in Figure 1.

The framework of MNMDCDA model for predicting the potential circRNA–disease associations.
Figure 1

The framework of MNMDCDA model for predicting the potential circRNA–disease associations.

In brief, the main contributions of the MNMDCDA model are as follows:

  • (i) We comprehensively utilize more attribute information of circRNAs and diseases and construct 12 multi-source similarity networks. It fuses these attributes from different perspectives to better describe the biological information of circRNAs and diseases.

  • (ii) We obtained higher-order neighborhood embedding representations of these attribute information from the network using the higher-order GCN algorithm and extracted advanced features. Thus, the hidden information contained in these networks is mined as much as possible to fully train the MNMDCDA model.

  • (iii) The MNMDCDA model was tested on three other benchmark datasets with the same experiments to further verify the generalization performance of the model. Furthermore, 25 of the top 30 disease-related circRNAs predicted by the MNMDCDA model in case studies have been validated by the latest published literature.

Materials

Gold standard dataset

To construct a high-reliability data source to evaluate the effectiveness of the MNMDCDA model, we employed currently available experimentally validated CircR2Disease dataset [18] as the gold standard dataset to assess the performance of the proposed model, which can be represented as follows:
(1)
where D+ and D are the positive and negative sample sets, respectively. represents the union set between the elements. Take the CircAtlas v2.0 dataset for an example, there are totally 846 experimentally validated circRNA–disease association pairs belonging to 776 circRNAs and 117 diseases in the original bipartite graph, which were used as positive samples. However, there are 776×117=90792 connections in the corresponding bipartite graph, of which there are 776×117846=89946 un-experimentally validated circRNA–disease associations, and all these possible unknown associated circRNA–disease pairs can be used to construct negative samples. To avoid unbalanced datasets that could lead to biased experimental results, we randomly selected the same number of data from other unknown circRNA–disease associations as negative samples to construct a balanced dataset. Although this approach may include the unconfirmed correlated circRNA–disease pairs as negative samples, the selected negative samples only account for 846÷(776×117)0.93% of all circRNA–disease associations, such a small bias is negligible from the machine learning and probability perspectives. Thus, we obtained 1692 samples from the CircAtlas v2.0 dataset, half of which were negative and positive. In this experiment, we first obtained 725 circRNA–disease associations consisting of 676 circRNAs and 100 diseases from the CircR2Disease dataset. Second, we constructed the adjacency matrix R2DM of the gold standard dataset. When circRNA c(i) is not associated with disease d(j), the element R2DM(i,j) of R2DM is assigned to 0; otherwise, it is assigned to 1. Ultimately, we can construct the circRNA–disease association network CDN by the obtained adjacency matrix R2DM.

Disease feature construction

Gaussian interaction profile kernel-based disease similarity

Since similar circRNAs and diseases usually exhibit similar interaction patterns, the hypothesis for this similarity is that similar diseases are more likely to be associated with functionally similar circRNAs [19]. Therefore, we used Gaussian interaction profile kernel (GIPK) similarity to construct the similarity model of diseases based on the known circRNA–disease association adjacency matrix. In this experiment, we defined the binary vector V(d(i)) to represent the interaction profiles of disease d(i). When the disease d(i) is not associated with a particular circRNA, the corresponding position of the circRNA in the binary vector V(d(i)) is set to 0; otherwise, it is set to 1. Thus, the GIPK similarity DGIP(d(i),d(j)) of disease d(i) and disease d(j) can be measured based on the following equations:
(2)
(3)
where θd is the regularization parameter controlling the kernel bandwidth, and m is the number of diseases in the circRNA–disease association adjacency matrix R2DM.

Medical subject heading-based disease semantic similarity

In this study, we used the Medical Subject Headings (MeSH) database [20, 21] to construct semantic similarity of diseases. MeSH is an authoritative, extensible biomedical subject heading that provides a rigorous classification of all diseases, which helps to calculate the semantic similarity of diseases. MeSH is available at https://www.nlm.nih.gov/. According to previous studies, we can use the semantic information of the MeSH database to construct a directed acyclic graph (DAG) to reflect the relationship among various diseases well. In the directed DAG, the nodes represent diseases and the directed edges indicate the relationships between diseases.

For a disease d, its DAG can be denoted as DAGd=(d,Nd,Ed), where Nd indicates the set of diseases associated with d, including the node set of the disease d itself and its ancestor nodes, and Ed is the set of links in the subgraph, indicating the relationship between these diseases. Assuming that there is a disease s in the DAGd, the semantic contribution value Dd(s) of disease s to disease d in this DAG can be calculated by the following equation:
(4)
where μ=0.5 denotes the semantic contribution decay parameter of edges linking disease s and its child disease s in Ed [5]. Eventually, we can obtain the semantic value DV(d) of disease d as follows, by accumulating the semantic contribution values of all child nodes related to disease d in the disease set Nd.
(5)
Given two diseases d(i) and d(j), we can calculate their semantic similarity DS(d(i), d(j)) by combining disease terms and MeSH-based hierarchical structure information using the following formula.
(6)

Disease Ontology-based disease semantic similarity

The Disease Ontology (DO) [22] can be organized as a DAG so that the semantic similarities among diseases can be computed based on their corresponding DO terms. The DO term for each disease is retrieved from http://disease-ontology.org/. Then, we measure the semantic similarities between two diseases following Wang’s method described, and the detailed calculation steps are described in the literature [23]. To distinguish, we use DSDO(d(i),d(j)) to represent the DO-based semantic similarity between two diseases d(i) and d(j).

Cosine similarity of disease

The cosine similarity is usually employed to express the distinction or similarity among finite sample sets [22]. Thus, we use cosine similarity to measure the similarity between diseases. Specifically, for diseases, we construct the cosine similarity model DCos(d(i),d(j)) to represent the similarity between disease d(i) and disease d(j), based on their associated circRNA information, which is calculated as follows:
(7)
Here, V(d(i)) and V(d(j)) denote the i-th and j-th columns of the adjacency matrix R2DM, respectively. ||V(d(i))|| and ||V(d(j))|| indicate the Euclidean norm of the vectors V(d(i)) and V(d(j)), respectively.

CircRNA feature construction

GIPK-based circRNA similarity

Similar to the disease GIPK similarity, we can give the binary vector V(c(i)) to represent the interaction profile of circRNA c(i) from the association between circRNA c(i) and diseases in the adjacency matrix R2DM. The GIPK similarity CGIP(c(i),c(j)) of circRNA c(i) and circRNA c(j) can be measured based on the following equations:
(8)
(9)
where θc is the regularization parameter controlling the kernel bandwidth, and n is the number of circRNAs in the matrix R2DM.

CircRNA functional similarity

According to the hypothesis that circRNAs sharing semantically similar disease groups are more likely to be functionally similar as well [24], we can measure the functional similarity between two circRNAs. In particular, given two different diseases d(i) and d(j), and meanwhile, given two disease groups D(i) and D(j), which denote the disease groups associated with circRNA c(i) and circRNA c(j), respectively. Assuming that CFS(c(i),c(j)) is the functional similarity matrix of two circRNAs, the functional similarity between circRNA c(i) and circRNA c(j) can be calculated by the following equation:
(10)
(11)

Cosine similarity of circRNA

Likewise, for circRNAs, we construct the cosine similarity model CCos(c(i),c(j)) to represent the similarity between circRNA c(i) and circRNA c(j), based on their associated disease information, which is calculated as follows:
(12)
Here, V(c(i)) and V(c(j)) denote the i-th row and j-th row of the adjacency matrix R2DM, respectively.

Multi-similarity matrix fusion

To fully utilize the information from different sources, we adopted a multi-similarity matrix fusion method to fuse circRNA similarity information and disease similarity information to realize feature complementation. The advantage of the fused information is that it not only reduces the potential shortcomings caused by single features but also absorbs the characteristics of different data sources.

For circRNA, we can construct the fused circRNA similarity information CFus by the following strategy. Specifically, if there is the functional similarity between circRNA c(i) and circRNA c(j), then we use the circRNA functional similarity matrix to construct the fusion similarity descriptor CFus(c(i),c(j)); otherwise, we use circRNA GIPK similarity to represent the similarity between circRNA c(i) and c(j). This construction strategy of fusion similarity for circRNAs can be expressed as follows:
(13)
Similarly, for diseases, we can construct the fused disease similarity information DFus by the following strategy. If there is the semantic similarity between disease d(i) and disease d(j), the fused disease similarity descriptor DFus(d(i),d(j)) is constructed by employing the disease semantic similarity matrix; otherwise, the disease GIPK similarity is constructed to represent the similarity between diseases d(i) and d(j). This construction strategy of fused disease similarity can be described by the following equation:
(14)

Finally, the circRNA fusional similarity network corresponding to the matrix CFus(c(i),c(j)) is CN, while the disease fusional similarity network corresponding to the matrix DFus(d(i),d(j)) is DN.

Feature embedding of high-order GCNs

After obtaining the fusion feature descriptors of circRNA and disease, we use the high-order GCN [25] to extract the low-dimensional feature embedding representations of circRNA and disease from the multi-source similarity network, respectively. To clearly understand the high-order GCN, we first introduce the original GCN proposed by Kipf and Welling [26], which can be elegantly summarized by the following expression:
(15)
where A^=D12(A+I)D12 is the symmetrically normalized graph adjacency matrix of A with self-connections. Here, I is the identity matrix with the same size as A, and D is a diagonal matrix, the degree matrix of (A + I). H(l) and H(l + 1) are the input and output activation matrices, which represent the row-wise embedding of the graph vertices in the l th and l + 1 th layers, respectively. W(l) is a trainable weight matrix of the l th layer, and σ is a nonlinear activation function. Thus, a GCN model with L layers can be expressed as follows:
(16)
However, the original GCN is susceptible to the over-smoothing problem because it focuses only on the first-order neighborhood information of each node, which limits its ability to capture remote dependencies among nodes from distant but informative nodes. Currently, it has been shown that better node feature representations can be learned by fusing mixed neighborhood information, which usually helps to improve prediction abilities for downstream tasks including link prediction and node classification [27]. High-order GCN mainly considers the neighborhood information of circRNAs or diseases at different distances, thus capturing the high-order features of biological networks and learning the linear mixture of features in multi-distance neighborhoods. The high-order GCN-based algorithm can be defined as follows:
(17)
where the hyperparameter P is a set of integers, the adjacency power of D12(A+I)D12, P={0,1,,p}, where p is the maximum order of the neighborhood considered by each high-order GCN layer for information propagation. Here, σ is the Rectified Linear Unit (ReLU) activation function. denotes the column-wise concatenation of neighborhood information of different orders from circRNAs or diseases embedding representations.
During the high-order GCN training, we use binary cross-entropy loss to optimize the model parameters:
(18)
where (ci,dj) denotes the training pair of circRNA ci and disease dj, Aij denotes the ground truth association label between these nodes of circRNA and disease, and pij denotes the predicted association probability between circRNA ci and disease dj. Thus, the final loss function considered for all associations between circRNA and disease is as follows:
(19)
where Tr+ and Tr represent the positive and negative sample data in the training process, respectively, and denotes the union set between the elements in the mathematical formula.

Deep neural network

After obtaining representative features of circRNA–disease pairs using high-order GCN, we utilized the DNN supervised learning model to identify potential associations between circRNAs and diseases. We employed three fully connected layers in the neural network in this study. In the hidden layer of the DNN, each neuron in layer i + 1 is connected to all neurons in layer i. Each hidden layer can be computed by the following equation:
(20)

In the input and hidden layers, we employed the ReLU [28] function (f(x) = max(0,x)) as the activation function of the model. In the output layer, we employed Sigmoid [29] function (f(x) = 1/1 + e-x) as the activation function to activate the DNN to obtain the probability score of circRNA–disease pairs, which was used to estimate the probability of association between circRNA and disease. The higher the score, the higher the association between circRNA and disease.

We used the binary cross-entropy as the loss function to judge whether the model is good or bad for the prediction results. In addition, to accelerate the training process and avoid overfitting, the Adam algorithm [30] is used to optimize the binary cross-entropy loss, and the Dropout technique [31] is also used in the input and hidden layers to further avoid overfitting of the proposed model.

Experimental results

Evaluation indicators

In the experiment, we introduced five evaluation metrics, namely, accuracy (Acc.), sensitivity (Sen.), precision (Pre.), F1-score (F1) and Matthews correlation coefficient (MCC), as evaluation criteria to measure the prediction performance of the proposed MNMDCDA model [32], which are defined as follows:
(21)
(22)
(23)
(24)
(25)
where TP and FP are true positive and false positive, indicating the number of correctly predicted positive samples and the number of incorrectly predicted positive samples, respectively. TN and FN are true negative and false negative, denoting the number of correctly predicted negative samples and the number of incorrectly predicted negative samples, respectively. Additionally, we plotted the receiver operating characteristic (ROC) curve [33] and calculated the area under the ROC curve (AUC) [34] to clearly visualize the prediction performance of MNMDCDA.

Evaluate model performance

In the training of high-order GCN, weight decay = 0.001, learning rate = 0.001, activation function = ReLU, number of neighbors = 20 and maximum order P of high-order GCN = 4. In the prediction using DNN, we used three layers of DNN. In the first and second layers we use 256 neurons, activation function = ReLU, dropout rate = 0.5. In the third layer we use 1 neuron, activation function = sigmoid. Meanwhile, Adam algorithm is utilized to optimize the binary cross-entropy loss function. Since the maximum order p of the high-order GCN determines the farthest distance that the nodes can obtain mixed information from their neighbors in the network learning, which greatly affects the performance of the prediction model. Therefore, to achieve the best prediction performance, we need to optimize the maximum order p of the high-order GCN to choose the appropriate order. The prediction results of the proposed model at different orders are given in Table 1. To visualize the prediction results, a line graph of the prediction performance of the proposed model at different orders is given in Figure 2. From these results, we can find that the proposed model obtains the highest AUC score of 95.16% at p = 4. Finally, we select the maximum order p = 4 for the high-order GCN to conduct the experiment in this study.

Table 1

The prediction results of the model at different orders

p012345678
AUC (%)91.9092.6993.1994.1895.1694.4094.3693.7993.81
p012345678
AUC (%)91.9092.6993.1994.1895.1694.4094.3693.7993.81
Table 1

The prediction results of the model at different orders

p012345678
AUC (%)91.9092.6993.1994.1895.1694.4094.3693.7993.81
p012345678
AUC (%)91.9092.6993.1994.1895.1694.4094.3693.7993.81
Line graph of the prediction performance of the model at different orders.
Figure 2

Line graph of the prediction performance of the model at different orders.

In the experiment, we utilized the 5-fold cross-validation approach to evaluate the prediction performance of the proposed MNMDCDA model on CircR2Disease dataset. The detailed experimental results of the 5-fold cross-validation are summarized in Table 2. From Table 2, we can see that the MNMDCDA model obtained an average accuracy of 88.69%. The average experimental results of MNMDCDA on Sen., Pre., F1, MCC and AUC were 94.07%, 85.00%, 89.28%, 77.87% and 95.16%, respectively, with their corresponding standard deviations of 1.93%, 3.00%, 2.11%, 4.57% and 1.84%, respectively. In addition, we also plotted the ROC curves generated by the MNMDCDA method using 5-fold cross-validation on the CircR2Disease dataset, as shown in Figure 3.

Table 2

Experimental results of the MNMDCDA model on CircR2Disease dataset

ModelTesting setAcc. (%)Sen. (%)Pre. (%)F1 (%)MCC (%)AUC (%)
Our model191.0395.8687.4291.4582.4597.79
288.6294.4884.5789.2577.7894.89
388.9795.8684.2489.6878.6895.65
484.8391.7280.6185.8170.3392.67
590.0092.4188.1690.2480.0994.79
Average88.6994.0785.0089.2877.8795.16
Standard deviation2.361.933.002.114.571.84
Cosine similarity model181.3886.2178.6282.2463.0591.52
285.5288.2883.6685.9171.1494.11
384.4886.2183.3384.7569.0193.28
482.7682.0783.2282.6465.5289.65
586.5586.2186.8186.5173.1192.22
Average84.1485.7983.1384.4168.3792.16
Standard deviation2.082.272.921.914.091.72
DO-based disease semantic similarity model187.5995.8682.2588.5476.2293.72
287.9395.8682.7488.8276.8394.09
386.5597.9379.7887.9375.0795.01
483.7987.5981.4184.3967.7891.60
583.7988.2881.0184.4967.8691.45
Average85.9393.1081.4486.8372.7593.17
Standard deviation2.024.801.152.214.551.58
DA modelAverage67.6673.1065.9369.3235.5370.61
Standard deviation1.852.341.711.803.713.30
LR modelAverage69.3874.3467.6570.8238.9771.43
Standard deviation1.492.091.541.442.993.44
NB modelAverage66.0754.3470.7361.3933.0173.37
Standard deviation5.017.855.126.879.884.85
KNN modelAverage82.2192.6976.6683.9165.8892.08
Standard deviation2.782.162.732.395.431.63
SVM modelAverage84.9786.7683.8685.2570.0494.37
Standard deviation1.731.653.051.443.371.27
DT modelAverage85.2486.9084.2685.4670.7191.01
Standard deviation0.664.202.520.951.411.25
Adboost modelAverage86.4194.6281.2787.4173.9392.26
Standard deviation2.134.351.112.254.771.76
RF modelAverage87.2491.4584.3787.7474.8094.30
Standard deviation1.402.871.281.462.950.84
ModelTesting setAcc. (%)Sen. (%)Pre. (%)F1 (%)MCC (%)AUC (%)
Our model191.0395.8687.4291.4582.4597.79
288.6294.4884.5789.2577.7894.89
388.9795.8684.2489.6878.6895.65
484.8391.7280.6185.8170.3392.67
590.0092.4188.1690.2480.0994.79
Average88.6994.0785.0089.2877.8795.16
Standard deviation2.361.933.002.114.571.84
Cosine similarity model181.3886.2178.6282.2463.0591.52
285.5288.2883.6685.9171.1494.11
384.4886.2183.3384.7569.0193.28
482.7682.0783.2282.6465.5289.65
586.5586.2186.8186.5173.1192.22
Average84.1485.7983.1384.4168.3792.16
Standard deviation2.082.272.921.914.091.72
DO-based disease semantic similarity model187.5995.8682.2588.5476.2293.72
287.9395.8682.7488.8276.8394.09
386.5597.9379.7887.9375.0795.01
483.7987.5981.4184.3967.7891.60
583.7988.2881.0184.4967.8691.45
Average85.9393.1081.4486.8372.7593.17
Standard deviation2.024.801.152.214.551.58
DA modelAverage67.6673.1065.9369.3235.5370.61
Standard deviation1.852.341.711.803.713.30
LR modelAverage69.3874.3467.6570.8238.9771.43
Standard deviation1.492.091.541.442.993.44
NB modelAverage66.0754.3470.7361.3933.0173.37
Standard deviation5.017.855.126.879.884.85
KNN modelAverage82.2192.6976.6683.9165.8892.08
Standard deviation2.782.162.732.395.431.63
SVM modelAverage84.9786.7683.8685.2570.0494.37
Standard deviation1.731.653.051.443.371.27
DT modelAverage85.2486.9084.2685.4670.7191.01
Standard deviation0.664.202.520.951.411.25
Adboost modelAverage86.4194.6281.2787.4173.9392.26
Standard deviation2.134.351.112.254.771.76
RF modelAverage87.2491.4584.3787.7474.8094.30
Standard deviation1.402.871.281.462.950.84
Table 2

Experimental results of the MNMDCDA model on CircR2Disease dataset

ModelTesting setAcc. (%)Sen. (%)Pre. (%)F1 (%)MCC (%)AUC (%)
Our model191.0395.8687.4291.4582.4597.79
288.6294.4884.5789.2577.7894.89
388.9795.8684.2489.6878.6895.65
484.8391.7280.6185.8170.3392.67
590.0092.4188.1690.2480.0994.79
Average88.6994.0785.0089.2877.8795.16
Standard deviation2.361.933.002.114.571.84
Cosine similarity model181.3886.2178.6282.2463.0591.52
285.5288.2883.6685.9171.1494.11
384.4886.2183.3384.7569.0193.28
482.7682.0783.2282.6465.5289.65
586.5586.2186.8186.5173.1192.22
Average84.1485.7983.1384.4168.3792.16
Standard deviation2.082.272.921.914.091.72
DO-based disease semantic similarity model187.5995.8682.2588.5476.2293.72
287.9395.8682.7488.8276.8394.09
386.5597.9379.7887.9375.0795.01
483.7987.5981.4184.3967.7891.60
583.7988.2881.0184.4967.8691.45
Average85.9393.1081.4486.8372.7593.17
Standard deviation2.024.801.152.214.551.58
DA modelAverage67.6673.1065.9369.3235.5370.61
Standard deviation1.852.341.711.803.713.30
LR modelAverage69.3874.3467.6570.8238.9771.43
Standard deviation1.492.091.541.442.993.44
NB modelAverage66.0754.3470.7361.3933.0173.37
Standard deviation5.017.855.126.879.884.85
KNN modelAverage82.2192.6976.6683.9165.8892.08
Standard deviation2.782.162.732.395.431.63
SVM modelAverage84.9786.7683.8685.2570.0494.37
Standard deviation1.731.653.051.443.371.27
DT modelAverage85.2486.9084.2685.4670.7191.01
Standard deviation0.664.202.520.951.411.25
Adboost modelAverage86.4194.6281.2787.4173.9392.26
Standard deviation2.134.351.112.254.771.76
RF modelAverage87.2491.4584.3787.7474.8094.30
Standard deviation1.402.871.281.462.950.84
ModelTesting setAcc. (%)Sen. (%)Pre. (%)F1 (%)MCC (%)AUC (%)
Our model191.0395.8687.4291.4582.4597.79
288.6294.4884.5789.2577.7894.89
388.9795.8684.2489.6878.6895.65
484.8391.7280.6185.8170.3392.67
590.0092.4188.1690.2480.0994.79
Average88.6994.0785.0089.2877.8795.16
Standard deviation2.361.933.002.114.571.84
Cosine similarity model181.3886.2178.6282.2463.0591.52
285.5288.2883.6685.9171.1494.11
384.4886.2183.3384.7569.0193.28
482.7682.0783.2282.6465.5289.65
586.5586.2186.8186.5173.1192.22
Average84.1485.7983.1384.4168.3792.16
Standard deviation2.082.272.921.914.091.72
DO-based disease semantic similarity model187.5995.8682.2588.5476.2293.72
287.9395.8682.7488.8276.8394.09
386.5597.9379.7887.9375.0795.01
483.7987.5981.4184.3967.7891.60
583.7988.2881.0184.4967.8691.45
Average85.9393.1081.4486.8372.7593.17
Standard deviation2.024.801.152.214.551.58
DA modelAverage67.6673.1065.9369.3235.5370.61
Standard deviation1.852.341.711.803.713.30
LR modelAverage69.3874.3467.6570.8238.9771.43
Standard deviation1.492.091.541.442.993.44
NB modelAverage66.0754.3470.7361.3933.0173.37
Standard deviation5.017.855.126.879.884.85
KNN modelAverage82.2192.6976.6683.9165.8892.08
Standard deviation2.782.162.732.395.431.63
SVM modelAverage84.9786.7683.8685.2570.0494.37
Standard deviation1.731.653.051.443.371.27
DT modelAverage85.2486.9084.2685.4670.7191.01
Standard deviation0.664.202.520.951.411.25
Adboost modelAverage86.4194.6281.2787.4173.9392.26
Standard deviation2.134.351.112.254.771.76
RF modelAverage87.2491.4584.3787.7474.8094.30
Standard deviation1.402.871.281.462.950.84
ROC curves of 5-fold cross-validation achieved by MNMDCDA on CircR2Disease dataset.
Figure 3

ROC curves of 5-fold cross-validation achieved by MNMDCDA on CircR2Disease dataset.

Comparison with cosine similarity model

In the MNMDCDA model, we used GIPK similarity to denote the correlation between circRNA and disease. Therefore, to verify whether GIPK similarity is beneficial to the prediction performance of the proposed model, we compared it with cosine similarity. To be fair, we only used cosine similarity instead of GIPK similarity, and the other parts of the model remain unchanged. The results are presented in Table 2. As shown in Table 2, the average values of Acc., Sen., Pre., F1, MCC and AUC obtained based on the cosine similarity model were 4.55%, 8.28%, 1.87%, 4.87%, 9.50% and 3.00% less than the MNMDCDA model, respectively. Figure 4 shows the ROC curves generated by the cosine similarity model on the CircR2Disease dataset. Figure 5 visualizes the experimental results of the cosine similarity model and the proposed model on the CircR2Disease dataset. From these results, it can be seen that the prediction performance of the MNMDCDA model is superior to that of the cosine similarity-based model on the same dataset.

ROC curves of 5-fold cross-validation achieved by cosine similarity model on CircR2Disease dataset.
Figure 4

ROC curves of 5-fold cross-validation achieved by cosine similarity model on CircR2Disease dataset.

Comparison of the proposed different combinatorial models on the CircR2Disease dataset.
Figure 5

Comparison of the proposed different combinatorial models on the CircR2Disease dataset.

Comparison with DO-based disease semantic similarity model

In the experiment, we used MeSH-based disease semantic similarity to represent the correlation between two diseases. Therefore, to verify whether the MeSH-based disease semantic similarity is beneficial to the prediction performance of the proposed model, we compared it with the DO-based disease semantic similarity. Similarly, we perform the same 5-fold cross-validation experiment on the CircR2Disease dataset, and the results are shown in Table 2. As shown in Table 2, the average values of Acc., Sen., Pre., F1, MCC and AUC obtained from the DO-based disease semantic similarity model were 2.76%, 0.97%, 3.56%, 2.45%, 5.12% and 1.99% less than the MNMDCDA model, respectively. Figure 5 visualizes the experimental results of the DO-based disease semantic similarity model and the proposed model on the CircR2Disease dataset. Figure 6 shows the ROC curves generated by the DO-based disease semantic similarity model on the CircR2Disease dataset.

ROC curves of 5-fold cross-validation achieved by DO-based disease semantic similarity model on CircR2Disease dataset.
Figure 6

ROC curves of 5-fold cross-validation achieved by DO-based disease semantic similarity model on CircR2Disease dataset.

Comparison of various classifier models

To evaluate the impact of the DNN classifier on the overall performance of the MNMDCDA model, we compared eight different computational models, including discriminant analysis (DA), logistic regression (LR), naive Bayes (NB), K-nearest neighbor (KNN), SVM, Decision tree (DT), Adboost and Random Forest (RF). Table 2 shows the average results of the 5-fold cross-validation obtained by these models on the CircR2Disease dataset. As can be seen from Table 2, the highest average accuracy of the eight models is 87.24%, which is significantly lower than the proposed MNMDCDA model with an average accuracy of 88.69%. Figure 7 visualizes the experimental results of different classifier models on the CircR2Disease dataset. The results of this experiment further suggest that the use of DNN classifier in the MNMDCDA model can not only accurately determine whether circRNAs are associated with diseases but also contributes to the improvement of model prediction performance.

Comparison of various classifier models on the CircR2Disease dataset.
Figure 7

Comparison of various classifier models on the CircR2Disease dataset.

Performance on independent dataset

Although the MNMDCDA model achieved good prediction performance on the CircR2Disease dataset, we also need to test its predictive ability on other independent datasets. In this paper, CircAtlas v2.0 [35], Circ2Disease [36] and CircRNADisease [37] are treated as independent datasets to examine the generalization performance of the model. The results are summarized in Table 3.

Table 3

Results of 5-fold cross-validation achieved by the proposed model on three other independent datasets

Independent datasetsTesting setAcc. (%)Sen. (%)Pre. (%)F1 (%)MCC (%)AUC (%)
CircAtlas v2.0184.3795.8877.9986.0270.6192.26
287.6194.6782.9088.4076.0093.74
390.8392.9089.2091.0181.7396.92
486.9898.2280.1988.3075.9196.41
587.5795.8682.2388.5276.2093.34
Average87.47 ± 2.3095.51 ± 1.9582.50 ± 4.2188.45 ± 1.7776.09 ± 3.9394.53 ± 2.03
Circ2Disease183.3388.8980.0084.2167.0890.12
277.7885.1974.1979.3156.1888.99
386.1183.3388.2485.7172.3393.52
485.1994.4479.6986.4471.6193.86
574.0774.0774.0774.0748.1582.51
Average81.30 ± 5.1785.19 ± 7.5279.24 ± 5.7881.95 ± 5.2163.07 ± 10.5589.80 ± 4.59
CircRNADisease180.7187.1477.2281.8861.9492.20
287.1497.1480.9588.3175.8290.41
386.4397.1480.0087.7474.5992.47
480.7184.2978.6781.3861.5992.53
585.7198.5778.4187.3473.9191.53
Average84.14 ± 3.1792.86 ± 6.6279.05 ± 1.4585.33 ± 3.4069.57 ± 7.1691.83 ± 0.89
Independent datasetsTesting setAcc. (%)Sen. (%)Pre. (%)F1 (%)MCC (%)AUC (%)
CircAtlas v2.0184.3795.8877.9986.0270.6192.26
287.6194.6782.9088.4076.0093.74
390.8392.9089.2091.0181.7396.92
486.9898.2280.1988.3075.9196.41
587.5795.8682.2388.5276.2093.34
Average87.47 ± 2.3095.51 ± 1.9582.50 ± 4.2188.45 ± 1.7776.09 ± 3.9394.53 ± 2.03
Circ2Disease183.3388.8980.0084.2167.0890.12
277.7885.1974.1979.3156.1888.99
386.1183.3388.2485.7172.3393.52
485.1994.4479.6986.4471.6193.86
574.0774.0774.0774.0748.1582.51
Average81.30 ± 5.1785.19 ± 7.5279.24 ± 5.7881.95 ± 5.2163.07 ± 10.5589.80 ± 4.59
CircRNADisease180.7187.1477.2281.8861.9492.20
287.1497.1480.9588.3175.8290.41
386.4397.1480.0087.7474.5992.47
480.7184.2978.6781.3861.5992.53
585.7198.5778.4187.3473.9191.53
Average84.14 ± 3.1792.86 ± 6.6279.05 ± 1.4585.33 ± 3.4069.57 ± 7.1691.83 ± 0.89
Table 3

Results of 5-fold cross-validation achieved by the proposed model on three other independent datasets

Independent datasetsTesting setAcc. (%)Sen. (%)Pre. (%)F1 (%)MCC (%)AUC (%)
CircAtlas v2.0184.3795.8877.9986.0270.6192.26
287.6194.6782.9088.4076.0093.74
390.8392.9089.2091.0181.7396.92
486.9898.2280.1988.3075.9196.41
587.5795.8682.2388.5276.2093.34
Average87.47 ± 2.3095.51 ± 1.9582.50 ± 4.2188.45 ± 1.7776.09 ± 3.9394.53 ± 2.03
Circ2Disease183.3388.8980.0084.2167.0890.12
277.7885.1974.1979.3156.1888.99
386.1183.3388.2485.7172.3393.52
485.1994.4479.6986.4471.6193.86
574.0774.0774.0774.0748.1582.51
Average81.30 ± 5.1785.19 ± 7.5279.24 ± 5.7881.95 ± 5.2163.07 ± 10.5589.80 ± 4.59
CircRNADisease180.7187.1477.2281.8861.9492.20
287.1497.1480.9588.3175.8290.41
386.4397.1480.0087.7474.5992.47
480.7184.2978.6781.3861.5992.53
585.7198.5778.4187.3473.9191.53
Average84.14 ± 3.1792.86 ± 6.6279.05 ± 1.4585.33 ± 3.4069.57 ± 7.1691.83 ± 0.89
Independent datasetsTesting setAcc. (%)Sen. (%)Pre. (%)F1 (%)MCC (%)AUC (%)
CircAtlas v2.0184.3795.8877.9986.0270.6192.26
287.6194.6782.9088.4076.0093.74
390.8392.9089.2091.0181.7396.92
486.9898.2280.1988.3075.9196.41
587.5795.8682.2388.5276.2093.34
Average87.47 ± 2.3095.51 ± 1.9582.50 ± 4.2188.45 ± 1.7776.09 ± 3.9394.53 ± 2.03
Circ2Disease183.3388.8980.0084.2167.0890.12
277.7885.1974.1979.3156.1888.99
386.1183.3388.2485.7172.3393.52
485.1994.4479.6986.4471.6193.86
574.0774.0774.0774.0748.1582.51
Average81.30 ± 5.1785.19 ± 7.5279.24 ± 5.7881.95 ± 5.2163.07 ± 10.5589.80 ± 4.59
CircRNADisease180.7187.1477.2281.8861.9492.20
287.1497.1480.9588.3175.8290.41
386.4397.1480.0087.7474.5992.47
480.7184.2978.6781.3861.5992.53
585.7198.5778.4187.3473.9191.53
Average84.14 ± 3.1792.86 ± 6.6279.05 ± 1.4585.33 ± 3.4069.57 ± 7.1691.83 ± 0.89

From Table 3, the average AUC values of the proposed model on three independent datasets were 94.53%, 89.80% and 91.83%, respectively. Therefore, this model can be used to explore organisms for which circRNA–disease association data are not yet available and to provide appropriate experience for further discovering new candidate diseases associated with circRNAs. Figure 8 gives the histogram of the experimental results of the proposed model on the independent dataset.

Comparison of experimental results on the benchmark dataset.
Figure 8

Comparison of experimental results on the benchmark dataset.

Comparison with other existing methods

To further evaluate the prediction performance of the MNMDCDA model, we compare it with these six popular methods using the same dataset, including MGRCDA [34], NMFCDA [38], SGANRDA [39], iCircDA-MF [40], GCNCDA [41] and PWCDA [42]. To be fair, we use the AUC value that can fully reflect the stability of the model as a comparison index between different methods. Table 4 summarizes the AUC values obtained by these models on CircR2Disease. Figure 9 shows the line graph of the AUC scores obtained on the CircR2Disease dataset by different computational methods. These comparative results demonstrate that the MNMDCDA model using the high-order GCN framework combined with multi-source similarity networks has the best performance and is a promising approach.

Table 4

The 5-fold cross-validation AUC values achieved by the various models

MethodsMNMDCDAMGRCDANMFCDASGANRDAiCircDA-MFGCNCDAPWCDA
AUC0.95160.92980.92780.92150.91780.90900.8900
MethodsMNMDCDAMGRCDANMFCDASGANRDAiCircDA-MFGCNCDAPWCDA
AUC0.95160.92980.92780.92150.91780.90900.8900
Table 4

The 5-fold cross-validation AUC values achieved by the various models

MethodsMNMDCDAMGRCDANMFCDASGANRDAiCircDA-MFGCNCDAPWCDA
AUC0.95160.92980.92780.92150.91780.90900.8900
MethodsMNMDCDAMGRCDANMFCDASGANRDAiCircDA-MFGCNCDAPWCDA
AUC0.95160.92980.92780.92150.91780.90900.8900
Comparison of the AUC values of existing computational methods on the CircR2Disease dataset.
Figure 9

Comparison of the AUC values of existing computational methods on the CircR2Disease dataset.

Case studies

To further investigate the effectiveness of MNMDCDA in screening unknown disease candidate circRNAs, we conducted the case studies experiment on the CircR2Disease dataset. After model prediction, the experimental results are shown in Table 5, from which we can see that 25 of the top 30 circRNA–disease pairs have been confirmed in the recently published literature. In general, MNMDCDA has an excellent ability to predict potential disease-associated circRNAs, and these top candidates will likely be selected for further biological studies to reduce the range of wet-lab experimental searches.

Table 5

Top 30 circRNA–disease associations predicted by MNMDCDA

RankcircRNADiseaseEvidence (PMID/ORCID)Year
1hsa_circ_001569Breast cancer311040122019
2circFAT1Breast cancer342888222021
3hsa_circ_0000190Breast cancer10.1093/annonc/mdy428.0102018
4hsa_circ_001763Breast cancer305091082019
5ciRS-7Breast cancer333908572021
6hsa_circ_0083964OsteoarthritisUnconfirmedN/A
7hsa_circ_001988Gastric cancer325922022021
8hsa_circ_0001724Gastric cancer10.1016/j.genrep.2021.1012262021
9circFAT1Gastric cancer304193462019
10circRHOBTB3Gastric cancer319285272020
11hsa_circ_0023404Rheumatoid arthritisUnconfirmedN/A
12hsa_circ_0000520Breast cancer10.21203/rs.3.rs-1023577/v12021
13circBRAFGlioma336500752021
14hsa_circ_005239Breast cancer290372202017
15Circ_HIPK3Glioblastoma341989782021
16Circ_SMARCA5Glioblastoma307364622019
17hsa_circ_0001566GliomaUnconfirmedN/A
18circHIPK3Pancreatic cancer321040742020
19circRTN4Pancreatic cancer349835372022
20circRHOBTB3Pancreatic cancer344169102021
21hsa_circ_0089974Gastric cancerUnconfirmedN/A
22circHIPK3Lung cancer312321772020
23hsa_circ_0001649Pancreatic cancer311380142019
24hsa_circ_0005015Diabetic retinopathy292882682017
25hsa_circRNA_100750Diabetes retinopathy288178292017
26hsa_circ_0005927Colorectal cancer333123762020
27hsa_circ_0081108Diabetic retinopathy324976302020
28hsa_circ_0045510OsteosarcomaUnconfirmedN/A
29circHIAT1Hepatocellular carcinoma311083512019
30circFAT1Hepatocellular carcinoma331794432020
RankcircRNADiseaseEvidence (PMID/ORCID)Year
1hsa_circ_001569Breast cancer311040122019
2circFAT1Breast cancer342888222021
3hsa_circ_0000190Breast cancer10.1093/annonc/mdy428.0102018
4hsa_circ_001763Breast cancer305091082019
5ciRS-7Breast cancer333908572021
6hsa_circ_0083964OsteoarthritisUnconfirmedN/A
7hsa_circ_001988Gastric cancer325922022021
8hsa_circ_0001724Gastric cancer10.1016/j.genrep.2021.1012262021
9circFAT1Gastric cancer304193462019
10circRHOBTB3Gastric cancer319285272020
11hsa_circ_0023404Rheumatoid arthritisUnconfirmedN/A
12hsa_circ_0000520Breast cancer10.21203/rs.3.rs-1023577/v12021
13circBRAFGlioma336500752021
14hsa_circ_005239Breast cancer290372202017
15Circ_HIPK3Glioblastoma341989782021
16Circ_SMARCA5Glioblastoma307364622019
17hsa_circ_0001566GliomaUnconfirmedN/A
18circHIPK3Pancreatic cancer321040742020
19circRTN4Pancreatic cancer349835372022
20circRHOBTB3Pancreatic cancer344169102021
21hsa_circ_0089974Gastric cancerUnconfirmedN/A
22circHIPK3Lung cancer312321772020
23hsa_circ_0001649Pancreatic cancer311380142019
24hsa_circ_0005015Diabetic retinopathy292882682017
25hsa_circRNA_100750Diabetes retinopathy288178292017
26hsa_circ_0005927Colorectal cancer333123762020
27hsa_circ_0081108Diabetic retinopathy324976302020
28hsa_circ_0045510OsteosarcomaUnconfirmedN/A
29circHIAT1Hepatocellular carcinoma311083512019
30circFAT1Hepatocellular carcinoma331794432020
Table 5

Top 30 circRNA–disease associations predicted by MNMDCDA

RankcircRNADiseaseEvidence (PMID/ORCID)Year
1hsa_circ_001569Breast cancer311040122019
2circFAT1Breast cancer342888222021
3hsa_circ_0000190Breast cancer10.1093/annonc/mdy428.0102018
4hsa_circ_001763Breast cancer305091082019
5ciRS-7Breast cancer333908572021
6hsa_circ_0083964OsteoarthritisUnconfirmedN/A
7hsa_circ_001988Gastric cancer325922022021
8hsa_circ_0001724Gastric cancer10.1016/j.genrep.2021.1012262021
9circFAT1Gastric cancer304193462019
10circRHOBTB3Gastric cancer319285272020
11hsa_circ_0023404Rheumatoid arthritisUnconfirmedN/A
12hsa_circ_0000520Breast cancer10.21203/rs.3.rs-1023577/v12021
13circBRAFGlioma336500752021
14hsa_circ_005239Breast cancer290372202017
15Circ_HIPK3Glioblastoma341989782021
16Circ_SMARCA5Glioblastoma307364622019
17hsa_circ_0001566GliomaUnconfirmedN/A
18circHIPK3Pancreatic cancer321040742020
19circRTN4Pancreatic cancer349835372022
20circRHOBTB3Pancreatic cancer344169102021
21hsa_circ_0089974Gastric cancerUnconfirmedN/A
22circHIPK3Lung cancer312321772020
23hsa_circ_0001649Pancreatic cancer311380142019
24hsa_circ_0005015Diabetic retinopathy292882682017
25hsa_circRNA_100750Diabetes retinopathy288178292017
26hsa_circ_0005927Colorectal cancer333123762020
27hsa_circ_0081108Diabetic retinopathy324976302020
28hsa_circ_0045510OsteosarcomaUnconfirmedN/A
29circHIAT1Hepatocellular carcinoma311083512019
30circFAT1Hepatocellular carcinoma331794432020
RankcircRNADiseaseEvidence (PMID/ORCID)Year
1hsa_circ_001569Breast cancer311040122019
2circFAT1Breast cancer342888222021
3hsa_circ_0000190Breast cancer10.1093/annonc/mdy428.0102018
4hsa_circ_001763Breast cancer305091082019
5ciRS-7Breast cancer333908572021
6hsa_circ_0083964OsteoarthritisUnconfirmedN/A
7hsa_circ_001988Gastric cancer325922022021
8hsa_circ_0001724Gastric cancer10.1016/j.genrep.2021.1012262021
9circFAT1Gastric cancer304193462019
10circRHOBTB3Gastric cancer319285272020
11hsa_circ_0023404Rheumatoid arthritisUnconfirmedN/A
12hsa_circ_0000520Breast cancer10.21203/rs.3.rs-1023577/v12021
13circBRAFGlioma336500752021
14hsa_circ_005239Breast cancer290372202017
15Circ_HIPK3Glioblastoma341989782021
16Circ_SMARCA5Glioblastoma307364622019
17hsa_circ_0001566GliomaUnconfirmedN/A
18circHIPK3Pancreatic cancer321040742020
19circRTN4Pancreatic cancer349835372022
20circRHOBTB3Pancreatic cancer344169102021
21hsa_circ_0089974Gastric cancerUnconfirmedN/A
22circHIPK3Lung cancer312321772020
23hsa_circ_0001649Pancreatic cancer311380142019
24hsa_circ_0005015Diabetic retinopathy292882682017
25hsa_circRNA_100750Diabetes retinopathy288178292017
26hsa_circ_0005927Colorectal cancer333123762020
27hsa_circ_0081108Diabetic retinopathy324976302020
28hsa_circ_0045510OsteosarcomaUnconfirmedN/A
29circHIAT1Hepatocellular carcinoma311083512019
30circFAT1Hepatocellular carcinoma331794432020

Conclusion

Identifying the association between circRNAs and diseases can not only provide insight into the pathogenesis of complex diseases but also provide effective ideas and solutions for early prevention, diagnosis and treatment of diseases. In this paper, we propose a novel computational model MNMDCDA combining high-order GCN and DNN, aiming to investigate the potential relationship between circRNAs and diseases. To evaluate the model performance, we performed several ablation experiments on four datasets, including comparison of cosine similarity model, DO-based disease semantic similarity model, different classifier models, comparison of model generalization performance with other existing models. Numerous experimental results suggest that MNMDCDA outperforms other existing computational models and can effectively discriminate new disease-associated circRNAs.

There are three main reasons for the excellent performance of MNMDCDA: (1) MNMDCDA integrates multiple biological attribute information between circRNAs and diseases to form fusion descriptors and to construct multiple multi-source similarity networks. (2) Using the GCN algorithm of deep learning to fully learn the high-order mixed neighborhood embedding representation of circRNAs and diseases. (3) MNMDCDA can effectively predict the potential disease-related circRNAs from the fused features, and it has good generalization performance on three independent datasets.

Key Points
  • Integrating the multiple biological attribute information of circRNAs and diseases can comprehensively describe the complex association between circRNAs and diseases from multiple perspectives.

  • The high-order GCN algorithm of deep learning is used to learn the embedding representations with high-order mixed neighborhood information of circRNAs and diseases from multiple multi-source similarity networks, respectively.

  • Experimental results on three other benchmark datasets ensure the generalization performance of the MNMDCDA model and provide corresponding theoretical guidance for further wet-lab approaches.

  • Extensive experimental results demonstrate the superior performance of the MNMDCDA model in predicting potential circRNA–disease associations.

Data Availability

The data sets and source code can be freely downloaded from: https://github.com/ly2021010123/MNMDCDA/.

Acknowledgements

The authors would like to thank all anonymous reviewers for their constructive advice.

Funding

National Natural Science Foundation of China (61976077, 62076085, 62172355, 62120106008), in part by the Major special projects of the Ministry of Science and Technology (2021ZD0200403), in part by the Qingtan scholar talent project of Zaozhuang University.

Author Biographies

Yang Li is a PhD student in the Key Laboratory of Knowledge Engineering with Big Data in Anhui Province, School of Computer Science and Information Engineering at Hefei University of Technology, Hefei, China. His current research interests include machine learning, data mining and its applications in bioinformatics.

Xue-Gang Hu is a professor of Hefei University of Technology. His research interests include data mining and knowledge engineering.

Lei Wang is a professor of Guangxi Academy of Sciences. His research interests include data mining, machine learning, deep learning, computational biology and bioinformatics.

Pei-Pei Li is an associate professor of Hefei University of Technology. Her research interests include data mining and intelligent computing.

Zhu-Hong You is a professor of Northwestern Polytechnical University. His research interests include neural networks, intelligent information processing, sparse representation and its applications in bioinformatics.

References

1.

Kristensen
L
,
Hansen
T
,
Venø
M
, et al.
Circular RNAs in cancer: opportunities and challenges in the field
.
Oncogene
2018
;
37
(
5
):
555
65
.

2.

Wang
L
,
Wong
L
,
You
Z-H
, et al.
NSECDA: natural semantic enhancement for circRNA-disease association prediction
.
IEEE J Biomed Health Inform
2022
;
26
:
5075
84
.

3.

Memczak
S
,
Jens
M
,
Elefsinioti
A
, et al.
Circular RNAs are a large class of animal RNAs with regulatory potency
.
Nature
2013
;
495
(
7441
):
333
8
.

4.

Zhang
X-O
,
Wang
H-B
,
Zhang
Y
, et al.
Complementary sequence-mediated exon circularization
.
Cell
2014
;
159
(
1
):
134
47
.

5.

Sanger
HL
,
Klotz
G
,
Riesner
D
, et al.
Viroids are single-stranded covalently closed circular RNA molecules existing as highly base-paired rod-like structures
.
Proc Natl Acad Sci
1976
;
73
(
11
):
3852
6
.

6.

Kolakofsky
D
.
Isolation and characterization of Sendai virus DI-RNAs
.
Cell
1976
;
8
(
4
):
547
55
.

7.

Nigro
JM
,
Cho
KR
,
Fearon
ER
, et al.
Scrambled exons
.
Cell
1991
;
64
(
3
):
607
13
.

8.

Capel
B
,
Swain
A
,
Nicolis
S
, et al.
Circular transcripts of the testis-determining gene Sry in adult mouse testis
.
Cell
1993
;
73
(
5
):
1019
30
.

9.

Hansen
TB
,
Jensen
TI
,
Clausen
BH
, et al.
Natural RNA circles function as efficient microRNA sponges
.
Nature
2013
;
495
(
7441
):
384
8
.

10.

Zeng
X
,
Lin
W
,
Guo
M
, et al.
A comprehensive overview and evaluation of circular RNA detection tools
.
PLoS Comput Biol
2017
;
13
(
6
):e1005420.

11.

Wang
L
,
Wong
L
,
Li
Z
, et al.
A machine learning framework based on multi-source feature fusion for circRNA-disease association prediction
.
Brief Bioinform
2022
;
23
(5):bbac388.

12.

Niu
M
,
Zou
Q
,
Wang
C
.
GMNN2CD: identification of circRNA–disease associations based on variational inference and graph Markov neural networks
.
Bioinformatics
2022
;
38
(
8
):
2246
53
.

13.

Niu
M
,
Ju
Y
,
Lin
C
, et al.
Characterizing viral circRNAs and their application in identifying circRNAs in viruses
.
Brief Bioinform
2022
;
23
(
1
):bbab404.

14.

Xiao
Q
,
Zhong
J
,
Tang
X
, et al.
iCDA-CMG: identifying circRNA-disease associations by federating multi-similarity fusion and collective matrix completion
.
Mol Gen Genomics
2021
;
296
(
1
):
223
33
.

15.

Zheng
K
,
You
Z-H
,
Li
J-Q
, et al.
iCDA-CGR: identification of circRNA-disease associations based on Chaos game representation
.
PLoS Comput Biol
2020
;
16
(
5
):e1007872.

16.

Yang
J
,
Lei
X
.
Predicting circRNA-disease associations based on autoencoder and graph embedding
.
Inf Sci
2021
;
571
:
323
36
.

17.

Lu
C
,
Zeng
M
,
Wu
F-X
, et al.
Improving circRNA–disease association prediction by sequence and ontology representations with convolutional and recurrent neural networks
.
Bioinformatics
2021
;
36
(
24
):
5656
64
.

18.

Fan
C
,
Lei
X
,
Fang
Z
, et al.
CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases
.
Database
2018
;
2018
:
1
6
.

19.

Wang
D
,
Wang
J
,
Lu
M
, et al.
Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases
.
Bioinformatics
2010
;
26
(
13
):
1644
50
.

20.

Xiang
Z
,
Qin
T
,
Qin
ZS
, et al.
A genome-wide MeSH-based literature mining system predicts implicit gene-to-gene relationships and networks
.
BMC Syst Biol
2013
;
7
(
3
):
1
15
.

21.

Wang
L
,
You
Z-H
,
Chen
X
, et al.
LMTRDA: using logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities
.
PLoS Comput Biol
2019
;
15
(
3
):e1006865.

22.

Jeon
M
,
Park
D
,
Lee
J
, et al.
ReSimNet: drug response similarity prediction using Siamese neural networks
.
Bioinformatics
2019
;
35
(
24
):
5249
56
.

23.

Yu
G
,
Wang
L-G
,
Yan
G-R
, et al.
DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis
.
Bioinformatics
2015
;
31
(
4
):
608
9
.

24.

Chen
X
,
Yan
CC
,
Zhang
X
, et al.
WBSMDA: within and between score for MiRNA-disease association prediction
.
Sci Rep
2016
;
6
(
1
):
1
9
.

25.

Abu-El-Haija
S
,
Perozzi
B
,
Kapoor
A
, et al. Mixhop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In:
International Conference on Machine Learning
. Long Beach, CA,
PMLR
,
2019
,
21
9
.

26.

Kipf
TN
,
Welling
M
.
Semi-supervised classification with graph convolutional networks
.
arXiv Preprint
2016
;arXiv:1609.02907.

27.

Wang
J
,
Liang
J
,
Cui
J
, et al.
Semi-supervised learning with mixed-order graph convolutional networks
.
Inf Sci
2021
;
573
:
171
81
.

28.

Hara
K
,
Saito
D
,
Shouno
H
. Analysis of function of rectified linear unit used in deep learning. In:
2015 International Joint Conference on Neural Networks (IJCNN)
.
Killarney, New York
:
IEEE
,
2015
,
1
8
.

29.

Wanto
A
,
Windarto
AP
,
Hartama
D
, et al.
Use of binary sigmoid function and linear identity in artificial neural networks for forecasting population density
.
Int J Inf Syst Technol
2017
;
1
(
1
):
43
54
.

30.

Kingma
DP
,
Ba
J
.
Adam: a method for stochastic optimization
.
arXiv Preprint
2014
;arXiv:1412.6980.

31.

Srivastava
N
,
Hinton
G
,
Krizhevsky
A
, et al.
Dropout: a simple way to prevent neural networks from overfitting
.
J Mach Learn Res
2014
;
15
(
1
):
1929
58
.

32.

Su
X
,
You
Z-H
,
Huang
D-s
, et al.
Biomedical knowledge graph embedding with capsule network for multi-label drug-drug interaction prediction
.
IEEE Trans Knowl Data Eng
2022
;
1-1
.

33.

Bradley
AP
.
The use of the area under the ROC curve in the evaluation of machine learning algorithms
.
Pattern Recogn
1997
;
30
(
7
):
1145
59
.

34.

Wang
L
,
You
Z-H
,
Huang
D-S
, et al.
MGRCDA: Metagraph Recommendation Method for Predicting CircRNA-Disease Association
.
IEEE Trans Cybern
2021
;1–9. https://doi.org/10.1109/TCYB.2021.3090756.

35.

Wu
W
,
Ji
P
,
Zhao
F
.
CircAtlas: an integrated resource of one million highly accurate circular RNAs from 1070 vertebrate transcriptomes
.
Genome Biol
2020
;
21
(
1
):
1
14
.

36.

Yao
D
,
Zhang
L
,
Zheng
M
, et al.
Circ2Disease: a manually curated database of experimentally validated circRNAs in human disease
.
Sci Rep
2018
;
8
(
1
):
1
6
.

37.

Zhao
Z
,
Wang
K
,
Wu
F
, et al.
circRNA disease: a manually curated database of experimentally supported circRNA-disease associations
.
Cell Death Dis
2018
;
9
(
5
):
1
2
.

38.

Wang
L
,
You
Z-H
,
Zhou
X
, et al.
NMFCDA: Combining randomization-based neural network with non-negative matrix factorization for predicting CircRNA-disease association
.
Appl Soft Comput
2021
;
110
:
107629
.

39.

Wang
L
,
Yan
X
,
You
Z-H
, et al.
SGANRDA: semi-supervised generative adversarial networks for predicting circRNA–disease associations
.
Brief Bioinform
2021
;
22
(
5
):bbab028.

40.

Wei
H
,
Liu
B
.
iCircDA-MF: identification of circRNA-disease associations based on matrix factorization
.
Brief Bioinform
2020
;
21
(
4
):
1356
67
.

41.

Wang
L
,
You
Z-H
,
Li
Y-M
, et al.
GCNCDA: a new method for predicting circRNA-disease associations based on graph convolutional network algorithm
.
PLoS Comput Biol
2020
;
16
(
5
):e1007568.

42.

Lei
X
,
Fang
Z
,
Chen
L
, et al.
PWCDA: path weighted method for predicting circRNA-disease associations
.
Int J Mol Sci
2018
;
19
(
11
):
3410
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)