-
PDF
- Split View
-
Views
-
Cite
Cite
Bo-Wei Zhao, Xiao-Rui Su, Peng-Wei Hu, Yu-Peng Ma, Xi Zhou, Lun Hu, A geometric deep learning framework for drug repositioning over heterogeneous information networks, Briefings in Bioinformatics, Volume 23, Issue 6, November 2022, bbac384, https://doi.org/10.1093/bib/bbac384
- Share Icon Share
Abstract
Drug repositioning (DR) is a promising strategy to discover new indicators of approved drugs with artificial intelligence techniques, thus improving traditional drug discovery and development. However, most of DR computational methods fall short of taking into account the non-Euclidean nature of biomedical network data. To overcome this problem, a deep learning framework, namely DDAGDL, is proposed to predict drug-drug associations (DDAs) by using geometric deep learning (GDL) over heterogeneous information network (HIN). Incorporating complex biological information into the topological structure of HIN, DDAGDL effectively learns the smoothed representations of drugs and diseases with an attention mechanism. Experiment results demonstrate the superior performance of DDAGDL on three real-world datasets under 10-fold cross-validation when compared with state-of-the-art DR methods in terms of several evaluation metrics. Our case studies and molecular docking experiments indicate that DDAGDL is a promising DR tool that gains new insights into exploiting the geometric prior knowledge for improved efficacy.
Introduction
Due to the high risk of failure, traditional drug development process is expensive and time-consuming [1]. It has been reported that developing a novel drug from scratch costs around USD 1.24 billion through traditional drug development process [2]. Nevertheless, the ever-increasing demands on efficacy and safety are the main reasons contributing to the low success rate (< 10%) in bringing new drugs to the market [3]. Taking Alzheimer’s disease as an example, it is the most common form of dementia, which affects more than 40 million people around the world with an increasing trend, but there are no pharmacological treatments that have been licensed for use in individuals with mild cognitive impairment [4]. Hence, finding more efficacious and safer drugs still presents a huge challenge to the scientific community.
As artificial intelligence techniques have been undergoing a rapid development, drug repositioning (DR), or drug repurposing, has attracted much attention as an alternative yet complementary strategy to discover new indicators for approved or experimental drugs, thus offering significant advantages to accelerate the drug development process by saving a lot of time and labor [5]. In recent years, a variety of computational-based DR methods have been developed for discovering novel drug-disease associations (DDAs) from a biomedical data point of view [6]. These methods normally extract desired features from the biomedical data related to drugs and diseases, and then incorporate the features into well-established classifiers for achieving the DR task. In light of feature extraction, they can be broadly classified into the categories of either machine learning (ML)-based or deep learning (DL)-based.
ML-based DR methods target to apply different ML techniques, such as matrix factorization [7, 8], support vector machines (SVM) [9] and neural networks [10], to predict unknown DDAs with extracted shallow features. For instance, DTINet [8] adopts matrix factorization to decompose the high-dimensional DDA matrix into the product of two low-dimensional matrices, where the feature vectors of drugs and diseases are extracted in a more compact form for accurately discovering unknown DDAs. However, shallow features used by ML-based methods are limited in their ability to represent drugs and diseases at a highly abstract level. To overcome this issue, several DL-based DR methods [11–17] have been developed by taking advantage of the powerful representation learning ability of DL. We note from [13] that deep learning-based methods are believed to have excellent advantages over traditional machine learning-based methods in addressing drug repositioning tasks, particularly in processing different types of input data involved in drug repositioning according to the nature of their frameworks. As a representative work in this field, deepDR [14] first integrates different kinds of drug-related associations by modeling them with matrices, and then applies a multi-modal deep autoencoder to learn the feature representations of drugs and diseases, with which a variational autoencoder is constructed to infer novel indicators for approved drugs. SKCNN [15] combines a convolutional neural network with sigmoid kernel to effectively learn the features of drugs and diseases for DDA prediction. CBPred [16] integrates a convolutional neural network (CNN) and a bidirectional long short-term memory (BiLSTM) to predict DDAs, where the CNN is used to learn the original representation of drugs and diseases base on their similarities and DDAs, and the BiLSTM is used to learn the path representations of DDAs. SkipGNN [17] first initializes embeddings by using a node2vec algorithm, and then the modified graph convolutional networks are applied to obtain final embeddings to complete the molecular interactions predictions. DeepR2cov [10] constructs multiple meta-paths to automatically learn low-dimensional vectors of drugs by deep neural networks, and then successfully discovers anti-inflammatory agents for COVID-19. Though effective, these methods have only demonstrated their success on curated biomedical data where an underlying Euclidean structure is observed, as they model DDAs as the functions related to the points of drugs and targets in the Euclidean space.
Recently, there has been an increasing interest in applying representation learning of drugs and diseases over heterogeneous information networks (HINs) in which there is a non-Euclidean geometric property. In particular, HINs of interest are graph models composed of not only drugs, diseases and related associations, but also their biological information denoted as the signals of corresponding nodes. Obviously, the structure of HINs is of great significance to reveal certain properties on their non-Euclidean domains, but both ML-based and DL-based DR methods fall short of capturing such information, and thereby lead to their unsatisfactory performance when applied to HINs [18]. Hence, certain effort has been devoted to apply geometric deep learning (GDL) for better learning the representations of drugs and diseases over HIN [19, 20]. As an emerging technique, GDL attempts to generalize deep neural networks to graph data in the non-Euclidean domains. For instance, DRHGCN [20] adopts multiple graph convolutional layers to learn the embedding representations of drugs and diseases by integrating three kinds of networks, including drug-disease, drug similarity and disease similarity networks. However, the over-smoothing issue resulted from the aggregation of neighborhood information within n-hops diminishes the discriminative ability of drug and disease representations learned by GDL [21]. Besides, for most GDL-based DR methods, equal contributions are taken for granted in aggregating neighbor representations to compose the final representations of drugs and diseases, but such an operation fails to capture significant features that are more representative [22, 23]. Consequently, the representation quality is negatively affected.
In this work, we propose a new framework, namely DDAGDL, to address these problems by proposing an attention-based GDL network. Toward this end, DDAGDL first integrates three kinds of drug-related networks, including drug-disease network, drug-protein network and protein-disease network, to compose a heterogeneous biomedical network, and a HIN is thus generated by further incorporating the biological knowledge of drugs, diseases and proteins. Second, DDAGDL takes advantage of complicated biological information to learn smoothed feature representations of drugs and diseases with the geometric prior knowledge in the non-Euclidean domain. Moreover, an attention mechanism is adopted by DDADRL to distinguish the significance of features when it learns the final representations of drugs and diseases. Last, a Gradient Boosting Decision Tree (GDBT) classifier, i.e. XGBoost [24], is employed to complete the DR task. Experimental results show that DDAGDL yields a promising performance across all the three benchmark datasets under classical 10-fold cross-validation when compared with several state-of-the-art DR methods. To further demonstrate the advantage of DDAGDL in the DR task, we have also conducted comparative case studies on the top-ranked drug candidates predicted by each comparing method for Alzheimer’s Disease and Breast Cancer. Our findings indicate that DDAGDL is able to identify high-quality DDAs that have already been reported by previously published studies, and some of them are not even identified by other methods. In addition to Alzheimer’s Disease and Breast Cancer, DDAGDL is also a useful DR tool for newly discovered diseases according to the results of molecular docking experiments for COVID-19. In conclusion, the key reason for the success of DDAGDL is its ability to leverage GDL for better handling the HIN data, in which there is an underlying non-Euclidean structure. Hence, our work opens a new avenue in drug repositioning with new insights gained from GDL.
Methods
Datasets
To evaluate the performance of DDAGDL, three benchmark datasets, i.e. B-dataset, C-dataset and F-dataset, are adopted to construct different HINs. Each dataset contains three kinds of biological networks, i.e. a DDA network, a drug-protein association network and a protein-disease association network. B-dataset and F-dataset are collected from previous studies [25–27] while C-dataset is constructed by following the instruction of Luo et al [28]. Drug-protein associations and protein-disease associations are downloaded from the DrugBank database [29] and the DisGeNET database [30], respectively. The details of these three datasets are presented in Table 1.
Dataset . | DDAs . | Drug-protein associations . | Protein-disease associations . | Drugs . | Diseases . | Proteins . | Density . |
---|---|---|---|---|---|---|---|
B-dataset | 18,416 | 3110 | 5898 | 269 | 598 | 1021 | 0.1144 |
C-dataset | 2532 | 3773 | 10,734 | 663 | 409 | 993 | 0.0093 |
F-dataset | 1933 | 3243 | 54,265 | 593 | 313 | 2741 | 0.0104 |
Dataset . | DDAs . | Drug-protein associations . | Protein-disease associations . | Drugs . | Diseases . | Proteins . | Density . |
---|---|---|---|---|---|---|---|
B-dataset | 18,416 | 3110 | 5898 | 269 | 598 | 1021 | 0.1144 |
C-dataset | 2532 | 3773 | 10,734 | 663 | 409 | 993 | 0.0093 |
F-dataset | 1933 | 3243 | 54,265 | 593 | 313 | 2741 | 0.0104 |
Dataset . | DDAs . | Drug-protein associations . | Protein-disease associations . | Drugs . | Diseases . | Proteins . | Density . |
---|---|---|---|---|---|---|---|
B-dataset | 18,416 | 3110 | 5898 | 269 | 598 | 1021 | 0.1144 |
C-dataset | 2532 | 3773 | 10,734 | 663 | 409 | 993 | 0.0093 |
F-dataset | 1933 | 3243 | 54,265 | 593 | 313 | 2741 | 0.0104 |
Dataset . | DDAs . | Drug-protein associations . | Protein-disease associations . | Drugs . | Diseases . | Proteins . | Density . |
---|---|---|---|---|---|---|---|
B-dataset | 18,416 | 3110 | 5898 | 269 | 598 | 1021 | 0.1144 |
C-dataset | 2532 | 3773 | 10,734 | 663 | 409 | 993 | 0.0093 |
F-dataset | 1933 | 3243 | 54,265 | 593 | 313 | 2741 | 0.0104 |
Construction of HIN
A HIN is denoted as a three-element tuple, i.e. |$\textrm{HIN}\big(\boldsymbol{V},\boldsymbol{C},\boldsymbol{E}\big)$|, where |$\boldsymbol{V}=\big\{{V}^{dr},{V}^{pr},{V}^{di}\big\}$| is a total of |$\mid \boldsymbol{V}\mid$| biomolecules including drugs (|${V}^{dr}$|), proteins (|${V}^{pr}$|) and diseases |$\big({V}^{di}\big)$|, |$\boldsymbol{C}={\big[{\boldsymbol{C}}^{dr};{\boldsymbol{C}}^{di};{\boldsymbol{C}}^{pr}\big]}^T\in{\mathbb{R}}^{\mid \textbf{V}\mid \times d}$| is a matrix representing the biological knowledge of all nodes in |$\boldsymbol{V}$|, |$\boldsymbol{E}=\big\{{E}^{dd},{E}^{dp},{E}^{pd}\big\}$| represents all DDAs |$\big({E}^{dd}\big)$|, drug-protein associations |$\big({E}^{dp}\big)$|, protein-disease associations |$\big({E}^{pd}\big)$|. Moreover, |$N$|, |$K$| and |$M$| are used to denote the respective numbers of drugs, proteins and diseases, and the adjacency matrix of HIN is defined as |$\boldsymbol{A}\boldsymbol{\in }{\mathbb{R}}^{\mid \boldsymbol{V}\mid \times \mid \boldsymbol{V}\mid }$|.
Representation learning from biological knowledge
Regarding the biological knowledge of drugs, we take advantage of the Simplified Molecular Input Line Entry System (SMILES) [31], which is a linear symbol to represent molecular reactions, to construct |${\boldsymbol{C}}^{dr}$|. Specifically, we first collect the SMILES data of each drug from the DrugBank database [29], and then process it with the RDkit tool [32] to obtain the feature vector |${c}_i^{dr}$| of the drug |${c}_i^{dr}\in{V}^{dr}$|. Since |${c}_i^{dr}$| is high-dimensional, an auto-encoder model [33] is applied to obtain a more compact form by reducing its dimension to |$d$|, whose value is set as 64 in our work. As last, with all |${c}_i^{dr}$|, we are able to obtain |${\boldsymbol{C}}^{dr}={\big[{c}_1^{dr};{c}_2^{dr};\cdots; {c}_N^{dr}\big]}^T$|.
Similarly, we extract the biological knowledge of diseases in light of medical subject descriptors collected from the Medical Subject Headings (MeSH) thesaurus, and then construct |${\boldsymbol{C}}^{di}$| by calculating the semantic similarity between diseases by following the instruction of Guo et al [34]. After that, an auto-encoder model [33] is also used to reduce the dimension of |${\boldsymbol{C}}^{di}$| to |$M\times d$|. Since the biological knowledge of drugs and diseases in the C-dataset are not found in related databases, we explicitly use the processed matrices provided by Van et al. [35] and Luo et al. [28] as |${\boldsymbol{C}}^{di}$| and |${\boldsymbol{C}}^{dr}$|, respectively. An auto-encoder model is also applied to reduce the size of vectors in |${\boldsymbol{C}}^{di}$| and |${\boldsymbol{C}}^{dr}$| to |$d$|.
Regarding proteins in |${V}^{pr}$|, we utilize their sequence information to construct |${\boldsymbol{C}}^{pr}$|. In particular, each amino acid is first divided into four classes, i.e. (Ala, Val, Leu, Ile, Met, Phe, Trp, Pro), (Gly, Ser, Thr, Cys, Asn, Gln, Tyr), (Arg, Lys, His) and (Asp, Glu), according to the nature of the side chain. A 3-mer algorithm [36] is then applied to obtain |${c}_i^{pr}$|for each protein. Given all |${c}_i^{pr}$|, we are able to obtain |${\boldsymbol{C}}^{pr}={\big[{c}_1^{pr};{c}_2^{pr};\cdots; {c}_K^{pr}\big]}^T$|.
So far, we are able to compose |$\boldsymbol{C}$| with |${\boldsymbol{C}}^{dr}$|, |${\boldsymbol{C}}^{di}$| and |${\boldsymbol{C}}^{pr}$|, and a |$\textrm{HIN}$| can thus be constructed. One thing about the biological knowledge should be noted. For an drug v, all elements of its corresponding vector in |$\boldsymbol{C}$| are set as 0 if we could not obtain it’s the biological knowledge from relevant databases. Similar operations are also applied to diseases and proteins without biological knowledge.
Attention-based GDL network
Finally, an |$\big(N+M\big)\times d$| matrix |$\boldsymbol{X}$| can be constructed to denote the feature representations of drugs and diseases.
Input: graph |$\textrm{HIN}\big(\boldsymbol{V},\boldsymbol{C},\boldsymbol{E}\big).$| |
representation size: |$d$| |
the number of regression trees: |$T$| |
Output: the prediction matrix |$\boldsymbol{R}$| |
1: Initialization: |$\boldsymbol{R}$| |
2: Obtaining biological knowledge matrix of drug |${\boldsymbol{C}}^{dr}$| |
3: Obtaining biological knowledge matrix of disease |${\boldsymbol{C}}^{di}$| |
4: Obtaining biological knowledge matrix of protein |${\boldsymbol{C}}^{pr}$| |
5: Reducing dimensions by the autoencoder |
6: |$\boldsymbol{C}={\big[{\boldsymbol{C}}^{dr};{\boldsymbol{C}}^{di};{\boldsymbol{C}}^{pr}\big]}^T\in{\mathbb{R}}^{\mid \textbf{V}\mid \times d}$| |
7: for each |${v}_i\in \boldsymbol{V}$|do |
8: |$NDLS\ \big({v}_i,\delta \big)=\mathit{\min}\big\{l:{\big\Vert{\boldsymbol{L}}_{v_i}^{\big(\infty \big)}-{\boldsymbol{L}}_{v_i}^{(l)}\big\Vert}_2<\delta \big\}$| |
9: |${\boldsymbol{X}}_{v_i}^{(l)}=\frac{\mathit{\exp}\big({\boldsymbol{h}}^T ReLU\big({\boldsymbol{W}}_{\textbf{2}}{\boldsymbol{X}}_{v_i}+b\big)\big)}{\sum \mathit{\exp}\big({\boldsymbol{h}}^T ReLU\big({\boldsymbol{W}}_{\textbf{2}}{\boldsymbol{X}}_{v_i}+b\big)\big)}$| |
10: end for |
11: for each |${e}_{ij}=<{v}_i,{v}_j>\in{E}^{dd}$|do |
12: the concatenated feature set |$H=\big\{\big({H}_i,{y}_i\big)\big\}\ \big(1\le i\le |H|\big)$| |
13: |${R}= XGBoost\ \big(H,T\big)$| |
14: end for |
15: return|${R}$| |
Input: graph |$\textrm{HIN}\big(\boldsymbol{V},\boldsymbol{C},\boldsymbol{E}\big).$| |
representation size: |$d$| |
the number of regression trees: |$T$| |
Output: the prediction matrix |$\boldsymbol{R}$| |
1: Initialization: |$\boldsymbol{R}$| |
2: Obtaining biological knowledge matrix of drug |${\boldsymbol{C}}^{dr}$| |
3: Obtaining biological knowledge matrix of disease |${\boldsymbol{C}}^{di}$| |
4: Obtaining biological knowledge matrix of protein |${\boldsymbol{C}}^{pr}$| |
5: Reducing dimensions by the autoencoder |
6: |$\boldsymbol{C}={\big[{\boldsymbol{C}}^{dr};{\boldsymbol{C}}^{di};{\boldsymbol{C}}^{pr}\big]}^T\in{\mathbb{R}}^{\mid \textbf{V}\mid \times d}$| |
7: for each |${v}_i\in \boldsymbol{V}$|do |
8: |$NDLS\ \big({v}_i,\delta \big)=\mathit{\min}\big\{l:{\big\Vert{\boldsymbol{L}}_{v_i}^{\big(\infty \big)}-{\boldsymbol{L}}_{v_i}^{(l)}\big\Vert}_2<\delta \big\}$| |
9: |${\boldsymbol{X}}_{v_i}^{(l)}=\frac{\mathit{\exp}\big({\boldsymbol{h}}^T ReLU\big({\boldsymbol{W}}_{\textbf{2}}{\boldsymbol{X}}_{v_i}+b\big)\big)}{\sum \mathit{\exp}\big({\boldsymbol{h}}^T ReLU\big({\boldsymbol{W}}_{\textbf{2}}{\boldsymbol{X}}_{v_i}+b\big)\big)}$| |
10: end for |
11: for each |${e}_{ij}=<{v}_i,{v}_j>\in{E}^{dd}$|do |
12: the concatenated feature set |$H=\big\{\big({H}_i,{y}_i\big)\big\}\ \big(1\le i\le |H|\big)$| |
13: |${R}= XGBoost\ \big(H,T\big)$| |
14: end for |
15: return|${R}$| |
Input: graph |$\textrm{HIN}\big(\boldsymbol{V},\boldsymbol{C},\boldsymbol{E}\big).$| |
representation size: |$d$| |
the number of regression trees: |$T$| |
Output: the prediction matrix |$\boldsymbol{R}$| |
1: Initialization: |$\boldsymbol{R}$| |
2: Obtaining biological knowledge matrix of drug |${\boldsymbol{C}}^{dr}$| |
3: Obtaining biological knowledge matrix of disease |${\boldsymbol{C}}^{di}$| |
4: Obtaining biological knowledge matrix of protein |${\boldsymbol{C}}^{pr}$| |
5: Reducing dimensions by the autoencoder |
6: |$\boldsymbol{C}={\big[{\boldsymbol{C}}^{dr};{\boldsymbol{C}}^{di};{\boldsymbol{C}}^{pr}\big]}^T\in{\mathbb{R}}^{\mid \textbf{V}\mid \times d}$| |
7: for each |${v}_i\in \boldsymbol{V}$|do |
8: |$NDLS\ \big({v}_i,\delta \big)=\mathit{\min}\big\{l:{\big\Vert{\boldsymbol{L}}_{v_i}^{\big(\infty \big)}-{\boldsymbol{L}}_{v_i}^{(l)}\big\Vert}_2<\delta \big\}$| |
9: |${\boldsymbol{X}}_{v_i}^{(l)}=\frac{\mathit{\exp}\big({\boldsymbol{h}}^T ReLU\big({\boldsymbol{W}}_{\textbf{2}}{\boldsymbol{X}}_{v_i}+b\big)\big)}{\sum \mathit{\exp}\big({\boldsymbol{h}}^T ReLU\big({\boldsymbol{W}}_{\textbf{2}}{\boldsymbol{X}}_{v_i}+b\big)\big)}$| |
10: end for |
11: for each |${e}_{ij}=<{v}_i,{v}_j>\in{E}^{dd}$|do |
12: the concatenated feature set |$H=\big\{\big({H}_i,{y}_i\big)\big\}\ \big(1\le i\le |H|\big)$| |
13: |${R}= XGBoost\ \big(H,T\big)$| |
14: end for |
15: return|${R}$| |
Input: graph |$\textrm{HIN}\big(\boldsymbol{V},\boldsymbol{C},\boldsymbol{E}\big).$| |
representation size: |$d$| |
the number of regression trees: |$T$| |
Output: the prediction matrix |$\boldsymbol{R}$| |
1: Initialization: |$\boldsymbol{R}$| |
2: Obtaining biological knowledge matrix of drug |${\boldsymbol{C}}^{dr}$| |
3: Obtaining biological knowledge matrix of disease |${\boldsymbol{C}}^{di}$| |
4: Obtaining biological knowledge matrix of protein |${\boldsymbol{C}}^{pr}$| |
5: Reducing dimensions by the autoencoder |
6: |$\boldsymbol{C}={\big[{\boldsymbol{C}}^{dr};{\boldsymbol{C}}^{di};{\boldsymbol{C}}^{pr}\big]}^T\in{\mathbb{R}}^{\mid \textbf{V}\mid \times d}$| |
7: for each |${v}_i\in \boldsymbol{V}$|do |
8: |$NDLS\ \big({v}_i,\delta \big)=\mathit{\min}\big\{l:{\big\Vert{\boldsymbol{L}}_{v_i}^{\big(\infty \big)}-{\boldsymbol{L}}_{v_i}^{(l)}\big\Vert}_2<\delta \big\}$| |
9: |${\boldsymbol{X}}_{v_i}^{(l)}=\frac{\mathit{\exp}\big({\boldsymbol{h}}^T ReLU\big({\boldsymbol{W}}_{\textbf{2}}{\boldsymbol{X}}_{v_i}+b\big)\big)}{\sum \mathit{\exp}\big({\boldsymbol{h}}^T ReLU\big({\boldsymbol{W}}_{\textbf{2}}{\boldsymbol{X}}_{v_i}+b\big)\big)}$| |
10: end for |
11: for each |${e}_{ij}=<{v}_i,{v}_j>\in{E}^{dd}$|do |
12: the concatenated feature set |$H=\big\{\big({H}_i,{y}_i\big)\big\}\ \big(1\le i\le |H|\big)$| |
13: |${R}= XGBoost\ \big(H,T\big)$| |
14: end for |
15: return|${R}$| |
DDA prediction
Since the task of DDA prediction is normally regraded as a binary classification problem, DDAGDL adopts a well-established ensemble learning classifier, XGBoost [24], to complete the prediction task with |$\boldsymbol{X}$|. In particular, XGBoost performs its classification task via multiple decision trees each of which contributes to the final prediction result. To train XGBoost, we first compose a set of drug-target pairs denoted as |$H=\big\{\big({H}_i,{y}_i\big)\big\}\ \big(1\le i\le |H|\big)$|, where |${H}_i$| denotes the concatenated feature vector of the |$i$|th drug-disease pair, |${y}_i$| is its label to indicate the existence of an interaction, and |$\mid H\mid$| is the size of |$H$|. Assuming that |${v}_{dr}\in{V}^{dr}$| and |${v}_{di}\in{V}^{di}$| consist of the |$i$|th drug-disease pair in |$H$|. |${H}_i$| is the concatenation of |${\boldsymbol{X}}_{v_{dr}}$| and |${\boldsymbol{X}}_{v_{di}}$|, which are the respective representation vector of |${v}_{dr}$| and |${v}_{di}$| obtained with Equation (5), and the value of |${y}_i$| is 1 if |${e}_{dd}\in{E}^{dd}$| and 0 otherwise.
To minimize |${\sum}_{i=1}^{\mid H\mid }L\big({y}_i,F\big({H}_i\big)\big)$|, XGBoost first builds an initial decision tree, and then iteratively constructs new regression trees to reduce the residues computed with the loss function |$L\big({y}_i,F\big({H}_i\big)\big)$| till convergence. Given a query drug-disease pair, DDAGDL applies the XGBoost classifier trained with aforementioned steps to predict a probability score indicating the likelihood of an interaction existed between the query drug and disease. The complete procedure of DDAGDL is described in Algorithm 1.
Before training a DDAGDL model, users are required to formulate their own dataset in accordance with the input configurations of DDAGDL. After that, a customized DDAGDL model can be obtained by running the scripts available at https://github.com/stevejobws/DDAGDL. For a query drug-disease pair, a prerequisite to calculate their association score with DDAGDL is that the corresponding drug and disease nodes should be existed in the HIN constructed from the training dataset. If that is the case, DDAGDL first obtains the representations of drug and disease nodes, and then takes them as input to the trained XGBoost classifier, with which the association score of the query drug-disease pairs can be calculated.
Results
Overview of DDAGDL
DDAGDL is composed of three steps, and its overall framework is presented in Figure 1. Given a HIN, DDAGDL first employs an autoencoder to obtain the initial representations of drugs and diseases from their biological knowledge. Second, an attention-based GDL network is developed to avoid the over-smoothing issue by adaptively adjusting the range of neighborhood information during aggregation, and its attention mechanism allows it to extract useful features for high-quality representation learning. With smoothed representations of drugs and diseases, DDAGDL infers new DDAs according to the scores predicted by the XGBoost classifier.

Performance comparison with state-of-the-art DR methods
To accurately evaluate the performance of DDAGDL, we adopt a 10-fold cross-validation (CV) scheme by dividing a benchmark dataset into 10-fold, each of which is alternatively taken as a testing set while the rest are used as the training set. The performance of DDAGDL on each fold has been evaluated with Accuracy, MCC and F1-score, and the results are presented in Supplementary Material. Regarding the generation of negative samples, we randomly pair up drugs and diseases whose associations are not found in the benchmark dataset, and the number of negative samples is equal to that of positive ones. As one of the most important GDL operations, neighborhood aggregation makes node representations less distinguishable if more layers are stacked to enlarge receptive fields, leading to the over-smoothing issue. Several recent studies [41, 42] have shown that dense graphs with sufficient connectivity and label information could alleviate this issue by ensuring effective aggregation. It is noted from Table 1 that the density of B-Dataset is considerably larger than that of F-Dataset. In this regard, the performance deterioration caused by the over-smoothing issue is less significant in B-Dataset, and accordingly DDAGDL has only slightly better performance on B-Dataset.
We have compared the overall performance of DDAGDL with five state-of-the-art baseline methods, including SKCNN [15], DeepR2Cov [10], deepDR [14], DTINet [8] and DRHGCN [20]. The details of these comparing methods are presented in the section of Introduction. Regarding their parameter settings used for training, we explicitly adopt the default parameter values recommended in their original work for conducting a fair comparison.
The experimental results of 10-fold CV on three benchmark datasets, including B-dataset, C-dataset and F-dataset, are presented in Figures 2 and 3. The details of these three datasets are presented in the section of Methods. We note that DDAGDL yields the best performance across all the benchmark datasets, as on average it gives 11.39%, 18.22% and 21.54% relative improvement in Accuracy, MCC and F1-score, respectively, over all baseline methods. Another point worth noting is that two GDL-based methods, i.e. DDAGDL and DRHGCN, considerably outperform both SKCNN, DeepR2cov, deepDR and DTINet that only handle the data with an underlying Euclidean structure. This could be a strong indicator that the consideration of GDL enhances the representation learning ability of computational DR methods over HIN. A further improvement achieved by DDAGDL is mainly due to its capability of avoiding the over-smoothing issue by adaptively adjusting the aggregation depth for each node in HIN. Hence, we have reason to believe that DDAGDL is preferred as a promising DR tool when applied to discover new indicators for existing drugs.

The experimental results of all comparing models on three benchmark datasets, and they are presented in subfigures, respectively.

The ROC curves w.r.t. the overall performance of all comparing models on three benchmark datasets, and they are presented in subfigures (A–C), respectively.
In addition to its superiority in identifying novel DDAs, DDAGDL is also more robust against noisy data without sacrificing the accuracy. Taking DRHGCN, which is the second-best method, as an example, DDAGDL performs better by 3.57%, 4.60% and 6.78% than it by averaging all evaluation scores over B-dataset, C-dataset and F-dataset respectively. Moreover, similar behaviors are observed on the evaluation metrics of Recall and Precision for all baseline methods. In particular, we note that the Precision scores obtained by deepDR, DTINet and DRHGCN are much larger than their Recall scores, as they are prone to predict known DDAs as negative. But for DDAGDL, the incorporation of an attention mechanism alleviates the impact of noisy data by concentrating on the most representative neighborhood information during aggregation. Consequently, its Recall and Precision scores are much closer to each other. Hence, DDAGDL is granted a stronger discriminative ability in distinguishing between known DDAs and those randomly paired up when compared with baseline methods.
Regarding the unsatisfactory performance of SKCNN, DeepR2Cov, deepDR and DTINet for the DR task, their operations conducted in the Euclidean domain have a 2-fold effect. First, they normally formalize the drug-related network data as matrices such that the statistical properties of the data can be exploited on the Euclidean space. However, different kinds of drug-related networks may present contradicting properties on their own Euclidean spaces, thus confusing the computational methods in predicting novel DDAs. Second, they fail to characterize the structure of HIN constructed in our work, and thereby miss certain geometric prior knowledge for better learning the representations of drugs and diseases in the context of HIN. A possible reason for the unsatisfactory performance of DeepR2cov could be ascribed to the fact that manually selected meta-paths may not be able to appropriately capture the characteristics of DDAs for improved prediction accuracy. Although DRHGCN achieves the second-best performance on all three benchmark datasets, it suffers from the over-smoothing disadvantage, as the resulting features tend to follow a uniform distribution, which in return constraints its predictive ability for the DR task.
In summary, the geometric prior knowledge of HIN is of benefit for DDAGDL to correctly capture its full complexity and structural richness in a non-Euclidean domain, and the proposed attention-based GDL network allows DDAGDL to seamlessly incorporate such knowledge for learning high-quality smoothed representations of drugs and diseases. Consequently, DDAGDL yields a promising performance in identifying novel DDAs.
Ablation study of DDAGDL
To better investigate the influence of GDL on the performance of DDAGDL, we have also constructed an ablation study by developing three variants of DDAGDL, i.e. DDAGDL-A, DDAGDL-N and DDAGDL-G. The main difference lying between them is how to obtain the representations of drugs and diseases. In particular, DDAGDL-A learns the representations of drugs and diseases from their biological knowledge by following an autoencoder scheme, but DDAGDL-N adopts the proposed attention-based GDL network for representation learning. When compared with DDAGDL, DDAGDL-N simply uses a one-hot encoding method to initialize the representations of nodes in HIN. To evaluate and testify the capability of avoiding the over-smoothing issue for DDAGDL, we develop a variant of DDAGDL, namely DDAGDL-G, to learn the feature representations of drugs and diseases with a traditional graph convolutional network [37], which has the over-smoothing issue during the aggregation of neighborhood information. Although the results of our ablation study indicate that DDAGDL-G achieves the second-best performance on all three benchmark datasets, yet its performance is constrained by the over-smoothing issue, as the resulting features tend to follow a uniform distribution with less distinguishable difference. When compared with DDAGDL-G, DDAGDL further improves the accuracy of DDA prediction by avoiding the over-smoothing issue with a node-dependent local smoothing (NDLS) strategy. The XGBoost classifiers used by these three variants share the same parameter setting as DDAGDL. Their experimental results of 10-fold CV on three benchmark datasets are presented in Figures 4 and 5, where several things can be noted.

The performance of DDAGDL-A, DDAGDL-N and DDAGDL-G on the three benchmark datasets in the ablation study, and they are presented in subfigures (A–C), respectively.

The ROC and PR curves are obtained by three variants of DDAGDL over three benchmark datasets in the ablation study, and they are presented in subfigures (A–C), respectively.
First, DDAGDL-A achieves the worst performance among DDAGDL and its variants. In this regard, only relying on the biological knowledge of drugs and diseases may not be sufficiently enough to achieve desired DR performance. Particularly, for newly discovered diseases, lack of sufficient biological knowledge increases the difficulty of accurately identifying their unknown DDAs with DDAGDL-A. Second, DDAGDL-N shows a bigger margin in performance against DDAGDL-A. On average, DDAGDL-N performs better by 6.67%, 5.33%, 9.67% and 5.33% than DDAGDL-A in terms of AUC, ACC, MCC and F1-score, respectively, across all the benchmark datasets. Thus, the geometric prior knowledge of network structure allows DDAGDL-N to better capture the characteristics of drugs and diseases in a non-Euclidean domain. Last, a further improvement is observed from DDAGDL by combining the advantages of DDAGDL-A and DDAGDL-N. Comparing the performance of DDAGDL-A with that of DDAGDL-N, we reason that it is the integration of GDL that contributes the most to the performance of DDAGDL.
Classifier selection of DDAGDL
In particular, we apply many well-established classifiers, i.e. Logistic regression (LR), SVM, Random Forest Classifier (RF), GDBT and XGBoost, to perform the task of drug repositioning with the representations of drugs and diseases learned by DDAGDL. Our purpose is to select the classifier with which the best performance of DDAGDL can be achieved. To achieve the purpose, all classifiers are trained under the same 10-fold CV over all the three benchmark datasets and their experimental results are presented in Figures 6 and 7. We note that DDAGDL yields the best performance when using XGBoost as its classifier. The main reason for that phenomenon is due to the robust ensemble learning ability of XGBoost. Hence, we decide to incorporate XGBoost into DDAGDL for DDA prediction.

The performance of different classifiers on the three benchmark datasets, and they are presented in subfigures (A–C), respectively.

The ROC and PR curves are obtained by different classifiers over three benchmark datasets in the ablation study, and they are presented in subfigures (A–C), respectively.
In addition, there are two points worth further commentary. On the one hand, among all classifiers, LR and SVM obtain the worst performance in terms of Accuracy, MCC, F1-score and AUC, as they have low fitting ability to heterogeneous information networks and are not applicable for the task of drug repositioning. On the other hand, the performances of RF and GDBT are only worse than XGBoost, but better than LR and SVM. This could be a strong indicator that nonlinear classifiers are better to characterize the associations between drugs and diseases.
Generalization ability of DDAGDL
In particular, we have conducted cross-data validation by taking C-dataset and F-dataset as the training and testing datasets respectively. To this end, all known DDAs in the C-dataset are regarded as the positive samples in the training dataset, while those in the F-dataset are used to compose the testing dataset. The strategy of selecting negative samples for both training and testing datasets is the same as we adopt for 10-fold CV. The experimental results indicate that the Accuracy, MCC, F1-score and AUC scores obtained by DDAGDL are 90.14%, 81.15%, 90.81% and 96.58%, respectively. In this regard, DDAGDL still yields a promising performance under cross-dataset validation. One should note that a prerequisite for DDAGDL to predict well out-of-sample is that both training and external testing datasets should share many common drugs and diseases. Otherwise, DDAGDL may not be able to accurately predict novel DDAs for drugs and diseases that are not found in the training dataset.
Case studies on Alzheimer’s disease and breast cancer
To demonstrate the capability of DDAGDL in practically discovering potential DDAs, we have conducted additional experiments on the B-dataset. In particular, all DDAs in the B-dataset are used to construct the training dataset, and our purpose is to predict new candidate drugs for two diseases, i.e. Alzheimer’s Disease and Breast Cancer, as the case studies. Moreover, the reasons why we select B-dataset are 2-fold. On the one hand, although DDAGDL performs worse on B-dataset than on F-dataset, it is more meaningful for us to discover novel DDAs from B-dataset so as to strongly indicate the superior ability of DDAGDL in drug repositioning. On the other hand, there are more diseases included in the B-dataset, and accordingly the generalization ability of DDAGDL can thus be verified by applying it to many different diseases. Since Alzheimer’s Disease and Breast Cancer are two popular diseases that have been well studied by researchers, approved drugs for these two diseases are much more complete than those of the other diseases. It is for this reason that we take Alzheimer’s Disease and Breast Cancer as our case studies in this work. Besides, we also report the drug candidates of other diseases discovered by DDAGDL from B-dataset in Supplementary materials.
In Table 2, we list the top 10 candidates discovered by DDAGDL as the potential drugs of Alzheimer’s Disease, and among them six candidates are verified with evidence collected from relevant literature. Taking chlorpromazine as an example, it is already known that a structural analog of chlorpromazine can treat early cognitive deficit by reducing the levels of amyloid beta (Aβ) [43]. Since pathological proteins of AD mainly contain Aβ [44], we have reason to believe that chlorpromazine has a pharmacological effect on the treatment of Alzheimer’s Disease. We also investigate the prediction results obtained by deepDR and DRHGCN, and find that none of them is able to discover the association between chlorpromazine and Alzheimer’s Disease. Hence, this phenomenon could be a strong indicator for the superior ability of DDAGDL in discovering new potential drugs for diseases.
Disease . | Drugs . | Scores . | Evidence (PMID) . |
---|---|---|---|
Alzheimer’s disease | Phenytoin | 0.89 | 16781825 |
Valproic acid | 0.88 | 19748552 | |
Risperidone | 0.87 | 33176899 | |
Chlorpromazine | 0.86 | N/A | |
Carbamazepine | 0.86 | 28193995 | |
Fluoxetine | 0.84 | 30592045 | |
Cocaine | 0.82 | N/A | |
Methotrexate | 0.81 | 32423175 | |
Diazepam | 0.81 | N/A | |
Diphenhydramine | 0.80 | N/A |
Disease . | Drugs . | Scores . | Evidence (PMID) . |
---|---|---|---|
Alzheimer’s disease | Phenytoin | 0.89 | 16781825 |
Valproic acid | 0.88 | 19748552 | |
Risperidone | 0.87 | 33176899 | |
Chlorpromazine | 0.86 | N/A | |
Carbamazepine | 0.86 | 28193995 | |
Fluoxetine | 0.84 | 30592045 | |
Cocaine | 0.82 | N/A | |
Methotrexate | 0.81 | 32423175 | |
Diazepam | 0.81 | N/A | |
Diphenhydramine | 0.80 | N/A |
Disease . | Drugs . | Scores . | Evidence (PMID) . |
---|---|---|---|
Alzheimer’s disease | Phenytoin | 0.89 | 16781825 |
Valproic acid | 0.88 | 19748552 | |
Risperidone | 0.87 | 33176899 | |
Chlorpromazine | 0.86 | N/A | |
Carbamazepine | 0.86 | 28193995 | |
Fluoxetine | 0.84 | 30592045 | |
Cocaine | 0.82 | N/A | |
Methotrexate | 0.81 | 32423175 | |
Diazepam | 0.81 | N/A | |
Diphenhydramine | 0.80 | N/A |
Disease . | Drugs . | Scores . | Evidence (PMID) . |
---|---|---|---|
Alzheimer’s disease | Phenytoin | 0.89 | 16781825 |
Valproic acid | 0.88 | 19748552 | |
Risperidone | 0.87 | 33176899 | |
Chlorpromazine | 0.86 | N/A | |
Carbamazepine | 0.86 | 28193995 | |
Fluoxetine | 0.84 | 30592045 | |
Cocaine | 0.82 | N/A | |
Methotrexate | 0.81 | 32423175 | |
Diazepam | 0.81 | N/A | |
Diphenhydramine | 0.80 | N/A |
Regarding the case study of Breast Cancer, the top 10 candidates of potential drugs predicted by DDAGDL are shown in Table 3, and seven out of them have been verified to be effective for the treatment of Breast Cancer according to our literature review. Among all unverified drugs, ethinyl estradiol obtains the largest prediction score, and an in-depth analysis is given after a systematic literature review. As has been pointed out by Iwase et al [45], ethinyl estradiol is of benefit for metastatic breast cancer after prior aromatase inhibitor treatment. In this regard, our findings indicate a possible treatment for breast cancer by ethinyl estradiol from the perspective of artificial intelligence.
Disease . | Drugs . | Scores . | Evidence (PMID) . |
---|---|---|---|
Breast cancer | Methylprednisolone | 0.94 | 12884026 |
Valproic acid | 0.92 | 30075223 | |
Nifedipine | 0.88 | 25436889 | |
Phenytoin | 0.86 | 22678159 | |
Simvastatin | 0.86 | 33705623 | |
Amiodarone | 0.86 | 26515726 | |
Sirolimus | 0.84 | 32335491 | |
Ethinyl estradiol | 0.83 | N/A | |
Betamethasone | 0.83 | N/A | |
Acetaminophen | 0.83 | N/A |
Disease . | Drugs . | Scores . | Evidence (PMID) . |
---|---|---|---|
Breast cancer | Methylprednisolone | 0.94 | 12884026 |
Valproic acid | 0.92 | 30075223 | |
Nifedipine | 0.88 | 25436889 | |
Phenytoin | 0.86 | 22678159 | |
Simvastatin | 0.86 | 33705623 | |
Amiodarone | 0.86 | 26515726 | |
Sirolimus | 0.84 | 32335491 | |
Ethinyl estradiol | 0.83 | N/A | |
Betamethasone | 0.83 | N/A | |
Acetaminophen | 0.83 | N/A |
Disease . | Drugs . | Scores . | Evidence (PMID) . |
---|---|---|---|
Breast cancer | Methylprednisolone | 0.94 | 12884026 |
Valproic acid | 0.92 | 30075223 | |
Nifedipine | 0.88 | 25436889 | |
Phenytoin | 0.86 | 22678159 | |
Simvastatin | 0.86 | 33705623 | |
Amiodarone | 0.86 | 26515726 | |
Sirolimus | 0.84 | 32335491 | |
Ethinyl estradiol | 0.83 | N/A | |
Betamethasone | 0.83 | N/A | |
Acetaminophen | 0.83 | N/A |
Disease . | Drugs . | Scores . | Evidence (PMID) . |
---|---|---|---|
Breast cancer | Methylprednisolone | 0.94 | 12884026 |
Valproic acid | 0.92 | 30075223 | |
Nifedipine | 0.88 | 25436889 | |
Phenytoin | 0.86 | 22678159 | |
Simvastatin | 0.86 | 33705623 | |
Amiodarone | 0.86 | 26515726 | |
Sirolimus | 0.84 | 32335491 | |
Ethinyl estradiol | 0.83 | N/A | |
Betamethasone | 0.83 | N/A | |
Acetaminophen | 0.83 | N/A |
Besides, we have performed a detailed analysis to explain why DDAGDL successfully discover verified drug candidates whose associations with their corresponding diseases are unknown in the B-dataset. In particular, the representations of verified drug candidates are compared with those of drugs whose associations with Alzheimer’s Disease and Breast Cancer are known in the B-dataset, and the Pearson coefficients between these two kinds of drugs are calculated to indicate the similarity between them. The results are presented in Figures 8 and 9 for Alzheimer’s Disease and Breast Cancer, respectively. We note that each verified drug candidate is highly similar to at least one of the known drugs according to the distribution of blocks with dark color. This again demonstrates the rationality of DDAGDL to assign higher scores to these drug candidates.

The similarity of feature representations between predicted drugs and approved drugs for AD. The vertical axis denotes the predicted drugs, whereas the horizontal axis denotes the approved drugs.

The similarity of feature representations between predicted drugs and approved drugs for Breast Cancer. The vertical axis denotes the predicted drugs, whereas the horizontal axis denotes the approved drugs.
Regarding the performance of deepDR and DRHGCN in our case studies, their experiment results are presented in Supplementary Material. When compared with DDAGDL, both deepDR and DRHGCN yield poor performance when discovering new drugs for Alzheimer’s Disease and Breast Cancer. With deepDR, only 2 out of top 10 drug candidates predicted for Alzheimer’s Disease have been verified by relevant literature, while that number for Breast Cancer is only 1. DRHGCN performs slightly better than deepDR, as three drug candidates and two candidate drugs are verified for Alzheimer’s Disease and Breast Cancer respectively among top 10 results. Moreover, the prediction scores of top 10 drug candidates yielded by DRHGCN are much lower than those of DDAGDL. In other words, DDAGDL is more confident about its prediction results. The main reason is that the attention-based GDL network used by DDAGDL allows it to focus on significant features for representation learning over HIN. Hence, DDAGDL could be a useful DR tool due to its promising performance.
Molecular docking experiments for COVID-19
The purpose of molecular docking experiments is to evaluate the performance of DDAGDL for newly discovered diseases, such as COVID-19, thus further demonstrating the generalization ability of DDAGDL. In particular, docking-based drug repositioning is a kind of structure-based methods that aim to simulate the binding process of drugs to their target proteins by predicting the structures of receptor-ligand complexes, thereby identifying new indications for approved drugs [46, 47]. However, docking-based drug repositioning suffers the disadvantages of high false-positive rate and being time-consuming. To address these issues, DDAGDL simultaneously considers the underlying non-Euclidean structure of HIN and the biological knowledge of drugs and diseases to improve the quality of feature representations of drugs and diseases from different perspectives. By incorporating these feature representations with XGBoost, DDAGDL is able to predict large-scale DDAs in a reasonable time. Toward this end, we first collect a total of 55 DDAs related to COVID-19 from HDVD [48] by following the instruction of Su et al [49], and add them into B-dataset. Following the same training procedure as we use for case studies, we select the top five drug candidates predicted by DDAGDL for COVID-19, and conduct molecular docking experiments to evaluate their binding energies with SARS-CoV-2 spike protein or human angiotensin-converting enzyme 2 (ACE2) [50], which are important functional receptors for COVID-19. The chemical structures of candidate drugs are downloaded from DrugBank [29], and the coordinate information of binding sites is obtained from RCSB [51]. The molecular docking experiments are performed by using the AutoDockTools and AutoDock software [52], where SARS-CoV-2 spike protein and ACE2 are taken as receptors and each candidate drug is considered as a ligand of interest. In particular, when conducting the molecular docking experiments with AutoDock software, we explicitly set the area of SARS-Cov-2 spike receptor-binding domain bounding with ACE2 with the coordinate, i.e. x = −36.884, y = 29.245, and z = −0.005 as reported by [53]. The binding energies of top five drug candidates are shown in Table 4, and their molecular docking results are presented in Figure 10.
Binding energies between top five drug candidates and SARS-CoV-2 spike protein/ACE2
Rank . | Drug name . | Drug bank ID . | Score . | Binding energy (kcal/mol) . |
---|---|---|---|---|
1 | Methotrexate | DB00563 | 0.97 | −6.05 |
2 | Clozapine | DB00363 | 0.96 | −6.80 |
3 | Olanzapine | DB00334 | 0.96 | −6.85 |
4 | Morphine | DB00295 | 0.96 | −7.83 |
5 | Clonidine | DB00575 | 0.95 | −6.79 |
Rank . | Drug name . | Drug bank ID . | Score . | Binding energy (kcal/mol) . |
---|---|---|---|---|
1 | Methotrexate | DB00563 | 0.97 | −6.05 |
2 | Clozapine | DB00363 | 0.96 | −6.80 |
3 | Olanzapine | DB00334 | 0.96 | −6.85 |
4 | Morphine | DB00295 | 0.96 | −7.83 |
5 | Clonidine | DB00575 | 0.95 | −6.79 |
Binding energies between top five drug candidates and SARS-CoV-2 spike protein/ACE2
Rank . | Drug name . | Drug bank ID . | Score . | Binding energy (kcal/mol) . |
---|---|---|---|---|
1 | Methotrexate | DB00563 | 0.97 | −6.05 |
2 | Clozapine | DB00363 | 0.96 | −6.80 |
3 | Olanzapine | DB00334 | 0.96 | −6.85 |
4 | Morphine | DB00295 | 0.96 | −7.83 |
5 | Clonidine | DB00575 | 0.95 | −6.79 |
Rank . | Drug name . | Drug bank ID . | Score . | Binding energy (kcal/mol) . |
---|---|---|---|---|
1 | Methotrexate | DB00563 | 0.97 | −6.05 |
2 | Clozapine | DB00363 | 0.96 | −6.80 |
3 | Olanzapine | DB00334 | 0.96 | −6.85 |
4 | Morphine | DB00295 | 0.96 | −7.83 |
5 | Clonidine | DB00575 | 0.95 | −6.79 |

Molecular docking results for top five drug candidates bound with SARS-CoV-2 spike protein/ACE2.
We note that the binding energies of top five drug candidates are positioned at a low level as indicated by Table 4, demonstrating that the molecules of these candidates have strong binding affinities with the receptors of COVID-19. To further indicate their eligibility in treating COVID-19, we also perform additional molecular docking experiments on two approved drugs, i.e. Remdesivir and Ribavirin, and find that the binding energies of Remdesivir and Ribavirin are −7.25 kcal/mol and − 6.87 kcal/mol, respectively. It is observed that among top five drug candidates, the binding energies of Morphine are smaller than those of Remdesivir and Ribavirin. Hence, Morphine is likely to have therapeutic effect against SARS-Cov-2, and thereby it could be considered an alternative treatment for COVID-19. Moreover, we note from Table 4 that Methotrexate obtains the largest prediction score, and it means that DDAGDL has the most confidence in predicting the association between Methotrexate and COVID-19. According to a literature review, we find that Methotrexate can affect the SARS-CoV-2 virus by disrupting the specific protein interactions of the targeted hub protein DDX39B [54]. One should note that due to the high false-positive rate, the results of molecular docking only indicate a therapeutic possibility for the antiviral drugs newly discovered by DDAGDL, and in-depth follow-up laboratory-based experiments are required to verify the practical therapeutic effect of these drugs for the treatment of related diseases. Overall, we believe that these five drug candidates are increasingly likely to have therapeutic effects against SARS-CoV-2 for the treatment of COVID-19.
Discussion and conclusion
Regarding the DR task, although a variety of ML-based and DL-based computational methods have been proposed, few of them can consider the non-Euclidean nature of biomedical data that are modeled with graphs, and thereby limiting their accuracy in identifying novel DDAs. To overcome this problem, we leverage the learning ability of GDL to improve the quality of feature representations of drugs and diseases by incorporating the geometric prior knowledge of HIN, and propose an efficient DR framework, namely DDAGDL, in this work to accomplish the DR task over HIN. Experimental results demonstrate that DDAGDL yields a superior performance across all the three benchmark datasets under 10-fold CV when compared with several state-of-the-art baseline methods in terms of Accuracy, MCC, F1-socre and AUC. This could be a strong indicator that DDAGDL can effectively learn the smoothed feature representations of drugs and diseases by projecting complicated biological information, characterized by its non-Euclidean nature, onto a latent space with GDL. Furthermore, we have also conducted the case studies to show the usefulness of DDAGDL in predicting novel DDAs by validating the top-ranked drug candidates for Alzheimer’s disease and Breast Cancer. Our findings indicate that most of the drug candidates are of high quality, as they have already been reported by previously published studies, and some of them are not even found in the prediction results of the other compared methods. Moreover, the results of molecular docking experiments indicate that DDAGDL is also efficacious for newly discovered diseases. Obviously, leveraging GDL provides us an alternative view to address the DR task by properly handling the non-Euclidean nature of HIN, which has been ignored by most existing DR methods. In conclusion, DDAGDL is a promising DR tool for identifying novel DDAs, and the consideration of GDL gains new insight into the representation learning of drugs and diseases over HIN by fully exploiting its geometric prior knowledge.
There are several reasons contributing to the superior performance of DDAGDL in the DR task. First, we introduce a HIN model composed of not only the biological knowledge of drugs and diseases, but also three kinds of association networks, i.e. a drug-disease network, a drug-protein network and a protein-disease network. It provides us an opportunity to address the DR task from an integrated perspective. Second, the non-Euclidean nature of HIN raises new challenges, as the basic operations of most existing DR computational methods are particularly designed in the Euclidean case. In this regard, an improved GDL network is incorporated into DDAGDL such that the smoothed representations of drugs and diseases are obtained by adaptively adjusting the aggregation depth. Last, DDAGDL adopts an attention mechanism to aggregate the most significance neighborhood information for representation learning. By doing so, its robustness against noisy data in HIN can be enhanced as indicated by the experimental results.
The limitations of DDAGDL are discussed from three aspects. First, according to the ablation study, we note that the consideration of our attention-based GDL network plays a critical role in improving the accuracy of DR. However, for new diseases without any known DDAs, DDAGDL has limited ability in learning their representations through network structure. Second, the issue of classifier selection matters the performance of DDAGDL, and currently we could only adopt the trial-and-test method to determine the classifier from a set of well-established classifiers. Last, the efficiency of DDAGDL is constrained due to the extra cost of computing the aggregation depth for each node in HIN.
In addition to proposing specific solutions to address the above limitations, we also would like to unfold our future work from another two aspects. First, we would like to integrate more different association networks, such as to increase the richness of HIN, and expect that DDAGDL is able to learn more expressive representations of drugs and diseases. Second, we are interested in evaluating the generalization ability of DDAGDL by applying it to address other kinds of association prediction problems, such as drug–drug association prediction [55] and protein–protein interaction prediction [56]. Understanding under what circumstances and to what extent DDAGDL generalize across different tasks should be studied as well.
We develop a novel attention-based GDL framework, i.e. DDAGDL, to learn smoothed feature representations of drugs and diseases with geometric prior knowledge in the non-Euclidean domain for improved performance of DDA prediction.
We leverage the learning ability of GDL to improve the quality of feature representations of drugs and diseases by adaptively adjusting the aggregation depth and then an attention mechanism is constructed to distinguish the significance of features when DDAGDL learns the final representations of drugs and diseases.
Experimental results on all benchmark datasets demonstrate that DDAGDL performs better than several state-of-the-art baseline methods under 10-fold cross-validation. Furthermore, we have conducted the case studies and molecular docking experiments to indicate that DDAGDL is a promising DR tool that gains new insights into exploiting the geometric prior knowledge for improved efficacy.
Data availability
The dataset and source code can be freely downloaded from https://github.com/stevejobws/DDAGDL.
Authors’ contributions
B.-W.Z., L.H. and X.-R.S. contributed to the conception, and design of the study and performed the statistical analysis. P.-W.H and Y.-P. M. organized the database. X.Z., B.-W.Z. and L.H. wrote the first draft of the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding
This work has been supported in part by the Natural Science Foundation of Xinjiang Uygur Autonomous Region under grant (2021D01D05), in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences, in part by the Tianshan Youth Project--Outstanding Youth Science and Technology Talents of Xinjiang under grant (2020Q005).
Author Biographies
Bo-Wei Zhao is a PhD candidate at the University of Chinese Academy of Sciences and the Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences.
Xiao-Rui Su is a PhD candidate at the University of Chinese Academy of Sciences and the Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences.
Peng-Wei Hu is a professor in Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Science, Urumqi, China. His research interests include machine learning, big data analysis and its applications in bioinformatics.
Yu-Peng Ma is a professor in Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Science, Urumqi, China. His research interests include internet of Things application technology and big data analysis.
Xi Zhou is a professor in Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Science, Urumqi, China. His research interests include machine learning and big data analysis.
Lun Hu received the B.Eng. degree from the Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, China, in 2006, and the M.Sc. and Ph.D. degrees from the Department of Computing, The Hong Kong Polytechnic University, Hong Kong, in 2008 and 2015, respectively. He joined the Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China, in 2020 as a professor of computer science. His research interests include machine learning, complex network analytics and their applications in bioinformatics.