-
PDF
- Split View
-
Views
-
Cite
Cite
Zhen Tian, Yue Yu, Haichuan Fang, Weixin Xie, Maozu Guo, Predicting microbe–drug associations with structure-enhanced contrastive learning and self-paced negative sampling strategy, Briefings in Bioinformatics, Volume 24, Issue 2, March 2023, bbac634, https://doi.org/10.1093/bib/bbac634
- Share Icon Share
Abstract
Predicting the associations between human microbes and drugs (MDAs) is one critical step in drug development and precision medicine areas. Since discovering these associations through wet experiments is time-consuming and labor-intensive, computational methods have already been an effective way to tackle this problem. Recently, graph contrastive learning (GCL) approaches have shown great advantages in learning the embeddings of nodes from heterogeneous biological graphs (HBGs). However, most GCL-based approaches don’t fully capture the rich structure information in HBGs. Besides, fewer MDA prediction methods could screen out the most informative negative samples for effectively training the classifier. Therefore, it still needs to improve the accuracy of MDA predictions.
In this study, we propose a novel approach that employs the Structure-enhanced Contrastive learning and Self-paced negative sampling strategy for Microbe-Drug Association predictions (SCSMDA). Firstly, SCSMDA constructs the similarity networks of microbes and drugs, as well as their different meta-path-induced networks. Then SCSMDA employs the representations of microbes and drugs learned from meta-path-induced networks to enhance their embeddings learned from the similarity networks by the contrastive learning strategy. After that, we adopt the self-paced negative sampling strategy to select the most informative negative samples to train the MLP classifier. Lastly, SCSMDA predicts the potential microbe–drug associations with the trained MLP classifier. The embeddings of microbes and drugs learning from the similarity networks are enhanced with the contrastive learning strategy, which could obtain their discriminative representations. Extensive results on three public datasets indicate that SCSMDA significantly outperforms other baseline methods on the MDA prediction task. Case studies for two common drugs could further demonstrate the effectiveness of SCSMDA in finding novel MDA associations.
The source code is publicly available on GitHub https://github.com/Yue-Yuu/SCSMDA-master.
Introduction
Microbe or microorganism is a category of microscopic living organisms that have close interactions with human hosts. Generally, one microbe community mainly contains bacteria, viruses, protozoa and fungi [1]. Recent studies have shown that microbe communities usually play significant roles in human health, such as facilitating metabolism [2], producing essential vitamins [3] and protecting against invasion from pathogens [4]. However, the imbalance or dysbiosis of microbe communities may also cause some common infectious diseases such as obesity [5], diabetes [6] and even cancer [7]. Therefore, discovering the relationship between microbes and drugs is one essential problem for precision medicine [8–10].
Since inferring these associations with conventional wet-lab experiments is time-consuming, computational methods have already been proposed to tackle this problem. Moreover, with the increasing availability of various data sources related to microbes and drugs, these computational-based approaches have gained remarkable success [11]. For example, Zhu [12] raised HMDAKATZ method that predicted the potential associations based on the microbe–drug heterogeneous network. Long proposed GCNMDA model that first measured the similarity between microbes and drugs and then employed the conditional random field-based framework to learn their deep representations [13]. HNERMDA [14] constructed the microbe–drug heterogeneous network and adopted the metapath2vec model to learn the low-dimensional embeddings. EGATMDA [15] aimed to fully utilize the multisource of microbes and drugs to discover their association relationships. This model could learn the importance of different heterogeneous networks with graph-level attention mechanism and then obtain a deep representation of microbes and drugs. Meanwhile, Graph2MDA [16] employed the variational graph autoencoder to obtain the informative and interpretable latent representations for microbes and drugs based on their multimodal attributed graphs. Besides, MKGCN [17] first extracted the features of microbes and drugs at different graph convolutional network (GCN) layers and then predicted the microbe–drug association with multiple kernel matrices. However, these approaches above may have some weaknesses. For example, HMDAKATZ only adopted simple metrics to evaluate the association strengths between microbes and drugs, while GCNMDA and EGATMDA only selected negative samples in a random manner, which ignored the effects of different negative samples on the prediction model. Meanwhile, MKGCN couldn’t fully capture the complex structure and rich semantics between nodes in the heterogeneous networks.
Recently, self-supervised learning approaches have attracted considerable attention because they provided novel insights into decreasing the dependency on known labels and enabled the training on massive unlabeled data [18]. They also have shown the superior capacity in dealing with graphs which could thoroughly learn the discriminative representations of nodes [19, 20]. Meanwhile, graph contrastive learning (GCL) modules have already been widely used to handle the pairwise relationship prediction tasks among biology entities in the bioinformatics area. For example, SGCL-DTI first generated the topology and semantic graph for drug–target pairs and established a contrastive loss function to guide the learning process in a supervised manner to obtain embeddings of drugs and targets [21]. To predict protein–peptide binding residues, PepBCL established a novel contrastive learning strategy to learn the embeddings of binding residues based on the imbalanced dataset [22]. To predict cancer drug response problems, GraphCDR first constructed two different drug–cell line association networks and adopted the contrastive learning strategy to enhance its ability in learning the feature representations of nodes [23]. Besides, MIRACLE took multiview graph contrastive learning strategy to predict drug-drug interactions, which could capture molecule structure in the inter-view and interactions in the intra-view between molecules simultaneously [24]. To fully learn the embedding of nodes in the heterogeneous networks, HeCo generated network schema view and meta-path view based on HINs, and applied the cross-view contrastive mechanism to capture the information in local and high-order structures simultaneously [25]. In bioinformatics areas, generating different meaningful views appropriately is one essential step for these approaches above. Standard data augmentation approaches, such as node dropping or edge perturbation, are not trivial for common biological networks because they might damage the original graph structure and degrade the ability of prediction models in learning the feature representations [26, 27]. Meanwhile, as heterogeneous networks usually consist of multiple types of nodes and relations, GCL approaches should comprehensively mine the complex structure and rich semantics for learning the embeddings of nodes.
For the pairwise relationship prediction task, it is still a challenging problem to select the most informative negative samples from the candidate negative sample set [28]. Existing machine learning methods typically treat the known associations (labeled samples) between entities as the positive samples and the remained unconfirmed associations (unlabeled samples) as the candidate negative samples [29]. In this manner, there is an extreme imbalance between the number of positive and negative samples. What’s more, with the negative under-sampling strategy, most approaches only randomly select a subset of negative samples from the whole candidate negative samples. [30]. For example, for the drug–target interaction prediction [31], miRNA–disease associations prediction [30, 32–34] and microbe–drug association prediction problems [13], these methods randomly selected the same number of negative samples as that of positive samples. A standard random under-sampling strategy often leads to the negligence of important and informative samples, and the introduction of meaningless and noisy samples [35]. Although some other models [36–38] improved the negative sampling strategy, they do not fully screen out the most informative negative samples that play an important role in the classifiers in the training process, which may largely limit their prediction capability.
Motivated by GCL approaches, we adopt the structure-enhanced contrastive learning strategy to obtain deep representations of microbes and drugs. Since microbes and drugs have multisource information, we first measure their respective similarity from different perspectives and construct the integrated similarity networks. Then to fully capture the complex structure and rich semantics of microbe–drug association network, we establish the meta-path-induced networks based on different meta-paths. Therefore, the similarity networks and meta-path-induced networks form the two views for contrastive learning. So we utilize the meta-path-induced networks of microbes and drugs to enhance their feature representations learned from the similarity networks. Besides, we adopt the self-paced negative sampling strategy to select the most informative negative samples, which aim to improve the capability of the prediction model.
In this study, we put forward a novel method that employs Structure-enhanced Contrastive learning and Self-paced negative sampling strategy to identify potential Microbe-Drug Associations (SCSMDA). Firstly, SCSMDA constructs the similarity networks of microbes and drugs, as well as their different meta-path-induced networks. Then, we employ the meta-path-induced networks of microbes and drugs to enhance their feature representations learned from the similarity networks with the contrastive learning strategy. After that, we utilize the self-paced negative sampling strategy to select the most informative negative samples to train the MLP classifier. Lastly, SCSMDA predicts the potential microbe–drug associations with the trained MLP classifier.
The workflow of SCSMDA is displayed in Figure 1. Our main contributions are summarized as follows:
1) SCSMDA constructs the similarity networks with the multisource information of microbes and drugs, and the meta-path-induced networks of microbes and drugs with different meta-paths.
2) SCSMDA employs the structure-enhanced contrastive learning strategy to obtain the discriminative embeddings of microbes and drugs in a self-supervised manner based on their similarity networks and meta-path-induced networks.
3) SCSMDA adopts the self-paced negative sampling strategy to select the most informative negative samples for training the MLP classifier.
4) Experimental results on three datasets indicate that SCSMDA outperforms other baseline approaches in microbe–drug association prediction tasks.

The overall workflow of SCSMDA. In step 1, SCSMDA constructs the similarity networks of microbes and drugs with their multisource information, as well as their different meta-path-induced networks. In step 2, we employ the meta-path-induced networks of microbes and drugs to enhance their feature representations learned from the similarity networks with the contrastive learning strategy. In step 3, SCSMDA adopts the self-paced negative sampling strategy to select the most informative negative samples for training the MLP classifier. In step 4, SCSMDA predicts the potential microbe–drug associations with the trained MLP classifier. In the figure, |$\Phi _1$|, |$\Phi _2$|, |$\Phi _3$| and |$\Phi _4$| denote meta-path MDM, MDMDM, DMD and DMDMD, respectively. SLA represents the semantic level attention.
Materials and methods
Notations . | Descriptions . |
---|---|
|${\mathcal{G}}$| | Heterogeneous Information Network |
|${\Phi }$| | Meta-path |
|$A$| | microbe–drug association matrix |
|$A_\Phi $| | Meta-path-induced matrix under |$\Phi $| |
|${h}$| | Initial features of nodes |
|${h}^{\prime}$| | Projected feature of nodes |
|${sn}$| | The integrated similarity network |
|${mp}$| | The integrated meta-path-induced network |
|$z_{m_i}^{sn}$| | The embedding of microbe |$m_i$| learned from |$sn$| |
|$z_{m_i}^{mp}$| | The embedding of microbe |$m_i$| learned from |$mp$| |
|$z_{d_j}^{sn}$| | The embedding of drug |$d_j$| learned from |$sn$| |
|$z_{d_j}^{mp}$| | The embedding of drug |$d_j$| learned from |$mp$| |
|$z_{m_i}$| | The final embedding of microbe |$m_i$| |
|$z_{d_j}$| | The final embedding of drug |$d_j$| |
|$\mathcal{H}$| | The Hardness function |
|$\mathcal{N}^{\Phi }_{v_i}$| | Meta-path-based neighbors for |$v_i$| with |$\Phi $| |
|$(i,j)$| | The node pair of microbe |$m_i$| and drug |$d_j$| |
|$y_{ij}$| | The ground truth of the node pair |$(i,j)$| |
|$\hat{y}_{ij}$| | The predicted score of the node pair |$(i,j)$| |
|${Y}^+$| | Positive MDAs in the training set |
|${Y}^-$| | Selected negative MDAs in the training set |
MLP | The Multilayer Perceptron |
Notations . | Descriptions . |
---|---|
|${\mathcal{G}}$| | Heterogeneous Information Network |
|${\Phi }$| | Meta-path |
|$A$| | microbe–drug association matrix |
|$A_\Phi $| | Meta-path-induced matrix under |$\Phi $| |
|${h}$| | Initial features of nodes |
|${h}^{\prime}$| | Projected feature of nodes |
|${sn}$| | The integrated similarity network |
|${mp}$| | The integrated meta-path-induced network |
|$z_{m_i}^{sn}$| | The embedding of microbe |$m_i$| learned from |$sn$| |
|$z_{m_i}^{mp}$| | The embedding of microbe |$m_i$| learned from |$mp$| |
|$z_{d_j}^{sn}$| | The embedding of drug |$d_j$| learned from |$sn$| |
|$z_{d_j}^{mp}$| | The embedding of drug |$d_j$| learned from |$mp$| |
|$z_{m_i}$| | The final embedding of microbe |$m_i$| |
|$z_{d_j}$| | The final embedding of drug |$d_j$| |
|$\mathcal{H}$| | The Hardness function |
|$\mathcal{N}^{\Phi }_{v_i}$| | Meta-path-based neighbors for |$v_i$| with |$\Phi $| |
|$(i,j)$| | The node pair of microbe |$m_i$| and drug |$d_j$| |
|$y_{ij}$| | The ground truth of the node pair |$(i,j)$| |
|$\hat{y}_{ij}$| | The predicted score of the node pair |$(i,j)$| |
|${Y}^+$| | Positive MDAs in the training set |
|${Y}^-$| | Selected negative MDAs in the training set |
MLP | The Multilayer Perceptron |
Notations . | Descriptions . |
---|---|
|${\mathcal{G}}$| | Heterogeneous Information Network |
|${\Phi }$| | Meta-path |
|$A$| | microbe–drug association matrix |
|$A_\Phi $| | Meta-path-induced matrix under |$\Phi $| |
|${h}$| | Initial features of nodes |
|${h}^{\prime}$| | Projected feature of nodes |
|${sn}$| | The integrated similarity network |
|${mp}$| | The integrated meta-path-induced network |
|$z_{m_i}^{sn}$| | The embedding of microbe |$m_i$| learned from |$sn$| |
|$z_{m_i}^{mp}$| | The embedding of microbe |$m_i$| learned from |$mp$| |
|$z_{d_j}^{sn}$| | The embedding of drug |$d_j$| learned from |$sn$| |
|$z_{d_j}^{mp}$| | The embedding of drug |$d_j$| learned from |$mp$| |
|$z_{m_i}$| | The final embedding of microbe |$m_i$| |
|$z_{d_j}$| | The final embedding of drug |$d_j$| |
|$\mathcal{H}$| | The Hardness function |
|$\mathcal{N}^{\Phi }_{v_i}$| | Meta-path-based neighbors for |$v_i$| with |$\Phi $| |
|$(i,j)$| | The node pair of microbe |$m_i$| and drug |$d_j$| |
|$y_{ij}$| | The ground truth of the node pair |$(i,j)$| |
|$\hat{y}_{ij}$| | The predicted score of the node pair |$(i,j)$| |
|${Y}^+$| | Positive MDAs in the training set |
|${Y}^-$| | Selected negative MDAs in the training set |
MLP | The Multilayer Perceptron |
Notations . | Descriptions . |
---|---|
|${\mathcal{G}}$| | Heterogeneous Information Network |
|${\Phi }$| | Meta-path |
|$A$| | microbe–drug association matrix |
|$A_\Phi $| | Meta-path-induced matrix under |$\Phi $| |
|${h}$| | Initial features of nodes |
|${h}^{\prime}$| | Projected feature of nodes |
|${sn}$| | The integrated similarity network |
|${mp}$| | The integrated meta-path-induced network |
|$z_{m_i}^{sn}$| | The embedding of microbe |$m_i$| learned from |$sn$| |
|$z_{m_i}^{mp}$| | The embedding of microbe |$m_i$| learned from |$mp$| |
|$z_{d_j}^{sn}$| | The embedding of drug |$d_j$| learned from |$sn$| |
|$z_{d_j}^{mp}$| | The embedding of drug |$d_j$| learned from |$mp$| |
|$z_{m_i}$| | The final embedding of microbe |$m_i$| |
|$z_{d_j}$| | The final embedding of drug |$d_j$| |
|$\mathcal{H}$| | The Hardness function |
|$\mathcal{N}^{\Phi }_{v_i}$| | Meta-path-based neighbors for |$v_i$| with |$\Phi $| |
|$(i,j)$| | The node pair of microbe |$m_i$| and drug |$d_j$| |
|$y_{ij}$| | The ground truth of the node pair |$(i,j)$| |
|$\hat{y}_{ij}$| | The predicted score of the node pair |$(i,j)$| |
|${Y}^+$| | Positive MDAs in the training set |
|${Y}^-$| | Selected negative MDAs in the training set |
MLP | The Multilayer Perceptron |
In this section, we will first briefly describe the experiment datasets and basic concepts used in SCSMDA. Then, the integrated similarity networks and meta-path-induced networks of microbes and drugs are established. Next, SCSMDA learns the embeddings of microbes and drugs with structure-enhanced contrastive learning strategy. After that, we utilize the self-paced negative sampling strategy to select the most informative negative samples and train the MLP classifier. Lastly, the loss function and some implementation details are presented.
Data collection
Currently, there are mainly three different known microbe–drug association datasets, which are MDAD [39], aBiofilm [40] and DrugVirus [41]. We collect these public datasets from the research [13] (https://github.com/longyahui/GCNMDA). Specifically, MDAD mainly contains 173 microbes and 1373 drugs involving 2470 associations. For aBiofilm dataset, it consists of 2884 microbe–drug associations between 140 microbes and 1720 drugs. For DrugVirus dataset, there are 95 microbes and 175 drugs including 933 microbe–drug associations between them. The statistics about these datasets are displayed in Table 2.
In each dataset, the association relationships between microbes and drugs can be established as one bipartite network. Without loss generality, the corresponding adjacency matrix can be denoted as |$A \in \mathbb{R} ^{N_m \times N_d}$|, where |$N_m$| and |$N_d$| represent the number of microbes and drugs in the bipartite network. |$A_{ij}$| will be 1 if there is one association between |$m_i$| and |$d_j$|, and 0 otherwise.
Basic concept
Heterogeneous Information Network (HIN). One heterogeneous information network could be defined as an undirected graph |$\mathcal{G}=(\mathcal{V},\mathcal{E})$| with the entity type mapping function |$\phi : \mathcal{V} \rightarrow \mathcal{A}$| and a relation type mapping |$\varphi : \mathcal{E} \rightarrow \mathcal{R}$|, where |$\mathcal{V}$| and |$\mathcal{A}$| denote the entity set and entity type set, and |$\mathcal{E}$| and |$\mathcal{R}$| denote the relation set and relation type set. Network |$\mathcal{G}$| will be one homogeneous information network if |$\left \lvert \mathcal{A}\right \rvert +\left \lvert \mathcal{R}\right \rvert =2$|. Otherwise, it will be one heterogeneous information network.
The microbe–drug association network (Figure 2A) could be treated as one HIN, since there are two types of nodes which are microbe and drug, and one type of link, which is the association relationship.

A toy example for SCSMDA. (A) Microbe–drug association network. (B) Four meta-paths involved in SCSMDA, which are MDM, MDMDM, DMD and DMDMD. (C) Drug |$D_2$| and its DTD meta-path-based neighbors |$D_1$|, |$D_2$|, |$D_3$| and |$D_4$| based on the microbe–drug association network in (A). (D)The meta-path-induced network with DMD based on the network in A.
Meta-paths. Generally, one meta-path |$\Phi $| with |$l$| nodes can be defined as |$N_1 \stackrel{R_1}{\longrightarrow }N_2 \stackrel{R_2}{\longrightarrow } \cdots \stackrel{R_l}{\longrightarrow }N_{l}$|, which is abbreviated as |$N_1N_2\cdots N_{l}$|. The composition relation between node |$N_1$| and |$N_{l}$| is formulated as |$R=R_1\circ R_2 \circ \cdots \circ R_l$|, where |$\circ $| is the composition operator on relations.
In the microbe–drug HIN (Figure 2A), two drugs can be connected by different meta-paths (Figure 2B), such as DMD and DMDMD. This type of meta-paths usually has a certain biological meaning. For example, DMD indicates that if two drugs interact with one common microbe, they should have a higher similarity with consistent functionality.
Meta-path-based neighbors. Suppose there is one node named |$v_i$| and one meta-path |$\Phi $|, its meta-path-based neighbors |$\mathcal{N}_{v_i}^{\Phi }$| can be defined as the nodes that connect with |$v_i$| according to the meta-path |$\Phi $|.
As is shown in Figure 2C, for drug |$D_1$|, its DMD meta-path-based neighbors are |$D_1$|, |$D_2$|, |$D_3$| and |$D_4$| based on Figure 2C.
Microbe and drug similarity network construction
Microbe similarity network construction
SCSMDA measures the similarity of microbes from two aspects. The 1st kind of similarity is called the microbe functional similarity. Suppose there are two microbes named |$m_i$| and |$m_j$| respectively, their microbe functional similarity can be denoted as |$FM(m_i,m_j)$|. SCSMDA measures the microbe functional similarity between all microbe pairs and finally establishes the Microbe Functional Similarity Network. The detailed calculation process is presented by Kamneva [42] and Long [13].
SCSMDA measures the integrated similarities for all the microbe pairs and then constructs the integrated microbe similarity network.
Drug similarity network construction
Meanwhile, we also measure the similarity of drugs from two aspects. The 1st one is the drug structure-based similarity proposed by Hattori [43]. For two drugs named |$d_i$| and |$d_j$|, their structure-based similarity can be represented as |$DS(d_i,d_j)$|. After calculating all the similarities between all drug pairs, we can establish the Drug Structure-based Similarity Network.
SCSMDA measures the integrated similarities for all the drug pairs and then constructs the integrated drug similarity network.
Meta-path-induced network construction
The microbe–drug association network can be regarded as one HIN with complex structure and rich semantics. Meta-paths could comprehensively reflect the structure of HINs and have been widely employed to capture rich semantic meanings in HINs. Therefore, SCSMDA establishes different meta-path-induced networks for microbes and drugs according to their diverse meta-paths.
A toy example for constructing the meta-path-induced network has been represented in Figure 2D.
Node feature transformation
Embeddings learning from the integrated similarity networks
The ultimate embeddings for microbes |$m_i$| and drugs |$d_j$| learned from the integrated similarity networks can be represented as |$z_{m_i}^{sn}$| and |$z_{d_j}^{sn}$|.
Embedding learning with meta-path-induced networks
SCSMDA generates two different meta-path-induced networks for microbes and drugs, respectively. Since microbes and drugs have similar learning module structures with meta-path-induced networks, we only take microbes as an example to show the process that SCSMDA learns their embeddings with vanilla GCNs.
Similarly, suppose there is one drug |$d_j$|, its final embeddings learned from the meta-path-induced network |$A_{\Phi _n}$| where |$n \in \{3,4\}$| can be represented as |$z_{d_j}^{mp}$|.
Structure-enhanced contrastive learning strategy

The positive pair selecting strategy of SCSMDA. |$G_i$| and |$G_j$| are two different views. |$m_1$| and |$m^{\prime}_1$| are the same nodes at |$G_i$| and |$G_j$| and ( |$m_1$|, |$m^{\prime}_1$|) is the instinct positive pair. |$m^{\prime}_2$| and |$m^{\prime}_5$| will be the conditional positive samples for |$m_1$| if they are connected by enough meta-paths and ( |$m_1$|,|$m^{\prime}_2$| ) and ( |$m_1$|,|$m^{\prime}_5$|) could be the conditional positive pairs. Meanwhile, (|$m_1$|, |$m^{\prime}_3$| ) and ( |$m_1$|,|$m^{\prime}_4$|) are the conditional negative pairs if (|$m_1$|, |$m^{\prime}_3$| ) and ( |$m_1$|,|$m^{\prime}_4$|) don’t connect by enough meta-paths.
SCSMDA learns the embeddings of nodes from the integrated similarity network by aggregating Information from their direct neighbors, making it could capture the local structure. Meanwhile, SCSMDA could also learn embedding from the meta-path-induced networks with multiple meta-paths, aiming at capturing the high-order structure Information. In our study, the proposed structure-enhanced contrastive strategy employs the representations of microbes and drugs from meta-path-induced networks to enhance their embeddings learned from the similarity networks with the contrastive learning strategy. SCSMDA adopts the embeddings of microbes learned from the integrated similarity network as their final embedding.
Self-paced negative sampling strategy
In the microbe–drug association datasets, all the known microbe–drug associations form the positive sample set denoted as |$P$|, whereas all the remaining microbe–drug associations are regarded as the candidate negative samples denoted as |$N$|. The number of positive and negative samples in this study has a relationship that |$\vert N\vert \gg \vert P \vert $|.
Previous research always randomly selects the same number of negative samples with that of the positive samples from the candidate negative sample set, which does not fully consider the specificity of negative samples. Selecting the most informative samples from the candidate negative sample set is a challenging task, which affects the capability of the prediction model. Here we employ the self-paced negative-sampling strategy motived by SPE [46] to choose the most informative negative samples.
The self-paced negative sampling strategy divides samples in |$N$| into three classes with the Hardness function |$\mathcal{H}$|, which are trivial samples, noise samples and borderline samples. The trivial samples are scored with small values by |$\mathcal{H}$| indicating that they are well classified by the classifier, whereas the noise samples are scored with large values by |$\mathcal{H}$| meaning that they may be false negative samples. These two types of samples should be selected as the negative samples with smaller probabilities for training the classifier. Correspondingly, we should focus on the borderline samples with scores around 0.5, since these samples are the most informative and should be selected as the negative samples with larger probabilities for training the classifier.
There are four steps for self-paced negative sampling strategy in SCSMDA, which are listed below.
Step one: SCSMDA predicts the values for all the candidate negative microbe–drug association pairs with the MLP classifier |$f(\cdot )$|.
- Step two: SCSMDA cuts all the candidate negative samples into |$k$| bins with respect to values scored by the hardness function |$\mathcal{H}$|, which can be formulated as:where |$k$| is a hyper-parameter. |$B_l$| is the negative sample set for |$l$|-th bin where |$l \in \{1,2, \dots , k\}$|. The hardness function used in SCSMDA is defined as:(26)$$ \begin{align}& B_{\mathit{l}}=\{(x,y)| \frac{l-1}{k} \leq \mathcal{H}(x,y,f) <\frac{l}{k}\}, \mathcal{H} \in [0,1], \end{align} $$where |$f(x)$| represents the MLP classifier’s output probability score of sample |$x$| and |$y$| is the ground-truth label of sample |$x$|.(27)$$ \begin{align}& \mathcal{H}(x,y,f)=|f(x)-y|, \end{align} $$
- Step three: SCSMDA employs the self-paced negative strategy to select the negative samples from |$k$| bins and obtains the negative sample set, which can be denoted as:where |$k$| is the number of bins, |$S_{B_l}$| is the number of negative samples selected from |$l$|-th bin, and |$x_{ln}$| denotes the |$j$|-th selected sample from |$l$|-th bin |$B_l$|. Parameter |$S_{B_l}$| is defined as:(28)$$ \begin{align}& N_0=\{ x_{lj} | l \in [1,k], j\in [1, S_{B_l}]; l,j \in \mathbb{N^+} \}, \end{align} $$where |$w_l$| represents the normalized sampling weight of |$l$|-th bin, |$\alpha $| is called the self-paced factor and |$h_l$| denotes the average hardness contribution for |$l$|-th bin. Besides, |$i$| denotes iteration number.$$ \begin{align*} &\begin{cases} S_{B_l}=\frac{w_l}{\sum_t w_t}\cdot|P|, t=1,\cdots,k \\ w_l=\frac{1}{h_l+\alpha}, l =1,\cdots,k \\ \alpha=tan(\frac{i\pi}{2n})\\ h_l=\sum_{x_{lj} \in B_l} \mathcal{H}(x_{lj},y_{lj}, f)/|B_l|, l=1, \cdots, k, \end{cases} \end{align*} $$
Step four: The selected negative samples |$N_0$| and all the known positive samples |$P$| are composed of the training set to train the MLP classifier and begin the next iteration.
The algorithm for the self-paced negative sampling strategy is shown in Algorithm 1.
Final decoder
Loss function
Implementation details
SCSMDA initializes the learnable parameters with Glorot initialization [47] and trains the model with Adam [48]. We adopt the grid search strategy to tune parameters for SCSMDA. Specifically, the learning rate is set to 0.0005 and |$\tau $| is tuned to 0.5. The final embedding sizes of drugs and microbes are both 128. The numbers of GCN layers and MLP layers are equal to 1. The best number of positive pairs |$k$| is 10. Besides, during the training process, the dropout values for the encoder on the integrated similarity network and meta-path-induced network are 0.95 and 0.3, respectively, and SCSMDA achieves the highest evaluation values when the number of epochs is 1000.
Besides, we implement our mode using a software environment with PyCharm Community Edition 2022.1.1 version and libraries with Python v3.9.5, Pytorch v1.11.0, NumPy v1.22.3, sci-kit-learn v1.1.1, scipy v1.9.3, and tqdm v4.64.0. All experiments were performed on hardware with a desktop computer with one Intel(R) Core(TM) i5-12600KF CPU and one NVIDIA RTX3060 8GB GPU. The detailed Implementation information has been published on GitHub (https://github.com/Yue-Yuu/SCSMDA-master).
Time complexity analysis
As shown in Figure 1, there are mainly three steps for training SCSMDA, which are the construction of similarity and meta-path-induced networks, and the structure-enhanced contrastive learning strategy, the self-paced negative sampling strategy. Therefore, we analyze the time complexity for them one by one.
In step one, suppose there are |$m$| microbes and |$n$| drugs, SCSMDA first measures the similarity between microbes or drugs and their time complexity is |$O({m^2}/{2})+O({n^2}/{2})$|. For establishing the integrated microbe and drug similarity network, the time complexity is |$O(m^2)+O(n^2)$|. Besides, for establishing the meta-path-induced networks, their time complexities are |$O(m^2n)$|, |$(m^3)$|, |$O(n^2m)$|, and |$O(n^3)$| under meta-path |$\Phi _1$|, |$\Phi _2$|, |$\Phi _3$|, and |$\Phi _4$|, respectively. As a result, the total time complexity in this step is |$O(m^2/2)+O(n^2/2)+ O(m^2)+O(n^2)+O(m^2n)+O(n^2m)+O(m^3)+O(n^3)=O(m^3)+O(n^3)$|.
In step two, SCSMDA adopts the GCNs to learn the embeddings of microbes and drugs. Suppose the layer number of GCNs is 1, and the initial and output feature dimensions of nodes are |$C$| and |$F$|, the time complexity for learning embeddings is |$O(|E|CF)$|, where |$E$| is the edge set of the input network to GCNs. Besides, since SCSMDA measure similarity between all the nodes for the contrastive learning strategy process, the time complexity is |$O(m^2)+ O(n^2)$|. Therefore, the total time complexity in this step is |$O(|E|CF)+O(m^2)+ O(n^2)$|.
In step three, SCSMDA selects the most informative samples from the candidate negative sample set. The positive sample set is denoted as |$P$|. Therefore, the number of the positive microbe–drug pairs is equal to |$|P|$|, and the number of the candidate negative microbe–drug pairs will be (|$mn-|P|$|). The time complexity for performing one epoch is |$O(mn-|P|)$|. Suppose the epoch number is |$T$|, then the time complexity is |$O(T( mn-|P|))$|. Since (|$ (mn) \gg \vert P \vert $|), the total time complexity in this step is |$O(Tmn)$|.
In summary, the total time complexity for training SCSMDA is the sum in these three steps above, which could be formulated as |$O(m^3) + O(n^3)+O(|E|CF)+ O(n^2)+ O(m^2)+O(Tmn)=O(m^3)+O(n^3)+ O(|E|CF)+O(Tmn)$|. Since parameters |$C$| and |$F$| are constant, so the ultimate time complexity is |$O(m^3)+O(n^3)+O(Tmn)$|.
Results
In this section, we first describe the evaluation metrics widely used in our study. Then, a comprehensive comparison between SCSMDA and other baseline approaches will be presented from different aspects. After that, ablation study and parameter sensitivity analysis experiments for SCSMDA are extensively investigated. Lastly, we conduct case studies for two interested drugs.
Experimental setup and evaluation metrics
In this study, we adopt the 5-fold cross-validation (5-CV) strategy [49, 50] to evaluate the performance of SCSMDA as well as the baseline approaches on MDAD, aBiofilm and DrugVirus datasets, respectively. Specifically, for each dataset, all the known microbe–drug association pairs are treated as the positive samples and form the positive sample set, whereas all the remained unknown microbe–drug association pairs are treated as the candidate negative samples and form the candidate negative sample set. SCSMDA selects the same number of negative samples with that of positive samples according to the self-paced negative strategy from the candidate negative sample set. The positive samples and selected negative samples are constructed as the experimental dataset, and we conduct the 5-CV evaluation experiment on it.
For the 5-CV experiment, SCSMDA first divides the experimental dataset into five subsets with equal numbers. Then, each subset is treated as the test subset in turn and the remaining four subsets will be training subsets. In this way, we could calculate true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN), respectively.
In addition, we mainly employ five metrics, which are area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), accuracy (ACC), Matthews correlation coefficient (MCC) and F1 score to evaluate the performance of the SCSMDA as well as comparison methods. These five evaluation metrics are widely used in previous studies [30], and here we don’t repeat them anymore.
To minimize the bias of the 5-CV strategy result, we perform the experiment five times for each method and then obtain the mean and standard deviation values of the scores.
Comparison with other baseline methods on AUC and AUPRC metrics
Here, we choose eight competitive approaches for comparison. These approaches are GCN [44], GAT [51], DTIGAT [52], NIMCGCN [53], MMGCN [54], GCNMDA[13], DTI-CNN [55] and Graph2MDA [16].
GCN [44] is a semi-supervised learning approach. Here we feed microbe similarity network and drug similarity network into GCNs and learn their embeddings for predicting the association relationships.
GAT [51] is one of the graph neural networks with the attention mechanism. We feed microbes similarity network and drug similarity network into GATs and obtain their feature representations for completing the microbe–drug association prediction tasks.
DTIGAT [52] is originally employed to predict the interactions between proteins and drugs with the attention mechanism. Here we feed the microbe–drug association network into this model to learn the features of microbes and drugs.
NIMCGCN [53] firstly adopts the GCNs to obtain the latent embeddings of miRNA and disease from their similarity networks and predicts miRNA–disease associations. We feed the microbe–drug association network into the model to predict microbe–drug associations.
MMGCN [54] employs GCN encoder to obtain the embeddings of miRNA and disease in different similarity views and enhance the learned representations by utilizing multichannel attention mechanism.
GCNMDA [13] builds a heterogeneous network for drugs and microbes, and then employs the GCN-based framework with conditional Random Field as well as attention mechanism techniques to discover microbe–drug associations, named GCNMDA.
DTI-CNN [55] extracts the embeddings of drugs and proteins based on the heterogeneous networks and constructs a convolutional neural network model to infer their interactions with learned features from a denoising autoencoder model.
Graph2MDA [16] adopts the variational graph autoencoder for learning the latent representations of microbes and drugs based on the multimodal attributed graphs and predicts MDAs with the deep neural network model.
We compare SCSMDA with other baseline methods on AUC and AUPRC metrics, and the corresponding results on MDAD, aBiofilm and DrugVirus datasets are shown in Figure 4. The proposed method SCSMDA achieves the best performance in all the SOTA approaches. In particular, the AUC values of SCSMDA on MDAD, aBiofilm and DrugVirus datasets are 0.9576, 0.9639 and 0.8881 respectively, whereas the AUPRC values of SCSMDA on these datasets are 0.9476, 0.9539 and 0.8630, respectively.

The ROC and PR curves of SCSMDA as well as the baseline methods for predicting microbe–drug associations on MDAD, aBiofilm and DrugVirus datasets.
Besides, DTI-CNN wins the 2nd-best performance in all the baseline approaches. Specifically, on MDAD dataset, its AUC and AUPRC values are 0.9332 and 0.9263, which are 2.5% and 2.2% lower than those of SCSMDA. On aBiofilm and DrugVirus datasets, its AUC values are 0.9467 and 0.8490, which is 1.9% and 4.4% lower than those of SCSMDA. Besides, on aBiofilm and DrugVirus datasets, Graph2MDA and NIMCGCN win the 2nd-best performance on AUPRC metric, and their values are 0.9485 and 0.8462, respectively. The results in Figure 4 fully demonstrate that SCSMDA is the most competitive approach in microbe–drug association prediction on these datasets.
Comparison with other baseline methods under different ratios
Different ratios between the number of positive samples and the number of negative samples may affect the performance of SCSMDA as well as the baseline approaches. Therefore, to evaluate their performance comprehensively, we conduct the evaluation experiments under three different ratios (# positive samples: # negative samples=1:1, 1:5 and 1:10, respectively) five times and obtain the mean and standard deviation values of the results. The corresponding results on AUC and AUPRC metrics are presented in Table 3.
The performance of SCSMDA for predicting microbe–drug associations under different ratios on MDAD, aBiofilm and DrugVirus datasets
. | . | MDAD . | aBiofilm . | DrugVirus . | |||
---|---|---|---|---|---|---|---|
Ratios . | . | AUC . | AUPRC . | AUC . | AUPRC . | AUC . | AUPRC . |
1:1 | |||||||
GCN [44] | 0.8631|${\pm }$|0.0059 | 0.8668|${\pm }$|0.0058 | 0.8878|${\pm }$|0.0066 | 0.8873|${\pm }$|0.0095 | 0.8202|${\pm }$|0.0093 | 0.7985|${\pm }$|0.0174 | |
GAT [51] | 0.8755|${\pm }$|0.0049 | 0.8772|${\pm }$|0.0046 | 0.8995|${\pm }$|0.0045 | 0.8922|${\pm }$|0.0058 | 0.8033|${\pm }$|0.0028 | 0.7908|${\pm }$|0.0018 | |
DTIGAT [52] | 0.9185|${\pm }$|0.0023 | 0.9149|${\pm }$|0.0066 | 0.9205|${\pm }$|0.0024 | 0.9179|${\pm }$|0.0041 | 0.8169|${\pm }$|0.0102 | 0.8152|${\pm }$|0.0105 | |
NIMCGCN [53] | 0.8944|${\pm }$|0.0087 | 0.9016|${\pm }$|0.0068 | 0.9201|${\pm }$|0.0066 | 0.9251|${\pm }$|0.0051 | 0.8319|${\pm }$|0.0065 | 0.8438|${\pm }$|0.0468 | |
MMGCN [54] | 0.8943|${\pm }$|0.0022 | 0.9033|${\pm }$|0.0051 | 0.9042|${\pm }$|0.0032 | 0.9103|${\pm }$|0.0056 | 0.7946|${\pm }$|0.0110 | 0.7840|${\pm }$|0.0139 | |
GCNMDA[13] | 0.9299|${\pm }$|0.0055 | 0.9192|${\pm }$|0.0094 | 0.9407|${\pm }$|0.0023 | 0.9291|${\pm }$|0.0044 | 0.8330|${\pm }$|0.0063 | 0.8047|${\pm }$|0.0088 | |
DTI-CNN [55] | 0.9325|${\pm }$|0.0054 | 0.9242|${\pm }$|0.0082 | 0.9436|${\pm }$|0.0010 | 0.9316|${\pm }$|0.0058 | 0.8581|${\pm }$|0.0013 | 0.8396|${\pm }$|0.0162 | |
SCSMDA (Ours) | 0.9573|${\pm }$|0.0020 | 0.9464|${\pm }$|0.0033 | 0.9658|${\pm }$|0.0026 | 0.9450|$\pm $|0.0037 | 0.8834|${\pm }$|0.0064 | 0.8637|${\pm }$|0.0096 | |
1:5 | |||||||
GCN [44] | 0.8830|${\pm }$|0.0027 | 0.6829|${\pm }$|0.0074 | 0.8808|${\pm }$|0.0018 | 0.6715|${\pm }$|0.0047 | 0.8291|${\pm }$|0.0007 | 0.4845|${\pm }$|0.0031 | |
GAT [51] | 0.8717|${\pm }$|0.0047 | 0.6325|${\pm }$|0.0097 | 0.9021|${\pm }$|0.0062 | 0.6867|${\pm }$|0.0082 | 0.8169|${\pm }$|0.0025 | 0.4725|${\pm }$|0.0138 | |
DTIGAT [52] | 0.9097|${\pm }$|0.0003 | 0.7462|${\pm }$|0.0056 | 0.9156|${\pm }$|0.0042 | 0.7565|${\pm }$|0.0060 | 0.8001|${\pm }$|0.0022 | 0.4630|${\pm }$|0.0058 | |
NIMCGCN [53] | 0.8983|${\pm }$|0.0039 | 0.7339|${\pm }$|0.0051 | 0.9143 |$\pm $| 0.0115 | 0.7626|${\pm }$|0.0118 | 0.8424|${\pm }$|0.0040 | 0.5280|${\pm }$|0.0061 | |
MMGCN [54] | 0.8964|${\pm }$|0.0008 | 0.7295|${\pm }$|0.0042 | 0.9072|${\pm }$|0.0010 | 0.7584|${\pm }$|0.0046 | 0.7791|${\pm }$|0.0040 | 0.4764|${\pm }$|0.0129 | |
GCNMDA [13] | 0.9274|${\pm }$|0.0006 | 0.7119|${\pm }$|0.0082 | 0.9374|${\pm }$|0.0043 | 0.7623|${\pm }$|0.0441 | 0.8366|${\pm }$|0.0054 | 0.4788|${\pm }$|0.0156 | |
DTI-CNN [55] | 0.9308|${\pm }$|0.0015 | 0.7545|${\pm }$|0.1011 | 0.9412|${\pm }$|0.0006 | 0.7891|${\pm }$|0.0014 | 0.8466|${\pm }$|0.0006 | 0.5644|${\pm }$|0.0045 | |
SCSMDA (Ours) | 0.9434|${\pm }$|0.0048 | 0.7607|${\pm }$|0.0193 | 0.9559|${\pm }$|0.0026 | 0.7971|${\pm }$|0.0041 | 0.8757|${\pm }$|0.0003 | 0.5777|${\pm }$|0.0046 | |
1:10 | |||||||
GCN [44] | 0.8921|${\pm }$|0.0065 | 0.5821|${\pm }$|0.0170 | 0.8974|${\pm }$|0.0018 | 0.5879|${\pm }$|0.0035 | 0.8231|${\pm }$|0.0018 | 0.3255|${\pm }$|0.0065 | |
GAT [51] | 0.8696|${\pm }$|0.0017 | 0.5324|${\pm }$|0.0073 | 0.8999|${\pm }$|0.0015 | 0.5828|${\pm }$|0.0103 | 0.8089|${\pm }$|0.0023 | 0.3208|${\pm }$|0.0094 | |
DTIGAT [52] | 0.9085|${\pm }$|0.0064 | 0.6483|${\pm }$|0.0264 | 0.9156|${\pm }$|0.0010 | 0.6419|${\pm }$|0.0091 | 0.7957 |$\pm $| 0.0012 | 0.3068|${\pm }$|0.0022 | |
NIMCGCN [53] | 0.9009|${\pm }$|0.0008 | 0.6256|${\pm }$|0.0108 | 0.9119|${\pm }$|0.0022 | 0.6579|${\pm }$|0.0030 | 0.8414|${\pm }$|0.0074 | 0.3503|${\pm }$|0.0076 | |
MMGCN [54] | 0.8941|${\pm }$|0.0011 | 0.6244|${\pm }$|0.0031 | 0.9044|${\pm }$|0.0005 | 0.6463|${\pm }$|0.0028 | 0.7765|${\pm }$|0.0048 | 0.3596|${\pm }$|0.0086 | |
GCNMDA[13] | 0.9310|${\pm }$|0.0028 | 0.5939|${\pm }$|0.0234 | 0.9415|${\pm }$|0.0010 | 0.6201|${\pm }$|0.0033 | 0.8304|${\pm }$|0.0055 | 0.3139|${\pm }$|0.0139 | |
DTI-CNN [55] | 0.9356|${\pm }$|0.0011 | 0.7071|${\pm }$|0.0010 | 0.9332|${\pm }$|0.0017 | 0.6997|${\pm }$|0.0081 | 0.8649|${\pm }$|0.0020 | 0.3943|${\pm }$|0.0080 | |
SCSMDA (ours) | 0.9377|${\pm }$|0.0015 | 0.6921|${\pm }$|0.0069 | 0.9481|${\pm }$|0.0009 | 0.6853|${\pm }$|0.0049 | 0.8729|${\pm }$|0.0017 | 0.4042|${\pm }$|0.0016 |
. | . | MDAD . | aBiofilm . | DrugVirus . | |||
---|---|---|---|---|---|---|---|
Ratios . | . | AUC . | AUPRC . | AUC . | AUPRC . | AUC . | AUPRC . |
1:1 | |||||||
GCN [44] | 0.8631|${\pm }$|0.0059 | 0.8668|${\pm }$|0.0058 | 0.8878|${\pm }$|0.0066 | 0.8873|${\pm }$|0.0095 | 0.8202|${\pm }$|0.0093 | 0.7985|${\pm }$|0.0174 | |
GAT [51] | 0.8755|${\pm }$|0.0049 | 0.8772|${\pm }$|0.0046 | 0.8995|${\pm }$|0.0045 | 0.8922|${\pm }$|0.0058 | 0.8033|${\pm }$|0.0028 | 0.7908|${\pm }$|0.0018 | |
DTIGAT [52] | 0.9185|${\pm }$|0.0023 | 0.9149|${\pm }$|0.0066 | 0.9205|${\pm }$|0.0024 | 0.9179|${\pm }$|0.0041 | 0.8169|${\pm }$|0.0102 | 0.8152|${\pm }$|0.0105 | |
NIMCGCN [53] | 0.8944|${\pm }$|0.0087 | 0.9016|${\pm }$|0.0068 | 0.9201|${\pm }$|0.0066 | 0.9251|${\pm }$|0.0051 | 0.8319|${\pm }$|0.0065 | 0.8438|${\pm }$|0.0468 | |
MMGCN [54] | 0.8943|${\pm }$|0.0022 | 0.9033|${\pm }$|0.0051 | 0.9042|${\pm }$|0.0032 | 0.9103|${\pm }$|0.0056 | 0.7946|${\pm }$|0.0110 | 0.7840|${\pm }$|0.0139 | |
GCNMDA[13] | 0.9299|${\pm }$|0.0055 | 0.9192|${\pm }$|0.0094 | 0.9407|${\pm }$|0.0023 | 0.9291|${\pm }$|0.0044 | 0.8330|${\pm }$|0.0063 | 0.8047|${\pm }$|0.0088 | |
DTI-CNN [55] | 0.9325|${\pm }$|0.0054 | 0.9242|${\pm }$|0.0082 | 0.9436|${\pm }$|0.0010 | 0.9316|${\pm }$|0.0058 | 0.8581|${\pm }$|0.0013 | 0.8396|${\pm }$|0.0162 | |
SCSMDA (Ours) | 0.9573|${\pm }$|0.0020 | 0.9464|${\pm }$|0.0033 | 0.9658|${\pm }$|0.0026 | 0.9450|$\pm $|0.0037 | 0.8834|${\pm }$|0.0064 | 0.8637|${\pm }$|0.0096 | |
1:5 | |||||||
GCN [44] | 0.8830|${\pm }$|0.0027 | 0.6829|${\pm }$|0.0074 | 0.8808|${\pm }$|0.0018 | 0.6715|${\pm }$|0.0047 | 0.8291|${\pm }$|0.0007 | 0.4845|${\pm }$|0.0031 | |
GAT [51] | 0.8717|${\pm }$|0.0047 | 0.6325|${\pm }$|0.0097 | 0.9021|${\pm }$|0.0062 | 0.6867|${\pm }$|0.0082 | 0.8169|${\pm }$|0.0025 | 0.4725|${\pm }$|0.0138 | |
DTIGAT [52] | 0.9097|${\pm }$|0.0003 | 0.7462|${\pm }$|0.0056 | 0.9156|${\pm }$|0.0042 | 0.7565|${\pm }$|0.0060 | 0.8001|${\pm }$|0.0022 | 0.4630|${\pm }$|0.0058 | |
NIMCGCN [53] | 0.8983|${\pm }$|0.0039 | 0.7339|${\pm }$|0.0051 | 0.9143 |$\pm $| 0.0115 | 0.7626|${\pm }$|0.0118 | 0.8424|${\pm }$|0.0040 | 0.5280|${\pm }$|0.0061 | |
MMGCN [54] | 0.8964|${\pm }$|0.0008 | 0.7295|${\pm }$|0.0042 | 0.9072|${\pm }$|0.0010 | 0.7584|${\pm }$|0.0046 | 0.7791|${\pm }$|0.0040 | 0.4764|${\pm }$|0.0129 | |
GCNMDA [13] | 0.9274|${\pm }$|0.0006 | 0.7119|${\pm }$|0.0082 | 0.9374|${\pm }$|0.0043 | 0.7623|${\pm }$|0.0441 | 0.8366|${\pm }$|0.0054 | 0.4788|${\pm }$|0.0156 | |
DTI-CNN [55] | 0.9308|${\pm }$|0.0015 | 0.7545|${\pm }$|0.1011 | 0.9412|${\pm }$|0.0006 | 0.7891|${\pm }$|0.0014 | 0.8466|${\pm }$|0.0006 | 0.5644|${\pm }$|0.0045 | |
SCSMDA (Ours) | 0.9434|${\pm }$|0.0048 | 0.7607|${\pm }$|0.0193 | 0.9559|${\pm }$|0.0026 | 0.7971|${\pm }$|0.0041 | 0.8757|${\pm }$|0.0003 | 0.5777|${\pm }$|0.0046 | |
1:10 | |||||||
GCN [44] | 0.8921|${\pm }$|0.0065 | 0.5821|${\pm }$|0.0170 | 0.8974|${\pm }$|0.0018 | 0.5879|${\pm }$|0.0035 | 0.8231|${\pm }$|0.0018 | 0.3255|${\pm }$|0.0065 | |
GAT [51] | 0.8696|${\pm }$|0.0017 | 0.5324|${\pm }$|0.0073 | 0.8999|${\pm }$|0.0015 | 0.5828|${\pm }$|0.0103 | 0.8089|${\pm }$|0.0023 | 0.3208|${\pm }$|0.0094 | |
DTIGAT [52] | 0.9085|${\pm }$|0.0064 | 0.6483|${\pm }$|0.0264 | 0.9156|${\pm }$|0.0010 | 0.6419|${\pm }$|0.0091 | 0.7957 |$\pm $| 0.0012 | 0.3068|${\pm }$|0.0022 | |
NIMCGCN [53] | 0.9009|${\pm }$|0.0008 | 0.6256|${\pm }$|0.0108 | 0.9119|${\pm }$|0.0022 | 0.6579|${\pm }$|0.0030 | 0.8414|${\pm }$|0.0074 | 0.3503|${\pm }$|0.0076 | |
MMGCN [54] | 0.8941|${\pm }$|0.0011 | 0.6244|${\pm }$|0.0031 | 0.9044|${\pm }$|0.0005 | 0.6463|${\pm }$|0.0028 | 0.7765|${\pm }$|0.0048 | 0.3596|${\pm }$|0.0086 | |
GCNMDA[13] | 0.9310|${\pm }$|0.0028 | 0.5939|${\pm }$|0.0234 | 0.9415|${\pm }$|0.0010 | 0.6201|${\pm }$|0.0033 | 0.8304|${\pm }$|0.0055 | 0.3139|${\pm }$|0.0139 | |
DTI-CNN [55] | 0.9356|${\pm }$|0.0011 | 0.7071|${\pm }$|0.0010 | 0.9332|${\pm }$|0.0017 | 0.6997|${\pm }$|0.0081 | 0.8649|${\pm }$|0.0020 | 0.3943|${\pm }$|0.0080 | |
SCSMDA (ours) | 0.9377|${\pm }$|0.0015 | 0.6921|${\pm }$|0.0069 | 0.9481|${\pm }$|0.0009 | 0.6853|${\pm }$|0.0049 | 0.8729|${\pm }$|0.0017 | 0.4042|${\pm }$|0.0016 |
Note: The best results are marked in bold and the 2nd-best ones are marked as underlined.
The performance of SCSMDA for predicting microbe–drug associations under different ratios on MDAD, aBiofilm and DrugVirus datasets
. | . | MDAD . | aBiofilm . | DrugVirus . | |||
---|---|---|---|---|---|---|---|
Ratios . | . | AUC . | AUPRC . | AUC . | AUPRC . | AUC . | AUPRC . |
1:1 | |||||||
GCN [44] | 0.8631|${\pm }$|0.0059 | 0.8668|${\pm }$|0.0058 | 0.8878|${\pm }$|0.0066 | 0.8873|${\pm }$|0.0095 | 0.8202|${\pm }$|0.0093 | 0.7985|${\pm }$|0.0174 | |
GAT [51] | 0.8755|${\pm }$|0.0049 | 0.8772|${\pm }$|0.0046 | 0.8995|${\pm }$|0.0045 | 0.8922|${\pm }$|0.0058 | 0.8033|${\pm }$|0.0028 | 0.7908|${\pm }$|0.0018 | |
DTIGAT [52] | 0.9185|${\pm }$|0.0023 | 0.9149|${\pm }$|0.0066 | 0.9205|${\pm }$|0.0024 | 0.9179|${\pm }$|0.0041 | 0.8169|${\pm }$|0.0102 | 0.8152|${\pm }$|0.0105 | |
NIMCGCN [53] | 0.8944|${\pm }$|0.0087 | 0.9016|${\pm }$|0.0068 | 0.9201|${\pm }$|0.0066 | 0.9251|${\pm }$|0.0051 | 0.8319|${\pm }$|0.0065 | 0.8438|${\pm }$|0.0468 | |
MMGCN [54] | 0.8943|${\pm }$|0.0022 | 0.9033|${\pm }$|0.0051 | 0.9042|${\pm }$|0.0032 | 0.9103|${\pm }$|0.0056 | 0.7946|${\pm }$|0.0110 | 0.7840|${\pm }$|0.0139 | |
GCNMDA[13] | 0.9299|${\pm }$|0.0055 | 0.9192|${\pm }$|0.0094 | 0.9407|${\pm }$|0.0023 | 0.9291|${\pm }$|0.0044 | 0.8330|${\pm }$|0.0063 | 0.8047|${\pm }$|0.0088 | |
DTI-CNN [55] | 0.9325|${\pm }$|0.0054 | 0.9242|${\pm }$|0.0082 | 0.9436|${\pm }$|0.0010 | 0.9316|${\pm }$|0.0058 | 0.8581|${\pm }$|0.0013 | 0.8396|${\pm }$|0.0162 | |
SCSMDA (Ours) | 0.9573|${\pm }$|0.0020 | 0.9464|${\pm }$|0.0033 | 0.9658|${\pm }$|0.0026 | 0.9450|$\pm $|0.0037 | 0.8834|${\pm }$|0.0064 | 0.8637|${\pm }$|0.0096 | |
1:5 | |||||||
GCN [44] | 0.8830|${\pm }$|0.0027 | 0.6829|${\pm }$|0.0074 | 0.8808|${\pm }$|0.0018 | 0.6715|${\pm }$|0.0047 | 0.8291|${\pm }$|0.0007 | 0.4845|${\pm }$|0.0031 | |
GAT [51] | 0.8717|${\pm }$|0.0047 | 0.6325|${\pm }$|0.0097 | 0.9021|${\pm }$|0.0062 | 0.6867|${\pm }$|0.0082 | 0.8169|${\pm }$|0.0025 | 0.4725|${\pm }$|0.0138 | |
DTIGAT [52] | 0.9097|${\pm }$|0.0003 | 0.7462|${\pm }$|0.0056 | 0.9156|${\pm }$|0.0042 | 0.7565|${\pm }$|0.0060 | 0.8001|${\pm }$|0.0022 | 0.4630|${\pm }$|0.0058 | |
NIMCGCN [53] | 0.8983|${\pm }$|0.0039 | 0.7339|${\pm }$|0.0051 | 0.9143 |$\pm $| 0.0115 | 0.7626|${\pm }$|0.0118 | 0.8424|${\pm }$|0.0040 | 0.5280|${\pm }$|0.0061 | |
MMGCN [54] | 0.8964|${\pm }$|0.0008 | 0.7295|${\pm }$|0.0042 | 0.9072|${\pm }$|0.0010 | 0.7584|${\pm }$|0.0046 | 0.7791|${\pm }$|0.0040 | 0.4764|${\pm }$|0.0129 | |
GCNMDA [13] | 0.9274|${\pm }$|0.0006 | 0.7119|${\pm }$|0.0082 | 0.9374|${\pm }$|0.0043 | 0.7623|${\pm }$|0.0441 | 0.8366|${\pm }$|0.0054 | 0.4788|${\pm }$|0.0156 | |
DTI-CNN [55] | 0.9308|${\pm }$|0.0015 | 0.7545|${\pm }$|0.1011 | 0.9412|${\pm }$|0.0006 | 0.7891|${\pm }$|0.0014 | 0.8466|${\pm }$|0.0006 | 0.5644|${\pm }$|0.0045 | |
SCSMDA (Ours) | 0.9434|${\pm }$|0.0048 | 0.7607|${\pm }$|0.0193 | 0.9559|${\pm }$|0.0026 | 0.7971|${\pm }$|0.0041 | 0.8757|${\pm }$|0.0003 | 0.5777|${\pm }$|0.0046 | |
1:10 | |||||||
GCN [44] | 0.8921|${\pm }$|0.0065 | 0.5821|${\pm }$|0.0170 | 0.8974|${\pm }$|0.0018 | 0.5879|${\pm }$|0.0035 | 0.8231|${\pm }$|0.0018 | 0.3255|${\pm }$|0.0065 | |
GAT [51] | 0.8696|${\pm }$|0.0017 | 0.5324|${\pm }$|0.0073 | 0.8999|${\pm }$|0.0015 | 0.5828|${\pm }$|0.0103 | 0.8089|${\pm }$|0.0023 | 0.3208|${\pm }$|0.0094 | |
DTIGAT [52] | 0.9085|${\pm }$|0.0064 | 0.6483|${\pm }$|0.0264 | 0.9156|${\pm }$|0.0010 | 0.6419|${\pm }$|0.0091 | 0.7957 |$\pm $| 0.0012 | 0.3068|${\pm }$|0.0022 | |
NIMCGCN [53] | 0.9009|${\pm }$|0.0008 | 0.6256|${\pm }$|0.0108 | 0.9119|${\pm }$|0.0022 | 0.6579|${\pm }$|0.0030 | 0.8414|${\pm }$|0.0074 | 0.3503|${\pm }$|0.0076 | |
MMGCN [54] | 0.8941|${\pm }$|0.0011 | 0.6244|${\pm }$|0.0031 | 0.9044|${\pm }$|0.0005 | 0.6463|${\pm }$|0.0028 | 0.7765|${\pm }$|0.0048 | 0.3596|${\pm }$|0.0086 | |
GCNMDA[13] | 0.9310|${\pm }$|0.0028 | 0.5939|${\pm }$|0.0234 | 0.9415|${\pm }$|0.0010 | 0.6201|${\pm }$|0.0033 | 0.8304|${\pm }$|0.0055 | 0.3139|${\pm }$|0.0139 | |
DTI-CNN [55] | 0.9356|${\pm }$|0.0011 | 0.7071|${\pm }$|0.0010 | 0.9332|${\pm }$|0.0017 | 0.6997|${\pm }$|0.0081 | 0.8649|${\pm }$|0.0020 | 0.3943|${\pm }$|0.0080 | |
SCSMDA (ours) | 0.9377|${\pm }$|0.0015 | 0.6921|${\pm }$|0.0069 | 0.9481|${\pm }$|0.0009 | 0.6853|${\pm }$|0.0049 | 0.8729|${\pm }$|0.0017 | 0.4042|${\pm }$|0.0016 |
. | . | MDAD . | aBiofilm . | DrugVirus . | |||
---|---|---|---|---|---|---|---|
Ratios . | . | AUC . | AUPRC . | AUC . | AUPRC . | AUC . | AUPRC . |
1:1 | |||||||
GCN [44] | 0.8631|${\pm }$|0.0059 | 0.8668|${\pm }$|0.0058 | 0.8878|${\pm }$|0.0066 | 0.8873|${\pm }$|0.0095 | 0.8202|${\pm }$|0.0093 | 0.7985|${\pm }$|0.0174 | |
GAT [51] | 0.8755|${\pm }$|0.0049 | 0.8772|${\pm }$|0.0046 | 0.8995|${\pm }$|0.0045 | 0.8922|${\pm }$|0.0058 | 0.8033|${\pm }$|0.0028 | 0.7908|${\pm }$|0.0018 | |
DTIGAT [52] | 0.9185|${\pm }$|0.0023 | 0.9149|${\pm }$|0.0066 | 0.9205|${\pm }$|0.0024 | 0.9179|${\pm }$|0.0041 | 0.8169|${\pm }$|0.0102 | 0.8152|${\pm }$|0.0105 | |
NIMCGCN [53] | 0.8944|${\pm }$|0.0087 | 0.9016|${\pm }$|0.0068 | 0.9201|${\pm }$|0.0066 | 0.9251|${\pm }$|0.0051 | 0.8319|${\pm }$|0.0065 | 0.8438|${\pm }$|0.0468 | |
MMGCN [54] | 0.8943|${\pm }$|0.0022 | 0.9033|${\pm }$|0.0051 | 0.9042|${\pm }$|0.0032 | 0.9103|${\pm }$|0.0056 | 0.7946|${\pm }$|0.0110 | 0.7840|${\pm }$|0.0139 | |
GCNMDA[13] | 0.9299|${\pm }$|0.0055 | 0.9192|${\pm }$|0.0094 | 0.9407|${\pm }$|0.0023 | 0.9291|${\pm }$|0.0044 | 0.8330|${\pm }$|0.0063 | 0.8047|${\pm }$|0.0088 | |
DTI-CNN [55] | 0.9325|${\pm }$|0.0054 | 0.9242|${\pm }$|0.0082 | 0.9436|${\pm }$|0.0010 | 0.9316|${\pm }$|0.0058 | 0.8581|${\pm }$|0.0013 | 0.8396|${\pm }$|0.0162 | |
SCSMDA (Ours) | 0.9573|${\pm }$|0.0020 | 0.9464|${\pm }$|0.0033 | 0.9658|${\pm }$|0.0026 | 0.9450|$\pm $|0.0037 | 0.8834|${\pm }$|0.0064 | 0.8637|${\pm }$|0.0096 | |
1:5 | |||||||
GCN [44] | 0.8830|${\pm }$|0.0027 | 0.6829|${\pm }$|0.0074 | 0.8808|${\pm }$|0.0018 | 0.6715|${\pm }$|0.0047 | 0.8291|${\pm }$|0.0007 | 0.4845|${\pm }$|0.0031 | |
GAT [51] | 0.8717|${\pm }$|0.0047 | 0.6325|${\pm }$|0.0097 | 0.9021|${\pm }$|0.0062 | 0.6867|${\pm }$|0.0082 | 0.8169|${\pm }$|0.0025 | 0.4725|${\pm }$|0.0138 | |
DTIGAT [52] | 0.9097|${\pm }$|0.0003 | 0.7462|${\pm }$|0.0056 | 0.9156|${\pm }$|0.0042 | 0.7565|${\pm }$|0.0060 | 0.8001|${\pm }$|0.0022 | 0.4630|${\pm }$|0.0058 | |
NIMCGCN [53] | 0.8983|${\pm }$|0.0039 | 0.7339|${\pm }$|0.0051 | 0.9143 |$\pm $| 0.0115 | 0.7626|${\pm }$|0.0118 | 0.8424|${\pm }$|0.0040 | 0.5280|${\pm }$|0.0061 | |
MMGCN [54] | 0.8964|${\pm }$|0.0008 | 0.7295|${\pm }$|0.0042 | 0.9072|${\pm }$|0.0010 | 0.7584|${\pm }$|0.0046 | 0.7791|${\pm }$|0.0040 | 0.4764|${\pm }$|0.0129 | |
GCNMDA [13] | 0.9274|${\pm }$|0.0006 | 0.7119|${\pm }$|0.0082 | 0.9374|${\pm }$|0.0043 | 0.7623|${\pm }$|0.0441 | 0.8366|${\pm }$|0.0054 | 0.4788|${\pm }$|0.0156 | |
DTI-CNN [55] | 0.9308|${\pm }$|0.0015 | 0.7545|${\pm }$|0.1011 | 0.9412|${\pm }$|0.0006 | 0.7891|${\pm }$|0.0014 | 0.8466|${\pm }$|0.0006 | 0.5644|${\pm }$|0.0045 | |
SCSMDA (Ours) | 0.9434|${\pm }$|0.0048 | 0.7607|${\pm }$|0.0193 | 0.9559|${\pm }$|0.0026 | 0.7971|${\pm }$|0.0041 | 0.8757|${\pm }$|0.0003 | 0.5777|${\pm }$|0.0046 | |
1:10 | |||||||
GCN [44] | 0.8921|${\pm }$|0.0065 | 0.5821|${\pm }$|0.0170 | 0.8974|${\pm }$|0.0018 | 0.5879|${\pm }$|0.0035 | 0.8231|${\pm }$|0.0018 | 0.3255|${\pm }$|0.0065 | |
GAT [51] | 0.8696|${\pm }$|0.0017 | 0.5324|${\pm }$|0.0073 | 0.8999|${\pm }$|0.0015 | 0.5828|${\pm }$|0.0103 | 0.8089|${\pm }$|0.0023 | 0.3208|${\pm }$|0.0094 | |
DTIGAT [52] | 0.9085|${\pm }$|0.0064 | 0.6483|${\pm }$|0.0264 | 0.9156|${\pm }$|0.0010 | 0.6419|${\pm }$|0.0091 | 0.7957 |$\pm $| 0.0012 | 0.3068|${\pm }$|0.0022 | |
NIMCGCN [53] | 0.9009|${\pm }$|0.0008 | 0.6256|${\pm }$|0.0108 | 0.9119|${\pm }$|0.0022 | 0.6579|${\pm }$|0.0030 | 0.8414|${\pm }$|0.0074 | 0.3503|${\pm }$|0.0076 | |
MMGCN [54] | 0.8941|${\pm }$|0.0011 | 0.6244|${\pm }$|0.0031 | 0.9044|${\pm }$|0.0005 | 0.6463|${\pm }$|0.0028 | 0.7765|${\pm }$|0.0048 | 0.3596|${\pm }$|0.0086 | |
GCNMDA[13] | 0.9310|${\pm }$|0.0028 | 0.5939|${\pm }$|0.0234 | 0.9415|${\pm }$|0.0010 | 0.6201|${\pm }$|0.0033 | 0.8304|${\pm }$|0.0055 | 0.3139|${\pm }$|0.0139 | |
DTI-CNN [55] | 0.9356|${\pm }$|0.0011 | 0.7071|${\pm }$|0.0010 | 0.9332|${\pm }$|0.0017 | 0.6997|${\pm }$|0.0081 | 0.8649|${\pm }$|0.0020 | 0.3943|${\pm }$|0.0080 | |
SCSMDA (ours) | 0.9377|${\pm }$|0.0015 | 0.6921|${\pm }$|0.0069 | 0.9481|${\pm }$|0.0009 | 0.6853|${\pm }$|0.0049 | 0.8729|${\pm }$|0.0017 | 0.4042|${\pm }$|0.0016 |
Note: The best results are marked in bold and the 2nd-best ones are marked as underlined.
For the result with the 1:1 ratio, SCSMDA wins the 1st rank on the three datasets. Specifically, the AUC and AUPRC values are 0.9573 and 0.9464 on MDAD dataset. Besides, the AUC and AUPRC values are 0.9658 and 0.9450 on aBiofilm dataset, whereas AUC and AUPRC values are 0.8834 and 0.8637 on DrugVirus, respectively. Meanwhile, DTI-CNN achieves the 2nd-best performance on these three datasets. Its AUC values are 0.9325, 0.9436 and 0.8581, and the AUPRC values are 0.9242, 0.9316 and 0.8396 on MDAD, aBiofilm and DrugVirus, respectively.
For the result with the 1:5 ratio, SCSMDA and DTI-CNN wins the 1st rank and 2nd rank on these three datasets. In particular, for the AUC metric, SCSMDA obtains the 0.9434, 0.9559 and 0.8757 scores, whereas DTI-CNN achieves the 0.9308, 0.9412 and 0.8466 scores, respectively. For the AUPCR metric, SCSMDA gets the 0.7607, 0.7971 and 0.5777 values, respectively, and DIT-CNN obtains 0.7545, 0.7891 and 0.5644 values, respectively.
For the result with the 1:10 ratio, SCSMDA achieves the highest scores on AUC metric, which are 0.9377, 0.9481 and 0.8729 on MADA, aBiofilm and DrugVirus datasets, respectively. Meanwhile, SCSMDA also achieves the best performance on AURPC metric for DrugVirus dataset with 0.4042. SCSMDA wins the 2nd-highest scores on AUPRC of MDAD and aBiofilm datasets and the values are 0.6920 and 0.6853. Besides, DTI-CNN wins the 1st rank on AUPRC metric for two datasets, and their AUPRC scores are 0.7071, and 0.6997 on MADA and aBiofilm datasets. DIT-CNN achieves the 2nd-best performance on AUC of MDAD, AUC of aBiofilm, AUC of DrugVirus and AUPRC of the DrugVirus, and their corresponding scores are 0.9356, 0.9332, 0.8469 and 0.3943, respectively. All the results are listed in Table 3, which comprehensively demonstrates that SCSMDA consistently has a better performance than other baseline approaches.
Model ablation study
SCSMDA learns the embedding of microbes and drugs with the structure-enhanced contrastive learning strategy, and selects the most informative samples with self-paced negative sampling strategy. Here we conduct the model ablation study to investigate the effect of each component on SCSMDA model. Here we mainly select three components which are the similarity-network-based embedding learning component (SN), the meta-path-induced network embedding learning component (MP) and the self-paced negative sampling strategy component (SP). The ablation study is performed as SCSMDA without SN component (SCS w/o SN), SCSMDA without MP component (SCS w/o MP), SCSMDA without SP component (SCS w/o SP) and SCSMDA with all these components. The corresponding results are represented in Figure 5.

The ablation study for SCSMDA. SCSMDA w/o SN, SCSMDA w/o MP and SCSMDA w/o SP indicate that SCSMDA doesn’t contain similarity-network-based embedding learning component, meta-path-induced network embedding learning component and the self-paced negative sampling strategy component, respectively.
Results on all these three datasets show that SN, MP and SP are all essential components for SCSMDA. Specifically, on MDAD dataset, SCSMDA wins the best performance on the five evaluation metrics. On MDAD dataset, the scores on ACC, AUC, AUPRC, MCC and F1 metric are 0.8791, 0.9573, 0.9464, 0.7261 and 0.8528, respectively. For aBiofilm dataset, the scores of ACC, AUC, AUPRC, MCC and F1 metrics are 0.8919, 0.9658, 0.9450, 0.7393, and 0.8592, respectively. On DrugVirus dataset, the values on ACC, AUC, AUPRC, MCC and F1 metrics are 0.8133, 0.8834, 0.8637, 0.6141 and 0.7981, respectively.
For other prediction models, SCSMDA w/o SP achieves the 2nd-best performance overall, whereas the performance of SCSMDA w/o SN model is the worst in all the models. The corresponding results for other modes are displayed in Figure 5 and we don’t repeat them anymore. Overall, the embedding of nodes learning from the similarity-network-based plays a major role in the performance of SCSMDA. Meanwhile, the structure-enhanced learning component plays an essential role in improving the performance of SCSMDA. The structure-enhanced contrasting learning strategy is effective in improving the performance of SCSMDA.
The statistical significance report on AUC values
The statistical significance is an effective manner for verifying the credibility and stability of the results of SCSMDA. Therefore, we employ the one-way ANOVA model [56, 57] to investigate the statistical significance of the results of all the MDA prediction approaches. Specifically, all these MDA prediction approaches are performed on the 5-CV experiments and obtain their corresponding AUC values (Table 4). The analysis results are demonstrated in Figure 6.
Dataset . | Iteration . | GCN . | GAT . | DTIGAT . | NIMCGCN . | MMGCN . | GCNMDA . | DTI-CNN . | Graph2MDA . | SCSMDA (ours) . |
---|---|---|---|---|---|---|---|---|---|---|
MDAD | 1 | 0.8685 | 0.8873 | 0.8692 | 0.8892 | 0.8934 | 0.9326 | 0.9326 | 0.9077 | 0.9562 |
2 | 0.8715 | 0.8899 | 0.9134 | 0.9001 | 0.8938 | 0.9261 | 0.9361 | 0.9022 | 0.9583 | |
3 | 0.8702 | 0.8856 | 0.9136 | 0.9018 | 0.8940 | 0.9297 | 0.9358 | 0.8617 | 0.9603 | |
4 | 0.8729 | 0.8695 | 0.9145 | 0.899 | 0.8937 | 0.9328 | 0.9319 | 0.9089 | 0.9563 | |
5 | 0.8738 | 0.8616 | 0.9139 | 0.8933 | 0.8941 | 0.9280 | 0.9303 | 0.8756 | 0.9617 | |
aBiofilm | 1 | 0.8987 | 0.8962 | 0.9192 | 0.9009 | 0.9083 | 0.9382 | 0.9443 | 0.9164 | 0.9667 |
2 | 0.8997 | 0.8758 | 0.9196 | 0.9117 | 0.9077 | 0.9398 | 0.9454 | 0.9212 | 0.9614 | |
3 | 0.9009 | 0.8898 | 0.9207 | 0.9147 | 0.9081 | 0.9424 | 0.9448 | 0.9125 | 0.9661 | |
4 | 0.8999 | 0.9038 | 0.9206 | 0.9193 | 0.9084 | 0.9412 | 0.9427 | 0.8894 | 0.9664 | |
5 | 0.9031 | 0.9048 | 0.9198 | 0.8964 | 0.9082 | 0.9422 | 0.9406 | 0.9272 | 0.9669 | |
DrugVirus | 1 | 0.8349 | 0.8036 | 0.8184 | 0.8427 | 0.7931 | 0.8349 | 0.8612 | 0.7725 | 0.8934 |
2 | 0.8353 | 0.7956 | 0.8203 | 0.8415 | 0.7937 | 0.7901 | 0.8611 | 0.7981 | 0.8841 | |
3 | 0.8356 | 0.7959 | 0.8190 | 0.8440 | 0.7962 | 0.8413 | 0.8566 | 0.7802 | 0.8845 | |
4 | 0.8349 | 0.7876 | 0.8230 | 0.8372 | 0.8237 | 0.8264 | 0.8574 | 0.7899 | 0.8888 | |
5 | 0.8349 | 0.7902 | 0.8164 | 0.8346 | 0.8215 | 0.8171 | 0.8611 | 0.7991 | 0.8865 |
Dataset . | Iteration . | GCN . | GAT . | DTIGAT . | NIMCGCN . | MMGCN . | GCNMDA . | DTI-CNN . | Graph2MDA . | SCSMDA (ours) . |
---|---|---|---|---|---|---|---|---|---|---|
MDAD | 1 | 0.8685 | 0.8873 | 0.8692 | 0.8892 | 0.8934 | 0.9326 | 0.9326 | 0.9077 | 0.9562 |
2 | 0.8715 | 0.8899 | 0.9134 | 0.9001 | 0.8938 | 0.9261 | 0.9361 | 0.9022 | 0.9583 | |
3 | 0.8702 | 0.8856 | 0.9136 | 0.9018 | 0.8940 | 0.9297 | 0.9358 | 0.8617 | 0.9603 | |
4 | 0.8729 | 0.8695 | 0.9145 | 0.899 | 0.8937 | 0.9328 | 0.9319 | 0.9089 | 0.9563 | |
5 | 0.8738 | 0.8616 | 0.9139 | 0.8933 | 0.8941 | 0.9280 | 0.9303 | 0.8756 | 0.9617 | |
aBiofilm | 1 | 0.8987 | 0.8962 | 0.9192 | 0.9009 | 0.9083 | 0.9382 | 0.9443 | 0.9164 | 0.9667 |
2 | 0.8997 | 0.8758 | 0.9196 | 0.9117 | 0.9077 | 0.9398 | 0.9454 | 0.9212 | 0.9614 | |
3 | 0.9009 | 0.8898 | 0.9207 | 0.9147 | 0.9081 | 0.9424 | 0.9448 | 0.9125 | 0.9661 | |
4 | 0.8999 | 0.9038 | 0.9206 | 0.9193 | 0.9084 | 0.9412 | 0.9427 | 0.8894 | 0.9664 | |
5 | 0.9031 | 0.9048 | 0.9198 | 0.8964 | 0.9082 | 0.9422 | 0.9406 | 0.9272 | 0.9669 | |
DrugVirus | 1 | 0.8349 | 0.8036 | 0.8184 | 0.8427 | 0.7931 | 0.8349 | 0.8612 | 0.7725 | 0.8934 |
2 | 0.8353 | 0.7956 | 0.8203 | 0.8415 | 0.7937 | 0.7901 | 0.8611 | 0.7981 | 0.8841 | |
3 | 0.8356 | 0.7959 | 0.8190 | 0.8440 | 0.7962 | 0.8413 | 0.8566 | 0.7802 | 0.8845 | |
4 | 0.8349 | 0.7876 | 0.8230 | 0.8372 | 0.8237 | 0.8264 | 0.8574 | 0.7899 | 0.8888 | |
5 | 0.8349 | 0.7902 | 0.8164 | 0.8346 | 0.8215 | 0.8171 | 0.8611 | 0.7991 | 0.8865 |
Dataset . | Iteration . | GCN . | GAT . | DTIGAT . | NIMCGCN . | MMGCN . | GCNMDA . | DTI-CNN . | Graph2MDA . | SCSMDA (ours) . |
---|---|---|---|---|---|---|---|---|---|---|
MDAD | 1 | 0.8685 | 0.8873 | 0.8692 | 0.8892 | 0.8934 | 0.9326 | 0.9326 | 0.9077 | 0.9562 |
2 | 0.8715 | 0.8899 | 0.9134 | 0.9001 | 0.8938 | 0.9261 | 0.9361 | 0.9022 | 0.9583 | |
3 | 0.8702 | 0.8856 | 0.9136 | 0.9018 | 0.8940 | 0.9297 | 0.9358 | 0.8617 | 0.9603 | |
4 | 0.8729 | 0.8695 | 0.9145 | 0.899 | 0.8937 | 0.9328 | 0.9319 | 0.9089 | 0.9563 | |
5 | 0.8738 | 0.8616 | 0.9139 | 0.8933 | 0.8941 | 0.9280 | 0.9303 | 0.8756 | 0.9617 | |
aBiofilm | 1 | 0.8987 | 0.8962 | 0.9192 | 0.9009 | 0.9083 | 0.9382 | 0.9443 | 0.9164 | 0.9667 |
2 | 0.8997 | 0.8758 | 0.9196 | 0.9117 | 0.9077 | 0.9398 | 0.9454 | 0.9212 | 0.9614 | |
3 | 0.9009 | 0.8898 | 0.9207 | 0.9147 | 0.9081 | 0.9424 | 0.9448 | 0.9125 | 0.9661 | |
4 | 0.8999 | 0.9038 | 0.9206 | 0.9193 | 0.9084 | 0.9412 | 0.9427 | 0.8894 | 0.9664 | |
5 | 0.9031 | 0.9048 | 0.9198 | 0.8964 | 0.9082 | 0.9422 | 0.9406 | 0.9272 | 0.9669 | |
DrugVirus | 1 | 0.8349 | 0.8036 | 0.8184 | 0.8427 | 0.7931 | 0.8349 | 0.8612 | 0.7725 | 0.8934 |
2 | 0.8353 | 0.7956 | 0.8203 | 0.8415 | 0.7937 | 0.7901 | 0.8611 | 0.7981 | 0.8841 | |
3 | 0.8356 | 0.7959 | 0.8190 | 0.8440 | 0.7962 | 0.8413 | 0.8566 | 0.7802 | 0.8845 | |
4 | 0.8349 | 0.7876 | 0.8230 | 0.8372 | 0.8237 | 0.8264 | 0.8574 | 0.7899 | 0.8888 | |
5 | 0.8349 | 0.7902 | 0.8164 | 0.8346 | 0.8215 | 0.8171 | 0.8611 | 0.7991 | 0.8865 |
Dataset . | Iteration . | GCN . | GAT . | DTIGAT . | NIMCGCN . | MMGCN . | GCNMDA . | DTI-CNN . | Graph2MDA . | SCSMDA (ours) . |
---|---|---|---|---|---|---|---|---|---|---|
MDAD | 1 | 0.8685 | 0.8873 | 0.8692 | 0.8892 | 0.8934 | 0.9326 | 0.9326 | 0.9077 | 0.9562 |
2 | 0.8715 | 0.8899 | 0.9134 | 0.9001 | 0.8938 | 0.9261 | 0.9361 | 0.9022 | 0.9583 | |
3 | 0.8702 | 0.8856 | 0.9136 | 0.9018 | 0.8940 | 0.9297 | 0.9358 | 0.8617 | 0.9603 | |
4 | 0.8729 | 0.8695 | 0.9145 | 0.899 | 0.8937 | 0.9328 | 0.9319 | 0.9089 | 0.9563 | |
5 | 0.8738 | 0.8616 | 0.9139 | 0.8933 | 0.8941 | 0.9280 | 0.9303 | 0.8756 | 0.9617 | |
aBiofilm | 1 | 0.8987 | 0.8962 | 0.9192 | 0.9009 | 0.9083 | 0.9382 | 0.9443 | 0.9164 | 0.9667 |
2 | 0.8997 | 0.8758 | 0.9196 | 0.9117 | 0.9077 | 0.9398 | 0.9454 | 0.9212 | 0.9614 | |
3 | 0.9009 | 0.8898 | 0.9207 | 0.9147 | 0.9081 | 0.9424 | 0.9448 | 0.9125 | 0.9661 | |
4 | 0.8999 | 0.9038 | 0.9206 | 0.9193 | 0.9084 | 0.9412 | 0.9427 | 0.8894 | 0.9664 | |
5 | 0.9031 | 0.9048 | 0.9198 | 0.8964 | 0.9082 | 0.9422 | 0.9406 | 0.9272 | 0.9669 | |
DrugVirus | 1 | 0.8349 | 0.8036 | 0.8184 | 0.8427 | 0.7931 | 0.8349 | 0.8612 | 0.7725 | 0.8934 |
2 | 0.8353 | 0.7956 | 0.8203 | 0.8415 | 0.7937 | 0.7901 | 0.8611 | 0.7981 | 0.8841 | |
3 | 0.8356 | 0.7959 | 0.8190 | 0.8440 | 0.7962 | 0.8413 | 0.8566 | 0.7802 | 0.8845 | |
4 | 0.8349 | 0.7876 | 0.8230 | 0.8372 | 0.8237 | 0.8264 | 0.8574 | 0.7899 | 0.8888 | |
5 | 0.8349 | 0.7902 | 0.8164 | 0.8346 | 0.8215 | 0.8171 | 0.8611 | 0.7991 | 0.8865 |

The statistical significance report with one-way ANOVA model. (A) P-values on MDAD dataset, (B) P-values on aBiofilm dataset,(C) P-values on DrugVirus dataset.
The results show that the P-values between SCSMDA and other baseline approaches (GCNMDA, GCN, GAT, DTI-GAT, NIMCGCN, MMGCN, DTI-CNN and Graph2MDA) are 9.9e-7, 6.2e-12, 6.5e-07, 3.4e-04, 9.9e-9, 7.3e-12, 2.2e-7 and 1.1e-4 on MDAD datasets, which all show SCSMDA has statistical significance values according to one-way ANOVA analysis. Besides, we also display the P-values between baseline approaches. The statistical significance analysis results on aBiofilm and DrugVirus are all displayed in Figure 6B and C and we don’t repeat them anymore.
Embedding size analysis on SCSMDA
SCSMDA learns the embeddings of microbes and drugs with the structure-enhanced contrastive learning strategy. Since the embedding size of microbes and drugs plays an important role in SCSMDA, we conduct this experiment and evaluate its impact on SCSMDA with five metrics which are ACC, AUC, AUPRC, MCC and F1. Here, we set the embeddings size of microbes and drugs as 32, 62, 128, 256 and 512, respectively, and the corresponding results are shown in Table 5.
The performance of SCSMDA under different embedding sizes on MDAD, aBiofilm and DrugVirus datasets.
Dataset . | Embedding size . | ACC . | AUC . | AUPRC . | MCC . | F1 . |
---|---|---|---|---|---|---|
MDAD | 16 | 0.8582|$\pm $| 0.0052 | 0.9478|${\pm }$|0.0031 | 0.9243|${\pm }$|0.0045 | 0.7329|${\pm }$|0.0121 | 0.8409|${\pm }$|0.0027 |
32 | 0.8659|${\pm }$|0.0036 | 0.9506|${\pm }$|0.0019 | 0.9364|${\pm }$|0.0044 | 0.7352|${\pm }$|0.0062 | 0.8485|${\pm }$|0.0027 | |
64 | 0.8701|${\pm }$|0.0045 | 0.9548|${\pm }$|0.0022 | 0.9409|${\pm }$|0.0027 | 0.7365|${\pm }$|0.0061 | 0.8504|${\pm }$|0.0036 | |
128 | 0.8791|${\pm }$|0.0054 | 0.9573|${\pm }$|0.0020 | 0.9464|${\pm }$|0.0033 | 0.7261|${\pm }$|0.0025 | 0.8528|${\pm }$|0.0008 | |
256 | 0.8651|${\pm }$|0.0030 | 0.9511|${\pm }$|0.0043 | 0.9389|${\pm }$|0.0068 | 0.7008|${\pm }$|0.0249 | 0.8477|${\pm }$|0.0181 | |
512 | 0.8304|${\pm }$|0.0131 | 0.9446|${\pm }$|0.0035 | 0.9330|${\pm }$|0.0044 | 0.7093|${\pm }$|0.0156 | 0.8491|${\pm }$|0.0095 | |
aBiofilm | 16 | 0.8824|${\pm }$|0.0013 | 0.9538|${\pm }$|0.0028 | 0.9491|${\pm }$|0.0082 | 0.7316|${\pm }$|0.0013 | 0.8627|${\pm }$|0.0081 |
32 | 0.8907|${\pm }$|0.0077 | 0.9633|${\pm }$|0.0011 | 0.9430|${\pm }$|0.0029 | 0.7384|${\pm }$|0.0161 | 0.8590|${\pm }$|0.0112 | |
64 | 0.8915|${\pm }$|0.0070 | 0.9644|${\pm }$|0.0041 | 0.9485|${\pm }$|0.0049 | 0.7367|${\pm }$|0.0077 | 0.8576|${\pm }$|0.0037 | |
128 | 0.8919|${\pm }$|0.0017 | 0.9658|${\pm }$|0.0026 | 0.9450|${\pm }$|0.0037 | 0.7393|${\pm }$|0.0041 | 0.8592|${\pm }$|0.0031 | |
256 | 0.8864|${\pm }$|0.0029 | 0.9632|${\pm }$|0.0003 | 0.9426|${\pm }$|0.0006 | 0.7317|${\pm }$|0.0060 | 0.8542|${\pm }$|0.0035 | |
512 | 0.8762|${\pm }$|0.0072 | 0.9560|${\pm }$|0.0085 | 0.9388|${\pm }$|0.0071 | 0.7249|${\pm }$|0.0004 | 0.8371|${\pm }$|0.0007 | |
DrugVirus | 16 | 0.8071|${\pm }$|0.0100 | 0.8748|${\pm }$|0.0088 | 0.8469|${\pm }$|0.0121 | 0.6002|${\pm }$|0.0172 | 0.7845|${\pm }$|0.0059 |
32 | 0.8165|${\pm }$|0.0132 | 0.8843|${\pm }$|0.0007 | 0.8575|${\pm }$|0.0109 | 0.6027|${\pm }$|0.0117 | 0.7899|${\pm }$|0.0048 | |
64 | 0.8196|${\pm }$|0.0080 | 0.8861|${\pm }$|0.0110 | 0.8572|${\pm }$|0.0173 | 0.6109|${\pm }$|0.0272 | 0.7955|${\pm }$|0.0148 | |
128 | 0.8133|${\pm }$|0.0082 | 0.8834|${\pm }$|0.0064 | 0.8637|${\pm }$|0.0096 | 0.6141|${\pm }$|0.0063 | 0.7981|${\pm }$|0.0016 | |
256 | 0.8096|${\pm }$|0.0032 | 0.8769|${\pm }$|0.0028 | 0.8611|${\pm }$|0.0069 | 0.6218|${\pm }$|0.0092 | 0.7979|${\pm }$|0.0076 | |
512 | 0.8031|${\pm }$|0.0024 | 0.8713|${\pm }$|0.0026 | 0.8624|${\pm }$|0.0014 | 0.5974|${\pm }$|0.0212 | 0.7881|${\pm }$|0.0121 |
Dataset . | Embedding size . | ACC . | AUC . | AUPRC . | MCC . | F1 . |
---|---|---|---|---|---|---|
MDAD | 16 | 0.8582|$\pm $| 0.0052 | 0.9478|${\pm }$|0.0031 | 0.9243|${\pm }$|0.0045 | 0.7329|${\pm }$|0.0121 | 0.8409|${\pm }$|0.0027 |
32 | 0.8659|${\pm }$|0.0036 | 0.9506|${\pm }$|0.0019 | 0.9364|${\pm }$|0.0044 | 0.7352|${\pm }$|0.0062 | 0.8485|${\pm }$|0.0027 | |
64 | 0.8701|${\pm }$|0.0045 | 0.9548|${\pm }$|0.0022 | 0.9409|${\pm }$|0.0027 | 0.7365|${\pm }$|0.0061 | 0.8504|${\pm }$|0.0036 | |
128 | 0.8791|${\pm }$|0.0054 | 0.9573|${\pm }$|0.0020 | 0.9464|${\pm }$|0.0033 | 0.7261|${\pm }$|0.0025 | 0.8528|${\pm }$|0.0008 | |
256 | 0.8651|${\pm }$|0.0030 | 0.9511|${\pm }$|0.0043 | 0.9389|${\pm }$|0.0068 | 0.7008|${\pm }$|0.0249 | 0.8477|${\pm }$|0.0181 | |
512 | 0.8304|${\pm }$|0.0131 | 0.9446|${\pm }$|0.0035 | 0.9330|${\pm }$|0.0044 | 0.7093|${\pm }$|0.0156 | 0.8491|${\pm }$|0.0095 | |
aBiofilm | 16 | 0.8824|${\pm }$|0.0013 | 0.9538|${\pm }$|0.0028 | 0.9491|${\pm }$|0.0082 | 0.7316|${\pm }$|0.0013 | 0.8627|${\pm }$|0.0081 |
32 | 0.8907|${\pm }$|0.0077 | 0.9633|${\pm }$|0.0011 | 0.9430|${\pm }$|0.0029 | 0.7384|${\pm }$|0.0161 | 0.8590|${\pm }$|0.0112 | |
64 | 0.8915|${\pm }$|0.0070 | 0.9644|${\pm }$|0.0041 | 0.9485|${\pm }$|0.0049 | 0.7367|${\pm }$|0.0077 | 0.8576|${\pm }$|0.0037 | |
128 | 0.8919|${\pm }$|0.0017 | 0.9658|${\pm }$|0.0026 | 0.9450|${\pm }$|0.0037 | 0.7393|${\pm }$|0.0041 | 0.8592|${\pm }$|0.0031 | |
256 | 0.8864|${\pm }$|0.0029 | 0.9632|${\pm }$|0.0003 | 0.9426|${\pm }$|0.0006 | 0.7317|${\pm }$|0.0060 | 0.8542|${\pm }$|0.0035 | |
512 | 0.8762|${\pm }$|0.0072 | 0.9560|${\pm }$|0.0085 | 0.9388|${\pm }$|0.0071 | 0.7249|${\pm }$|0.0004 | 0.8371|${\pm }$|0.0007 | |
DrugVirus | 16 | 0.8071|${\pm }$|0.0100 | 0.8748|${\pm }$|0.0088 | 0.8469|${\pm }$|0.0121 | 0.6002|${\pm }$|0.0172 | 0.7845|${\pm }$|0.0059 |
32 | 0.8165|${\pm }$|0.0132 | 0.8843|${\pm }$|0.0007 | 0.8575|${\pm }$|0.0109 | 0.6027|${\pm }$|0.0117 | 0.7899|${\pm }$|0.0048 | |
64 | 0.8196|${\pm }$|0.0080 | 0.8861|${\pm }$|0.0110 | 0.8572|${\pm }$|0.0173 | 0.6109|${\pm }$|0.0272 | 0.7955|${\pm }$|0.0148 | |
128 | 0.8133|${\pm }$|0.0082 | 0.8834|${\pm }$|0.0064 | 0.8637|${\pm }$|0.0096 | 0.6141|${\pm }$|0.0063 | 0.7981|${\pm }$|0.0016 | |
256 | 0.8096|${\pm }$|0.0032 | 0.8769|${\pm }$|0.0028 | 0.8611|${\pm }$|0.0069 | 0.6218|${\pm }$|0.0092 | 0.7979|${\pm }$|0.0076 | |
512 | 0.8031|${\pm }$|0.0024 | 0.8713|${\pm }$|0.0026 | 0.8624|${\pm }$|0.0014 | 0.5974|${\pm }$|0.0212 | 0.7881|${\pm }$|0.0121 |
Note: The best results are marked in bold.
The performance of SCSMDA under different embedding sizes on MDAD, aBiofilm and DrugVirus datasets.
Dataset . | Embedding size . | ACC . | AUC . | AUPRC . | MCC . | F1 . |
---|---|---|---|---|---|---|
MDAD | 16 | 0.8582|$\pm $| 0.0052 | 0.9478|${\pm }$|0.0031 | 0.9243|${\pm }$|0.0045 | 0.7329|${\pm }$|0.0121 | 0.8409|${\pm }$|0.0027 |
32 | 0.8659|${\pm }$|0.0036 | 0.9506|${\pm }$|0.0019 | 0.9364|${\pm }$|0.0044 | 0.7352|${\pm }$|0.0062 | 0.8485|${\pm }$|0.0027 | |
64 | 0.8701|${\pm }$|0.0045 | 0.9548|${\pm }$|0.0022 | 0.9409|${\pm }$|0.0027 | 0.7365|${\pm }$|0.0061 | 0.8504|${\pm }$|0.0036 | |
128 | 0.8791|${\pm }$|0.0054 | 0.9573|${\pm }$|0.0020 | 0.9464|${\pm }$|0.0033 | 0.7261|${\pm }$|0.0025 | 0.8528|${\pm }$|0.0008 | |
256 | 0.8651|${\pm }$|0.0030 | 0.9511|${\pm }$|0.0043 | 0.9389|${\pm }$|0.0068 | 0.7008|${\pm }$|0.0249 | 0.8477|${\pm }$|0.0181 | |
512 | 0.8304|${\pm }$|0.0131 | 0.9446|${\pm }$|0.0035 | 0.9330|${\pm }$|0.0044 | 0.7093|${\pm }$|0.0156 | 0.8491|${\pm }$|0.0095 | |
aBiofilm | 16 | 0.8824|${\pm }$|0.0013 | 0.9538|${\pm }$|0.0028 | 0.9491|${\pm }$|0.0082 | 0.7316|${\pm }$|0.0013 | 0.8627|${\pm }$|0.0081 |
32 | 0.8907|${\pm }$|0.0077 | 0.9633|${\pm }$|0.0011 | 0.9430|${\pm }$|0.0029 | 0.7384|${\pm }$|0.0161 | 0.8590|${\pm }$|0.0112 | |
64 | 0.8915|${\pm }$|0.0070 | 0.9644|${\pm }$|0.0041 | 0.9485|${\pm }$|0.0049 | 0.7367|${\pm }$|0.0077 | 0.8576|${\pm }$|0.0037 | |
128 | 0.8919|${\pm }$|0.0017 | 0.9658|${\pm }$|0.0026 | 0.9450|${\pm }$|0.0037 | 0.7393|${\pm }$|0.0041 | 0.8592|${\pm }$|0.0031 | |
256 | 0.8864|${\pm }$|0.0029 | 0.9632|${\pm }$|0.0003 | 0.9426|${\pm }$|0.0006 | 0.7317|${\pm }$|0.0060 | 0.8542|${\pm }$|0.0035 | |
512 | 0.8762|${\pm }$|0.0072 | 0.9560|${\pm }$|0.0085 | 0.9388|${\pm }$|0.0071 | 0.7249|${\pm }$|0.0004 | 0.8371|${\pm }$|0.0007 | |
DrugVirus | 16 | 0.8071|${\pm }$|0.0100 | 0.8748|${\pm }$|0.0088 | 0.8469|${\pm }$|0.0121 | 0.6002|${\pm }$|0.0172 | 0.7845|${\pm }$|0.0059 |
32 | 0.8165|${\pm }$|0.0132 | 0.8843|${\pm }$|0.0007 | 0.8575|${\pm }$|0.0109 | 0.6027|${\pm }$|0.0117 | 0.7899|${\pm }$|0.0048 | |
64 | 0.8196|${\pm }$|0.0080 | 0.8861|${\pm }$|0.0110 | 0.8572|${\pm }$|0.0173 | 0.6109|${\pm }$|0.0272 | 0.7955|${\pm }$|0.0148 | |
128 | 0.8133|${\pm }$|0.0082 | 0.8834|${\pm }$|0.0064 | 0.8637|${\pm }$|0.0096 | 0.6141|${\pm }$|0.0063 | 0.7981|${\pm }$|0.0016 | |
256 | 0.8096|${\pm }$|0.0032 | 0.8769|${\pm }$|0.0028 | 0.8611|${\pm }$|0.0069 | 0.6218|${\pm }$|0.0092 | 0.7979|${\pm }$|0.0076 | |
512 | 0.8031|${\pm }$|0.0024 | 0.8713|${\pm }$|0.0026 | 0.8624|${\pm }$|0.0014 | 0.5974|${\pm }$|0.0212 | 0.7881|${\pm }$|0.0121 |
Dataset . | Embedding size . | ACC . | AUC . | AUPRC . | MCC . | F1 . |
---|---|---|---|---|---|---|
MDAD | 16 | 0.8582|$\pm $| 0.0052 | 0.9478|${\pm }$|0.0031 | 0.9243|${\pm }$|0.0045 | 0.7329|${\pm }$|0.0121 | 0.8409|${\pm }$|0.0027 |
32 | 0.8659|${\pm }$|0.0036 | 0.9506|${\pm }$|0.0019 | 0.9364|${\pm }$|0.0044 | 0.7352|${\pm }$|0.0062 | 0.8485|${\pm }$|0.0027 | |
64 | 0.8701|${\pm }$|0.0045 | 0.9548|${\pm }$|0.0022 | 0.9409|${\pm }$|0.0027 | 0.7365|${\pm }$|0.0061 | 0.8504|${\pm }$|0.0036 | |
128 | 0.8791|${\pm }$|0.0054 | 0.9573|${\pm }$|0.0020 | 0.9464|${\pm }$|0.0033 | 0.7261|${\pm }$|0.0025 | 0.8528|${\pm }$|0.0008 | |
256 | 0.8651|${\pm }$|0.0030 | 0.9511|${\pm }$|0.0043 | 0.9389|${\pm }$|0.0068 | 0.7008|${\pm }$|0.0249 | 0.8477|${\pm }$|0.0181 | |
512 | 0.8304|${\pm }$|0.0131 | 0.9446|${\pm }$|0.0035 | 0.9330|${\pm }$|0.0044 | 0.7093|${\pm }$|0.0156 | 0.8491|${\pm }$|0.0095 | |
aBiofilm | 16 | 0.8824|${\pm }$|0.0013 | 0.9538|${\pm }$|0.0028 | 0.9491|${\pm }$|0.0082 | 0.7316|${\pm }$|0.0013 | 0.8627|${\pm }$|0.0081 |
32 | 0.8907|${\pm }$|0.0077 | 0.9633|${\pm }$|0.0011 | 0.9430|${\pm }$|0.0029 | 0.7384|${\pm }$|0.0161 | 0.8590|${\pm }$|0.0112 | |
64 | 0.8915|${\pm }$|0.0070 | 0.9644|${\pm }$|0.0041 | 0.9485|${\pm }$|0.0049 | 0.7367|${\pm }$|0.0077 | 0.8576|${\pm }$|0.0037 | |
128 | 0.8919|${\pm }$|0.0017 | 0.9658|${\pm }$|0.0026 | 0.9450|${\pm }$|0.0037 | 0.7393|${\pm }$|0.0041 | 0.8592|${\pm }$|0.0031 | |
256 | 0.8864|${\pm }$|0.0029 | 0.9632|${\pm }$|0.0003 | 0.9426|${\pm }$|0.0006 | 0.7317|${\pm }$|0.0060 | 0.8542|${\pm }$|0.0035 | |
512 | 0.8762|${\pm }$|0.0072 | 0.9560|${\pm }$|0.0085 | 0.9388|${\pm }$|0.0071 | 0.7249|${\pm }$|0.0004 | 0.8371|${\pm }$|0.0007 | |
DrugVirus | 16 | 0.8071|${\pm }$|0.0100 | 0.8748|${\pm }$|0.0088 | 0.8469|${\pm }$|0.0121 | 0.6002|${\pm }$|0.0172 | 0.7845|${\pm }$|0.0059 |
32 | 0.8165|${\pm }$|0.0132 | 0.8843|${\pm }$|0.0007 | 0.8575|${\pm }$|0.0109 | 0.6027|${\pm }$|0.0117 | 0.7899|${\pm }$|0.0048 | |
64 | 0.8196|${\pm }$|0.0080 | 0.8861|${\pm }$|0.0110 | 0.8572|${\pm }$|0.0173 | 0.6109|${\pm }$|0.0272 | 0.7955|${\pm }$|0.0148 | |
128 | 0.8133|${\pm }$|0.0082 | 0.8834|${\pm }$|0.0064 | 0.8637|${\pm }$|0.0096 | 0.6141|${\pm }$|0.0063 | 0.7981|${\pm }$|0.0016 | |
256 | 0.8096|${\pm }$|0.0032 | 0.8769|${\pm }$|0.0028 | 0.8611|${\pm }$|0.0069 | 0.6218|${\pm }$|0.0092 | 0.7979|${\pm }$|0.0076 | |
512 | 0.8031|${\pm }$|0.0024 | 0.8713|${\pm }$|0.0026 | 0.8624|${\pm }$|0.0014 | 0.5974|${\pm }$|0.0212 | 0.7881|${\pm }$|0.0121 |
Note: The best results are marked in bold.
Specifically, on MDAD dataset, the ACC, AUC, AUPRC and F1 values are 0.8719, 0.9573, 0.9464 and 0.8528, which are the highest scores when the embedding size is 128. The highest score on MCC is 0.7365 when the embedding size is 64. For aBiofilm dataset, the highest scores for ACC, AUC, MCC and F1 are 0.8919, 0.9658, 0.7393 and 0.8592 when the embedding size is 128 and the highest value for AUPRC is 0.9458 when the embedding size is 64. For DrugVirus dataset, SCSMDA performs best on ACC, AUC, AUPRC, MCC and F1 when the embedding size is 64, 64, 128, 256 and 128, respectively. From the results, we can find that the embedding size affects the performance of SCSMDA model. SCSMDA achieves the highest scores when the embedding size is 128 overall. As a result, we adopt the embedding size as 128 for SCSMDA.
Parameter sensitivity analysis
For SCSMDA model, some crucial parameters affect its performance. Here we mainly focus on five parameters: the number of positive pairs, the number of GCN layers, the number of MLP layers, the number of bins and the learning rate. The corresponding experiments are performed and the results are all evaluated with ACC, AUC, AUPRC, MCC and F1.
The 1st parameter is the number of positive pairs for structure-enhanced contrastive learning strategy. We vary the number of positive pairs from {1,2,4,6,8,10,12,14} and conduct the experiments on all three datasets. The results are presented in Figure 7. Specifically, on the MDAD dataset, the values of ACC, AUC, AUPRC, MCC and F1 first increase gradually and then slightly decreases with positive sample number ranging from {1,2,4,6,8,10,12,14}. When the threshold is 10, the scores are highest and the values are 0.8791, 0.9573, 0.9464, 0.7261 and 0.8528 on ACC, AUC, AUPRC, MCC and F1, respectively. For aBiofilm and DrugVirus datasets, their results are similar to those on MDAD dataset and we don’t repeat them anymore. It should be noted that the evaluation scores are almost the lowest when the number of positive pairs is 1. This could further confirm that our novel positive-pair selection strategy is helpful in improving the performance of SCSMDA. As a result, we set the number of positive pairs as 10.

The performance of SCSMDA under different numbers of positive pairs on MDAD, aBiofilm and DrugVirus datasets.
The 2nd parameter is the number of the MLP layer. MLP is employed as the classifier to predict MDAs, which directly affects the performance of the SCSMDA. It is very critical to choose a proper layer number for MLP. The corresponding results (Figure 8) fully indicate that SCSMDA achieves the best performance when the number of the MLP layer is 1. Previous studies also find that too many MLP layers may lead to over-smoothing [58, 59], which seriously affects the performance of the prediction model. SCSMDA achieves its best results when the number of MLP layers is 1, which is consistent with the previous study. The 3rd parameter is the number of the GCN layer. GCN is employed to learn the embeddings of microbes and drugs, which is decisive to the prediction accuracy of SCSMDA. The results under different GCN lay numbers are presented in Figure 8 The best performance is achieved when the number of GCN layers is 1.

The performance of SCSMDA under different numbers of MLP layers and GCN layers on MDAD, aBiofilm and DrugVirus datasets.
The last two parameters are the learning rate and the number of bins. The learning rate is a hyperparameter that controls how much to change one model in response to the estimated error [60]. Choosing a proper learning rate is challenging, since a small value may result in a long training process, while a too-large value may result in learning an unstable training process. As a result, SCSMDA searches on learning rate from {1e-2, 1e-3, 5e-3, 1e-4, 5e-4, 1e-5} and we evaluate the performance of SCSMDA under these different learning rates. The results are shown in Figure 9. We observe that performance of SCSMDA first increases and then slightly decreases with the weights from 1e-1 to 1e-5. SCSMDA achieves the best results when the learning rate is 5e-4. Lastly, for the number of bins which is the hyperparameter in self-paced negative sampling strategy process, SCSMDA chooses the values from {2, 4, 6, 8, 10,12} and the corresponding results are presented in Figure 9. SCSMDA obtains the best scores when the number of bins equals 10.

The performance of SCSMDA under different thresholds for learning rate and number of Bins on MDAD, aBiofilm and DrugVirus datasets.
Visualization and interpretation for the embeddings of microbe–drug pairs learned by SCSMDA
To further demonstrate the outstanding ability of SCSMDA in learning the embedding of nodes, we conduct the visualization experiment on aBiofilm dataset. Specifically, with the learned embeddings of microbes and drugs, novel embeddings for microbe–drug pairs are generated based on the Hadamard products. If one microbe and one drug have an association relationship, this microbe–drug pair will be labeled with a positive pair. Otherwise, it will be labeled with a negative pair. All the embeddings of microbe–drug pairs are plotted into a two-dimensional space using t-SNE tool [61]. The visualization results are displayed in Figure 10.

Visualization of the learned microbe–drug embeddings by SCSMDA on aBiofilm under different epochs.
It can be seen that the positive pairs and the negative pairs are gradually distinguished with the increase of the epochs. The embeddings of positive pairs and the negative pairs are in chaos when the epoch number is 1. The embedding distribution is gradually clear with the epochs increase. Finally, the positive pairs (red points) and the negative pairs (blue points) are almost separated when the epochs equal 100. Meanwhile, it should be noted that some red and green dots are still mixed in some areas, indicating that the decision boundary is very difficult in microbe–drug association prediction task. This observation further confirms that the learned embeddings of microbe–drug pairs are discriminative and interpretable, which improves the accuracy of SCSMDA in predicting MDAs.
Running time of SCSMDA and baseline approaches
To fully evaluate the execution efficiency of SCSMDA as well as the comparison approaches, we conduct the 5-CV experiment on the three datasets for each prediction model and compare their corresponding running time. The 5-CV experiments were conducted five times independently and their corresponding results are all displayed in Table 6.
Running time (seconds) of SCSMDA and other baseline approaches on MDAD, aBiofilm and DrugVirus datasets.
Datasets . | Rounds . | GCN . | GAT . | DTIGAT . | NIMCGCN . | MMGCN . | GCNMDA . | DTI-CNN . | Graph2MDA . | SCSMDA(ours) . |
---|---|---|---|---|---|---|---|---|---|---|
MDAD | 1 | 98 | 293 | 296 | 114 | 116 | 303 | 10 | 786 | 342 |
2 | 109 | 296 | 295 | 121 | 114 | 302 | 10 | 788 | 341 | |
3 | 106 | 299 | 289 | 118 | 115 | 302 | 9 | 790 | 346 | |
4 | 100 | 296 | 295 | 121 | 117 | 303 | 10 | 788 | 340 | |
5 | 109 | 297 | 295 | 119 | 116 | 311 | 9 | 787 | 340 | |
AVE | 104 | 296 | 294 | 118 | 116 | 304 | 10 | 788 | 342 | |
aBiofilm | 1 | 127 | 393 | 379 | 161 | 170 | 399 | 11 | 1255 | 417 |
2 | 143 | 387 | 381 | 162 | 147 | 394 | 9 | 1266 | 589 | |
3 | 142 | 386 | 385 | 164 | 147 | 394 | 10 | 1256 | 418 | |
4 | 144 | 386 | 383 | 187 | 148 | 395 | 11 | 1261 | 417 | |
5 | 145 | 387 | 382 | 163 | 147 | 395 | 10 | 1269 | 411 | |
AVE | 140 | 388 | 382 | 167 | 152 | 395 | 10 | 1261 | 450 | |
DrugVirus | 1 | 16 | 69 | 69 | 19 | 17 | 28 | 4 | 50 | 132 |
2 | 15 | 71 | 66 | 19 | 16 | 28 | 4 | 53 | 136 | |
3 | 14 | 71 | 68 | 17 | 17 | 28 | 4 | 53 | 135 | |
4 | 15 | 66 | 67 | 18 | 17 | 28 | 4 | 53 | 136 | |
5 | 15 | 70 | 70 | 17 | 16 | 28 | 4 | 52 | 130 | |
AVE | 15 | 70 | 68 | 18 | 17 | 28 | 4 | 52 | 134 |
Datasets . | Rounds . | GCN . | GAT . | DTIGAT . | NIMCGCN . | MMGCN . | GCNMDA . | DTI-CNN . | Graph2MDA . | SCSMDA(ours) . |
---|---|---|---|---|---|---|---|---|---|---|
MDAD | 1 | 98 | 293 | 296 | 114 | 116 | 303 | 10 | 786 | 342 |
2 | 109 | 296 | 295 | 121 | 114 | 302 | 10 | 788 | 341 | |
3 | 106 | 299 | 289 | 118 | 115 | 302 | 9 | 790 | 346 | |
4 | 100 | 296 | 295 | 121 | 117 | 303 | 10 | 788 | 340 | |
5 | 109 | 297 | 295 | 119 | 116 | 311 | 9 | 787 | 340 | |
AVE | 104 | 296 | 294 | 118 | 116 | 304 | 10 | 788 | 342 | |
aBiofilm | 1 | 127 | 393 | 379 | 161 | 170 | 399 | 11 | 1255 | 417 |
2 | 143 | 387 | 381 | 162 | 147 | 394 | 9 | 1266 | 589 | |
3 | 142 | 386 | 385 | 164 | 147 | 394 | 10 | 1256 | 418 | |
4 | 144 | 386 | 383 | 187 | 148 | 395 | 11 | 1261 | 417 | |
5 | 145 | 387 | 382 | 163 | 147 | 395 | 10 | 1269 | 411 | |
AVE | 140 | 388 | 382 | 167 | 152 | 395 | 10 | 1261 | 450 | |
DrugVirus | 1 | 16 | 69 | 69 | 19 | 17 | 28 | 4 | 50 | 132 |
2 | 15 | 71 | 66 | 19 | 16 | 28 | 4 | 53 | 136 | |
3 | 14 | 71 | 68 | 17 | 17 | 28 | 4 | 53 | 135 | |
4 | 15 | 66 | 67 | 18 | 17 | 28 | 4 | 53 | 136 | |
5 | 15 | 70 | 70 | 17 | 16 | 28 | 4 | 52 | 130 | |
AVE | 15 | 70 | 68 | 18 | 17 | 28 | 4 | 52 | 134 |
Note: AVE denotes the average running time of the five 5-CV experiment for each model.
Running time (seconds) of SCSMDA and other baseline approaches on MDAD, aBiofilm and DrugVirus datasets.
Datasets . | Rounds . | GCN . | GAT . | DTIGAT . | NIMCGCN . | MMGCN . | GCNMDA . | DTI-CNN . | Graph2MDA . | SCSMDA(ours) . |
---|---|---|---|---|---|---|---|---|---|---|
MDAD | 1 | 98 | 293 | 296 | 114 | 116 | 303 | 10 | 786 | 342 |
2 | 109 | 296 | 295 | 121 | 114 | 302 | 10 | 788 | 341 | |
3 | 106 | 299 | 289 | 118 | 115 | 302 | 9 | 790 | 346 | |
4 | 100 | 296 | 295 | 121 | 117 | 303 | 10 | 788 | 340 | |
5 | 109 | 297 | 295 | 119 | 116 | 311 | 9 | 787 | 340 | |
AVE | 104 | 296 | 294 | 118 | 116 | 304 | 10 | 788 | 342 | |
aBiofilm | 1 | 127 | 393 | 379 | 161 | 170 | 399 | 11 | 1255 | 417 |
2 | 143 | 387 | 381 | 162 | 147 | 394 | 9 | 1266 | 589 | |
3 | 142 | 386 | 385 | 164 | 147 | 394 | 10 | 1256 | 418 | |
4 | 144 | 386 | 383 | 187 | 148 | 395 | 11 | 1261 | 417 | |
5 | 145 | 387 | 382 | 163 | 147 | 395 | 10 | 1269 | 411 | |
AVE | 140 | 388 | 382 | 167 | 152 | 395 | 10 | 1261 | 450 | |
DrugVirus | 1 | 16 | 69 | 69 | 19 | 17 | 28 | 4 | 50 | 132 |
2 | 15 | 71 | 66 | 19 | 16 | 28 | 4 | 53 | 136 | |
3 | 14 | 71 | 68 | 17 | 17 | 28 | 4 | 53 | 135 | |
4 | 15 | 66 | 67 | 18 | 17 | 28 | 4 | 53 | 136 | |
5 | 15 | 70 | 70 | 17 | 16 | 28 | 4 | 52 | 130 | |
AVE | 15 | 70 | 68 | 18 | 17 | 28 | 4 | 52 | 134 |
Datasets . | Rounds . | GCN . | GAT . | DTIGAT . | NIMCGCN . | MMGCN . | GCNMDA . | DTI-CNN . | Graph2MDA . | SCSMDA(ours) . |
---|---|---|---|---|---|---|---|---|---|---|
MDAD | 1 | 98 | 293 | 296 | 114 | 116 | 303 | 10 | 786 | 342 |
2 | 109 | 296 | 295 | 121 | 114 | 302 | 10 | 788 | 341 | |
3 | 106 | 299 | 289 | 118 | 115 | 302 | 9 | 790 | 346 | |
4 | 100 | 296 | 295 | 121 | 117 | 303 | 10 | 788 | 340 | |
5 | 109 | 297 | 295 | 119 | 116 | 311 | 9 | 787 | 340 | |
AVE | 104 | 296 | 294 | 118 | 116 | 304 | 10 | 788 | 342 | |
aBiofilm | 1 | 127 | 393 | 379 | 161 | 170 | 399 | 11 | 1255 | 417 |
2 | 143 | 387 | 381 | 162 | 147 | 394 | 9 | 1266 | 589 | |
3 | 142 | 386 | 385 | 164 | 147 | 394 | 10 | 1256 | 418 | |
4 | 144 | 386 | 383 | 187 | 148 | 395 | 11 | 1261 | 417 | |
5 | 145 | 387 | 382 | 163 | 147 | 395 | 10 | 1269 | 411 | |
AVE | 140 | 388 | 382 | 167 | 152 | 395 | 10 | 1261 | 450 | |
DrugVirus | 1 | 16 | 69 | 69 | 19 | 17 | 28 | 4 | 50 | 132 |
2 | 15 | 71 | 66 | 19 | 16 | 28 | 4 | 53 | 136 | |
3 | 14 | 71 | 68 | 17 | 17 | 28 | 4 | 53 | 135 | |
4 | 15 | 66 | 67 | 18 | 17 | 28 | 4 | 53 | 136 | |
5 | 15 | 70 | 70 | 17 | 16 | 28 | 4 | 52 | 130 | |
AVE | 15 | 70 | 68 | 18 | 17 | 28 | 4 | 52 | 134 |
Note: AVE denotes the average running time of the five 5-CV experiment for each model.
The results indicate that method DIT-CNN requires the shortest running time, whereas Graph2MDA needs the longest running time. The average running time on MDAD, aBiofilm and DrugVirus datasets for DIT-CNN is 10, 10 and 4s. The average running time on MDAD, aBiofilm and DrugVirus datasets for Graph2MDA is 788, 1261 and 52s. For our proposed model SCSMDA, its average running time on MDAD, aBiofilm and DrugVirus is 342, 450 and 134s, respectively. The results illustrate that our proposed method could complete training and prediction tasks within an acceptable time.
Case study
To comprehensively verify the ability of SCSMDA in finding novel MDAs, we perform case studies on two popular antimicrobial drugs ciprofloxacin and moxifloxacin, which is the same as the previous research [15]. Specifically, for each target drug, all the known microbe–drug associations will be set to unknown, and then all the candidate microbes will be sorted in a descending manner according to their scores predicted by SCSMDA. Lastly, we screen out the top-20 ranked microbes and verify them by published literature. The case study results for ciprofloxacin and moxifloxacin are displayed in Tables 7 and 8.
Microbe name . | Rank . | Evidence . | Microbe name . | Rank . | Evidence . |
---|---|---|---|---|---|
Candida albicans | 1 | PMID:31471074 | Listeria monocytogenes | 11 | PMID:28355096 |
Streptococcus mutans | 2 | PMID:30468214 | Bacillus cereus | 12 | PMID:8448312 |
Salmonella enterica | 3 | PMID:26933017 | Burkholderia pseudomallei | 13 | PMID:24502667 |
Staphylococcus epidermidis | 4 | PMID:28481197 | Streptococcus epidermidis | 14 | Unconfirmed |
Burkholderia cenocepacia | 5 | PMID:27799222 | Campylobacter jejuni | 15 | PMID:11920303 |
Bacillus subtilis | 6 | PMID:15194135 | Agrobacterium tumefaciens | 16 | Unconfirmed |
Serratia marcescens | 7 | PMID:23751969 | Vibrio vulnificus | 17 | PMID:24978586 |
Acinetobacter baumannii | 8 | PMID:25147676 | Staphylococcus epidermidis | 18 | PMID:10632381 |
Streptococcus sanguis | 9 | PMID:11347679 | Candida tropicalis | 19 | Unconfirmed |
Vibrio harveyi | 10 | PMID:27247095 | Actinomyces oris | 20 | Unconfirmed |
Microbe name . | Rank . | Evidence . | Microbe name . | Rank . | Evidence . |
---|---|---|---|---|---|
Candida albicans | 1 | PMID:31471074 | Listeria monocytogenes | 11 | PMID:28355096 |
Streptococcus mutans | 2 | PMID:30468214 | Bacillus cereus | 12 | PMID:8448312 |
Salmonella enterica | 3 | PMID:26933017 | Burkholderia pseudomallei | 13 | PMID:24502667 |
Staphylococcus epidermidis | 4 | PMID:28481197 | Streptococcus epidermidis | 14 | Unconfirmed |
Burkholderia cenocepacia | 5 | PMID:27799222 | Campylobacter jejuni | 15 | PMID:11920303 |
Bacillus subtilis | 6 | PMID:15194135 | Agrobacterium tumefaciens | 16 | Unconfirmed |
Serratia marcescens | 7 | PMID:23751969 | Vibrio vulnificus | 17 | PMID:24978586 |
Acinetobacter baumannii | 8 | PMID:25147676 | Staphylococcus epidermidis | 18 | PMID:10632381 |
Streptococcus sanguis | 9 | PMID:11347679 | Candida tropicalis | 19 | Unconfirmed |
Vibrio harveyi | 10 | PMID:27247095 | Actinomyces oris | 20 | Unconfirmed |
Microbe name . | Rank . | Evidence . | Microbe name . | Rank . | Evidence . |
---|---|---|---|---|---|
Candida albicans | 1 | PMID:31471074 | Listeria monocytogenes | 11 | PMID:28355096 |
Streptococcus mutans | 2 | PMID:30468214 | Bacillus cereus | 12 | PMID:8448312 |
Salmonella enterica | 3 | PMID:26933017 | Burkholderia pseudomallei | 13 | PMID:24502667 |
Staphylococcus epidermidis | 4 | PMID:28481197 | Streptococcus epidermidis | 14 | Unconfirmed |
Burkholderia cenocepacia | 5 | PMID:27799222 | Campylobacter jejuni | 15 | PMID:11920303 |
Bacillus subtilis | 6 | PMID:15194135 | Agrobacterium tumefaciens | 16 | Unconfirmed |
Serratia marcescens | 7 | PMID:23751969 | Vibrio vulnificus | 17 | PMID:24978586 |
Acinetobacter baumannii | 8 | PMID:25147676 | Staphylococcus epidermidis | 18 | PMID:10632381 |
Streptococcus sanguis | 9 | PMID:11347679 | Candida tropicalis | 19 | Unconfirmed |
Vibrio harveyi | 10 | PMID:27247095 | Actinomyces oris | 20 | Unconfirmed |
Microbe name . | Rank . | Evidence . | Microbe name . | Rank . | Evidence . |
---|---|---|---|---|---|
Candida albicans | 1 | PMID:31471074 | Listeria monocytogenes | 11 | PMID:28355096 |
Streptococcus mutans | 2 | PMID:30468214 | Bacillus cereus | 12 | PMID:8448312 |
Salmonella enterica | 3 | PMID:26933017 | Burkholderia pseudomallei | 13 | PMID:24502667 |
Staphylococcus epidermidis | 4 | PMID:28481197 | Streptococcus epidermidis | 14 | Unconfirmed |
Burkholderia cenocepacia | 5 | PMID:27799222 | Campylobacter jejuni | 15 | PMID:11920303 |
Bacillus subtilis | 6 | PMID:15194135 | Agrobacterium tumefaciens | 16 | Unconfirmed |
Serratia marcescens | 7 | PMID:23751969 | Vibrio vulnificus | 17 | PMID:24978586 |
Acinetobacter baumannii | 8 | PMID:25147676 | Staphylococcus epidermidis | 18 | PMID:10632381 |
Streptococcus sanguis | 9 | PMID:11347679 | Candida tropicalis | 19 | Unconfirmed |
Vibrio harveyi | 10 | PMID:27247095 | Actinomyces oris | 20 | Unconfirmed |
Microbe name . | Rank . | Evidence . | Microbe name . | Rank . | Evidence . |
---|---|---|---|---|---|
Escherichia coli | 1 | PMID:31542319 | Burkholderia cenocepacia | 11 | PMID:28355096 |
Streptococcus mutans | 2 | PMID:29160117 | Serratia marcescens | 12 | Unconfirmed |
Staphylococcus aureus | 3 | PMID:12654680 | Burkholderia pseudomallei | 13 | PMID:24502667 |
Pseudomonas aeruginosa | 4 | PMID:31691651 | Streptococcus epidermidis | 14 | Unconfirmed |
Staphylococcus epidermidis | 5 | PMID:11249827 | Acinetobacter baumannii | 15 | PMID:12951327 |
Vibrio harveyi | 6 | Unconfirmed | Salmonella enterica | 16 | PMID:22151215 |
Staphylococcus epidermidis | 7 | PMID:31516359 | Vibrio cholerae | 17 | Unconfirmed |
Enterococcus faecalis | 8 | PMID:31763048 | Vibrio vulnificus | 18 | PMID:10632381 |
Listeria monocytogenes | 9 | PMID:28739228 | Klebsiella pneumoniae | 19 | PMID:27257956 |
Proteus mirabilis | 10 | PMID:15077996 | Actinomyces oris | 20 | Unconfirmed |
Microbe name . | Rank . | Evidence . | Microbe name . | Rank . | Evidence . |
---|---|---|---|---|---|
Escherichia coli | 1 | PMID:31542319 | Burkholderia cenocepacia | 11 | PMID:28355096 |
Streptococcus mutans | 2 | PMID:29160117 | Serratia marcescens | 12 | Unconfirmed |
Staphylococcus aureus | 3 | PMID:12654680 | Burkholderia pseudomallei | 13 | PMID:24502667 |
Pseudomonas aeruginosa | 4 | PMID:31691651 | Streptococcus epidermidis | 14 | Unconfirmed |
Staphylococcus epidermidis | 5 | PMID:11249827 | Acinetobacter baumannii | 15 | PMID:12951327 |
Vibrio harveyi | 6 | Unconfirmed | Salmonella enterica | 16 | PMID:22151215 |
Staphylococcus epidermidis | 7 | PMID:31516359 | Vibrio cholerae | 17 | Unconfirmed |
Enterococcus faecalis | 8 | PMID:31763048 | Vibrio vulnificus | 18 | PMID:10632381 |
Listeria monocytogenes | 9 | PMID:28739228 | Klebsiella pneumoniae | 19 | PMID:27257956 |
Proteus mirabilis | 10 | PMID:15077996 | Actinomyces oris | 20 | Unconfirmed |
Microbe name . | Rank . | Evidence . | Microbe name . | Rank . | Evidence . |
---|---|---|---|---|---|
Escherichia coli | 1 | PMID:31542319 | Burkholderia cenocepacia | 11 | PMID:28355096 |
Streptococcus mutans | 2 | PMID:29160117 | Serratia marcescens | 12 | Unconfirmed |
Staphylococcus aureus | 3 | PMID:12654680 | Burkholderia pseudomallei | 13 | PMID:24502667 |
Pseudomonas aeruginosa | 4 | PMID:31691651 | Streptococcus epidermidis | 14 | Unconfirmed |
Staphylococcus epidermidis | 5 | PMID:11249827 | Acinetobacter baumannii | 15 | PMID:12951327 |
Vibrio harveyi | 6 | Unconfirmed | Salmonella enterica | 16 | PMID:22151215 |
Staphylococcus epidermidis | 7 | PMID:31516359 | Vibrio cholerae | 17 | Unconfirmed |
Enterococcus faecalis | 8 | PMID:31763048 | Vibrio vulnificus | 18 | PMID:10632381 |
Listeria monocytogenes | 9 | PMID:28739228 | Klebsiella pneumoniae | 19 | PMID:27257956 |
Proteus mirabilis | 10 | PMID:15077996 | Actinomyces oris | 20 | Unconfirmed |
Microbe name . | Rank . | Evidence . | Microbe name . | Rank . | Evidence . |
---|---|---|---|---|---|
Escherichia coli | 1 | PMID:31542319 | Burkholderia cenocepacia | 11 | PMID:28355096 |
Streptococcus mutans | 2 | PMID:29160117 | Serratia marcescens | 12 | Unconfirmed |
Staphylococcus aureus | 3 | PMID:12654680 | Burkholderia pseudomallei | 13 | PMID:24502667 |
Pseudomonas aeruginosa | 4 | PMID:31691651 | Streptococcus epidermidis | 14 | Unconfirmed |
Staphylococcus epidermidis | 5 | PMID:11249827 | Acinetobacter baumannii | 15 | PMID:12951327 |
Vibrio harveyi | 6 | Unconfirmed | Salmonella enterica | 16 | PMID:22151215 |
Staphylococcus epidermidis | 7 | PMID:31516359 | Vibrio cholerae | 17 | Unconfirmed |
Enterococcus faecalis | 8 | PMID:31763048 | Vibrio vulnificus | 18 | PMID:10632381 |
Listeria monocytogenes | 9 | PMID:28739228 | Klebsiella pneumoniae | 19 | PMID:27257956 |
Proteus mirabilis | 10 | PMID:15077996 | Actinomyces oris | 20 | Unconfirmed |
Drug ciprofloxacin belongs to a class of drugs called quinolone antibiotics. It usually is used to treat a variety of bacterial infections such as urinary tract infections and pneumonia [62]. Previous studies have indicated that ciprofloxacin has a close relationship with many human microbes. For example, it is reported that Candida albicans and Staphylococcus aureus together could result in biofilm formation and increase antimicrobial resistance. Daniel [63] fully accessed the susceptibility between ciprofloxacin and Salmonella and found that ciprofloxacin susceptibility was highly dependent on serotype. Besides, Mercedes [64] discovered that the activity of ciprofloxacin against Bacillus subtilis species depends on the drug’s interaction with its target enzymes. The results for other predicted microbes are displayed in Table 7 and 16 out of top 20 predicted candidate microbes related to ciprofloxacin can be confirmed by literature.
Drug moxifloxacin is also a common antibiotic, which is always employed to treat bacterial infections including pneumonia, conjunctivitis, endocarditis, tuberculosis and sinusitis [65, 66]. Moxifloxacin could inhibit the reproduction growth rate and life cycle of broad-spectrum bacteria. For example, Escherichia coli is a bacteria that normally lives in the intestines of both healthy people and animals. Axel [67] suggested that moxifloxacin had a potential impact on bactericidal activities of Escherichia coli. Staphylococcus aureus is a Gram-positive spherically shaped bacterium, a member of the Bacillota. Dilek [68] stated that moxifloxacin had enhanced potency against S. aureus. Besides, some studies confirmed that bactericidal activity of moxifloxacin is against S. aureus strains in vitro [69]. We display the top-20 predicted candidate microbes in Table 8 and 15 of them can be verified by previous publications. Case studies on these two drugs further indicate that SCSMDA has a good performance in identifying novel MDAs.
Besides, SCSMDA conducts the case study for each microbe and drug on the three public datasets. The correspondence results are available in the GitHub and we don’t repeat them anymore.
Conclusion
Recent studies have comprehensively shown that microbes residing within and upon human bodies play critical roles in human health. Accurately identifying the microbe–drug associations is a crucial step in precision medicine. Here we propose a novel approach named SCSMDA to predict microbe–drug associations which achieves the best performance among all the baseline approaches. SCSMDA employs the meta-path-induced networks of microbes and drugs to enhance their feature representations learned from the similarity networks with the contrastive learning strategy, which could obtain their deep-level representations. Besides, SCSMDA utilizes the self-paced negative sampling strategy to select the most informative negative samples for training the MLP classifier more efficiently.
To comprehensively evaluate the performance of SCSMDA as well as the baseline methods, we conduct the 5-CV experiment on three public datasets. Experimental results show that the proposed method wins the highest scores on the AUC and AUPRC evaluation metrics. We also conduct the comparison experiments under different ratios (# positive sample: # negative samples=1:1, 1:5 and 1:10). SCSMDA achieves the best performance on these datasets. Besides, the model ablation experiment is adopted to further verify the effectiveness of the structure-enhanced contrastive learning strategy and self-paced negative sampling strategy. Meanwhile, parameter sensitivity experiments are employed to tune the best parameters for SCSMDA. In the end, the results of case studies on two common drugs could be supported by published literature, which further confirms the advantages of SCSMDA in discovering novel MDAs.
Next, we can do some work from the following two aspects. Firstly, some other biological entities such as genes and proteins could be employed to establish a more comprehensive knowledge graph related to microbes and drugs. We can learn the embedding of microbes and drugs with the help of knowledge graphs aiming to improve the prediction accuracy of the MDA prediction model. Secondly, since association relationship predictions between biological entities are one of the foundation tasks in computational biology, we can apply SCSMDA to other link prediction problems such as drug-drug interaction and miRNA–disease association prediction.
SCSMDA constructs the meta-path induced networks for microbes and drugs by utilizing their different meta-paths with semantic meanings.
SCSMDA employs the structure-enhanced contrastive learning strategy to obtain the effective representations of microbes and drugs.
SCSMDA adopts the self-paced negative sampling strategy to select the most informative negative samples for training the MLP classifier.
Results on these three datasets comprehensively indicate that SCSMDA outperforms seven other baseline methods in microbe–drug association prediction task.
Acknowledgements
The authors thank the anonymous reviewers for their valuable suggestions.
Funding
National Science Foundation of China (No. 61801432, 62031003).
Author contributions statement
Z.T. conceived the experiment and the whole manuscript. Y.Y. developed the codes and algorithm. Z.T., H.F. and Y.Y. set up the general idea of this study. W.X. and M.G. revised the manuscript. All authors have read and approved the manuscript.
Availability and implementation
The source code and databases are available at https://github.com/Yue-Yuu/SCSMDA-master.
Author Biographies
Zhen Tian, PhD (Harbin Institute of Technology), is a lecturer at the School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China. His current research interests include computational biology, complex network analysis and data mining.
Yue Yu is currently studying toward the Master Degree of Computer Science and Technology in Zhengzhou University, Zhengzhou, China. His research interests include knowledge graph embedding, bioinformatics and deep learning.
Haichuan Fang is currently working toward the Master Degree of Engineering in Zhengzhou University, Zhengzhou, China. His research interests include knowledge graph embedding, bioinformatics and deep learning.
Weixin Xie. Weixin Xie, Ph.D. (Harbin Engineering University, Harbin, China). Her research focuses on biomedical informatics, deep learning and text mining.
Maozu Guo is a professor at the College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China. He received the PhD degree in Computer Science and Technology from Harbin Institute of Technology. His research interests include bioinformatics, machine learning and data mining.
References
Zhang H, John K, Baise D, et al.
Hassani K, and Khasahmadi AH,.
Peng Z, Huang W, Luo M, et al.
Wang Y, Min Y, Chen X, et al.
Wang X, Liu N, Han H, et al.
Van der Maaten L, Hinton G. Visualizing data using t-sne.