-
PDF
- Split View
-
Views
-
Cite
Cite
Jie Pan, Zhen Zhang, Ying Li, Jiaoyang Yu, Zhuhong You, Chenyu Li, Shixu Wang, Minghui Zhu, Fengzhi Ren, Xuexia Zhang, Yanmei Sun, Shiwei Wang, A microbial knowledge graph-based deep learning model for predicting candidate microbes for target hosts, Briefings in Bioinformatics, Volume 25, Issue 3, May 2024, bbae119, https://doi.org/10.1093/bib/bbae119
- Share Icon Share
Abstract
Predicting interactions between microbes and hosts plays critical roles in microbiome population genetics and microbial ecology and evolution. How to systematically characterize the sophisticated mechanisms and signal interplay between microbes and hosts is a significant challenge for global health risks. Identifying microbe-host interactions (MHIs) can not only provide helpful insights into their fundamental regulatory mechanisms, but also facilitate the development of targeted therapies for microbial infections. In recent years, computational methods have become an appealing alternative due to the high risk and cost of wet-lab experiments. Therefore, in this study, we utilized rich microbial metagenomic information to construct a novel heterogeneous microbial network (HMN)-based model named KGVHI to predict candidate microbes for target hosts. Specifically, KGVHI first built a HMN by integrating human proteins, viruses and pathogenic bacteria with their biological attributes. Then KGVHI adopted a knowledge graph embedding strategy to capture the global topological structure information of the whole network. A natural language processing algorithm is used to extract the local biological attribute information from the nodes in HMN. Finally, we combined the local and global information and fed it into a blended deep neural network (DNN) for training and prediction. Compared to state-of-the-art methods, the comprehensive experimental results show that our model can obtain excellent results on the corresponding three MHI datasets. Furthermore, we also conducted two pathogenic bacteria case studies to further indicate that KGVHI has excellent predictive capabilities for potential MHI pairs.
INTRODUCTION
The main precipitants of infectious diseases are the emergence and re-emergence of pathogens, which presents major challenges to public health worldwide [1]. In the last decade, with the rapid evolution of deep sequencing analysis techniques and the rich information of dense and diverse microbial communities, the microbiome has attracted substantial attention in the fields of human health and biological application [2]. Inter-species interactions of proteins [3], DNA and RNA [4] form a complex web, and it can help pathogens disrupt target cellular pathways and gene functions [5]. Therefore, studies of microbe-host interactions (MHIs) can help to understand the molecular mechanisms underlying infection diseases and develop novel biotherapeutics. For example, many studies have found that pathogens typically interact with the protein bottlenecks (proteins at the central locations of important pathways) and hubs (proteins with a lot of interaction partners) in complex spatiotemporally distributed microbe-host protein–protein interaction (PPI) networks [6]. In addition, several large-scale genomics programs have been launched, such as Metagenomics of the Human Intestinal Tract [7] and Human Microbiome Projects [8]. It can aid our understanding of the biological and medical implications of the human microbiome [9]. However, due to time and cost constraints, the number of experimentally validated MHI pairs is very limited. As new virus diseases have been continued to be identified, there is an urgent need to develop a computation-based method to predict MHI to help recommend candidate target hosts from the microbe proteome [10].
Existing MHI computational prediction methods have mainly depended on the evolutionary information of protein sequence [11–13]. Generally, the approaches for MHI prediction can be classified into three categories: structure-based approaches [14], domain-based approaches [15] and sequence-based approaches [16]. The structure-based methods are not applicable when the 3-dimensional (3D) structures of the target proteins are unknown. Similarly, domain-based methods need information on protein domains, which may not be available for each protein in the massive protein-interacting network. Therefore, sequence-based approaches are the most commonly used methodology. While protein functions have been shown that they can be utilized to predict intra-species (e.g. human or plant) PPI [17, 18], and such protein-specific features exist for some extensively studied bacteria (e.g. Helicobacter pylori [19] and Burkholderia pseudomallei [20]), these features are rare and costly to obtain.
Over the past decade, owing to the rapid development of computer technology, computational-based models can decrease the treatment costs by providing viable strategies for biological experiments. For example, Tsukiyama et al. [21] developed a novel LSTM model named LSTM-PHV to predict human-virus interactions (HVI). With the recent development of transfer learning technologies, Yang et al. [22] combined evolutionary sequence profile features with a multi-layer perceptron and a Siamese convolutional neural network (CNN) architecture to predict HVI. Sun et al. [23] proposed MMiKG, which collected a lot of knowledge of the microbiota-gut-brain axis and depicted it in the form of a KG. Liu-Wei et al. [24] designed a deep learning-based model named DeepViral, which embedded human proteins and infected viruses in a shared space through their interacted phenotypes and functions. DeepViral significantly improves existing sequence-based methods by jointly learning protein sequence and phenotype features for predicting HVI. Lian et al. [25] developed a machine learning-based method that performed three conventional sequence-based encoding schemes and two host networks to predict human-Yersinia pestis PPI. Despite these satisfactory results, such computational methods still suffer some unavoidable disadvantages. For instance, the topological structure of the full microbial networks is not taken into account.
Benefiting from the wide application of sequencing technology and bioinformatics, MHI prediction is not limited to viruses, but also to other microbes related to various hosts, such as HVI, human-bacteria interaction (HBI), phage-bacteria interaction (PBI) and so on [26]. Recently, mining knowledge from the heterogeneous network has been recognized as a novel insight to improve predictive performance, which has complex data structures with nature high-dimensional spatial properties [27]. Thus, graph-based models [28, 29] can model the microbial network structure and extract complex topological information from a global perspective. To date, a number of network embedding algorithms have been presented, and they can be broadly grouped into three types: neural network-based [30], matrix factorization-based [31] and random walk-based [32]. All of these algorithms have achieved great success in the field of bioinformatics. However, this type of approach focuses only on sparse connections between different molecular nodes and ignores the types of edges and node attributes [33]. To overcome the limitations of these methods, knowledge graph (KG)-based methods [34] have gained increased traction [35]. The KG-based embedding techniques can be roughly divided into two categories: (i) semantic matching models [36] and (ii) translational distance models [37]. Methods of the first type measure this by matching the latent semantics of entities and relations, which are captured by vector space representations. On the other hand, the distance between the two entities is the second type of method used to measure the factual plausibility, such as TransE [38], TransR [39] and TransH [40]. Furthermore, with the increasing popularity of graph neural network (GNN), new methods for KG integration have emerged [41]. These methods leverage the idea of deep learning to capture the semantic relations and higher-order structures of graph networks. Prominent examples of such methods include KGAT [42] and R-GCN [43]. However, only a few models have been exploited for predicting MHI.
In this study, we develop a novel KG-based deep learning framework called KGVHI to predict potential interactions between microbes and target hosts by aggregating local biological attribute information and global topological structure information of the microbe-MHI network. In particular, we construct a heterogeneous microbial network (HMN) that includes the human virus, bacteria and phages, and apply three distinct modules to extract meaningful geometric information from the HMN. First, the KG module is applied to obtain the global topological structure information of the entities and edges from HMN. Second, the natural language processing module is adopted to extract the local biological attribute information. Third, we designed a blended deep neural network (DNN) module for integrating these information, training and prediction. The experimental results of 5-fold cross-validation (CV) on three different MHI datasets demonstrate that KGVHI can achieve superior performance, which is superior to state-of-the-art methods. In addition, case studies on two pathogenic bacteria further demonstrated the effectiveness of KGVHI in identifying microbe-related hosts.
METHODS AND MATERIALS
Overview of KGVHI model
In this paper, we present a KG-based computational model named KGHIV, which can improve the prediction performance of various microbes and target hosts. The microbe-host-based interaction graph is a directed graph, where the nodes represent entities and the edges indicate entity types of these biomolecules. Let d represents a predefined embedding dimension, and
As Figure 1 illustrates, the local biological attribute information of HMN is represented by capturing sequences through natural language processing. Then, KGVHI applied the KGE-based InteractE [44] algorithm to extract the global typological structure features from HMN. As a novel KGE-based method, InteractE has three key ideas to increase the interactions between relationships and entity embeddings. Significantly, we developed a new prediction model named KGVHI and constructed microbes and hosts related to KG to learn protein biological information and generate the topology-preserving representation of numerous microbial interaction networks.

The workflow diagram depicts an overview of the proposed KGVHI model. The heterogeneous microbial network (HMN) consisted of various microbes and target hosts. KGVHI combined the global typological structure feature with the local biological attribute information to infer potential interaction pairs via a blended deep neural network (DNN).
Benchmark dataset description
Human-virus interaction dataset
In this study, we separated the whole MHI dataset into three categories: HVI, HBI and PBI. The HVI datasets were collected from the host-pathogen interaction database (HPIDB) [45]. To better train the proposed model, we first pretreated the collected data set. Specifically, we removed the HVI pairs with a mutual information (MI) score below 0.3 to ensure that all utilization data had high confidence. The MI score represents the confidence level of HVI pairs, which is assigned by InAct [46] and VirHostNet [47]. Then, we used the CD-HIT algorithm with a threshold of 0.95 to remove the redundant pairs [48]. Third, we only selected proteins with lengths longer than 30 and shorter than 1000 residues, which were all composed of standard amino acids. Eventually, the final positive sample is composed by 22 383 HVI pairs from 996 virus proteins and 5882 human proteins.
It is well known that the construction of negative samples does not follow recognized gold standard procedures. A lot of previous works applied the random selection strategy to construct the negative data. However, the random sampling strategy has the risk of assigning positive data to the negative samples [49]. To compensate for this issue, we employed a dissimilarity-negative sampling technique to select the protein pairs that do not have interactions. In detail, the Needleman–Wunsch algorithm of BLOSUM30 [50] was applied to compute the sequence similarities of all virus proteins and assign them a similarity vector. Then, we excluded more than half of total viral proteins with sequence similarity below Ts as outliers, which can be calculates as follows:
where
The human proteins were retrieved from the UniProKB/Swiss-Prot [51] database with sequences longer than 30 and shorter than 1000. Then, taking into account that human proteins may interact with virus proteins, we removed the proteins whose distance was less than threshold T. Based on previous research [52], we set the threshold T to 0.8, and the negative samples were constructed by the remaining pairs. Finally, we randomly selected 22 383 negative samples from these candidates.
Human-bacteria interaction dataset
The prediction of human and target bacterial pathogens is an important step in systematically analyzing the basic mechanisms of bacterial infection. Thus, we also collected the HBI dataset to further indicate the generalizability of the proposed KGVHI model. The pathogenic bacteria that we collected in HMN were mainly contained of Yersinia, Bacillus and Francisella tularensis. The HBI dataset was collected from the HPIDB database [45]. After pre-trained the collected dataset, we finally collected 8653 HBI pairs, which contain 3502 human proteins and 2563 bacterial proteins.
Phage-bacteria interaction dataset
The third MHI dataset that we used in this work is PBI. Due to the abuse of antibiotics, bacterial resistance continues to increase. Bacteriophages (phages) are viruses that specifically infect and lyse bacterial cells. Thus, phage therapy is a potential solution for these questions. The identification of PBI can help people predict phages for target bacteria. The bacteria that we explored in this work are associated with the ESKAPE pathogen, which frequently causes bacterial infections due to their multidrug resistance and aggressive phenotypes. We collected the corresponding datasets from MillardLab [53] and UniProt database [54], which provided 959 phage tail proteins and 522 bacteria receptor-binding proteins.
Capturing global topological structure information from HMN
The graph embeddings algorithm can mine the hidden information and the linear patterns between edges and entities. Nevertheless, traditional graph embedding methods are not suitable for our work because they focus only on the connection between biological nodes and ignore a large number of edge and entity attributes. In this part, we will focus on utilizing a KG-based algorithm, InteractE, to extract global topological structure information from HMN. Benefiting from the multilayer advantages of CNN, InteractE can increase expressive power while remaining parameter-efficient. Some methods have proved that the expression ability of a model will be improved by adding the possible interaction between embeddings. InteractE extends this concept and leverages three ideas (feature permutation, checkered reshaping and circular convolution) to mine the interactions between the entity and relation feature. In this way, each biological node can be expressed as a specific embedding vector.
Suppose
where
where vector concatenation, entity embedding matrix, depth-wise circular convolution and learnable weight matrix are represented as

The reason is that circular convolution can capture more information than standard convolution. X represents the input matrix, and the location of the filter application is characterized by the shaded area.
Method of local biological attribute feature embedding from HPI network
In the field of computational biology, natural language processing-based approaches such as doc2vec [57] and word2vec [58] are performed to extract biological attribute information from protein sequences or structures. The weights of the neural network in word2vec can represent and encode different linguistic regularities and patterns of original biological sequences. The word2vec algorithm has two ways (Skip-Gram and CBOW) to learn and extract semantic information from biological sequences. According to a previous study [59], CBOW-based model applies the current words to predict the context, which can learn faster than Skip-Gram, while the second method uses the nearest context to predict the current words for a more accurate output.
In this work, we used the Skip-Gram based word2vec module for capturing the local biological attribute information from HMN. We used the idea of k-mers to construct words from each microbial sequence, and the sequences were regarded as sentences. Taking the sequence MTDTLDLE as an example, the units of 4-mers are MTDT, TDTL, DTLD, TLDL and LDLE. We used the genome of Python package to perform the Skip-Gram algorithm for learning the appearance pattern of microbial sequences [60]. We set k to 4 and iterated 1000 times to gain a better prediction model [61]. Finally, the word2vec model can produce 64-dimensional vectors in each k-mer to represent the local biological attribute information.
A deep learning-based classifier by combining CNN and DNN
To provide a comprehensive prediction model, we propose a blended DNN, which combines CNN with a fully connected neural network (FCNN) to explore multiple pieces of information. The multi-level network is shown in Figure 3. The CNN part of KGVHI is composed of two 2D convolutional layers (CLs), with a kernel size of
where

The architecture of the DNN module in the proposed KGVHI model.
The second module is an FCNN to process the global and local information and yield the final prediction results. This module first accepts the input features and then connects them to three layers with 32, 64 and 128 neurons. These two modules flatten and concatenate the features and feed them into a dense layer, where the final softmax function generates the score for the final prediction. In addition, we set the learning rate to 1e-2 and used the AdaGrad optimizer to update the parameters. Finally, the binary cross-entropy function was used as the loss function, and it can be defined as:
where y represents the binary class label, and the probability that the results belong to the label y is represented as
EXPERIMENTS AND RESULTS
Prediction measures
To fully validate the prediction performance of the proposed KGVHI model, eight commonly used evaluation index were performed, including accuracy (Acc), sensitivity (Sen or Recall), precision (Prec), Matthews correlation coefficient (MCC) and F1-score (F1). These measures can be calculated by:
where true positive (TP) represents the positive samples that were identified as positive, true negative (TN) denotes the negative samples that were predicted to be negative, false positive (FP) represents the negative samples that were predicted to be positive and false negative (FN) denotes the positive samples that were predicted to be negative, respectively. Besides, the receiver operating characteristic (ROC), area under the ROC curves (AUPR) and area under the precision-recall curves (AUPR) are also employed to plot the visualization of the prediction performance of the KGVHI model.
Evaluation of prediction performance
To fully verify the predictive ability of KGVHI, we performed it on three different microbe-MHIs datasets under a 5-fold CV framework, including HVI, HBI and PBI datasets. First, the whole MHI dataset was split into five equally-sized subsets. Second, four of them were used to train the model, and the remaining one was applied to test the model. This procedure was repeated five times until all subsets were used for validation once.
As illustrated in Table 1, the average Acc values that KGVHI yielded in three MHI datasets (HVI, HBI and PBI) are 95.46%, 92.88% and 92.31%, with SDs of 0.22%, 0.72% and 1.15%, respectively. Besides, KGVHI obtains average AUC values of 0.9892, 0.9788 and 0.9724, with SDs of 0.0007, 0.0039, and 0.0052, respectively, which are shown in Figures 4–6. Moreover, the average AUPR values are 0.9866, 0.9744 and 0.9634, with SDs of 0.0013, 0.0048 and 0.0098, respectively. After analyzing these results, we found that KGVHI obtained the highest Acc value in the HVI dataset, which is over 95%, while the accuracy on the HPI and PBI datasets is relatively low. We attribute this phenomenon to the size of the dataset. In general, the larger the dataset, the more effective the prediction model. Another possibility for this difference is that KGVHI is more sensitive to human and virus proteins. However, we still obtained a high Acc of 0.9231 on the smallest PBI dataset. Collectively, the proposed KGVHI model is efficient and robust in predicting different types of MHI.



Dataset . | Acc. (%) . | Sen. (%) . | Prec. (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
HVI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
HBI | 92.88 ± 0.72 | 90.57 ± 1.66 | 91.01 ± 1.39 | 85.86 ± 1.40 | 93.04 ± 0.67 | 0.9788 ± 0.0039 | 0.9744 ± 0.0048 |
PBI | 92.31 + 1.15 | 88.50 ± 2.02 | 89.34 ± 1.62 | 84.89 ± 2.30 | 92.59 ± 1.11 | 0.9724 + 0.0052 | 0.9634 ± 0.0098 |
Dataset . | Acc. (%) . | Sen. (%) . | Prec. (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
HVI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
HBI | 92.88 ± 0.72 | 90.57 ± 1.66 | 91.01 ± 1.39 | 85.86 ± 1.40 | 93.04 ± 0.67 | 0.9788 ± 0.0039 | 0.9744 ± 0.0048 |
PBI | 92.31 + 1.15 | 88.50 ± 2.02 | 89.34 ± 1.62 | 84.89 ± 2.30 | 92.59 ± 1.11 | 0.9724 + 0.0052 | 0.9634 ± 0.0098 |
Dataset . | Acc. (%) . | Sen. (%) . | Prec. (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
HVI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
HBI | 92.88 ± 0.72 | 90.57 ± 1.66 | 91.01 ± 1.39 | 85.86 ± 1.40 | 93.04 ± 0.67 | 0.9788 ± 0.0039 | 0.9744 ± 0.0048 |
PBI | 92.31 + 1.15 | 88.50 ± 2.02 | 89.34 ± 1.62 | 84.89 ± 2.30 | 92.59 ± 1.11 | 0.9724 + 0.0052 | 0.9634 ± 0.0098 |
Dataset . | Acc. (%) . | Sen. (%) . | Prec. (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
HVI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
HBI | 92.88 ± 0.72 | 90.57 ± 1.66 | 91.01 ± 1.39 | 85.86 ± 1.40 | 93.04 ± 0.67 | 0.9788 ± 0.0039 | 0.9744 ± 0.0048 |
PBI | 92.31 + 1.15 | 88.50 ± 2.02 | 89.34 ± 1.62 | 84.89 ± 2.30 | 92.59 ± 1.11 | 0.9724 + 0.0052 | 0.9634 ± 0.0098 |
In order to gain a deeper insight into the distribution of the prediction results, we employ the t-distributed stochastic neighbor embedding (t-SNE) algorithm to visualize the training process. As shown in Figure 7, the positive samples were indicated by the aqua color ‘+’ and the negative samples were represented by the purple color ‘−’. By comparison, one can see that when the training epochs increased, KGVHI could roughly isolate and distinguish these biological samples.

The t-SNE transformed 2D visualization of positive and negative samples during different epochs of the proposed KGVHI model.
Best embedding dimensions of KGVHI model
The embedding dimension is a hyperparameter for representation learning that can encode the topological structure of HMN into a specific vector space. It is a challenge but necessary for KGVHI to choose an appropriate embedding dimension that can be applied to multiple prediction tasks. The chosen dimension must be large enough to be effective for modeling, but small enough to be computationally efficient. In this section, we designed an experiment to find the optimal embedding dimensions for the purpose of better model performance. Specifically, we applied the same embedding method to the MHI dataset with different embedding dimensions (16, 32, 64 and 128) to extract local and global features from HMN. Table 2 and Figure 8 list the prediction results that were obtained from these four different dimensions. We found that the prediction accuracy tends to increase and then decrease as the embedding dimension increases. In the proposed KGVHI model, we found that 64-dimension can extract the richest topological information without too much noise data. Finally, it can be concluded that KGVHI has a stabilizing ability to predict potential target hosts for various microorganisms.

Prediction performance of KGVHI with a 5-fold CV framework of different embedding dimensions.
Fold . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
16 | 94.29 ± 0.13 | 94.63 ± 1.32 | 94.01 ± 1.09 | 88.60 ± 0.27 | 94.31 ± 0.16 | 0.9855 ± 0.0009 | 0.9832 ± 0.0011 |
32 | 94.85 ± 0.33 | 96.05 ± 1.15 | 93.84 ± 1.43 | 89.76 ± 0.63 | 94.92 ± 0.28 | 0.9878 ± 0.0010 | 0.9858 ± 0.0012 |
64 | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
128 | 92.20 ± 0.24 | 94.16 ± 0.97 | 90.62 ± 1.09 | 84.48 ± 0.41 | 92.35 ± 0.17 | 0.9748 ± 0.0010 | 0.9701 ± 0.0013 |
Fold . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
16 | 94.29 ± 0.13 | 94.63 ± 1.32 | 94.01 ± 1.09 | 88.60 ± 0.27 | 94.31 ± 0.16 | 0.9855 ± 0.0009 | 0.9832 ± 0.0011 |
32 | 94.85 ± 0.33 | 96.05 ± 1.15 | 93.84 ± 1.43 | 89.76 ± 0.63 | 94.92 ± 0.28 | 0.9878 ± 0.0010 | 0.9858 ± 0.0012 |
64 | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
128 | 92.20 ± 0.24 | 94.16 ± 0.97 | 90.62 ± 1.09 | 84.48 ± 0.41 | 92.35 ± 0.17 | 0.9748 ± 0.0010 | 0.9701 ± 0.0013 |
Fold . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
16 | 94.29 ± 0.13 | 94.63 ± 1.32 | 94.01 ± 1.09 | 88.60 ± 0.27 | 94.31 ± 0.16 | 0.9855 ± 0.0009 | 0.9832 ± 0.0011 |
32 | 94.85 ± 0.33 | 96.05 ± 1.15 | 93.84 ± 1.43 | 89.76 ± 0.63 | 94.92 ± 0.28 | 0.9878 ± 0.0010 | 0.9858 ± 0.0012 |
64 | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
128 | 92.20 ± 0.24 | 94.16 ± 0.97 | 90.62 ± 1.09 | 84.48 ± 0.41 | 92.35 ± 0.17 | 0.9748 ± 0.0010 | 0.9701 ± 0.0013 |
Fold . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
16 | 94.29 ± 0.13 | 94.63 ± 1.32 | 94.01 ± 1.09 | 88.60 ± 0.27 | 94.31 ± 0.16 | 0.9855 ± 0.0009 | 0.9832 ± 0.0011 |
32 | 94.85 ± 0.33 | 96.05 ± 1.15 | 93.84 ± 1.43 | 89.76 ± 0.63 | 94.92 ± 0.28 | 0.9878 ± 0.0010 | 0.9858 ± 0.0012 |
64 | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
128 | 92.20 ± 0.24 | 94.16 ± 0.97 | 90.62 ± 1.09 | 84.48 ± 0.41 | 92.35 ± 0.17 | 0.9748 ± 0.0010 | 0.9701 ± 0.0013 |
Compared KGVHI with some state-of-the-art methods
To comprehensively illustrate the validity of the proposed method, several KG-embedding methods are chosen to compare with our method, including ConvE [63], Distmult [64] and TransE [40]. In addition, to demonstrate the effectiveness of the proposed deep learning-based classifier, we compared it to some widely used machine learning classifiers, including support vector machine (SVM; c = 0.045, g = 1.3), random forest (RF, n = 13) and gradient-boosting decision tree (GBDT, n = 5).
In this part, these comparison experiments were performed on the HVI dataset. We first compared KGVHI with these methods under a 5-fold CV framework, and all the prediction results are listed in Table 3. To ensure a fair and thorough comparison, the global features that were extracted from KG methods are combined with the same local features. Similarly, when comparing the DNN module with these machine learning-based classifiers, we used the same local and global features and the same division dataset. Moreover, the parameters of KGE algorithms were set as default, and the parameters of machine learning classifiers were fine-tuned. In this way, these methods may not achieve perfect predictive performance. Here, however, our goal is to validate the effectiveness of these comparison methods, so we will not seek to obtain little progress by fine-tuning parameters. It can be seen from Figure 9 that KGVHI outperforms all comparison methods in AUC and AUPR values. We attribute this performance to the fact that KGVHI can effectively combine InteractE and the word2vec algorithm to capture latent information of HMN. We also attribute this to the proposed blended DNN, which can fuse the global and local-derived features.

The performance of KGVHI and comparison methods under a 5-fold CV framework on the HVI dataset.
Method . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
ConvE | 92.11 ± 0.43 | 93.25 ± 1.16 | 91.18 ± 1.12 | 84.25 ± 0.86 | 92.2 ± 0.42 | 0.9721 ± 0.0026 | 0.9664 ± 0.0042 |
DistMult | 88.87 ± 0.70 | 92.05 ± 2.86 | 86.7 ± 2.87 | 78.05 ± 0.94 | 89.22 ± 0.37 | 0.9518 ± 0.0029 | 0.9395 ± 0.0067 |
TransE | 86.42 ± 0.60 | 88.07 ± 1.66 | 85.31 ± 1.77 | 72.93 ± 1.20 | 86.64 ± 0.50 | 0.9333 ± 0.0047 | 0.9225 ± 0.0058 |
SVM | 92.19 ± 2.24 | 96.11 ± 1.36 | 89.16 ± 2.73 | 84.65 ± 4.37 | 92.5 ± 2.09 | 0.9811 ± 0.0070 | 0.9810 ± 0.0069 |
RF | 84.70 ± 0.42 | 83.13 ± 1.17 | 85.86 ± 1.44 | 69.46 ± 0.91 | 84.46 ± 0.32 | 0.9231 ± 0.0050 | 0.9164 ± 0.0064 |
GBDT | 86.67 ± 0.52 | 99.14 ± 0.48 | 79.37 ± 0.71 | 75.75 ± 0.88 | 88.15 ± 0.4 | 0.9180 ± 0.0055 | 0.8699 ± 0.0053 |
KGVHI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
Method . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
ConvE | 92.11 ± 0.43 | 93.25 ± 1.16 | 91.18 ± 1.12 | 84.25 ± 0.86 | 92.2 ± 0.42 | 0.9721 ± 0.0026 | 0.9664 ± 0.0042 |
DistMult | 88.87 ± 0.70 | 92.05 ± 2.86 | 86.7 ± 2.87 | 78.05 ± 0.94 | 89.22 ± 0.37 | 0.9518 ± 0.0029 | 0.9395 ± 0.0067 |
TransE | 86.42 ± 0.60 | 88.07 ± 1.66 | 85.31 ± 1.77 | 72.93 ± 1.20 | 86.64 ± 0.50 | 0.9333 ± 0.0047 | 0.9225 ± 0.0058 |
SVM | 92.19 ± 2.24 | 96.11 ± 1.36 | 89.16 ± 2.73 | 84.65 ± 4.37 | 92.5 ± 2.09 | 0.9811 ± 0.0070 | 0.9810 ± 0.0069 |
RF | 84.70 ± 0.42 | 83.13 ± 1.17 | 85.86 ± 1.44 | 69.46 ± 0.91 | 84.46 ± 0.32 | 0.9231 ± 0.0050 | 0.9164 ± 0.0064 |
GBDT | 86.67 ± 0.52 | 99.14 ± 0.48 | 79.37 ± 0.71 | 75.75 ± 0.88 | 88.15 ± 0.4 | 0.9180 ± 0.0055 | 0.8699 ± 0.0053 |
KGVHI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
Method . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
ConvE | 92.11 ± 0.43 | 93.25 ± 1.16 | 91.18 ± 1.12 | 84.25 ± 0.86 | 92.2 ± 0.42 | 0.9721 ± 0.0026 | 0.9664 ± 0.0042 |
DistMult | 88.87 ± 0.70 | 92.05 ± 2.86 | 86.7 ± 2.87 | 78.05 ± 0.94 | 89.22 ± 0.37 | 0.9518 ± 0.0029 | 0.9395 ± 0.0067 |
TransE | 86.42 ± 0.60 | 88.07 ± 1.66 | 85.31 ± 1.77 | 72.93 ± 1.20 | 86.64 ± 0.50 | 0.9333 ± 0.0047 | 0.9225 ± 0.0058 |
SVM | 92.19 ± 2.24 | 96.11 ± 1.36 | 89.16 ± 2.73 | 84.65 ± 4.37 | 92.5 ± 2.09 | 0.9811 ± 0.0070 | 0.9810 ± 0.0069 |
RF | 84.70 ± 0.42 | 83.13 ± 1.17 | 85.86 ± 1.44 | 69.46 ± 0.91 | 84.46 ± 0.32 | 0.9231 ± 0.0050 | 0.9164 ± 0.0064 |
GBDT | 86.67 ± 0.52 | 99.14 ± 0.48 | 79.37 ± 0.71 | 75.75 ± 0.88 | 88.15 ± 0.4 | 0.9180 ± 0.0055 | 0.8699 ± 0.0053 |
KGVHI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
Method . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
ConvE | 92.11 ± 0.43 | 93.25 ± 1.16 | 91.18 ± 1.12 | 84.25 ± 0.86 | 92.2 ± 0.42 | 0.9721 ± 0.0026 | 0.9664 ± 0.0042 |
DistMult | 88.87 ± 0.70 | 92.05 ± 2.86 | 86.7 ± 2.87 | 78.05 ± 0.94 | 89.22 ± 0.37 | 0.9518 ± 0.0029 | 0.9395 ± 0.0067 |
TransE | 86.42 ± 0.60 | 88.07 ± 1.66 | 85.31 ± 1.77 | 72.93 ± 1.20 | 86.64 ± 0.50 | 0.9333 ± 0.0047 | 0.9225 ± 0.0058 |
SVM | 92.19 ± 2.24 | 96.11 ± 1.36 | 89.16 ± 2.73 | 84.65 ± 4.37 | 92.5 ± 2.09 | 0.9811 ± 0.0070 | 0.9810 ± 0.0069 |
RF | 84.70 ± 0.42 | 83.13 ± 1.17 | 85.86 ± 1.44 | 69.46 ± 0.91 | 84.46 ± 0.32 | 0.9231 ± 0.0050 | 0.9164 ± 0.0064 |
GBDT | 86.67 ± 0.52 | 99.14 ± 0.48 | 79.37 ± 0.71 | 75.75 ± 0.88 | 88.15 ± 0.4 | 0.9180 ± 0.0055 | 0.8699 ± 0.0053 |
KGVHI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
Ablation experiments
The above comparison results demonstrate the validity of KGVHI, and the robustness of the proposed model is benefited from its design: a KG-based embedding method to generate the global topology feature and a natural language processing-based method to capture local biological attribute features. Here, an ablation study was carried out, and the following variant of KGVHI was considered and performed on the HVI dataset. From which we can explore the contribution of these modules.
(i) KGVHI_G is a variant of KGVHI that only uses the global topological structure features.
(ii) KGVHI_L is a variant of KGVHI that only uses the local biological attribute features.
The prediction results of KGVHI and its two variants are listed in Table 4. We can see that the prediction power of KGVHI is impaired when any module is deleted, meaning that all components are essential to KGVHI. From Figure 10, we can see that the KGVHI_G variant was affected significantly after removing the local features, with the Acc and MCC values reduced by 18.72% and 36.65%, respectively. Meanwhile, the Acc and MCC values of KGVHI_L were reduced by 15.16% and 30.26%, respectively. From the comparison results of KGVHI and KGVHI_L, we can conclude that the application of global topological attribute information can significantly enhance the prediction ability of the proposed model. Moreover, the comparison results of KGVHI_G and KGVHI_L further demonstrated that biological attribute information is essential for predicting MHI. We also found that removing global information also leads to a decrease in model performance, suggesting that the topological information of HMN must be taken into account.

The comparison results of the different variants on the HVI dataset.
The comparison results of KGVHI and different variants yielded on the HVI dataset under 5-fold CV
Model . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
KGVHI_G | 76.74 ± 0.97 | 69.08 ± 4.74 | 81.83 ± 3.16 | 54.30 ± 1.93 | 74.75 ± 1.77 | 0.8216 ± 0.0099 | 0.8217 ± 0.0109 |
KGVHI_L | 80.30 ± 0.43 | 80.73 ± 3.03 | 80.13 ± 1.77 | 60.69 ± 0.84 | 80.37 ± 0.77 | 0.8699 ± 0.0045 | 0.8563 ± 0.0074 |
KGVHI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
Model . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
KGVHI_G | 76.74 ± 0.97 | 69.08 ± 4.74 | 81.83 ± 3.16 | 54.30 ± 1.93 | 74.75 ± 1.77 | 0.8216 ± 0.0099 | 0.8217 ± 0.0109 |
KGVHI_L | 80.30 ± 0.43 | 80.73 ± 3.03 | 80.13 ± 1.77 | 60.69 ± 0.84 | 80.37 ± 0.77 | 0.8699 ± 0.0045 | 0.8563 ± 0.0074 |
KGVHI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
The comparison results of KGVHI and different variants yielded on the HVI dataset under 5-fold CV
Model . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
KGVHI_G | 76.74 ± 0.97 | 69.08 ± 4.74 | 81.83 ± 3.16 | 54.30 ± 1.93 | 74.75 ± 1.77 | 0.8216 ± 0.0099 | 0.8217 ± 0.0109 |
KGVHI_L | 80.30 ± 0.43 | 80.73 ± 3.03 | 80.13 ± 1.77 | 60.69 ± 0.84 | 80.37 ± 0.77 | 0.8699 ± 0.0045 | 0.8563 ± 0.0074 |
KGVHI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
Model . | ACC (%) . | Sen (%) . | Prec (%) . | MCC (%) . | F1 (%) . | AUC . | AUPR . |
---|---|---|---|---|---|---|---|
KGVHI_G | 76.74 ± 0.97 | 69.08 ± 4.74 | 81.83 ± 3.16 | 54.30 ± 1.93 | 74.75 ± 1.77 | 0.8216 ± 0.0099 | 0.8217 ± 0.0109 |
KGVHI_L | 80.30 ± 0.43 | 80.73 ± 3.03 | 80.13 ± 1.77 | 60.69 ± 0.84 | 80.37 ± 0.77 | 0.8699 ± 0.0045 | 0.8563 ± 0.0074 |
KGVHI | 95.46 ± 0.22 | 96.05 ± 0.96 | 94.95 ± 0.85 | 90.95 ± 0.43 | 95.49 ± 0.22 | 0.9892 ± 0.0007 | 0.9866 ± 0.0013 |
Case study
To further verify the predictive ability of KGVHI, we carried out case studies on two common bacteria, Acinetobacter baumannii and Staphylococcus aureus. Acinetobacter baumannii, a gram-negative pathogenic bacterium, is one of the most common cases of nosocomial infection. It usually causes severe bacterial diseases, such as meningitis, pneumonia, endocarditis, peritonitis and urinary tract and skin infections. The widespread use and misuse of antibiotics has led to the development of resistance in A. baumannii, which has become multidrug-resistant A. baumannii and has attracted more and more attention from academic and microbiology researchers. We also made a case study on S. aureus, which is a gram-positive bacterium with strong pathogenicity and is common in clinics. It is always parasitic in the skin, nasal cavity, throat and septic sores of humans and animals, and also ubiquitous in the air, sewage and other environments.
For each bacterial species, phages that have known interactions with the bacteria are first removed. Then, the predicted scores of candidate phages are sorted in descending order according to the KGVHI model. In other words, KGVHI only relies on the known microbiome knowledge and the extracted local and global information of the training sets. Assume a microbiome adjacency matrix A. Then, all the MHI samples will be ranked based on the prediction scores. After predicting, the top 30 ranked samples will be further indicated by public databases and available literature. As can be seen in Tables 5 and 6, 27 and 29 of the top 30 predicted A. baumannii and S. aureus target phages have been validated by previous studies. These prediction results further demonstrated that our model has generalizability and validity. We hope that our model can help to predict more and more potential MHI pairs.
Rank . | UniProt id . | Score . | Evidence . | Rank . | UniProt id . | Score . | Evidence . |
---|---|---|---|---|---|---|---|
1 | ARB06757.1 | 0.99984 | Confirmed | 16 | A0A068CGF5 | 0.99701 | Confirmed |
2 | A0A0A0RMQ0 | 0.99982 | Confirmed | 17 | ARQ94727.1 | 0.99653 | Confirmed |
3 | A0A2S1GTT1 | 0.99978 | Confirmed | 18 | A0A221SBN3 | 0.99647 | Confirmed |
4 | A0A2S1GTS5 | 0.99976 | Confirmed | 19 | U5PW98 | 0.99646 | Confirmed |
5 | A0A386KK43 | 0.99973 | Confirmed | 20 | A0A190XCC0 | 0.99627 | Confirmed |
6 | A0A172Q097 | 0.99955 | Confirmed | 21 | A0A386KAA1 | 0.99595 | Confirmed |
7 | AWD93192.1 | 0.99943 | Confirmed | 22 | YP_009206147.1 | 0.99560 | NA |
8 | A0A346FJ10 | 0.99925 | Confirmed | 23 | APD19509.1 | 0.99560 | NA |
9 | ARQ94726.1 | 0.99904 | Confirmed | 24 | A0A0P0IE19 | 0.99505 | Confirmed |
10 | A0A220NQG3 | 0.99893 | Confirmed | 25 | A0A075DXN1 | 0.99494 | Confirmed |
11 | CEK40295.1 | 0.99780 | NA | 26 | AXY82734.1 | 0.99469 | Confirmed |
12 | A0A386KM25 | 0.99768 | Confirmed | 27 | A0A0P0IJ73 | 0.99466 | Confirmed |
13 | A0A220NQG5 | 0.99756 | Confirmed | 28 | J7I0X3 | 0.99464 | Confirmed |
14 | AFV51556.1 | 0.99725 | Confirmed | 29 | E5KJQ6 | 0.99421 | Confirmed |
15 | AXY82661.1 | 0.99722 | Confirmed | 30 | A0A0A0RR02 | 0.99391 | Confirmed |
Rank . | UniProt id . | Score . | Evidence . | Rank . | UniProt id . | Score . | Evidence . |
---|---|---|---|---|---|---|---|
1 | ARB06757.1 | 0.99984 | Confirmed | 16 | A0A068CGF5 | 0.99701 | Confirmed |
2 | A0A0A0RMQ0 | 0.99982 | Confirmed | 17 | ARQ94727.1 | 0.99653 | Confirmed |
3 | A0A2S1GTT1 | 0.99978 | Confirmed | 18 | A0A221SBN3 | 0.99647 | Confirmed |
4 | A0A2S1GTS5 | 0.99976 | Confirmed | 19 | U5PW98 | 0.99646 | Confirmed |
5 | A0A386KK43 | 0.99973 | Confirmed | 20 | A0A190XCC0 | 0.99627 | Confirmed |
6 | A0A172Q097 | 0.99955 | Confirmed | 21 | A0A386KAA1 | 0.99595 | Confirmed |
7 | AWD93192.1 | 0.99943 | Confirmed | 22 | YP_009206147.1 | 0.99560 | NA |
8 | A0A346FJ10 | 0.99925 | Confirmed | 23 | APD19509.1 | 0.99560 | NA |
9 | ARQ94726.1 | 0.99904 | Confirmed | 24 | A0A0P0IE19 | 0.99505 | Confirmed |
10 | A0A220NQG3 | 0.99893 | Confirmed | 25 | A0A075DXN1 | 0.99494 | Confirmed |
11 | CEK40295.1 | 0.99780 | NA | 26 | AXY82734.1 | 0.99469 | Confirmed |
12 | A0A386KM25 | 0.99768 | Confirmed | 27 | A0A0P0IJ73 | 0.99466 | Confirmed |
13 | A0A220NQG5 | 0.99756 | Confirmed | 28 | J7I0X3 | 0.99464 | Confirmed |
14 | AFV51556.1 | 0.99725 | Confirmed | 29 | E5KJQ6 | 0.99421 | Confirmed |
15 | AXY82661.1 | 0.99722 | Confirmed | 30 | A0A0A0RR02 | 0.99391 | Confirmed |
Rank . | UniProt id . | Score . | Evidence . | Rank . | UniProt id . | Score . | Evidence . |
---|---|---|---|---|---|---|---|
1 | ARB06757.1 | 0.99984 | Confirmed | 16 | A0A068CGF5 | 0.99701 | Confirmed |
2 | A0A0A0RMQ0 | 0.99982 | Confirmed | 17 | ARQ94727.1 | 0.99653 | Confirmed |
3 | A0A2S1GTT1 | 0.99978 | Confirmed | 18 | A0A221SBN3 | 0.99647 | Confirmed |
4 | A0A2S1GTS5 | 0.99976 | Confirmed | 19 | U5PW98 | 0.99646 | Confirmed |
5 | A0A386KK43 | 0.99973 | Confirmed | 20 | A0A190XCC0 | 0.99627 | Confirmed |
6 | A0A172Q097 | 0.99955 | Confirmed | 21 | A0A386KAA1 | 0.99595 | Confirmed |
7 | AWD93192.1 | 0.99943 | Confirmed | 22 | YP_009206147.1 | 0.99560 | NA |
8 | A0A346FJ10 | 0.99925 | Confirmed | 23 | APD19509.1 | 0.99560 | NA |
9 | ARQ94726.1 | 0.99904 | Confirmed | 24 | A0A0P0IE19 | 0.99505 | Confirmed |
10 | A0A220NQG3 | 0.99893 | Confirmed | 25 | A0A075DXN1 | 0.99494 | Confirmed |
11 | CEK40295.1 | 0.99780 | NA | 26 | AXY82734.1 | 0.99469 | Confirmed |
12 | A0A386KM25 | 0.99768 | Confirmed | 27 | A0A0P0IJ73 | 0.99466 | Confirmed |
13 | A0A220NQG5 | 0.99756 | Confirmed | 28 | J7I0X3 | 0.99464 | Confirmed |
14 | AFV51556.1 | 0.99725 | Confirmed | 29 | E5KJQ6 | 0.99421 | Confirmed |
15 | AXY82661.1 | 0.99722 | Confirmed | 30 | A0A0A0RR02 | 0.99391 | Confirmed |
Rank . | UniProt id . | Score . | Evidence . | Rank . | UniProt id . | Score . | Evidence . |
---|---|---|---|---|---|---|---|
1 | ARB06757.1 | 0.99984 | Confirmed | 16 | A0A068CGF5 | 0.99701 | Confirmed |
2 | A0A0A0RMQ0 | 0.99982 | Confirmed | 17 | ARQ94727.1 | 0.99653 | Confirmed |
3 | A0A2S1GTT1 | 0.99978 | Confirmed | 18 | A0A221SBN3 | 0.99647 | Confirmed |
4 | A0A2S1GTS5 | 0.99976 | Confirmed | 19 | U5PW98 | 0.99646 | Confirmed |
5 | A0A386KK43 | 0.99973 | Confirmed | 20 | A0A190XCC0 | 0.99627 | Confirmed |
6 | A0A172Q097 | 0.99955 | Confirmed | 21 | A0A386KAA1 | 0.99595 | Confirmed |
7 | AWD93192.1 | 0.99943 | Confirmed | 22 | YP_009206147.1 | 0.99560 | NA |
8 | A0A346FJ10 | 0.99925 | Confirmed | 23 | APD19509.1 | 0.99560 | NA |
9 | ARQ94726.1 | 0.99904 | Confirmed | 24 | A0A0P0IE19 | 0.99505 | Confirmed |
10 | A0A220NQG3 | 0.99893 | Confirmed | 25 | A0A075DXN1 | 0.99494 | Confirmed |
11 | CEK40295.1 | 0.99780 | NA | 26 | AXY82734.1 | 0.99469 | Confirmed |
12 | A0A386KM25 | 0.99768 | Confirmed | 27 | A0A0P0IJ73 | 0.99466 | Confirmed |
13 | A0A220NQG5 | 0.99756 | Confirmed | 28 | J7I0X3 | 0.99464 | Confirmed |
14 | AFV51556.1 | 0.99725 | Confirmed | 29 | E5KJQ6 | 0.99421 | Confirmed |
15 | AXY82661.1 | 0.99722 | Confirmed | 30 | A0A0A0RR02 | 0.99391 | Confirmed |
Rank . | UniProt id . | Score . | Evidence . | Rank . | UniProt id . | Score . | Evidence . |
---|---|---|---|---|---|---|---|
1 | A0A2D1GPH4 | 0.99664 | Confirmed | 16 | YP_002332534.1 | 0.99518 | Confirmed |
2 | A0A2R4P8T8 | 0.99661 | Confirmed | 17 | A0A1D6Z279 | 0.99515 | Confirmed |
3 | A0A345AQC0 | 0.99655 | Confirmed | 18 | A0A0E4BZU6 | 0.99505 | Confirmed |
4 | Q8SDP3 | 0.99647 | Confirmed | 19 | YP_009006774.1 | 0.99469 | Confirmed |
5 | QAY02680.1 | 0.99638 | Confirmed | 20 | M9NSU7 | 0.99464 | Confirmed |
6 | AUG85753.1 | 0.99636 | Confirmed | 21 | YP_001004393.1 | 0.99456 | Confirmed |
7 | APD20961.1 | 0.99630 | Confirmed | 22 | A0A220BYL5 | 0.99446 | Confirmed |
8 | AWD93112.1 | 0.99619 | Confirmed | 23 | AUS03378.1 | 0.99443 | Confirmed |
9 | A0A345AQC1 | 0.99616 | Confirmed | 24 | YP_001004325.1 | 0.99438 | Confirmed |
10 | APD20962.1 | 0.99612 | Confirmed | 25 | A3RE05 | 0.99422 | NA |
11 | A0A2R4P8U9 | 0.99595 | Confirmed | 26 | A0A2K9V4F5 | 0.99421 | Confirmed |
12 | YP_009278564.1 | 0.99587 | Confirmed | 27 | A0A0E3XCZ0 | 0.99420 | Confirmed |
13 | A0A1D6Z282 | 0.99566 | Confirmed | 28 | A0A0H3U4S3 | 0.99420 | Confirmed |
14 | APD20939.1 | 0.99548 | Confirmed | 29 | YP_009006776.1 | 0.99419 | Confirmed |
15 | S4V7L6 | 0.99534 | Confirmed | 30 | D2K045 | 0.99419 | Confirmed |
Rank . | UniProt id . | Score . | Evidence . | Rank . | UniProt id . | Score . | Evidence . |
---|---|---|---|---|---|---|---|
1 | A0A2D1GPH4 | 0.99664 | Confirmed | 16 | YP_002332534.1 | 0.99518 | Confirmed |
2 | A0A2R4P8T8 | 0.99661 | Confirmed | 17 | A0A1D6Z279 | 0.99515 | Confirmed |
3 | A0A345AQC0 | 0.99655 | Confirmed | 18 | A0A0E4BZU6 | 0.99505 | Confirmed |
4 | Q8SDP3 | 0.99647 | Confirmed | 19 | YP_009006774.1 | 0.99469 | Confirmed |
5 | QAY02680.1 | 0.99638 | Confirmed | 20 | M9NSU7 | 0.99464 | Confirmed |
6 | AUG85753.1 | 0.99636 | Confirmed | 21 | YP_001004393.1 | 0.99456 | Confirmed |
7 | APD20961.1 | 0.99630 | Confirmed | 22 | A0A220BYL5 | 0.99446 | Confirmed |
8 | AWD93112.1 | 0.99619 | Confirmed | 23 | AUS03378.1 | 0.99443 | Confirmed |
9 | A0A345AQC1 | 0.99616 | Confirmed | 24 | YP_001004325.1 | 0.99438 | Confirmed |
10 | APD20962.1 | 0.99612 | Confirmed | 25 | A3RE05 | 0.99422 | NA |
11 | A0A2R4P8U9 | 0.99595 | Confirmed | 26 | A0A2K9V4F5 | 0.99421 | Confirmed |
12 | YP_009278564.1 | 0.99587 | Confirmed | 27 | A0A0E3XCZ0 | 0.99420 | Confirmed |
13 | A0A1D6Z282 | 0.99566 | Confirmed | 28 | A0A0H3U4S3 | 0.99420 | Confirmed |
14 | APD20939.1 | 0.99548 | Confirmed | 29 | YP_009006776.1 | 0.99419 | Confirmed |
15 | S4V7L6 | 0.99534 | Confirmed | 30 | D2K045 | 0.99419 | Confirmed |
Rank . | UniProt id . | Score . | Evidence . | Rank . | UniProt id . | Score . | Evidence . |
---|---|---|---|---|---|---|---|
1 | A0A2D1GPH4 | 0.99664 | Confirmed | 16 | YP_002332534.1 | 0.99518 | Confirmed |
2 | A0A2R4P8T8 | 0.99661 | Confirmed | 17 | A0A1D6Z279 | 0.99515 | Confirmed |
3 | A0A345AQC0 | 0.99655 | Confirmed | 18 | A0A0E4BZU6 | 0.99505 | Confirmed |
4 | Q8SDP3 | 0.99647 | Confirmed | 19 | YP_009006774.1 | 0.99469 | Confirmed |
5 | QAY02680.1 | 0.99638 | Confirmed | 20 | M9NSU7 | 0.99464 | Confirmed |
6 | AUG85753.1 | 0.99636 | Confirmed | 21 | YP_001004393.1 | 0.99456 | Confirmed |
7 | APD20961.1 | 0.99630 | Confirmed | 22 | A0A220BYL5 | 0.99446 | Confirmed |
8 | AWD93112.1 | 0.99619 | Confirmed | 23 | AUS03378.1 | 0.99443 | Confirmed |
9 | A0A345AQC1 | 0.99616 | Confirmed | 24 | YP_001004325.1 | 0.99438 | Confirmed |
10 | APD20962.1 | 0.99612 | Confirmed | 25 | A3RE05 | 0.99422 | NA |
11 | A0A2R4P8U9 | 0.99595 | Confirmed | 26 | A0A2K9V4F5 | 0.99421 | Confirmed |
12 | YP_009278564.1 | 0.99587 | Confirmed | 27 | A0A0E3XCZ0 | 0.99420 | Confirmed |
13 | A0A1D6Z282 | 0.99566 | Confirmed | 28 | A0A0H3U4S3 | 0.99420 | Confirmed |
14 | APD20939.1 | 0.99548 | Confirmed | 29 | YP_009006776.1 | 0.99419 | Confirmed |
15 | S4V7L6 | 0.99534 | Confirmed | 30 | D2K045 | 0.99419 | Confirmed |
Rank . | UniProt id . | Score . | Evidence . | Rank . | UniProt id . | Score . | Evidence . |
---|---|---|---|---|---|---|---|
1 | A0A2D1GPH4 | 0.99664 | Confirmed | 16 | YP_002332534.1 | 0.99518 | Confirmed |
2 | A0A2R4P8T8 | 0.99661 | Confirmed | 17 | A0A1D6Z279 | 0.99515 | Confirmed |
3 | A0A345AQC0 | 0.99655 | Confirmed | 18 | A0A0E4BZU6 | 0.99505 | Confirmed |
4 | Q8SDP3 | 0.99647 | Confirmed | 19 | YP_009006774.1 | 0.99469 | Confirmed |
5 | QAY02680.1 | 0.99638 | Confirmed | 20 | M9NSU7 | 0.99464 | Confirmed |
6 | AUG85753.1 | 0.99636 | Confirmed | 21 | YP_001004393.1 | 0.99456 | Confirmed |
7 | APD20961.1 | 0.99630 | Confirmed | 22 | A0A220BYL5 | 0.99446 | Confirmed |
8 | AWD93112.1 | 0.99619 | Confirmed | 23 | AUS03378.1 | 0.99443 | Confirmed |
9 | A0A345AQC1 | 0.99616 | Confirmed | 24 | YP_001004325.1 | 0.99438 | Confirmed |
10 | APD20962.1 | 0.99612 | Confirmed | 25 | A3RE05 | 0.99422 | NA |
11 | A0A2R4P8U9 | 0.99595 | Confirmed | 26 | A0A2K9V4F5 | 0.99421 | Confirmed |
12 | YP_009278564.1 | 0.99587 | Confirmed | 27 | A0A0E3XCZ0 | 0.99420 | Confirmed |
13 | A0A1D6Z282 | 0.99566 | Confirmed | 28 | A0A0H3U4S3 | 0.99420 | Confirmed |
14 | APD20939.1 | 0.99548 | Confirmed | 29 | YP_009006776.1 | 0.99419 | Confirmed |
15 | S4V7L6 | 0.99534 | Confirmed | 30 | D2K045 | 0.99419 | Confirmed |
DISCUSSION AND CONCLUSION
Recent studies have clearly demonstrated that microorganisms play vital roles in their reaction mechanisms. Moreover, predicting microbe-MHIs can benefit people by facilitating drug development, which also helps in the development of novel phage therapies. Compared with traditional wet-lab experiments, it is more convenient to use computational methods for identifying target viruses for existing human proteins or novel phages for known bacteria. However, few of them have been used to address these hard questions, possibly because only a few interaction pairs have been experimentally validated and applied to computational models.
In this paper, we propose a KG-based model named KGVHI to predict potential MHI by combining local biological attribute information with global topological structure information, which has inductive and scalable capabilities. This model not only considers phage and bacterial biochemical information, but also employs the idea of KG to extract global topological structure information from the HMN. Specifically, the word2vec algorithm is used to extract biological features from HMN. It is worth nothing that word2vec can preserve richer semantic information in biological sequence. In recent years, due to the rapid advances in high-throughput technologies, the metagenomic alterations of microbiota have increased quickly. However, there are still a few available data points for predicting host-associated microbes. In such a case, KGVHI performs a KG algorithm, InteractE, to integrate the neighboring features from HMN, thus avoiding the negative impact of the lack of a sufficient dataset. The presentation of multivariate information ensures KGVHI fully realizes the potential ability of HMN, which can increase the prediction accuracy of microbes and target hosts. On the other hand, KGVHI also takes advantage of the powerful mechanism of the blended DNN to accurately conduct the prediction task. The blended DNN has more hidden layers and neurons, which can further increase the ability of KGVHI to process the multivariate heterogeneous information and perform robustly in predicting MHI. In addition, comparison experiments with various machine learning and KG-based methods also demonstrated that the presented method performs well and is applicable to different microbe-host bioinformatics tasks. We also carried out case studies on gram-positive and negative-bacteria to further indicate that KGVHI has practical applications.
Although KGVHI shows surprising predictive performance, many challenges remain for this model. For example, the negative samples are selected by the dissimilarity negative sampling technique, which may cause certain noise to limit predicting ability. Another challenge is that in the MHI, consider three different relationships among microbes, including human-virus, human-bacteria and phage-bacteria. However, there were other possible interactions in MHI besides these three relationships. In this regard, we will try to collect more different types of MHI pairs to construct a more comprehensive HMN, from which KGVHI can capture more expressive biological features. However, as the microorganism network becomes more complex, the redundancy and noise in functional information are becoming more and more severe. Thus, it may cause new difficulties for the prediction models about attribute fusion capabilities. To address this issue, we intend to explore novel KG methods to provide new insight for the downstream prediction task. In the next step, we would try to integrate more microbial attributes into HMN to improve the predictive performance of KGVHI.
KGVHI constructed a novel a HMN and used a novel algorithm to construct the negative dataset.
KGVHI integrated the global topological structure information with local biological attribute information to predict candidate microbes for target hosts.
Knowledge graph embedding for capturing the global topological structure information from MHN.
FUNDING
Science & Technology Fundamental Resources Investigation Program (2022FY101100); National Science Fund for Distinguished Young Scholars of China (62325308); National Natural Science Foundation of China (32170114 and 31770152).
DATA AVAILABILITY
The score code and whole data is available at https://github.com/NWUJiePan/KGVHI
Jie Pan is a Phd student at Northwest University. His research interests include bioinformatics, microbiology, machine learning, and network analysis.
Zhen Zhang is a Phd student at Northwest University. His research intersts include bioinformatics, machine learning, and complex network.
Ying Li, Jiaoyang Yu, Chenyu Li, Shixu Wang and Minghui Zhu are master students of Northwest University. Their research interests include phage and bacterial whole-genome data analysis, transcriptome.
Zhuhong You is a Professor of Northwestern Polytechnical University. His research interest include data mining, machine learning, computational biology and bioinformatics.
Xuexia Zhang is the professor in North China Pharmaceutical Group and National Microbial Medicine Engineering & Research Center. Her research interests include Microbiology and biochemical pharmacy.
Fengzhi Ren is the professor in North China Pharmaceutical Group and National Microbial Medicine Engineering & Research Center. Her research interests include microbial genetics and microbial communities.
Yanmei Sun is a Professor of Northwest University. Her research interests include host-microbiome interactions and microbial communities.
Shiwei Wang is a Professor of Northwest University. His research interests include physiological studies on Pseudomonas aeruginosa and molecular biology of phages and their applications.
References
The Human Microbiome Project Consortium.