A novel framework for phage-host prediction via logical probability theory and network sparsification

Abstract

Bacterial resistance has emerged as one of the greatest threats to human health, and phages have shown tremendous potential in addressing the issue of drug-resistant bacteria by lysing host. The identification of phage–host interactions (PHI) is crucial for addressing bacterial infections. Some existing computational methods for predicting PHI are suboptimal in terms of prediction efficiency due to the limited types of available information. Despite the emergence of some supporting information, the generalizability of models using this information is limited by the small scale of the databases. Additionally, most existing models overlook the sparsity of association data, which severely impacts their predictive performance as well. In this study, we propose a dual-view sparse network model (DSPHI) to predict PHI, which leverages logical probability theory and network sparsification. Specifically, we first constructed similarity networks using the sequences of phages and hosts respectively, and then sparsified these networks, enabling the model to focus more on key information during the learning process, thereby improving prediction efficiency. Next, we utilize logical probability theory to compute high-order logical information between phages (hosts), which is known as mutual information. Subsequently, we connect this information in node form to the sparse phage (host) similarity network, resulting in a phage (host) heterogeneous network that better integrates the two information views, thereby reducing the complexity of model computation and enhancing information aggregation capabilities. The hidden features of phages and hosts are explored through graph learning algorithms. Experimental results demonstrate that mutual information is effective information in predicting PHI, and the sparsification procedure of similarity networks significantly improves the model’s predictive performance.

phage–host interactions, logical probability theory, network sparsification, metagenomic data, graph convolutional network

Introduction

Bacteriophages, or phages, are the most common type of virus, specifically infecting bacteria and archaea [1]. Phages recognize specific receptors on the surface of host cells and bind to them, injecting their genetic material into the host. They then exploit the host’s resources to replicate, translate, and assemble new phage particles [2]. Bacterial-viral interactions constitute a complex and multifaceted process, encompassing gene transfer, substance cycling, and microbial evolution, playing a crucial role in maintaining microbial ecological stability [3]. During the phage–host interactions (PHI), the phage’s genome and proteome constantly adapt alongside the evolution of the host to ensure efficient infection and replication in specific bacterial or archaeal hosts, without harming other microorganisms or cells [1]. The unique ability of phages to recognize and kill host provides new insights into the treatment of bacterial diseases, particularly offering opportunities for the diagnosis and treatment of multidrug-resistant bacteria in recent years [4]. Phage therapy has demonstrated significant potential in replacing antibiotics for the treatment of bacterial infections, with numerous successful cases of bacterial infections treated by phages worldwide [5]. The biological dependence of phages on their hosts for survival highlights the importance of studying PHI as a primary aspect of phage research [6]. With the advancement of high-throughput sequencing technologies, a large number of viruses have been discovered. Solely relying on laboratory culture methods to study PHI is not only costly and time-consuming but also limited in its ability to fully encompass all possible host ranges under experimental conditions. Therefore, there is an urgent need for a rapid and effective computational method for predicting PHI.

Up to now, several computational methods for phage host prediction have been designed. Existing calculation methods are primarily classified into sequence-based methods [7–9], feature-based methods [10–12], and network-based methods. Sequence-based methods are further divided into alignment-based methods and non-alignment-based methods [13]. The alignment-based method determines the interaction relationship by aligning the gene sequences or protein sequences of phages and hosts. This method cannot accurately predict entities with large sequence differences. For example, BLAST [7] and CL4PHI [14]. Non-alignment-based methods directly predict by analyzing the features of genomes or proteomes, but the process is cumbersome and consumes a lot of computational resources. For example, VirHostMatcher [9], WIsH [15], PHP [8], and PHIST [16]. Feature-based learning methods involve using machine learning or deep learning algorithms to learn the features of phages and hosts for prediction purposes. For example, RaFAH [11], vHULK [12], and PHIAF [10]. While intuitive and interpretable, these methods face challenges in handling high-dimensional sparse data and lack end-to-end learning. Network-based methods refer to an end-to-end approach that involves placing phages and host into a heterogeneous graph for feature embedding and prediction, leading to stronger generalization ability. For example, HostG [13], CHERRY [17], and GSPHI [18].

Although existing methods have achieved certain successes in predicting PHI, some issues remain unaddressed. The first is the high sparsity of the dataset caused by the specificity exhibited by phages during the infection of their hosts. Up to now, there have been approximately 34 000 verified and publicly available Virus–Host interaction relationships [19]. The virus-host DB [19] is currently the largest database of PHI, involving approximately 4770 host species and 28 780 viruses. However, this is not comparable in quantity to the number of viruses and bacteria that have been publicly disclosed. Utilizing computational methods to explore interactions is highly suitable and necessary [20]. Although limited by the number of validated PHI, computational results may produce false positives. However, they are quite effective in narrowing the search scope and exploring potential associations [20, 21]. Moreover, among these associations, |$99.99\%$| of phages have a host range size of no more than 5, and over |$92\%$| of host correspond to no more than 10 phages. This structural data indicates the great potential of phages in specifically recognizing and lysing drug-resistant bacteria. This further underscores the importance of the high sparsity of data in predicting PHI relationships.

Secondly, the currently available effective information is limited, and the scale of the information database is relatively small. Currently available sources of information mainly include abundance information [21–23], sequence information [21, 23], quorum sensing system information [24], and association information [25]. Sequence information includes gene sequences, protein sequences, receptors, protein-binding receptors, and |$CRISPR$| sequences [2, 6], which are the most widely used and recognized type of biological information [20]. However, among all existing information sources, databases for all other types of information except gene sequences are relatively small. For example, the latest publicly available database for protein-binding receptors contains only 1230 entries [26], and the largest publicly available database for protein receptor information contains information for just over 400 receptors from 22 bacterial species [27]. It is essential to study the interaction mechanisms between phages and hosts and develop new, more universally applicable, and effective information. We aim to uncover an effective method that involves more comprehensive phage information to improve the accuracy of PHI prediction tasks.

To address the aforementioned challenges, we implemented innovative solutions. First, to account for the high sparsity of the data structure during modeling, we performed a sparsification operation on the fully connected network during its construction. Second, we extracted a new type of valuable information from the metagenomic data, referred to as higher-order logic information (also known as higher-order mutual information). A large number of experiments have shown that interactions between phages exist, mostly occurring within the host [28]. Some phages help other phages evade host immunity, while enzymes produced by some phages help others accelerate the process of lysing the host, driving these types of phages to appear together [29]. Considering the survival mode of phages as parasites, we assume that phages that share a host will appear together in the environment where the host exist. Based on this assumption, we aim to extract the information carried by this co-occurrence phenomenon from metagenomic data. Specifically, we annotate metagenomic data to obtain the abundance information of phages and host, and then determine the presence status of these entities in each sample environment, denoting presence as 1 and absence as 0. Subsequently, we treat the status information of phages and hosts in each environment as random variables. We use a probabilistic logic approach to calculate the logical probability scores between these random variables to measure the logical relationships between corresponding entities.

DSPHI integrates high-order mutual information, sequence information, and association information, while sparsifying the network to create a dual-view heterogeneous network. DSPHI has several advantages: firstly, mutual information, as an effective information source, is derived from metagenomic data, enhancing its universality. Secondly, the sparsification of DSPHI is designed based on the sparsity of association networks, significantly reducing redundancy and noise while improving the model’s runtime and space efficiency. Thirdly, by incorporating high-order mutual information as nodes in the network, DSPHI reduces model complexity compared to hypergraph networks, enhancing information aggregation capabilities. Lastly, DSPHI includes a mature framework for annotating metagenomic data and extends the database for Kraken2, making it more accessible to researchers in need.

Materials and methods

Dataset

The interactions between phages and hosts were obtained from the Phage-Host Daily (PHD) database (https://phdaily.info/report) [30]. PHD integrates data from various databases, including NCBI (https://www.ncbi.nlm.nih.gov/genomes/), GenBank [31], Virus-Host DB, RefSeq [32], and MVP [33], compiling comprehensive information on PHI. As of 2023, the database contains 16 890 interaction pairs, with |$96.24\%$| of viruses infecting only one host.

Metagenomic data were sourced from NCBI, specifically from the 2nd Diabetes Gut Microbiome of Chinese individuals [34], comprising 370 individuals’ gut metagenomic data (PRJNA422434). The data can be downloaded from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA422434.

Identification of phages from metagenomic data involves various tools, such as VirSorter2 [35], VirusSeeker [36], DeepVirFinder [37], kraken2 [38], MetaPhinder [39], among others. In our analysis, we opted for Kraken2 [38, 40], KneadData [41], and Bracken [42] due to their superior performance in metagenomic data analysis.

We used KneadData for quality control of raw metagenomic data (using Trimmomatic [43]) and removal of host information (using Bowtie2 [44]). Trimmomatic filters out adapter sequences and low-quality reads. Bowtie2 is a sequence alignment tool suitable for aligning post-sequencing reads to a long reference genome. After processing with KneadData, the data no longer contains host sequences and can be used for species annotation. FastQC was used before and after KneadData to assess the quality and effectiveness of the quality control. OTU tables were generated based on sequence similarity clustering. Then, we used Kraken2 for species classification of the sequencing data. Kraken2 performs taxonomic classification analysis using k-mers. Finally, we used bracken for abundance estimation of each classification, obtaining abundance data for phages and bacteria. To eliminate errors in the annotation process and for computational efficiency, species with abundance values below 10 were filtered out. During this process, we used the standard database provided by Kraken2, which is approximately 50 GB in size and contains approximately 96 000 species of bacteria, 18 600 species of viruses, and other microorganisms such as archaea.

We compared this database with the PHD database and found that 9589 virus species present in the PHD were not retrieved in the Kraken2 database. We downloaded the gene sequence files and taxonomic information for these 9589 virus species from NCBI and expanded the Kraken2 database with these data in the format required by Kraken2. After expanding the database, we performed species classification and abundance calculation.

After thorough processing, we obtained abundance information at the species level for both phages and bacteria. From the metagenomic data, species-level abundance information for 8688 bacteria and 2969 phages was obtained.

We selected interaction pairs that appeared simultaneously in both the PHD and the annotated abundance databases to construct our required PHI database. This database contains 1909 viruses, 278 hosts, and 2002 PHI pairs. Notably, |$97\%$| of phages were found to infect only one bacterial species, and |$76\%$| of bacterial hosts had interactions with five or fewer phages. Finally, based on the acquired interactions, we obtained complete genomes of phages and hosts from NCBI and Virus-Host DB for sequence feature learning.

Phage–host associations

To better construct a heterogeneous network, we transformed the phage–host association information into an adjacency matrix |$A_{m\times n}$|⁠. Here |$n$| represents the number of phages, and |$m$| represents the number of hosts. The values in the adjacency matrix are binary. Binary information is easier to process and optimize in heterogeneous networks, facilitating more efficient information transmission. If phage |$i$| can infect bacterial host |$j$|⁠, then |$A_{ij}=1$|⁠; otherwise, |$A_{ij}=0$|⁠.

Higher-order mutual information

Bowers [45] used a probabilistic logic-based method to infer second-order logical relationships between proteins, achieving good results. We applied this method to calculate the logical relationships between species. As shown in Fig. 1. To conduct uncertain reasoning, we converted abundance data into binary data, where 0 indicates the absence of a species in a sample, and 1 indicates its presence, obtaining the states of all microorganisms in each sample.

Figure 1

Mutual information acquisition process. (a) The process of obtaining species abundance tables of phages and bacteria from metagenomic data is shown. (b) The process of mining mutual information by means of probabilistic logic method according to the state of species in each sample.

Open in new tab Download slide

Information entropy is used in information theory to represent the uncertainty or information content of a random variable. According to the definition of entropy, the larger the entropy, the greater the uncertainty of the random variable. Similarly, joint entropy is a measure of uncertainty of the joint distribution of two random variables.

For any microorganisms |$A$| and |$B$|⁠, where |$A=\{x_{1},x_{2},...,x_{n}\}$| and |$B=\{y_{1},y_{2},... y_{n}\}$| indicates the status of the two species in different samples, and the corresponding |$\{P(x_{1}),P(x_{2}),...,P(x_{n})\}$| and |$\{P(y_{1}),P(y_{2}),... P(y_{n})\}$| representing the probabilities of these states, where |$n=370$|⁠. The entropy of microorganism A is represented as follows:

$$ \begin{align}& f(A)=-\sum_{i}P(x_{i})logP(x_{i})\end{align} $$

(1)

The joint entropy of microorganisms A and B is represented as follows:

$$ \begin{align}& f(A,B)=-\sum_{i,j}P(x_{i},y_{j})logP(x_{i},y_{j})\end{align} $$

(2)

The uncertainty coefficient |$U$| is used to represent the probability of logical relationships:

$$ \begin{align}& U(A|B)=f(A)+f(B)-\frac{f(A,B)}{f(A)}\end{align} $$

(3)

|$U(A|B)$| represents the impact of |$B$|’s determination on |$A$|’s certainty when |$B$| is determine. The logical relationship between the two is called first-order logic, denoted as |$U(A|F1(B))$|⁠, where |$F1(B)$| represents the two possible states of |$B$|⁠: presence and absence, i.e. |$U(A|B)$| and |$U(A|\neg B)$|⁠. The logical relationship between three entities is called a second-order logic or higher-order logic.

The logical relationship between three entities is represented as follows:

$$ \begin{align}& U(C|F_{2}(A,B))=f(C)+f(F_{2}(A,B))-\frac{f(C|F_{2}(A,B))}{f(C)}\end{align} $$

(4)

Here, |$F_{2}(\cdot )$| represents the combination of different states of |$A$| and |$B$|⁠, resulting in eight possible combinations. As shown in Fig. 2b, a more detailed introduction is presented in Appendix 4.

Figure 2

Quality analysis of data results corresponding to eight logical relationship types. (a) shows the number of logical relationships obtained and the proportion of co-hosts among phages in the triad corresponding to each type, (b) shows the proportion of phages covered by the information corresponding to each logical type and the proportion of co-hosts.

Open in new tab Download slide

According to the designed computational process, we obtained eight types of mutual information. To evaluate the quality of different types of mutual information, besides the quantity of information, we designed some metrics for assessment. Firstly, considering co-infections on a single host, a mutual information contains three phages, denoted as ’co_1’, ’co_2’, and ’co_3’, representing the cases where three phages infect the same host. Secondly, we used ’co-type’ to indicate the proportion of mutual information with shared hosts among the total of each type, and ’cov-phage’ to indicate the proportion of phages in mutual information with shared hosts among the total phages. The specific experimental results are shown in Fig. 2. The experimental results indicate that the total amount of mutual information for Types 2, 3, and 6 is significantly larger than other types. The ratio of ’co-Type’ for Types 1, 3, 5, and 7 is significantly larger than other types. Types 2 and 3 cover a larger number of phages compared to the general case.

Although we explored first-order logical information and second-order logical information separately in the experiment, obtaining 24 000 pieces of first-order logical information, the experimental analysis revealed that only |$27\%$| of the information pertained to cases with shared hosts, with the number of phages involved being less than |$30\%$|⁠. The experimental results were not ideal, likely due to excessive redundancy in the information and low information quality. This may be attributed to the lack of rigor in the formula chosen for calculating first-order logical relationships. This study primarily focuses on the analysis of second-order logical information.

Sparse similar networks

In the task of predicting PHI, following a basic assumption, phages that are more similar are more likely to infect similar bacterial hosts. For example, sequences of Acinetobacter phages belonging to the same genus are often very similar, and most of them infect |$Acinetobacter baumannii$|⁠, or infect |$Acinetobacter calcoaceticus$| and |$Acinetobacter pittii$|⁠, which are similar to A. baumannii [30].

Under this assumption, we construct similarity networks for phages and hosts respectively by calculating the sequence similarity between phages and between hosts. We decomposed the complete genomes of phages and hosts into subsequences of length k and calculated the frequency of each k-mer occurrence. This ultimately generated k-mer frequency features for each phage and host. The connection relationships between any two phages (hosts) were then obtained by calculating the cosine similarity distance between their respective k-mer frequency features. The k-mer frequency features provide rich information about the genome sequence and are particularly suitable for extracting global information from genomes in cases where precise annotation or full-length alignment is unavailable. The formula is as follows:

$$ \begin{align}& S_{ij}=\frac{\overrightarrow{p_{i}}\cdot\overrightarrow{p_{j}}}{\|\overrightarrow{p_{i}}\|\overrightarrow{p_{j}}\|}\end{align} $$

(5)

|$\overrightarrow{p_{i}}$| and |$\overrightarrow{p_{j}}$| represents the k-mer frequency feature vector of phage |$p_{i}$| and |$p_{j}$|⁠, respectively. |$\|\cdot \|$| represents the |$\ell _{2}$| norm of the vector.

In network-based methods, the similarity network is constructed as a fully connected network, which contradicts the sparsity of verified association information. This phenomenon implies that the similarity network contains a large amount of redundant information. To eliminate the influence of redundant information, we set parameters |$simv$| and |$simh$| to process the similarity values of phages and hosts, respectively. In the processed similarity network, only the nearest neighbors with high similarity to the phages or hosts are retained, and the feature learning process focuses only on the locally important relationships.

Heterogeneous network integration of mutual information and sequence information

Mutual information represents a second-order logical relationship, while sequence information represents a first-order relationship. Among existing methods, only hypergraph neural networks can integrate first-order and second-order relationships. Examples include hypergraph convolutional neural networks and hypergraph attention neural networks. However, these hypergraph neural networks are computationally expensive and have limitations in interpretability and capturing local information. To better integrate multiple information sources, we propose to integrate mutual information as nodes with entity information, and then combine them with sequence information for modeling.

The eight types of second-order logical relationships each correspond to a specific mutual information measure. We conducted a quality analysis for each type, and the results showed that the mutual information corresponding to Type 3 exhibited the best quality, as shown in Fig. 2.

Although the quality of Type 3 information is the most ideal, there still exist some redundant information. We utilized the sequence similarity of phages to filter the data. According to the definition of Type 3, if phage A appears or phage B appears, then phage C appears. We calculated the sequence similarity between phages A, B, and C. If the similarity value was below a certain threshold, we discarded the mutual information. After filtering, we obtained 9826 pieces of mutual information, of which 96.68% involved shared hosts.

After filtering, we found that the number of pieces of information far exceeded the number of phages. This was because some phages appeared frequently in the mutual information. To avoid imbalance issues during feature learning, we merged the mutual information. The merging criterion was: if two pieces of mutual information involved the same phages A and B, then the two pieces of information were merged. After merging, we obtained 2775 pieces of information.

We created a virtual node for each piece of mutual information, and this virtual node was connected to the nodes representing the phages (or hosts) involved in the mutual information. Each piece of mutual information corresponded to a star network, where the central node of the network is a virtual node. Finally, we merged the constructed sparse similarity network with each star network to obtain a heterogeneous network that integrates mutual information and sequence information.

Feature learning

Figure 3 illustrates the workflow of DSPHI. First, we obtain the phage–host association information from the constructed database. We use metagenomic analysis methods and probabilistic logic to obtain the mutual information of PHI. Then, combining the cosine similarity matrices of phages and hosts, we construct a heterogeneous network that includes four types of nodes and four types of edges. Next, we use an attention graph convolutional algorithm to perform graph embedding representation learning on this heterogeneous network. Finally, we decode the mapping operator of the reconstructed phage–host association matrix to obtain the predicted relationships.

Figure 3

The overall framework of DSPHI. (A) Data preprocessing module: calculating high-order mutual information and similarity measures between phages (or hosts). (B) Network construction module: integrating sequence information and high-order mutual information into a network and sparsifying the network. (C) Feature learning module: learning hidden feature representations using a graph convolutional algorithm. (D) PHI prediction module: reconstructing the PHI association matrix.

Open in new tab Download slide

Graph convolutional networks (GCNs) fuse local and global information by learning representations of nodes and their neighbors, generating high-quality node embeddings. The information propagation rule between layers is

$$ \begin{align}& Z^{(l+1)}=f(Z^{l},G)=\sigma(D^{{-\frac{1}{2}}}GD^{{-\frac{1}{2}}}Z^{(l)}W^{l})\end{align} $$

(6)

|$G$| represents an adjacency matrix composed of relationships between various entities. |$Z$| is characteristic of each layer, |$\sigma (\cdot )$|is the nonlinear activation function, |$W^{l}$| is a trainable parameter matrix.

The adjacency matrix of the constructed graph |$G$| is expressed as follows:

$$ \begin{align}& G=\left( \begin{array}{cccc} P & A & B_{P}^{T} & 0 \\ A^{T} & H& 0 & B_{H}^{T} \\ B_{P} & 0 & E & 0 \\ 0 & B_{H} & 0 & E \\ \end{array} \right)\end{align} $$

(7)

|$P$| denotes the cosine similarity of the phage’s DNA sequence, while |$H$| denotes the cosine similarity of the host bacterium’s DNA sequence. |$A$| represents known association information, |$B_{P}$| represents the adjacency matrix of the star-shaped network corresponding to the mutual information of the phage, where the edge weights in the adjacency matrix are regulated by the set parameter |$simw$|⁠. Similarly, |$B_{H}$| represents the adjacency matrix of the star-shaped network corresponding to the mutual information of the host, where the edge weights in the adjacency matrix are regulated by the set parameter |$simz$|⁠.

The edges in the constructed network are derived from feature information. To avoid the duplication of information and to better learn the topology of the heterogeneous network, we set the initial inputs for phages and hosts as follows:

$$ \begin{align}& \left( \begin{array}{cc} 0 & A \\ A^{T} & 0 \\ \end{array} \right)\end{align} $$

(8)

We assess the importance of mutual information to nodes by setting parameters |$simw$| and |$simz$|⁠, using them as the initial edge weights between virtual nodes and phage (host) nodes. The features of the first-order neighbors of virtual nodes are subjected to average pooling, and the result after pooling serves as the initial features for virtual nodes.

$$ \begin{align}& \left( \begin{array}{cc} M_{P} \\ M_{H} \\ \end{array} \right)\end{align} $$

(9)

In the graph convolution process, the information aggregated in each layer varies. We introduce an attention mechanism to assign different weights to each convolutional layer for generating the final embedding representation.

Taking phage |$P_{i}$| as an example, |$Z^{1}_{i}, Z^{2}_{i},\cdots ,Z^{l}_{i}$| represent the output representations of |$P_{i}$| at each layer. Therefore, the final output representation is

$$ \begin{align}& Z_{i}=\sigma\left(\sum^{l}_{j=1}\alpha_{ij}Z^{j}_{i}\ \right)\end{align} $$

(10)

To ensure a more objective comparison of the work, we employed the commonly used Adam optimization algorithm for training the model. The main equation for the reconstruction process is

$$ \begin{align}& A^{\prime}=sigmoid(Z_{P}W^{\prime}Z_{H}^{T})\end{align} $$

(11)

Here, |$W^{\prime}$| represents the trainable parameter matrix, |$A^{\prime}$| represents the reconstructed phage–host association matrix, |$Z_{P}$| and |$Z_{H}$| represent the model-generated embedded representations of the phage and host, respectively.

Cross-entropy loss function

The cross-entropy loss function used in this study is particularly suitable for classification tasks. Compared to other loss functions such as mean squared error, the cross-entropy loss function is easier to optimize. Here, the positive instance set and the negative instance set are denoted as |$y^{+}$| and |$y^{-}$|⁠, respectively. The dataset contains |$N_{P}$| types of phages and |$N_{H}$| types of hosts, the number of positive samples in the prediction task is much smaller than the number of negative samples. To minimize the impact of this imbalance, we added a new parameter |$\gamma $|⁠, whose value is obtained through the ratio of positive to negative samples. The specific formula is as follows:

$$ \begin{align}& L=-\frac{1}{N_{P}\times N_{H}}\left(\gamma\times\sum_{(i,j)\in y^{+}}log\hat{y}^{(ij)}+\sum_{(i,j)\in y^{-}}log(1-\hat{y}^{(ij)})\right)\end{align} $$

(12)

Experiment

Experiment settings

This section mainly presents the settings of multiple parameters, baseline models, and evaluation metrics involved in the experimental process. The biological characteristic of phages in specifically recognizing their hosts leads to a significant class imbalance in the dataset, where the number of positive samples is far smaller than that of negative samples. In our experiments, we randomly divided the positive samples into five equal parts. Four parts were combined with samples labeled as 0 (negative samples) to form the training set. The remaining positive samples, together with the samples labeled as 0, constituted the testing set for model evaluation. Additionally, to further enhance the evaluation of the model, we employed a downsampling approach to balance the number of positive and negative samples and compared the results with those of the baseline models.

Hyper-parameter setting

Regarding the several hyperparameters involved in the experiments, we chose a more comprehensive and intuitive grid search method, including the selection range of embedding dimension |$k$| is {32, 64, 128, 256, 512}, and the selection range of layer number |$L$| of the graph convolution is {1, 2, 3, 4, 5, 6}. The optimizer’s initial learning rate |$lr$| selection range is {0.1, 0.01, 0.001}. The final parameter settings are shown in Table 1.

The ranges for the edge weights |$simv$| and |$simh$| between nodes representing mutual information and phage (host) nodes are both from 0.1 to 2. The training epoch selection range is {500, 1000, 2000, 3000, 4000}.The relevant experimental procedures are shown in the appendix.

Table 1

Open in new tab

The parameters of DSPHI

Strycture	Parameters
Later	Amount(L):3
dimension	Units(K):128
Activation function	ReLu
Node dropout	0.4
Edge dropout	0.7
Loss function	Cross-entropy loss function
Optimizer	Adam
Epoch	4000
Initial learning rate	0.01

Strycture	Parameters
Later	Amount(L):3
dimension	Units(K):128
Activation function	ReLu
Node dropout	0.4
Edge dropout	0.7
Loss function	Cross-entropy loss function
Optimizer	Adam
Epoch	4000
Initial learning rate	0.01

Table 1

Open in new tab

The parameters of DSPHI

Strycture	Parameters
Later	Amount(L):3
dimension	Units(K):128
Activation function	ReLu
Node dropout	0.4
Edge dropout	0.7
Loss function	Cross-entropy loss function
Optimizer	Adam
Epoch	4000
Initial learning rate	0.01

Strycture	Parameters
Later	Amount(L):3
dimension	Units(K):128
Activation function	ReLu
Node dropout	0.4
Edge dropout	0.7
Loss function	Cross-entropy loss function
Optimizer	Adam
Epoch	4000
Initial learning rate	0.01

Baselines and evaluation metrics

To objectively evaluate the performance of the DSPHI model, we selected several state-of-the-art baseline prediction models from the last three years and conducted experiments on the same dataset.

CL4PHI [14]: a contrastive learning-based method for PHI prediction, focusing on identifying phage-host relationships at the species and genus levels.

GCNAT [46]: a novel deep learning algorithm that combines graph convolutional networks with graph attention mechanisms for predicting metabolic and disease-related correlations.

PHIAF [10]: a classical model utilizing convolutional neural networks that incorporates data augmentation with generative adversarial networks and combines sequence features for PHI prediction.

PHP [8]: a Gaussian model for predicting prokaryotic viral hosts in genomics, based on the non-homology method, representing a classic model in the field.

PredPHI [47]: a model exclusively using protein sequences as features and combining them with a CNN model for PHI prediction.

MKGCN [48]: a deep learning model based on graph convolutional neural networks that leverages multiple kernel fusion to infer microbiota-drug correlations.

In the experiments, a five-fold cross-validation method was utilized for model training and prediction. The dataset was divided into five subsets, with any four subsets used as training sets and the remaining one as the test set. Five sets of predictions were obtained, and the average of these five results was considered as the final prediction result. To evaluate the model performance comprehensively, the numbers of true positive (⁠|$TP$|⁠) and true negative (⁠|$TN$|⁠) samples, as well as false negative (⁠|$FN$|⁠) and false positive (⁠|$FP$|⁠) samples, were first calculated. Based on these four basic statistics, three metrics were computed: area under the receiver operating characteristic curve (⁠|$AUC$|⁠), area under the precision-recall curve (⁠|$AUPR$|⁠), and accuracy (⁠|$ACC$|⁠). This protocol was consistent with the standards commonly adopted in professional journals.

Comparison of predictive performance

All the mentioned models were evaluated using five-fold cross-validation, and the final results were obtained. AUC and ACC were chosen as common evaluation metrics.

As shown in the Fig. 4, the DSPHI model outperformed the other models significantly in both of these metrics.

Figure 4

The performance of DSPHI and other methods using five-fold cross-validation.

Open in new tab Download slide

For a more comprehensive assessment of model performance, experiments were conducted at the genus level for all models. The results indicate that, compared to other baseline models, the DSPHI model exhibits a clear advantage at both the species and genus levels, As shown in the Fig. 5.

Figure 5

The AUC of DSPHI and other methods at the genus and species levels.

Open in new tab Download slide

Finally, we performed paired sample t-tests to compare the AUPR values of DSPHI with other tools, and the results indicated that DSPHI achieved statistically significant improvements over the competing methods. Detailed experimental results are provided in the appendix.

Ablation experiment and analysis

In addition to tuning the basic parameters of the model, there are several ways to improve the model by adjusting hyperparameters and improving the input. To better validate the effectiveness of these improvements, we conducted the following experiments.

The impact of network sparsification

First, we performed sparsification on the constructed phage similarity network and host similarity network. We set the hyperparameters |$simv$| and |$simh$| separately. As shown in the Fig. 6. The results show that as these two parameters increase, all performance metrics improve. However, when the networks become too sparse, the results are not optimal. This indicates that sparsification is effective and necessary for the PHI prediction task. The optimal value for |$simh$| was found to be 1, indicating that the sequence information of the host did not have a positive impact on the experimental results. This is why we used an identity matrix to represent the host’s adjacent matrix.

Figure 6

The impact of hyperparameters |$simv$| and |$simh$|⁠.

Open in new tab Download slide

Next, we present detailed results of several variants of DSPHI on three metrics. Here, |$DSPHI$|-|$no$| indicates no sparsification, |$DSPHI$|-|$Bi$| indicates a bipartite network model containing only association relationships, and |$DSPHI$|-|$simh$| and |$DSPHI$|-|$simv$| indicate sparsification of the host similarity network and phage similarity network, respectively. The specific experimental results are shown in Table 2.

Table 2

Open in new tab

The prediction results based on network sparsity

DSPHI-sparse network	AUPR	AUC	ACC
DSPHI-no	0.1502	0.8207	0.9093
DSPHI-Bi	0.1407	0.8780	0.9516
DSPHI-simh	0.3197	0.8937	0.9710
DSPHI-simv	0.3368	0.9348	0.9804
DSPHI	0.3643	0.9439	0.9839

DSPHI-sparse network	AUPR	AUC	ACC
DSPHI-no	0.1502	0.8207	0.9093
DSPHI-Bi	0.1407	0.8780	0.9516
DSPHI-simh	0.3197	0.8937	0.9710
DSPHI-simv	0.3368	0.9348	0.9804
DSPHI	0.3643	0.9439	0.9839

Table 2

Open in new tab

The prediction results based on network sparsity

DSPHI-sparse network	AUPR	AUC	ACC
DSPHI-no	0.1502	0.8207	0.9093
DSPHI-Bi	0.1407	0.8780	0.9516
DSPHI-simh	0.3197	0.8937	0.9710
DSPHI-simv	0.3368	0.9348	0.9804
DSPHI	0.3643	0.9439	0.9839

DSPHI-sparse network	AUPR	AUC	ACC
DSPHI-no	0.1502	0.8207	0.9093
DSPHI-Bi	0.1407	0.8780	0.9516
DSPHI-simh	0.3197	0.8937	0.9710
DSPHI-simv	0.3368	0.9348	0.9804
DSPHI	0.3643	0.9439	0.9839

The impact of mutual information

This study identified a total of eight types of logical relationships, and experiments were conducted for each type, with results shown in Table 3. The results indicate that data from Type 3 was most helpful for the experiments, consistent with the quality analysis results. Additionally, the information corresponding to Types 5 and 7 showed relatively good results in terms of both the proportion of shared hosts and the coverage of phages, ranking just below the results for Type 3.

Table 3

Open in new tab

The prediction results are based on each logical relation type

Type	All	co-host(%)	phage coverage(%)	AUC	ACC
Type1	1337	0.7723	0.3662	0.9115	0.9681
Type2	22 753	0.0801	0.6358	0.9039	0.9707
Type3	13 358	0.8818	0.5348	0.9359	0.9710
Type4	209	0.0574	0.0409	0.9153	0.9665
Type5	1295	0.7560	0.2614	0.9214	0.9786
Type6	28 170	0.3436	0.4097	0.9094	0.9672
Type7	2379	0.8548	0.3332	0.9280	0.9753
Type8	692	0.4336	0.0854	0.9193	0.9693

Type	All	co-host(%)	phage coverage(%)	AUC	ACC
Type1	1337	0.7723	0.3662	0.9115	0.9681
Type2	22 753	0.0801	0.6358	0.9039	0.9707
Type3	13 358	0.8818	0.5348	0.9359	0.9710
Type4	209	0.0574	0.0409	0.9153	0.9665
Type5	1295	0.7560	0.2614	0.9214	0.9786
Type6	28 170	0.3436	0.4097	0.9094	0.9672
Type7	2379	0.8548	0.3332	0.9280	0.9753
Type8	692	0.4336	0.0854	0.9193	0.9693

Table 3

Open in new tab

The prediction results are based on each logical relation type

Type	All	co-host(%)	phage coverage(%)	AUC	ACC
Type1	1337	0.7723	0.3662	0.9115	0.9681
Type2	22 753	0.0801	0.6358	0.9039	0.9707
Type3	13 358	0.8818	0.5348	0.9359	0.9710
Type4	209	0.0574	0.0409	0.9153	0.9665
Type5	1295	0.7560	0.2614	0.9214	0.9786
Type6	28 170	0.3436	0.4097	0.9094	0.9672
Type7	2379	0.8548	0.3332	0.9280	0.9753
Type8	692	0.4336	0.0854	0.9193	0.9693

Type	All	co-host(%)	phage coverage(%)	AUC	ACC
Type1	1337	0.7723	0.3662	0.9115	0.9681
Type2	22 753	0.0801	0.6358	0.9039	0.9707
Type3	13 358	0.8818	0.5348	0.9359	0.9710
Type4	209	0.0574	0.0409	0.9153	0.9665
Type5	1295	0.7560	0.2614	0.9214	0.9786
Type6	28 170	0.3436	0.4097	0.9094	0.9672
Type7	2379	0.8548	0.3332	0.9280	0.9753
Type8	692	0.4336	0.0854	0.9193	0.9693

We conducted experiments using Type 3 mutual information. To validate the necessity of each processing step, we designed six variants of DSPHI for ablation experiments. |$DSPHI$|-|$No$| represents the state without any mutual information, |$DSPHI$|-|$Al$| uses all Type 3 mutual information, |$DSPHI$|-|$Cl$| experiments are conducted after purifying and filtering the mutual information, |$DSPHI$|-|$Me$| corresponds to the integration of mutual information, |$DSPHI$|-|$P$| and |$DSPHI$|-|$H$| represent experiments where the model only uses phage mutual information and only uses host mutual information, respectively. The experimental results are shown in the Table 4. The results indicate that purifying and integrating mutual information are essential. Moreover, the results show that the inclusion of phage mutual information has a greater impact on the experimental results, while host mutual information has a relatively small impact. We believe this may be related to the highly specific recognition characteristics of phages, as a phage can infect multiple host, and such phages are rare.

Table 4

Open in new tab

The prediction results based on mutual information

DSPHI-mutual	AUPR	AUC	ACC
DSPHI-No	0.3197	0.8937	0.9710
DSPHI-Al	0.3240	0.9359	0.9710
DSPHI-Cl	0.3269	0.9396	0.9782
DSPHI-Me	0.3221	0.9386	0.9810
DSPHI-P	0.3465	0.9417	0.9800
DSPHI-H	0.3121	0.9086	0.9780
DSPHI	0.3643	0.9439	0.9839

DSPHI-mutual	AUPR	AUC	ACC
DSPHI-No	0.3197	0.8937	0.9710
DSPHI-Al	0.3240	0.9359	0.9710
DSPHI-Cl	0.3269	0.9396	0.9782
DSPHI-Me	0.3221	0.9386	0.9810
DSPHI-P	0.3465	0.9417	0.9800
DSPHI-H	0.3121	0.9086	0.9780
DSPHI	0.3643	0.9439	0.9839

Table 4

Open in new tab

The prediction results based on mutual information

DSPHI-mutual	AUPR	AUC	ACC
DSPHI-No	0.3197	0.8937	0.9710
DSPHI-Al	0.3240	0.9359	0.9710
DSPHI-Cl	0.3269	0.9396	0.9782
DSPHI-Me	0.3221	0.9386	0.9810
DSPHI-P	0.3465	0.9417	0.9800
DSPHI-H	0.3121	0.9086	0.9780
DSPHI	0.3643	0.9439	0.9839

DSPHI-mutual	AUPR	AUC	ACC
DSPHI-No	0.3197	0.8937	0.9710
DSPHI-Al	0.3240	0.9359	0.9710
DSPHI-Cl	0.3269	0.9396	0.9782
DSPHI-Me	0.3221	0.9386	0.9810
DSPHI-P	0.3465	0.9417	0.9800
DSPHI-H	0.3121	0.9086	0.9780
DSPHI	0.3643	0.9439	0.9839

In the experiments, we set two hyperparameters, |$simw$| and |$simz$|⁠, for the mutual information of phages and hosts, respectively. These hyperparameters are used to adjust the edge weights between virtual nodes and real nodes. The specific experimental results are shown in Fig. 7. The results indicate that as the value of |$simw$| approaches 1 and the value of |$simz$| approaches 0, the AUC value of the experimental results increases. Other results are presented in the appendix.

Figure 7

The impact of hyperparameters |$simz$| and |$simw$|⁠.

Open in new tab Download slide

The impact of network construction methods

This section focuses on the ablation experiment of the network construction method. Considering that second-order logic relationships are involved, the traditional network construction method is to construct hypergraphs. In order to eliminate the influence of virtual nodes on the experiment, we choose the |$DSPHI$|-|$hyper$| model. The results show that the virtual node modeling method can significantly improve the experimental results and obtain higher AUPR and AUC. The experimental results are shown in Table 5.

Table 5

Open in new tab

Prediction results based on the network construction method

DSPHI-network	AUPR	AUC	ACC
DSPHI-Hyper	0.3295	0.8958	0.9855
DSPHI	0.3643	0.9439	0.9839

Table 5

Open in new tab

Prediction results based on the network construction method

DSPHI-network	AUPR	AUC	ACC
DSPHI-Hyper	0.3295	0.8958	0.9855
DSPHI	0.3643	0.9439	0.9839

The case study

Case study for mutual information

Figure 8

The mutual information of phages with respect to the phages corresponding to the top five bacteria.

Open in new tab Download slide

Case study for DSPHI

To further determine the performance of the DSPHI model in predicting PHI, we designed experiments. Firstly, we predicted new phages and new host by sequentially removing their relevant association information, considering these as newly discovered, and observing the model’s predictive ability. We selected two phages, |$phage$| |$R18C$| and |$Escherichia$| |$1720a$|-|$02$|⁠, as shown in the Table 8, and two hosts, |$Klebsiellapneumoniae$| and |$Escherichia coli$|⁠, as shown in the Tables 6 and 7. The host range of |$phage$| |$R18C$| is 3, and that of |$Escherichia$| |$1720a$|-|$02$| is 2. For the phage experiments, we selected the top 7 results with the highest predicted probabilities, and for the host, we selected the top 20 predicted results. The results showed that DSPHI demonstrated good predictive performance for both new viruses and new bacteria, especially in predicting hosts for viruses.

Table 6

Open in new tab

The top20 results of the DSPHI’s predictions for |$Klebsiella pneumoniae$|

Top	Phage	PMID	Top	Phage	PMID
1	Klebsiella phage RAD2	34431721	11	Klebsiella phage vB_KpnS ZX4	26008965
2	Klebsiella phage ST11-VIM1phi8.2	31752386	12	\|$Taipeivirus menlow$\|	31023815
3	Klebsiella phage vB_1086	35985450	13	\|$Delmidovirus copri$\|	N/A
4	Klebsiella phage vB_KleM KB2	37330608	14	\|$Webervirus mezzogao$\|	29122857
5	Klebsiella phage vB_KpnM_17-11	35865823	15	\|$Drulisvirus altogao$\|	29122857
6	\|$Drulisvirus minorna$\|	N/A	16	Diorhovirus copri	N/A
7	Klebsiella phage vB_KpnP-Bp5	33631221	17	Webervirus sin4	31558644
8	\|$Lastavirus sopranogao$\|	29122857	18	Efquatrovirus SHEF4	N/A
9	\|$Pylasvirus pylas$\|	31727721	19	Webervirus sweeny	31558643
10	\|$Taipeivirus may$\|	31072899	20	\|$Yonseivirus seifer$\|	31727722

Top	Phage	PMID	Top	Phage	PMID
1	Klebsiella phage RAD2	34431721	11	Klebsiella phage vB_KpnS ZX4	26008965
2	Klebsiella phage ST11-VIM1phi8.2	31752386	12	\|$Taipeivirus menlow$\|	31023815
3	Klebsiella phage vB_1086	35985450	13	\|$Delmidovirus copri$\|	N/A
4	Klebsiella phage vB_KleM KB2	37330608	14	\|$Webervirus mezzogao$\|	29122857
5	Klebsiella phage vB_KpnM_17-11	35865823	15	\|$Drulisvirus altogao$\|	29122857
6	\|$Drulisvirus minorna$\|	N/A	16	Diorhovirus copri	N/A
7	Klebsiella phage vB_KpnP-Bp5	33631221	17	Webervirus sin4	31558644
8	\|$Lastavirus sopranogao$\|	29122857	18	Efquatrovirus SHEF4	N/A
9	\|$Pylasvirus pylas$\|	31727721	19	Webervirus sweeny	31558643
10	\|$Taipeivirus may$\|	31072899	20	\|$Yonseivirus seifer$\|	31727722

Table 6

Open in new tab

The top20 results of the DSPHI’s predictions for |$Klebsiella pneumoniae$|

Top	Phage	PMID	Top	Phage	PMID
1	Klebsiella phage RAD2	34431721	11	Klebsiella phage vB_KpnS ZX4	26008965
2	Klebsiella phage ST11-VIM1phi8.2	31752386	12	\|$Taipeivirus menlow$\|	31023815
3	Klebsiella phage vB_1086	35985450	13	\|$Delmidovirus copri$\|	N/A
4	Klebsiella phage vB_KleM KB2	37330608	14	\|$Webervirus mezzogao$\|	29122857
5	Klebsiella phage vB_KpnM_17-11	35865823	15	\|$Drulisvirus altogao$\|	29122857
6	\|$Drulisvirus minorna$\|	N/A	16	Diorhovirus copri	N/A
7	Klebsiella phage vB_KpnP-Bp5	33631221	17	Webervirus sin4	31558644
8	\|$Lastavirus sopranogao$\|	29122857	18	Efquatrovirus SHEF4	N/A
9	\|$Pylasvirus pylas$\|	31727721	19	Webervirus sweeny	31558643
10	\|$Taipeivirus may$\|	31072899	20	\|$Yonseivirus seifer$\|	31727722

Top	Phage	PMID	Top	Phage	PMID
1	Klebsiella phage RAD2	34431721	11	Klebsiella phage vB_KpnS ZX4	26008965
2	Klebsiella phage ST11-VIM1phi8.2	31752386	12	\|$Taipeivirus menlow$\|	31023815
3	Klebsiella phage vB_1086	35985450	13	\|$Delmidovirus copri$\|	N/A
4	Klebsiella phage vB_KleM KB2	37330608	14	\|$Webervirus mezzogao$\|	29122857
5	Klebsiella phage vB_KpnM_17-11	35865823	15	\|$Drulisvirus altogao$\|	29122857
6	\|$Drulisvirus minorna$\|	N/A	16	Diorhovirus copri	N/A
7	Klebsiella phage vB_KpnP-Bp5	33631221	17	Webervirus sin4	31558644
8	\|$Lastavirus sopranogao$\|	29122857	18	Efquatrovirus SHEF4	N/A
9	\|$Pylasvirus pylas$\|	31727721	19	Webervirus sweeny	31558643
10	\|$Taipeivirus may$\|	31072899	20	\|$Yonseivirus seifer$\|	31727722

Table 7

Open in new tab

The top20 results of the DSPHI’s predictions for |$Escherichia coli$|

Top	Phage	PMID	Top	Phage	PMID
1	\|$Escherichia phage$\| vB_EcoP-ZX5	36558779	11	\|$Escherichia phage$\| Schickermooser	31109012
2	\|$Escherichia phage$\|vB_EcoM-P10	34966369	12	Tequatrovirus T4	26081634
3	\|$Escherichia phage$\| vB_EcoM_C2-3	34762992	13	Traversvirus II	12813092
4	\|$Escherichia coli$\|phage phiStx2k	38078984	14	\|$Goslarvirus goslar$\|	31109012
5	\|$Escherichia phage$\| vB_EcoS-PJ16	37632591	15	Inovirus M13	5257006
6	\|$Escherichia phage$\| vB_EcoM-RPN242	35598209	16	Kuttervirus SenALZ1	N/A
7	\|$Jahgtovirus intestinalis$\|	N/A	17	\|$Escherichia phage$\| OSYSP	38182094
8	\|$Escherichia phage$\| TL-2011b	22403614	18	Kayfunavirus SH4	N/A
9	Enterobacteria phage P4	7483254	19	\|$Escherichia phage$\| vB_EcoM-Alf5	28522702
10	\|$Escherichia phage$\| SRT7	30762120	20	\|$Warwickvirus tunus$\|	32899836

Top	Phage	PMID	Top	Phage	PMID
1	\|$Escherichia phage$\| vB_EcoP-ZX5	36558779	11	\|$Escherichia phage$\| Schickermooser	31109012
2	\|$Escherichia phage$\|vB_EcoM-P10	34966369	12	Tequatrovirus T4	26081634
3	\|$Escherichia phage$\| vB_EcoM_C2-3	34762992	13	Traversvirus II	12813092
4	\|$Escherichia coli$\|phage phiStx2k	38078984	14	\|$Goslarvirus goslar$\|	31109012
5	\|$Escherichia phage$\| vB_EcoS-PJ16	37632591	15	Inovirus M13	5257006
6	\|$Escherichia phage$\| vB_EcoM-RPN242	35598209	16	Kuttervirus SenALZ1	N/A
7	\|$Jahgtovirus intestinalis$\|	N/A	17	\|$Escherichia phage$\| OSYSP	38182094
8	\|$Escherichia phage$\| TL-2011b	22403614	18	Kayfunavirus SH4	N/A
9	Enterobacteria phage P4	7483254	19	\|$Escherichia phage$\| vB_EcoM-Alf5	28522702
10	\|$Escherichia phage$\| SRT7	30762120	20	\|$Warwickvirus tunus$\|	32899836

Table 7

Open in new tab

The top20 results of the DSPHI’s predictions for |$Escherichia coli$|

Top	Phage	PMID	Top	Phage	PMID
1	\|$Escherichia phage$\| vB_EcoP-ZX5	36558779	11	\|$Escherichia phage$\| Schickermooser	31109012
2	\|$Escherichia phage$\|vB_EcoM-P10	34966369	12	Tequatrovirus T4	26081634
3	\|$Escherichia phage$\| vB_EcoM_C2-3	34762992	13	Traversvirus II	12813092
4	\|$Escherichia coli$\|phage phiStx2k	38078984	14	\|$Goslarvirus goslar$\|	31109012
5	\|$Escherichia phage$\| vB_EcoS-PJ16	37632591	15	Inovirus M13	5257006
6	\|$Escherichia phage$\| vB_EcoM-RPN242	35598209	16	Kuttervirus SenALZ1	N/A
7	\|$Jahgtovirus intestinalis$\|	N/A	17	\|$Escherichia phage$\| OSYSP	38182094
8	\|$Escherichia phage$\| TL-2011b	22403614	18	Kayfunavirus SH4	N/A
9	Enterobacteria phage P4	7483254	19	\|$Escherichia phage$\| vB_EcoM-Alf5	28522702
10	\|$Escherichia phage$\| SRT7	30762120	20	\|$Warwickvirus tunus$\|	32899836

Top	Phage	PMID	Top	Phage	PMID
1	\|$Escherichia phage$\| vB_EcoP-ZX5	36558779	11	\|$Escherichia phage$\| Schickermooser	31109012
2	\|$Escherichia phage$\|vB_EcoM-P10	34966369	12	Tequatrovirus T4	26081634
3	\|$Escherichia phage$\| vB_EcoM_C2-3	34762992	13	Traversvirus II	12813092
4	\|$Escherichia coli$\|phage phiStx2k	38078984	14	\|$Goslarvirus goslar$\|	31109012
5	\|$Escherichia phage$\| vB_EcoS-PJ16	37632591	15	Inovirus M13	5257006
6	\|$Escherichia phage$\| vB_EcoM-RPN242	35598209	16	Kuttervirus SenALZ1	N/A
7	\|$Jahgtovirus intestinalis$\|	N/A	17	\|$Escherichia phage$\| OSYSP	38182094
8	\|$Escherichia phage$\| TL-2011b	22403614	18	Kayfunavirus SH4	N/A
9	Enterobacteria phage P4	7483254	19	\|$Escherichia phage$\| vB_EcoM-Alf5	28522702
10	\|$Escherichia phage$\| SRT7	30762120	20	\|$Warwickvirus tunus$\|	32899836

Table 8

Open in new tab

The top7 results of the DSPHI’s predictions for phage R18C and Escherichia 1720a-02

Phage	Host-predicting	Evidence(PMID)
phage R18C	\|$Shigella sonnei$\|	31641840
	\|$Erwinia amylovora$\|	N/A
	\|$Citrobacter rodentium$\|	31641840
	\|$Citrobacter koseri$\|	N/A
	\|$Citrobacter portucalensis$\|	N/A
	\|$Shigella flexneri$\|	N/A
	\|$Escherichia coli$\|	31641840
Escherichia 1720a-02	\|$Citrobacter freundii$\|	N/A
	\|$Escherichia coli$\|	32761142
	\|$Citrobacter koseri$\|	N/A
	\|$Citrobacter rodentium$\|	34929548
	Escherichia sp.	N/A
	\|$Escherichia fergusonii$\|	N/A
	\|$Shigella sonnei$\|	N/A

Phage	Host-predicting	Evidence(PMID)
phage R18C	\|$Shigella sonnei$\|	31641840
	\|$Erwinia amylovora$\|	N/A
	\|$Citrobacter rodentium$\|	31641840
	\|$Citrobacter koseri$\|	N/A
	\|$Citrobacter portucalensis$\|	N/A
	\|$Shigella flexneri$\|	N/A
	\|$Escherichia coli$\|	31641840
Escherichia 1720a-02	\|$Citrobacter freundii$\|	N/A
	\|$Escherichia coli$\|	32761142
	\|$Citrobacter koseri$\|	N/A
	\|$Citrobacter rodentium$\|	34929548
	Escherichia sp.	N/A
	\|$Escherichia fergusonii$\|	N/A
	\|$Shigella sonnei$\|	N/A

Table 8

Open in new tab

The top7 results of the DSPHI’s predictions for phage R18C and Escherichia 1720a-02

Phage	Host-predicting	Evidence(PMID)
phage R18C	\|$Shigella sonnei$\|	31641840
	\|$Erwinia amylovora$\|	N/A
	\|$Citrobacter rodentium$\|	31641840
	\|$Citrobacter koseri$\|	N/A
	\|$Citrobacter portucalensis$\|	N/A
	\|$Shigella flexneri$\|	N/A
	\|$Escherichia coli$\|	31641840
Escherichia 1720a-02	\|$Citrobacter freundii$\|	N/A
	\|$Escherichia coli$\|	32761142
	\|$Citrobacter koseri$\|	N/A
	\|$Citrobacter rodentium$\|	34929548
	Escherichia sp.	N/A
	\|$Escherichia fergusonii$\|	N/A
	\|$Shigella sonnei$\|	N/A

Phage	Host-predicting	Evidence(PMID)
phage R18C	\|$Shigella sonnei$\|	31641840
	\|$Erwinia amylovora$\|	N/A
	\|$Citrobacter rodentium$\|	31641840
	\|$Citrobacter koseri$\|	N/A
	\|$Citrobacter portucalensis$\|	N/A
	\|$Shigella flexneri$\|	N/A
	\|$Escherichia coli$\|	31641840
Escherichia 1720a-02	\|$Citrobacter freundii$\|	N/A
	\|$Escherichia coli$\|	32761142
	\|$Citrobacter koseri$\|	N/A
	\|$Citrobacter rodentium$\|	34929548
	Escherichia sp.	N/A
	\|$Escherichia fergusonii$\|	N/A
	\|$Shigella sonnei$\|	N/A

Discussion

The widespread use of antibiotics has driven the evolution of antibiotic-resistant bacteria. In contrast, phages, due to their high specificity, can target and infect specific bacterial strains without affecting other microorganisms. Thus, the application of phage therapy has the potential to reduce the over-reliance on antibiotics, thereby lowering the selective pressure for resistant bacterial strains. More importantly, phages co-evolve alongside bacterial evolution, a characteristic that antibiotics cannot match.

This study proposes a richer semantic information (mutual information) between phages and hosts for PHI prediction tasks, and designs a dual-view sparse network model that integrates mutual information, sequence information, and association information. Mutual information is a high-order information directly calculated from the sample environment, overcoming the limitations of small-scale and incomplete information in biological information databases. Considering the sparsity of validated PHI relationships, we construct sparse similarity networks, allowing for a more focused feature learning on local features. In the fusion of dual-view information, we integrate high-order mutual information with sequence information in the form of nodes. Compared to traditional hypergraphs, this network construction pattern reduces the complexity of the network and makes the information aggregation process more convenient. It is worth mentioning that in annotating metagenomic data, we found that existing analysis tools’ virus databases are relatively small. Therefore, we expanded the database based on validated phage–host association information to facilitate future researchers’ use. Experimental results show that mutual information is a new and effective semantic information, and sparsification is an effective processing method, both of which improve the model’s prediction performance. Moreover, representing high-order information as nodes is more suitable for PHI prediction tasks compared to hyperedges.

In our experiment, compared to the latest models, DSPHI improved the AUC results by |$3\%$|⁠. The model’s predictive performance was already superior to most existing models even with just the network sparsification. The mutual information, which was calculated based on metagenomic data, implies that as we discover new viruses from metagenomic data, it is entirely feasible to compute the corresponding mutual information. This capability significantly enhances the efficiency of studying interactions between newly identified viruses and their hosts.

Although the model proposed in this study achieved good results, there are still some challenges in the experimental process. Firstly, when mining mutual information in metagenomic data, we found that the first-order logical relationships contained too much redundant information, which needs to be improved in the mining method. This will be the focus of our next work. Secondly, when using sequence information, we only considered gene information and did not consider protein information. After all, protein information can be completed and can cover all phages and host.

At the same time, we are also considering whether this way of mining useful information from metagenomic data is also applicable to other biological information problems, such as predicting interactions between microorganisms. This is also a problem we will think about in the long term.

Key Points

This study proposes a dual-view sparse network model for predicting phage–host interaction relationships, which integrates mutual information, sequence information, and association information.
This study uses a probabilistic logic method to obtain richer semantic information (mutual information) between phages and hosts from metagenomic data for phage–host interaction prediction tasks. This information is a high-order information.
When constructing the similarity network based on sequence information, considering the sparsity of validated phage–host relationship information, we sparsify the network.
When integrating high-order mutual information and first-order sequence information, we propose to represent high-order mutual information as nodes. Compared to hypergraphs, this approach is more conducive to information aggregation.
In annotating metagenomic data, we expand the database of the analysis tool Kraken2 based on validated phage–host interaction information.

Acknowledgments

We express our deepest gratitude to the authors of all the code and data cited in this paper, especially Feiyue Sun and Jianqiang Sun for their GCNAT method, which greatly inspired our work. We also thank the reviewers for their helpful and constructive comments.

Conflict of interest: None declared.

Funding

This work was partially supported by the National Natural Science Foundation of China (62372205, 62472192 and 61932008), the National Language Commission Key Research Project (ZDI145-56), the Fundamental Research Funds for Central Universities (KJ02502022-0450), the Natural Science Foundation of Hubei Province of China (2022CFB289), and the Self-determined Research Funds of CCNU from the Colleges’ Basic Research and Operation of MOE (CCNU24JC032).

Code and data availability

We utilized a publicly available dataset of phage–host associations (PHD) and a publicly available metagenomic dataset. The PHD dataset is updated daily and can be accessed at http://phdaily.info or http://combio.pl/phdaily. The metagenomic dataset is labeled as PRJNA422434 and comprises 370 human gut samples. It is accessible at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA422434. Related code, dataset splits, and trained models can be found at https://github.com/Ankang-Wei/DSPHI.

Author contributions statement

Ankang Wei: contributed to the methodological ideas of DSPHI, conducted experimental data analysis, visualized the results, and drafted the manuscript. Huanghan Zhan: contributed to data organization and analysis, and drafted the manuscript. Zhen Xiao: contributed to the organization and analysis of real data, and drafted the manuscript. Weizhong Zhao: contributed to the methodological ideas of DSPHI and drafted the manuscript. Xingpeng Jiang: contributed to the methodological ideas, biological insights, and interpretations of DSPHI, and drafted the manuscript. All authors have read and approved the final manuscript.

References

Samson

Magadán

Sabri

. et al. .

Revenge of the phages: defeating bacterial defences

Nat Rev Microbiol

2013

;

675

–

Month:	Total Views:
January 2025	191
February 2025	106
March 2025	137
April 2025	58
May 2025	3

Article Contents

A novel framework for phage-host prediction via logical probability theory and network sparsification

Abstract

Introduction

Materials and methods

Dataset

Phage–host associations

Higher-order mutual information

Sparse similar networks

Heterogeneous network integration of mutual information and sequence information

Feature learning

Cross-entropy loss function

Experiment

Experiment settings

Hyper-parameter setting

Baselines and evaluation metrics

Comparison of predictive performance

Ablation experiment and analysis

The impact of network sparsification

The impact of mutual information

The impact of network construction methods

The case study

Case study for mutual information

Case study for DSPHI

Discussion

Acknowledgments

Funding

Code and data availability

Author contributions statement

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only