Abstract

Bacterial resistance has emerged as one of the greatest threats to human health, and phages have shown tremendous potential in addressing the issue of drug-resistant bacteria by lysing host. The identification of phage–host interactions (PHI) is crucial for addressing bacterial infections. Some existing computational methods for predicting PHI are suboptimal in terms of prediction efficiency due to the limited types of available information. Despite the emergence of some supporting information, the generalizability of models using this information is limited by the small scale of the databases. Additionally, most existing models overlook the sparsity of association data, which severely impacts their predictive performance as well. In this study, we propose a dual-view sparse network model (DSPHI) to predict PHI, which leverages logical probability theory and network sparsification. Specifically, we first constructed similarity networks using the sequences of phages and hosts respectively, and then sparsified these networks, enabling the model to focus more on key information during the learning process, thereby improving prediction efficiency. Next, we utilize logical probability theory to compute high-order logical information between phages (hosts), which is known as mutual information. Subsequently, we connect this information in node form to the sparse phage (host) similarity network, resulting in a phage (host) heterogeneous network that better integrates the two information views, thereby reducing the complexity of model computation and enhancing information aggregation capabilities. The hidden features of phages and hosts are explored through graph learning algorithms. Experimental results demonstrate that mutual information is effective information in predicting PHI, and the sparsification procedure of similarity networks significantly improves the model’s predictive performance.

Introduction

Bacteriophages, or phages, are the most common type of virus, specifically infecting bacteria and archaea [1]. Phages recognize specific receptors on the surface of host cells and bind to them, injecting their genetic material into the host. They then exploit the host’s resources to replicate, translate, and assemble new phage particles [2]. Bacterial-viral interactions constitute a complex and multifaceted process, encompassing gene transfer, substance cycling, and microbial evolution, playing a crucial role in maintaining microbial ecological stability [3]. During the phage–host interactions (PHI), the phage’s genome and proteome constantly adapt alongside the evolution of the host to ensure efficient infection and replication in specific bacterial or archaeal hosts, without harming other microorganisms or cells [1]. The unique ability of phages to recognize and kill host provides new insights into the treatment of bacterial diseases, particularly offering opportunities for the diagnosis and treatment of multidrug-resistant bacteria in recent years [4]. Phage therapy has demonstrated significant potential in replacing antibiotics for the treatment of bacterial infections, with numerous successful cases of bacterial infections treated by phages worldwide [5]. The biological dependence of phages on their hosts for survival highlights the importance of studying PHI as a primary aspect of phage research [6]. With the advancement of high-throughput sequencing technologies, a large number of viruses have been discovered. Solely relying on laboratory culture methods to study PHI is not only costly and time-consuming but also limited in its ability to fully encompass all possible host ranges under experimental conditions. Therefore, there is an urgent need for a rapid and effective computational method for predicting PHI.

Up to now, several computational methods for phage host prediction have been designed. Existing calculation methods are primarily classified into sequence-based methods [7–9], feature-based methods [10–12], and network-based methods. Sequence-based methods are further divided into alignment-based methods and non-alignment-based methods [13]. The alignment-based method determines the interaction relationship by aligning the gene sequences or protein sequences of phages and hosts. This method cannot accurately predict entities with large sequence differences. For example, BLAST [7] and CL4PHI [14]. Non-alignment-based methods directly predict by analyzing the features of genomes or proteomes, but the process is cumbersome and consumes a lot of computational resources. For example, VirHostMatcher [9], WIsH [15], PHP [8], and PHIST [16]. Feature-based learning methods involve using machine learning or deep learning algorithms to learn the features of phages and hosts for prediction purposes. For example, RaFAH [11], vHULK [12], and PHIAF [10]. While intuitive and interpretable, these methods face challenges in handling high-dimensional sparse data and lack end-to-end learning. Network-based methods refer to an end-to-end approach that involves placing phages and host into a heterogeneous graph for feature embedding and prediction, leading to stronger generalization ability. For example, HostG [13], CHERRY [17], and GSPHI [18].

Although existing methods have achieved certain successes in predicting PHI, some issues remain unaddressed. The first is the high sparsity of the dataset caused by the specificity exhibited by phages during the infection of their hosts. Up to now, there have been approximately 34 000 verified and publicly available Virus–Host interaction relationships [19]. The virus-host DB [19] is currently the largest database of PHI, involving approximately 4770 host species and 28 780 viruses. However, this is not comparable in quantity to the number of viruses and bacteria that have been publicly disclosed. Utilizing computational methods to explore interactions is highly suitable and necessary [20]. Although limited by the number of validated PHI, computational results may produce false positives. However, they are quite effective in narrowing the search scope and exploring potential associations [20, 21]. Moreover, among these associations, |$99.99\%$| of phages have a host range size of no more than 5, and over |$92\%$| of host correspond to no more than 10 phages. This structural data indicates the great potential of phages in specifically recognizing and lysing drug-resistant bacteria. This further underscores the importance of the high sparsity of data in predicting PHI relationships.

Secondly, the currently available effective information is limited, and the scale of the information database is relatively small. Currently available sources of information mainly include abundance information [21–23], sequence information [21, 23], quorum sensing system information [24], and association information [25]. Sequence information includes gene sequences, protein sequences, receptors, protein-binding receptors, and |$CRISPR$| sequences [2, 6], which are the most widely used and recognized type of biological information [20]. However, among all existing information sources, databases for all other types of information except gene sequences are relatively small. For example, the latest publicly available database for protein-binding receptors contains only 1230 entries [26], and the largest publicly available database for protein receptor information contains information for just over 400 receptors from 22 bacterial species [27]. It is essential to study the interaction mechanisms between phages and hosts and develop new, more universally applicable, and effective information. We aim to uncover an effective method that involves more comprehensive phage information to improve the accuracy of PHI prediction tasks.

To address the aforementioned challenges, we implemented innovative solutions. First, to account for the high sparsity of the data structure during modeling, we performed a sparsification operation on the fully connected network during its construction. Second, we extracted a new type of valuable information from the metagenomic data, referred to as higher-order logic information (also known as higher-order mutual information). A large number of experiments have shown that interactions between phages exist, mostly occurring within the host [28]. Some phages help other phages evade host immunity, while enzymes produced by some phages help others accelerate the process of lysing the host, driving these types of phages to appear together [29]. Considering the survival mode of phages as parasites, we assume that phages that share a host will appear together in the environment where the host exist. Based on this assumption, we aim to extract the information carried by this co-occurrence phenomenon from metagenomic data. Specifically, we annotate metagenomic data to obtain the abundance information of phages and host, and then determine the presence status of these entities in each sample environment, denoting presence as 1 and absence as 0. Subsequently, we treat the status information of phages and hosts in each environment as random variables. We use a probabilistic logic approach to calculate the logical probability scores between these random variables to measure the logical relationships between corresponding entities.

DSPHI integrates high-order mutual information, sequence information, and association information, while sparsifying the network to create a dual-view heterogeneous network. DSPHI has several advantages: firstly, mutual information, as an effective information source, is derived from metagenomic data, enhancing its universality. Secondly, the sparsification of DSPHI is designed based on the sparsity of association networks, significantly reducing redundancy and noise while improving the model’s runtime and space efficiency. Thirdly, by incorporating high-order mutual information as nodes in the network, DSPHI reduces model complexity compared to hypergraph networks, enhancing information aggregation capabilities. Lastly, DSPHI includes a mature framework for annotating metagenomic data and extends the database for Kraken2, making it more accessible to researchers in need.

Materials and methods

Dataset

The interactions between phages and hosts were obtained from the Phage-Host Daily (PHD) database (https://phdaily.info/report) [30]. PHD integrates data from various databases, including NCBI (https://www.ncbi.nlm.nih.gov/genomes/), GenBank [31], Virus-Host DB, RefSeq [32], and MVP [33], compiling comprehensive information on PHI. As of 2023, the database contains 16 890 interaction pairs, with |$96.24\%$| of viruses infecting only one host.

Metagenomic data were sourced from NCBI, specifically from the 2nd Diabetes Gut Microbiome of Chinese individuals [34], comprising 370 individuals’ gut metagenomic data (PRJNA422434). The data can be downloaded from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA422434.

Identification of phages from metagenomic data involves various tools, such as VirSorter2 [35], VirusSeeker [36], DeepVirFinder [37], kraken2 [38], MetaPhinder [39], among others. In our analysis, we opted for Kraken2 [38, 40], KneadData [41], and Bracken [42] due to their superior performance in metagenomic data analysis.

We used KneadData for quality control of raw metagenomic data (using Trimmomatic [43]) and removal of host information (using Bowtie2 [44]). Trimmomatic filters out adapter sequences and low-quality reads. Bowtie2 is a sequence alignment tool suitable for aligning post-sequencing reads to a long reference genome. After processing with KneadData, the data no longer contains host sequences and can be used for species annotation. FastQC was used before and after KneadData to assess the quality and effectiveness of the quality control. OTU tables were generated based on sequence similarity clustering. Then, we used Kraken2 for species classification of the sequencing data. Kraken2 performs taxonomic classification analysis using k-mers. Finally, we used bracken for abundance estimation of each classification, obtaining abundance data for phages and bacteria. To eliminate errors in the annotation process and for computational efficiency, species with abundance values below 10 were filtered out. During this process, we used the standard database provided by Kraken2, which is approximately 50 GB in size and contains approximately 96 000 species of bacteria, 18 600 species of viruses, and other microorganisms such as archaea.

We compared this database with the PHD database and found that 9589 virus species present in the PHD were not retrieved in the Kraken2 database. We downloaded the gene sequence files and taxonomic information for these 9589 virus species from NCBI and expanded the Kraken2 database with these data in the format required by Kraken2. After expanding the database, we performed species classification and abundance calculation.

After thorough processing, we obtained abundance information at the species level for both phages and bacteria. From the metagenomic data, species-level abundance information for 8688 bacteria and 2969 phages was obtained.

We selected interaction pairs that appeared simultaneously in both the PHD and the annotated abundance databases to construct our required PHI database. This database contains 1909 viruses, 278 hosts, and 2002 PHI pairs. Notably, |$97\%$| of phages were found to infect only one bacterial species, and |$76\%$| of bacterial hosts had interactions with five or fewer phages. Finally, based on the acquired interactions, we obtained complete genomes of phages and hosts from NCBI and Virus-Host DB for sequence feature learning.

Phage–host associations

To better construct a heterogeneous network, we transformed the phage–host association information into an adjacency matrix |$A_{m\times n}$|⁠. Here |$n$| represents the number of phages, and |$m$| represents the number of hosts. The values in the adjacency matrix are binary. Binary information is easier to process and optimize in heterogeneous networks, facilitating more efficient information transmission. If phage |$i$| can infect bacterial host |$j$|⁠, then |$A_{ij}=1$|⁠; otherwise, |$A_{ij}=0$|⁠.

Higher-order mutual information

Bowers [45] used a probabilistic logic-based method to infer second-order logical relationships between proteins, achieving good results. We applied this method to calculate the logical relationships between species. As shown in Fig. 1. To conduct uncertain reasoning, we converted abundance data into binary data, where 0 indicates the absence of a species in a sample, and 1 indicates its presence, obtaining the states of all microorganisms in each sample.

Mutual information acquisition process. (a) The process of obtaining species abundance tables of phages and bacteria from metagenomic data is shown. (b) The process of mining mutual information by means of probabilistic logic method according to the state of species in each sample.
Figure 1

Mutual information acquisition process. (a) The process of obtaining species abundance tables of phages and bacteria from metagenomic data is shown. (b) The process of mining mutual information by means of probabilistic logic method according to the state of species in each sample.

Information entropy is used in information theory to represent the uncertainty or information content of a random variable. According to the definition of entropy, the larger the entropy, the greater the uncertainty of the random variable. Similarly, joint entropy is a measure of uncertainty of the joint distribution of two random variables.

For any microorganisms |$A$| and |$B$|⁠, where |$A=\{x_{1},x_{2},...,x_{n}\}$| and |$B=\{y_{1},y_{2},... y_{n}\}$| indicates the status of the two species in different samples, and the corresponding |$\{P(x_{1}),P(x_{2}),...,P(x_{n})\}$| and |$\{P(y_{1}),P(y_{2}),... P(y_{n})\}$| representing the probabilities of these states, where |$n=370$|⁠. The entropy of microorganism A is represented as follows:

(1)

The joint entropy of microorganisms A and B is represented as follows:

(2)

The uncertainty coefficient |$U$| is used to represent the probability of logical relationships:

(3)

|$U(A|B)$| represents the impact of |$B$|’s determination on |$A$|’s certainty when |$B$| is determine. The logical relationship between the two is called first-order logic, denoted as |$U(A|F1(B))$|⁠, where |$F1(B)$| represents the two possible states of |$B$|⁠: presence and absence, i.e. |$U(A|B)$| and |$U(A|\neg B)$|⁠. The logical relationship between three entities is called a second-order logic or higher-order logic.

The logical relationship between three entities is represented as follows:

(4)

Here, |$F_{2}(\cdot )$| represents the combination of different states of |$A$| and |$B$|⁠, resulting in eight possible combinations. As shown in Fig. 2b, a more detailed introduction is presented in Appendix 4.

Quality analysis of data results corresponding to eight logical relationship types. (a) shows the number of logical relationships obtained and the proportion of co-hosts among phages in the triad corresponding to each type, (b) shows the proportion of phages covered by the information corresponding to each logical type and the proportion of co-hosts.
Figure 2

Quality analysis of data results corresponding to eight logical relationship types. (a) shows the number of logical relationships obtained and the proportion of co-hosts among phages in the triad corresponding to each type, (b) shows the proportion of phages covered by the information corresponding to each logical type and the proportion of co-hosts.

According to the designed computational process, we obtained eight types of mutual information. To evaluate the quality of different types of mutual information, besides the quantity of information, we designed some metrics for assessment. Firstly, considering co-infections on a single host, a mutual information contains three phages, denoted as ’co_1’, ’co_2’, and ’co_3’, representing the cases where three phages infect the same host. Secondly, we used ’co-type’ to indicate the proportion of mutual information with shared hosts among the total of each type, and ’cov-phage’ to indicate the proportion of phages in mutual information with shared hosts among the total phages. The specific experimental results are shown in Fig. 2. The experimental results indicate that the total amount of mutual information for Types 2, 3, and 6 is significantly larger than other types. The ratio of ’co-Type’ for Types 1, 3, 5, and 7 is significantly larger than other types. Types 2 and 3 cover a larger number of phages compared to the general case.

Although we explored first-order logical information and second-order logical information separately in the experiment, obtaining 24 000 pieces of first-order logical information, the experimental analysis revealed that only |$27\%$| of the information pertained to cases with shared hosts, with the number of phages involved being less than |$30\%$|⁠. The experimental results were not ideal, likely due to excessive redundancy in the information and low information quality. This may be attributed to the lack of rigor in the formula chosen for calculating first-order logical relationships. This study primarily focuses on the analysis of second-order logical information.

Sparse similar networks

In the task of predicting PHI, following a basic assumption, phages that are more similar are more likely to infect similar bacterial hosts. For example, sequences of Acinetobacter phages belonging to the same genus are often very similar, and most of them infect |$Acinetobacter baumannii$|⁠, or infect |$Acinetobacter calcoaceticus$| and |$Acinetobacter pittii$|⁠, which are similar to A. baumannii [30].

Under this assumption, we construct similarity networks for phages and hosts respectively by calculating the sequence similarity between phages and between hosts. We decomposed the complete genomes of phages and hosts into subsequences of length k and calculated the frequency of each k-mer occurrence. This ultimately generated k-mer frequency features for each phage and host. The connection relationships between any two phages (hosts) were then obtained by calculating the cosine similarity distance between their respective k-mer frequency features. The k-mer frequency features provide rich information about the genome sequence and are particularly suitable for extracting global information from genomes in cases where precise annotation or full-length alignment is unavailable. The formula is as follows:

(5)

|$\overrightarrow{p_{i}}$| and |$\overrightarrow{p_{j}}$| represents the k-mer frequency feature vector of phage |$p_{i}$| and |$p_{j}$|⁠, respectively. |$\|\cdot \|$| represents the |$\ell _{2}$| norm of the vector.

In network-based methods, the similarity network is constructed as a fully connected network, which contradicts the sparsity of verified association information. This phenomenon implies that the similarity network contains a large amount of redundant information. To eliminate the influence of redundant information, we set parameters |$simv$| and |$simh$| to process the similarity values of phages and hosts, respectively. In the processed similarity network, only the nearest neighbors with high similarity to the phages or hosts are retained, and the feature learning process focuses only on the locally important relationships.

Heterogeneous network integration of mutual information and sequence information

Mutual information represents a second-order logical relationship, while sequence information represents a first-order relationship. Among existing methods, only hypergraph neural networks can integrate first-order and second-order relationships. Examples include hypergraph convolutional neural networks and hypergraph attention neural networks. However, these hypergraph neural networks are computationally expensive and have limitations in interpretability and capturing local information. To better integrate multiple information sources, we propose to integrate mutual information as nodes with entity information, and then combine them with sequence information for modeling.

The eight types of second-order logical relationships each correspond to a specific mutual information measure. We conducted a quality analysis for each type, and the results showed that the mutual information corresponding to Type 3 exhibited the best quality, as shown in Fig. 2.

Although the quality of Type 3 information is the most ideal, there still exist some redundant information. We utilized the sequence similarity of phages to filter the data. According to the definition of Type 3, if phage A appears or phage B appears, then phage C appears. We calculated the sequence similarity between phages A, B, and C. If the similarity value was below a certain threshold, we discarded the mutual information. After filtering, we obtained 9826 pieces of mutual information, of which 96.68% involved shared hosts.

After filtering, we found that the number of pieces of information far exceeded the number of phages. This was because some phages appeared frequently in the mutual information. To avoid imbalance issues during feature learning, we merged the mutual information. The merging criterion was: if two pieces of mutual information involved the same phages A and B, then the two pieces of information were merged. After merging, we obtained 2775 pieces of information.

We created a virtual node for each piece of mutual information, and this virtual node was connected to the nodes representing the phages (or hosts) involved in the mutual information. Each piece of mutual information corresponded to a star network, where the central node of the network is a virtual node. Finally, we merged the constructed sparse similarity network with each star network to obtain a heterogeneous network that integrates mutual information and sequence information.

Feature learning

Figure 3 illustrates the workflow of DSPHI. First, we obtain the phage–host association information from the constructed database. We use metagenomic analysis methods and probabilistic logic to obtain the mutual information of PHI. Then, combining the cosine similarity matrices of phages and hosts, we construct a heterogeneous network that includes four types of nodes and four types of edges. Next, we use an attention graph convolutional algorithm to perform graph embedding representation learning on this heterogeneous network. Finally, we decode the mapping operator of the reconstructed phage–host association matrix to obtain the predicted relationships.

The overall framework of DSPHI. (A) Data preprocessing module: calculating high-order mutual information and similarity measures between phages (or hosts). (B) Network construction module: integrating sequence information and high-order mutual information into a network and sparsifying the network. (C) Feature learning module: learning hidden feature representations using a graph convolutional algorithm. (D) PHI prediction module: reconstructing the PHI association matrix.
Figure 3

The overall framework of DSPHI. (A) Data preprocessing module: calculating high-order mutual information and similarity measures between phages (or hosts). (B) Network construction module: integrating sequence information and high-order mutual information into a network and sparsifying the network. (C) Feature learning module: learning hidden feature representations using a graph convolutional algorithm. (D) PHI prediction module: reconstructing the PHI association matrix.

Graph convolutional networks (GCNs) fuse local and global information by learning representations of nodes and their neighbors, generating high-quality node embeddings. The information propagation rule between layers is

(6)

|$G$| represents an adjacency matrix composed of relationships between various entities. |$Z$| is characteristic of each layer, |$\sigma (\cdot )$|is the nonlinear activation function, |$W^{l}$| is a trainable parameter matrix.

The adjacency matrix of the constructed graph |$G$| is expressed as follows:

(7)

|$P$| denotes the cosine similarity of the phage’s DNA sequence, while |$H$| denotes the cosine similarity of the host bacterium’s DNA sequence. |$A$| represents known association information, |$B_{P}$| represents the adjacency matrix of the star-shaped network corresponding to the mutual information of the phage, where the edge weights in the adjacency matrix are regulated by the set parameter |$simw$|⁠. Similarly, |$B_{H}$| represents the adjacency matrix of the star-shaped network corresponding to the mutual information of the host, where the edge weights in the adjacency matrix are regulated by the set parameter |$simz$|⁠.

The edges in the constructed network are derived from feature information. To avoid the duplication of information and to better learn the topology of the heterogeneous network, we set the initial inputs for phages and hosts as follows:

(8)

We assess the importance of mutual information to nodes by setting parameters |$simw$| and |$simz$|⁠, using them as the initial edge weights between virtual nodes and phage (host) nodes. The features of the first-order neighbors of virtual nodes are subjected to average pooling, and the result after pooling serves as the initial features for virtual nodes.

(9)

In the graph convolution process, the information aggregated in each layer varies. We introduce an attention mechanism to assign different weights to each convolutional layer for generating the final embedding representation.

Taking phage |$P_{i}$| as an example, |$Z^{1}_{i}, Z^{2}_{i},\cdots ,Z^{l}_{i}$| represent the output representations of |$P_{i}$| at each layer. Therefore, the final output representation is

(10)

To ensure a more objective comparison of the work, we employed the commonly used Adam optimization algorithm for training the model. The main equation for the reconstruction process is

(11)

Here, |$W^{\prime}$| represents the trainable parameter matrix, |$A^{\prime}$| represents the reconstructed phage–host association matrix, |$Z_{P}$| and |$Z_{H}$| represent the model-generated embedded representations of the phage and host, respectively.

Cross-entropy loss function

The cross-entropy loss function used in this study is particularly suitable for classification tasks. Compared to other loss functions such as mean squared error, the cross-entropy loss function is easier to optimize. Here, the positive instance set and the negative instance set are denoted as |$y^{+}$| and |$y^{-}$|⁠, respectively. The dataset contains |$N_{P}$| types of phages and |$N_{H}$| types of hosts, the number of positive samples in the prediction task is much smaller than the number of negative samples. To minimize the impact of this imbalance, we added a new parameter |$\gamma $|⁠, whose value is obtained through the ratio of positive to negative samples. The specific formula is as follows:

(12)

Experiment

Experiment settings

This section mainly presents the settings of multiple parameters, baseline models, and evaluation metrics involved in the experimental process. The biological characteristic of phages in specifically recognizing their hosts leads to a significant class imbalance in the dataset, where the number of positive samples is far smaller than that of negative samples. In our experiments, we randomly divided the positive samples into five equal parts. Four parts were combined with samples labeled as 0 (negative samples) to form the training set. The remaining positive samples, together with the samples labeled as 0, constituted the testing set for model evaluation. Additionally, to further enhance the evaluation of the model, we employed a downsampling approach to balance the number of positive and negative samples and compared the results with those of the baseline models.

Hyper-parameter setting

Regarding the several hyperparameters involved in the experiments, we chose a more comprehensive and intuitive grid search method, including the selection range of embedding dimension |$k$| is {32, 64, 128, 256, 512}, and the selection range of layer number |$L$| of the graph convolution is {1, 2, 3, 4, 5, 6}. The optimizer’s initial learning rate |$lr$| selection range is {0.1, 0.01, 0.001}. The final parameter settings are shown in Table 1.

The ranges for the edge weights |$simv$| and |$simh$| between nodes representing mutual information and phage (host) nodes are both from 0.1 to 2. The training epoch selection range is {500, 1000, 2000, 3000, 4000}.The relevant experimental procedures are shown in the appendix.

Table 1

The parameters of DSPHI

StryctureParameters
LaterAmount(L):3
dimensionUnits(K):128
Activation functionReLu
Node dropout0.4
Edge dropout0.7
Loss functionCross-entropy loss function
OptimizerAdam
Epoch4000
Initial learning rate0.01
StryctureParameters
LaterAmount(L):3
dimensionUnits(K):128
Activation functionReLu
Node dropout0.4
Edge dropout0.7
Loss functionCross-entropy loss function
OptimizerAdam
Epoch4000
Initial learning rate0.01
Table 1

The parameters of DSPHI

StryctureParameters
LaterAmount(L):3
dimensionUnits(K):128
Activation functionReLu
Node dropout0.4
Edge dropout0.7
Loss functionCross-entropy loss function
OptimizerAdam
Epoch4000
Initial learning rate0.01
StryctureParameters
LaterAmount(L):3
dimensionUnits(K):128
Activation functionReLu
Node dropout0.4
Edge dropout0.7
Loss functionCross-entropy loss function
OptimizerAdam
Epoch4000
Initial learning rate0.01

Baselines and evaluation metrics

To objectively evaluate the performance of the DSPHI model, we selected several state-of-the-art baseline prediction models from the last three years and conducted experiments on the same dataset.

CL4PHI [14]: a contrastive learning-based method for PHI prediction, focusing on identifying phage-host relationships at the species and genus levels.

GCNAT [46]: a novel deep learning algorithm that combines graph convolutional networks with graph attention mechanisms for predicting metabolic and disease-related correlations.

PHIAF [10]: a classical model utilizing convolutional neural networks that incorporates data augmentation with generative adversarial networks and combines sequence features for PHI prediction.

PHP [8]: a Gaussian model for predicting prokaryotic viral hosts in genomics, based on the non-homology method, representing a classic model in the field.

PredPHI [47]: a model exclusively using protein sequences as features and combining them with a CNN model for PHI prediction.

MKGCN [48]: a deep learning model based on graph convolutional neural networks that leverages multiple kernel fusion to infer microbiota-drug correlations.

In the experiments, a five-fold cross-validation method was utilized for model training and prediction. The dataset was divided into five subsets, with any four subsets used as training sets and the remaining one as the test set. Five sets of predictions were obtained, and the average of these five results was considered as the final prediction result. To evaluate the model performance comprehensively, the numbers of true positive (⁠|$TP$|⁠) and true negative (⁠|$TN$|⁠) samples, as well as false negative (⁠|$FN$|⁠) and false positive (⁠|$FP$|⁠) samples, were first calculated. Based on these four basic statistics, three metrics were computed: area under the receiver operating characteristic curve (⁠|$AUC$|⁠), area under the precision-recall curve (⁠|$AUPR$|⁠), and accuracy (⁠|$ACC$|⁠). This protocol was consistent with the standards commonly adopted in professional journals.

Comparison of predictive performance

All the mentioned models were evaluated using five-fold cross-validation, and the final results were obtained. AUC and ACC were chosen as common evaluation metrics.

As shown in the Fig. 4, the DSPHI model outperformed the other models significantly in both of these metrics.

The performance of DSPHI and other methods using five-fold cross-validation.
Figure 4

The performance of DSPHI and other methods using five-fold cross-validation.

For a more comprehensive assessment of model performance, experiments were conducted at the genus level for all models. The results indicate that, compared to other baseline models, the DSPHI model exhibits a clear advantage at both the species and genus levels, As shown in the Fig. 5.

The AUC of DSPHI and other methods at the genus and species levels.
Figure 5

The AUC of DSPHI and other methods at the genus and species levels.

Finally, we performed paired sample t-tests to compare the AUPR values of DSPHI with other tools, and the results indicated that DSPHI achieved statistically significant improvements over the competing methods. Detailed experimental results are provided in the appendix.

Ablation experiment and analysis

In addition to tuning the basic parameters of the model, there are several ways to improve the model by adjusting hyperparameters and improving the input. To better validate the effectiveness of these improvements, we conducted the following experiments.

The impact of network sparsification

First, we performed sparsification on the constructed phage similarity network and host similarity network. We set the hyperparameters |$simv$| and |$simh$| separately. As shown in the Fig. 6. The results show that as these two parameters increase, all performance metrics improve. However, when the networks become too sparse, the results are not optimal. This indicates that sparsification is effective and necessary for the PHI prediction task. The optimal value for |$simh$| was found to be 1, indicating that the sequence information of the host did not have a positive impact on the experimental results. This is why we used an identity matrix to represent the host’s adjacent matrix.

The impact of hyperparameters $simv$ and $simh$.
Figure 6

The impact of hyperparameters |$simv$| and |$simh$|⁠.

Next, we present detailed results of several variants of DSPHI on three metrics. Here, |$DSPHI$|-|$no$| indicates no sparsification, |$DSPHI$|-|$Bi$| indicates a bipartite network model containing only association relationships, and |$DSPHI$|-|$simh$| and |$DSPHI$|-|$simv$| indicate sparsification of the host similarity network and phage similarity network, respectively. The specific experimental results are shown in Table 2.

Table 2

The prediction results based on network sparsity

DSPHI-sparse networkAUPRAUCACC
DSPHI-no0.15020.82070.9093
DSPHI-Bi0.14070.87800.9516
DSPHI-simh0.31970.89370.9710
DSPHI-simv0.33680.93480.9804
DSPHI0.36430.94390.9839
DSPHI-sparse networkAUPRAUCACC
DSPHI-no0.15020.82070.9093
DSPHI-Bi0.14070.87800.9516
DSPHI-simh0.31970.89370.9710
DSPHI-simv0.33680.93480.9804
DSPHI0.36430.94390.9839
Table 2

The prediction results based on network sparsity

DSPHI-sparse networkAUPRAUCACC
DSPHI-no0.15020.82070.9093
DSPHI-Bi0.14070.87800.9516
DSPHI-simh0.31970.89370.9710
DSPHI-simv0.33680.93480.9804
DSPHI0.36430.94390.9839
DSPHI-sparse networkAUPRAUCACC
DSPHI-no0.15020.82070.9093
DSPHI-Bi0.14070.87800.9516
DSPHI-simh0.31970.89370.9710
DSPHI-simv0.33680.93480.9804
DSPHI0.36430.94390.9839

The impact of mutual information

This study identified a total of eight types of logical relationships, and experiments were conducted for each type, with results shown in Table 3. The results indicate that data from Type 3 was most helpful for the experiments, consistent with the quality analysis results. Additionally, the information corresponding to Types 5 and 7 showed relatively good results in terms of both the proportion of shared hosts and the coverage of phages, ranking just below the results for Type 3.

Table 3

The prediction results are based on each logical relation type

TypeAllco-host(%)phage coverage(%)AUCACC
Type113370.77230.36620.91150.9681
Type222 7530.08010.63580.90390.9707
Type313 3580.88180.53480.93590.9710
Type42090.05740.04090.91530.9665
Type512950.75600.26140.92140.9786
Type628 1700.34360.40970.90940.9672
Type723790.85480.33320.92800.9753
Type86920.43360.08540.91930.9693
TypeAllco-host(%)phage coverage(%)AUCACC
Type113370.77230.36620.91150.9681
Type222 7530.08010.63580.90390.9707
Type313 3580.88180.53480.93590.9710
Type42090.05740.04090.91530.9665
Type512950.75600.26140.92140.9786
Type628 1700.34360.40970.90940.9672
Type723790.85480.33320.92800.9753
Type86920.43360.08540.91930.9693
Table 3

The prediction results are based on each logical relation type

TypeAllco-host(%)phage coverage(%)AUCACC
Type113370.77230.36620.91150.9681
Type222 7530.08010.63580.90390.9707
Type313 3580.88180.53480.93590.9710
Type42090.05740.04090.91530.9665
Type512950.75600.26140.92140.9786
Type628 1700.34360.40970.90940.9672
Type723790.85480.33320.92800.9753
Type86920.43360.08540.91930.9693
TypeAllco-host(%)phage coverage(%)AUCACC
Type113370.77230.36620.91150.9681
Type222 7530.08010.63580.90390.9707
Type313 3580.88180.53480.93590.9710
Type42090.05740.04090.91530.9665
Type512950.75600.26140.92140.9786
Type628 1700.34360.40970.90940.9672
Type723790.85480.33320.92800.9753
Type86920.43360.08540.91930.9693

We conducted experiments using Type 3 mutual information. To validate the necessity of each processing step, we designed six variants of DSPHI for ablation experiments. |$DSPHI$|-|$No$| represents the state without any mutual information, |$DSPHI$|-|$Al$| uses all Type 3 mutual information, |$DSPHI$|-|$Cl$| experiments are conducted after purifying and filtering the mutual information, |$DSPHI$|-|$Me$| corresponds to the integration of mutual information, |$DSPHI$|-|$P$| and |$DSPHI$|-|$H$| represent experiments where the model only uses phage mutual information and only uses host mutual information, respectively. The experimental results are shown in the Table 4. The results indicate that purifying and integrating mutual information are essential. Moreover, the results show that the inclusion of phage mutual information has a greater impact on the experimental results, while host mutual information has a relatively small impact. We believe this may be related to the highly specific recognition characteristics of phages, as a phage can infect multiple host, and such phages are rare.

Table 4

The prediction results based on mutual information

DSPHI-mutualAUPRAUCACC
DSPHI-No0.31970.89370.9710
DSPHI-Al0.32400.93590.9710
DSPHI-Cl0.32690.93960.9782
DSPHI-Me0.32210.93860.9810
DSPHI-P0.34650.94170.9800
DSPHI-H0.31210.90860.9780
DSPHI0.36430.94390.9839
DSPHI-mutualAUPRAUCACC
DSPHI-No0.31970.89370.9710
DSPHI-Al0.32400.93590.9710
DSPHI-Cl0.32690.93960.9782
DSPHI-Me0.32210.93860.9810
DSPHI-P0.34650.94170.9800
DSPHI-H0.31210.90860.9780
DSPHI0.36430.94390.9839
Table 4

The prediction results based on mutual information

DSPHI-mutualAUPRAUCACC
DSPHI-No0.31970.89370.9710
DSPHI-Al0.32400.93590.9710
DSPHI-Cl0.32690.93960.9782
DSPHI-Me0.32210.93860.9810
DSPHI-P0.34650.94170.9800
DSPHI-H0.31210.90860.9780
DSPHI0.36430.94390.9839
DSPHI-mutualAUPRAUCACC
DSPHI-No0.31970.89370.9710
DSPHI-Al0.32400.93590.9710
DSPHI-Cl0.32690.93960.9782
DSPHI-Me0.32210.93860.9810
DSPHI-P0.34650.94170.9800
DSPHI-H0.31210.90860.9780
DSPHI0.36430.94390.9839

In the experiments, we set two hyperparameters, |$simw$| and |$simz$|⁠, for the mutual information of phages and hosts, respectively. These hyperparameters are used to adjust the edge weights between virtual nodes and real nodes. The specific experimental results are shown in Fig. 7. The results indicate that as the value of |$simw$| approaches 1 and the value of |$simz$| approaches 0, the AUC value of the experimental results increases. Other results are presented in the appendix.

The impact of hyperparameters $simz$ and $simw$.
Figure 7

The impact of hyperparameters |$simz$| and |$simw$|⁠.

The impact of network construction methods

This section focuses on the ablation experiment of the network construction method. Considering that second-order logic relationships are involved, the traditional network construction method is to construct hypergraphs. In order to eliminate the influence of virtual nodes on the experiment, we choose the |$DSPHI$|-|$hyper$| model. The results show that the virtual node modeling method can significantly improve the experimental results and obtain higher AUPR and AUC. The experimental results are shown in Table 5.

Table 5

Prediction results based on the network construction method

DSPHI-networkAUPRAUCACC
DSPHI-Hyper0.32950.89580.9855
DSPHI0.36430.94390.9839
DSPHI-networkAUPRAUCACC
DSPHI-Hyper0.32950.89580.9855
DSPHI0.36430.94390.9839
Table 5

Prediction results based on the network construction method

DSPHI-networkAUPRAUCACC
DSPHI-Hyper0.32950.89580.9855
DSPHI0.36430.94390.9839
DSPHI-networkAUPRAUCACC
DSPHI-Hyper0.32950.89580.9855
DSPHI0.36430.94390.9839

The case study

Case study for mutual information

We rank the hosts according to the number of corresponding phages and select the top five bacteria: |$Clostridioides$|  |$difficile$|⁠, |$Lactococcus$|  |$lactis$|⁠, |$Salmonella$|  |$enterica$|⁠, |$Klebsiella pneumoniae$|⁠, and |$Escherichia$|  |$coli$| for mutual information display, as shown in the Fig. 8. At least half of the phages corresponding to each host can be found in their respective mutual information. This means that the process of mining mutual information can discover the phenomenon of co-occurrence of phages shared by hosts. This is extremely advantageous for aggregating information in the model.

The mutual information of phages with respect to the phages corresponding to the top five bacteria.
Figure 8

The mutual information of phages with respect to the phages corresponding to the top five bacteria.

Case study for DSPHI

To further determine the performance of the DSPHI model in predicting PHI, we designed experiments. Firstly, we predicted new phages and new host by sequentially removing their relevant association information, considering these as newly discovered, and observing the model’s predictive ability. We selected two phages, |$phage$|  |$R18C$| and |$Escherichia$|  |$1720a$|-|$02$|⁠, as shown in the Table 8, and two hosts, |$Klebsiellapneumoniae$| and |$Escherichia coli$|⁠, as shown in the Tables 6 and 7. The host range of |$phage$|  |$R18C$| is 3, and that of |$Escherichia$|  |$1720a$|-|$02$| is 2. For the phage experiments, we selected the top 7 results with the highest predicted probabilities, and for the host, we selected the top 20 predicted results. The results showed that DSPHI demonstrated good predictive performance for both new viruses and new bacteria, especially in predicting hosts for viruses.

Table 6

The top20 results of the DSPHI’s predictions for |$Klebsiella pneumoniae$|

TopPhagePMIDTopPhagePMID
1Klebsiella phage RAD23443172111Klebsiella phage vB_KpnS ZX426008965
2Klebsiella phage ST11-VIM1phi8.23175238612|$Taipeivirus menlow$|31023815
3Klebsiella phage vB_10863598545013|$Delmidovirus copri$|N/A
4Klebsiella phage vB_KleM KB23733060814|$Webervirus mezzogao$|29122857
5Klebsiella phage vB_KpnM_17-113586582315|$Drulisvirus altogao$|29122857
6|$Drulisvirus minorna$|N/A16Diorhovirus copriN/A
7Klebsiella phage vB_KpnP-Bp53363122117Webervirus sin431558644
8|$Lastavirus sopranogao$|2912285718Efquatrovirus SHEF4N/A
9|$Pylasvirus pylas$|3172772119Webervirus sweeny31558643
10|$Taipeivirus may$|3107289920|$Yonseivirus seifer$|31727722
TopPhagePMIDTopPhagePMID
1Klebsiella phage RAD23443172111Klebsiella phage vB_KpnS ZX426008965
2Klebsiella phage ST11-VIM1phi8.23175238612|$Taipeivirus menlow$|31023815
3Klebsiella phage vB_10863598545013|$Delmidovirus copri$|N/A
4Klebsiella phage vB_KleM KB23733060814|$Webervirus mezzogao$|29122857
5Klebsiella phage vB_KpnM_17-113586582315|$Drulisvirus altogao$|29122857
6|$Drulisvirus minorna$|N/A16Diorhovirus copriN/A
7Klebsiella phage vB_KpnP-Bp53363122117Webervirus sin431558644
8|$Lastavirus sopranogao$|2912285718Efquatrovirus SHEF4N/A
9|$Pylasvirus pylas$|3172772119Webervirus sweeny31558643
10|$Taipeivirus may$|3107289920|$Yonseivirus seifer$|31727722
Table 6

The top20 results of the DSPHI’s predictions for |$Klebsiella pneumoniae$|

TopPhagePMIDTopPhagePMID
1Klebsiella phage RAD23443172111Klebsiella phage vB_KpnS ZX426008965
2Klebsiella phage ST11-VIM1phi8.23175238612|$Taipeivirus menlow$|31023815
3Klebsiella phage vB_10863598545013|$Delmidovirus copri$|N/A
4Klebsiella phage vB_KleM KB23733060814|$Webervirus mezzogao$|29122857
5Klebsiella phage vB_KpnM_17-113586582315|$Drulisvirus altogao$|29122857
6|$Drulisvirus minorna$|N/A16Diorhovirus copriN/A
7Klebsiella phage vB_KpnP-Bp53363122117Webervirus sin431558644
8|$Lastavirus sopranogao$|2912285718Efquatrovirus SHEF4N/A
9|$Pylasvirus pylas$|3172772119Webervirus sweeny31558643
10|$Taipeivirus may$|3107289920|$Yonseivirus seifer$|31727722
TopPhagePMIDTopPhagePMID
1Klebsiella phage RAD23443172111Klebsiella phage vB_KpnS ZX426008965
2Klebsiella phage ST11-VIM1phi8.23175238612|$Taipeivirus menlow$|31023815
3Klebsiella phage vB_10863598545013|$Delmidovirus copri$|N/A
4Klebsiella phage vB_KleM KB23733060814|$Webervirus mezzogao$|29122857
5Klebsiella phage vB_KpnM_17-113586582315|$Drulisvirus altogao$|29122857
6|$Drulisvirus minorna$|N/A16Diorhovirus copriN/A
7Klebsiella phage vB_KpnP-Bp53363122117Webervirus sin431558644
8|$Lastavirus sopranogao$|2912285718Efquatrovirus SHEF4N/A
9|$Pylasvirus pylas$|3172772119Webervirus sweeny31558643
10|$Taipeivirus may$|3107289920|$Yonseivirus seifer$|31727722
Table 7

The top20 results of the DSPHI’s predictions for |$Escherichia coli$|

TopPhagePMIDTopPhagePMID
1|$Escherichia phage$| vB_EcoP-ZX53655877911|$Escherichia phage$| Schickermooser31109012
2|$Escherichia phage$|vB_EcoM-P103496636912Tequatrovirus T426081634
3|$Escherichia phage$| vB_EcoM_C2-33476299213Traversvirus II12813092
4|$Escherichia coli$|phage phiStx2k3807898414|$Goslarvirus goslar$|31109012
5|$Escherichia phage$| vB_EcoS-PJ163763259115Inovirus M135257006
6|$Escherichia phage$| vB_EcoM-RPN2423559820916Kuttervirus SenALZ1N/A
7|$Jahgtovirus intestinalis$|N/A17|$Escherichia phage$| OSYSP38182094
8|$Escherichia phage$| TL-2011b2240361418Kayfunavirus SH4N/A
9Enterobacteria phage P4748325419|$Escherichia phage$| vB_EcoM-Alf528522702
10|$Escherichia phage$| SRT73076212020|$Warwickvirus tunus$|32899836
TopPhagePMIDTopPhagePMID
1|$Escherichia phage$| vB_EcoP-ZX53655877911|$Escherichia phage$| Schickermooser31109012
2|$Escherichia phage$|vB_EcoM-P103496636912Tequatrovirus T426081634
3|$Escherichia phage$| vB_EcoM_C2-33476299213Traversvirus II12813092
4|$Escherichia coli$|phage phiStx2k3807898414|$Goslarvirus goslar$|31109012
5|$Escherichia phage$| vB_EcoS-PJ163763259115Inovirus M135257006
6|$Escherichia phage$| vB_EcoM-RPN2423559820916Kuttervirus SenALZ1N/A
7|$Jahgtovirus intestinalis$|N/A17|$Escherichia phage$| OSYSP38182094
8|$Escherichia phage$| TL-2011b2240361418Kayfunavirus SH4N/A
9Enterobacteria phage P4748325419|$Escherichia phage$| vB_EcoM-Alf528522702
10|$Escherichia phage$| SRT73076212020|$Warwickvirus tunus$|32899836
Table 7

The top20 results of the DSPHI’s predictions for |$Escherichia coli$|

TopPhagePMIDTopPhagePMID
1|$Escherichia phage$| vB_EcoP-ZX53655877911|$Escherichia phage$| Schickermooser31109012
2|$Escherichia phage$|vB_EcoM-P103496636912Tequatrovirus T426081634
3|$Escherichia phage$| vB_EcoM_C2-33476299213Traversvirus II12813092
4|$Escherichia coli$|phage phiStx2k3807898414|$Goslarvirus goslar$|31109012
5|$Escherichia phage$| vB_EcoS-PJ163763259115Inovirus M135257006
6|$Escherichia phage$| vB_EcoM-RPN2423559820916Kuttervirus SenALZ1N/A
7|$Jahgtovirus intestinalis$|N/A17|$Escherichia phage$| OSYSP38182094
8|$Escherichia phage$| TL-2011b2240361418Kayfunavirus SH4N/A
9Enterobacteria phage P4748325419|$Escherichia phage$| vB_EcoM-Alf528522702
10|$Escherichia phage$| SRT73076212020|$Warwickvirus tunus$|32899836
TopPhagePMIDTopPhagePMID
1|$Escherichia phage$| vB_EcoP-ZX53655877911|$Escherichia phage$| Schickermooser31109012
2|$Escherichia phage$|vB_EcoM-P103496636912Tequatrovirus T426081634
3|$Escherichia phage$| vB_EcoM_C2-33476299213Traversvirus II12813092
4|$Escherichia coli$|phage phiStx2k3807898414|$Goslarvirus goslar$|31109012
5|$Escherichia phage$| vB_EcoS-PJ163763259115Inovirus M135257006
6|$Escherichia phage$| vB_EcoM-RPN2423559820916Kuttervirus SenALZ1N/A
7|$Jahgtovirus intestinalis$|N/A17|$Escherichia phage$| OSYSP38182094
8|$Escherichia phage$| TL-2011b2240361418Kayfunavirus SH4N/A
9Enterobacteria phage P4748325419|$Escherichia phage$| vB_EcoM-Alf528522702
10|$Escherichia phage$| SRT73076212020|$Warwickvirus tunus$|32899836
Table 8

The top7 results of the DSPHI’s predictions for phage R18C and Escherichia 1720a-02

PhageHost-predictingEvidence(PMID)
phage R18C|$Shigella sonnei$|31641840
|$Erwinia amylovora$|N/A
|$Citrobacter rodentium$|31641840
|$Citrobacter koseri$|N/A
|$Citrobacter portucalensis$|N/A
|$Shigella flexneri$|N/A
|$Escherichia coli$|31641840
Escherichia 1720a-02|$Citrobacter freundii$|N/A
|$Escherichia coli$|32761142
|$Citrobacter koseri$|N/A
|$Citrobacter rodentium$|34929548
Escherichia sp.N/A
|$Escherichia fergusonii$|N/A
|$Shigella sonnei$|N/A
PhageHost-predictingEvidence(PMID)
phage R18C|$Shigella sonnei$|31641840
|$Erwinia amylovora$|N/A
|$Citrobacter rodentium$|31641840
|$Citrobacter koseri$|N/A
|$Citrobacter portucalensis$|N/A
|$Shigella flexneri$|N/A
|$Escherichia coli$|31641840
Escherichia 1720a-02|$Citrobacter freundii$|N/A
|$Escherichia coli$|32761142
|$Citrobacter koseri$|N/A
|$Citrobacter rodentium$|34929548
Escherichia sp.N/A
|$Escherichia fergusonii$|N/A
|$Shigella sonnei$|N/A
Table 8

The top7 results of the DSPHI’s predictions for phage R18C and Escherichia 1720a-02

PhageHost-predictingEvidence(PMID)
phage R18C|$Shigella sonnei$|31641840
|$Erwinia amylovora$|N/A
|$Citrobacter rodentium$|31641840
|$Citrobacter koseri$|N/A
|$Citrobacter portucalensis$|N/A
|$Shigella flexneri$|N/A
|$Escherichia coli$|31641840
Escherichia 1720a-02|$Citrobacter freundii$|N/A
|$Escherichia coli$|32761142
|$Citrobacter koseri$|N/A
|$Citrobacter rodentium$|34929548
Escherichia sp.N/A
|$Escherichia fergusonii$|N/A
|$Shigella sonnei$|N/A
PhageHost-predictingEvidence(PMID)
phage R18C|$Shigella sonnei$|31641840
|$Erwinia amylovora$|N/A
|$Citrobacter rodentium$|31641840
|$Citrobacter koseri$|N/A
|$Citrobacter portucalensis$|N/A
|$Shigella flexneri$|N/A
|$Escherichia coli$|31641840
Escherichia 1720a-02|$Citrobacter freundii$|N/A
|$Escherichia coli$|32761142
|$Citrobacter koseri$|N/A
|$Citrobacter rodentium$|34929548
Escherichia sp.N/A
|$Escherichia fergusonii$|N/A
|$Shigella sonnei$|N/A

Discussion

The widespread use of antibiotics has driven the evolution of antibiotic-resistant bacteria. In contrast, phages, due to their high specificity, can target and infect specific bacterial strains without affecting other microorganisms. Thus, the application of phage therapy has the potential to reduce the over-reliance on antibiotics, thereby lowering the selective pressure for resistant bacterial strains. More importantly, phages co-evolve alongside bacterial evolution, a characteristic that antibiotics cannot match.

This study proposes a richer semantic information (mutual information) between phages and hosts for PHI prediction tasks, and designs a dual-view sparse network model that integrates mutual information, sequence information, and association information. Mutual information is a high-order information directly calculated from the sample environment, overcoming the limitations of small-scale and incomplete information in biological information databases. Considering the sparsity of validated PHI relationships, we construct sparse similarity networks, allowing for a more focused feature learning on local features. In the fusion of dual-view information, we integrate high-order mutual information with sequence information in the form of nodes. Compared to traditional hypergraphs, this network construction pattern reduces the complexity of the network and makes the information aggregation process more convenient. It is worth mentioning that in annotating metagenomic data, we found that existing analysis tools’ virus databases are relatively small. Therefore, we expanded the database based on validated phage–host association information to facilitate future researchers’ use. Experimental results show that mutual information is a new and effective semantic information, and sparsification is an effective processing method, both of which improve the model’s prediction performance. Moreover, representing high-order information as nodes is more suitable for PHI prediction tasks compared to hyperedges.

In our experiment, compared to the latest models, DSPHI improved the AUC results by |$3\%$|⁠. The model’s predictive performance was already superior to most existing models even with just the network sparsification. The mutual information, which was calculated based on metagenomic data, implies that as we discover new viruses from metagenomic data, it is entirely feasible to compute the corresponding mutual information. This capability significantly enhances the efficiency of studying interactions between newly identified viruses and their hosts.

Although the model proposed in this study achieved good results, there are still some challenges in the experimental process. Firstly, when mining mutual information in metagenomic data, we found that the first-order logical relationships contained too much redundant information, which needs to be improved in the mining method. This will be the focus of our next work. Secondly, when using sequence information, we only considered gene information and did not consider protein information. After all, protein information can be completed and can cover all phages and host.

At the same time, we are also considering whether this way of mining useful information from metagenomic data is also applicable to other biological information problems, such as predicting interactions between microorganisms. This is also a problem we will think about in the long term.

Key Points
  • This study proposes a dual-view sparse network model for predicting phage–host interaction relationships, which integrates mutual information, sequence information, and association information.

  • This study uses a probabilistic logic method to obtain richer semantic information (mutual information) between phages and hosts from metagenomic data for phage–host interaction prediction tasks. This information is a high-order information.

  • When constructing the similarity network based on sequence information, considering the sparsity of validated phage–host relationship information, we sparsify the network.

  • When integrating high-order mutual information and first-order sequence information, we propose to represent high-order mutual information as nodes. Compared to hypergraphs, this approach is more conducive to information aggregation.

  • In annotating metagenomic data, we expand the database of the analysis tool Kraken2 based on validated phage–host interaction information.

Acknowledgments

We express our deepest gratitude to the authors of all the code and data cited in this paper, especially Feiyue Sun and Jianqiang Sun for their GCNAT method, which greatly inspired our work. We also thank the reviewers for their helpful and constructive comments.

Conflict of interest: None declared.

Funding

This work was partially supported by the National Natural Science Foundation of China (62372205, 62472192 and 61932008), the National Language Commission Key Research Project (ZDI145-56), the Fundamental Research Funds for Central Universities (KJ02502022-0450), the Natural Science Foundation of Hubei Province of China (2022CFB289), and the Self-determined Research Funds of CCNU from the Colleges’ Basic Research and Operation of MOE (CCNU24JC032).

Code and data availability

We utilized a publicly available dataset of phage–host associations (PHD) and a publicly available metagenomic dataset. The PHD dataset is updated daily and can be accessed at http://phdaily.info or http://combio.pl/phdaily. The metagenomic dataset is labeled as PRJNA422434 and comprises 370 human gut samples. It is accessible at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA422434. Related code, dataset splits, and trained models can be found at https://github.com/Ankang-Wei/DSPHI.

Author contributions statement

Ankang Wei: contributed to the methodological ideas of DSPHI, conducted experimental data analysis, visualized the results, and drafted the manuscript. Huanghan Zhan: contributed to data organization and analysis, and drafted the manuscript. Zhen Xiao: contributed to the organization and analysis of real data, and drafted the manuscript. Weizhong Zhao: contributed to the methodological ideas of DSPHI and drafted the manuscript. Xingpeng Jiang: contributed to the methodological ideas, biological insights, and interpretations of DSPHI, and drafted the manuscript. All authors have read and approved the final manuscript.

References

1

Samson
 
JE
,
Magadán
 
AH
,
Sabri
 
M
. et al. .  
Revenge of the phages: defeating bacterial defences
.
Nat Rev Microbiol
 
2013
;
11
:
675
87
. .

2

Hampton
 
HG
,
Watson
 
BNJ
,
Fineran
 
PC
.
The arms race between bacteria and their phage foes
.
Nature
 
2020
;
577
:
327
36
. .

3

Chevallereau
 
A
,
Pons
 
BJ
,
van Houte
 
S
. et al. .  
Interactions between bacterial and phage communities in natural environments
.
Nat Rev Microbiol
 
2022
;
20
:
49
62
. .

4

Levin
 
BR
,
Bull
 
JJ
.
Population and evolutionary dynamics of phage therapy
.
Nat Rev Microbiol
 
2004
;
2
:
166
73
. .

5

Nick
 
JA
,
Dedrick
 
RM
,
Gray
 
AL
. et al. .  
Host and pathogen response to bacteriophage engineered against mycobacterium abscessus lung infection
.
Cell
 
2022
;
185
:
1860
1874.e12
. .

6

Salmond
 
GPC
,
Fineran
 
PC
.
A century of the phage: past, present and future
.
Nat Rev Microbiol
 
2015
;
13
:
777
86
. .

7

Madden
 
T
.
The blast sequence analysis tool
.
The NCBI handbook
2013;
2
:425–36.

8

Congyu
 
L
,
Zhang
 
Z
,
Cai
 
Z
. et al. .  
Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics
.
BMC Biol
 
2021
;
19
:
1
11
.

9

Wang
 
W
,
Ren
 
J
,
Tang
 
K
. et al. .  
A network-based integrated framework for predicting virus–prokaryote interactions
.
NAR Genom Bioinform
 
2020
;
2
:lqaa044. .

10

Li
 
M
,
Zhang
 
W
.
PHIAF: prediction of phage-host interactions with Gan-based data augmentation and sequence-based feature fusion
.
Brief Bioinform
 
2022
;
23
:
bbab348
. .

11

Coutinho
 
FH
,
Zaragoza-Solas
 
A
,
López-Pérez
 
M
. et al. .  
RaFAH: host prediction for viruses of bacteria and archaea based on protein content
.
Patterns
 
2021
;
2
:
100274
. .

12

Amgarten
 
D
,
Iha
 
BKV
,
Piroupo
 
CM
. et al. .  
vHulk, a new tool for bacteriophage host prediction based on annotated genomic features and neural networks
.
PHAGE
 
2022
;
3
:
204
12
. .

13

Shang
 
J
,
Sun
 
Y
.
Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning
.
BMC Biol
 
2021
;
19
:
1
15
. .

14

Zhang
 
Y-z
,
Liu
 
Y
,
Bai
 
Z
. et al. .  
Zero-shot-capable identification of phage–host relationships with whole-genome sequence representation by contrastive learning
.
Brief Bioinform
 
2023
;
24
:bbad239. .

15

Galiez
 
C
,
Siebert
 
M
,
Enault
 
F
. et al. .  
WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs
.
Bioinformatics
 
2017
;
33
:
3113
4
. .

16

Zielezinski
 
A
,
Deorowicz
 
S
,
Gudyś
 
A
.
PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences
.
Bioinformatics
 
2022
;
38
:
1447
9
. .

17

Shang
 
J
,
Sun
 
Y
.
CHERRY: a computational method for accurate prediction of virus–prokaryotic interactions using a graph encoder–decoder model
.
Brief Bioinform
 
2022
;
23
:
bbac182
. .

18

Pan
 
J
,
You
 
W
,
Xiaoliang
 
L
. et al. .  
GSPHI: a novel deep learning model for predicting phage-host interactions via multiple biological information
.
Comput Struct Biotechnol J
 
2023
;
21
:
3404
13
. .

19

Mihara
 
T
,
Nishimura
 
Y
,
Shimizu
 
Y
. et al. .  
Linking virus genomes with host taxonomy
.
Viruses
 
2016
;
8
:
66
. .

20

Coclet
 
C
,
Roux
 
S
.
Global overview and major challenges of host prediction methods for uncultivated phages
.
Curr Opin Virol
 
2021
;
49
:
117
26
. .

21

Edwards
 
RA
,
McNair
 
K
,
Faust
 
K
. et al. .  
Computational approaches to predict bacteriophage–host relationships
.
FEMS Microbiol Rev
 
2016
;
40
:
258
72
. .

22

Ma
 
Y
,
You
 
X
,
Mai
 
G
. et al. .  
A human gut phage catalog correlates the gut phageome with type 2 diabetes
.
Microbiome
 
2018
;
6
:
1
12
. .

23

Blaisdell
 
BE
,
Campbell
 
AM
,
Karlin
 
S
.
Similarities and dissimilarities of phage genomes
.
Proc Natl Acad Sci
 
1996
;
93
:
5854
9
. .

24

León-Félix
 
J
,
Villicaña
 
C
.
The impact of quorum sensing on the modulation of phage-host interactions
.
J Bacteriol
 
2021
;
203
:
10
1128
. .

25

Versoza
 
CJ
,
Pfeifer
 
SP
.
Computational prediction of bacteriophage host ranges
.
Microorganisms
 
2022
;
10
:149. .

26

Boeckaerts
 
D
,
Stock
 
M
,
Criel
 
B
. et al. .  
Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins
.
Sci Rep
 
2021
;
11
:
1467
. .

27

Zhang
 
Z
,
Fen
 
Y
,
Zou
 
Y
. et al. .  
Phage protein receptors have multiple interaction partners and high expressions
.
Bioinformatics
 
2020
;
36
:
2975
9
. .

28

Díaz-Muñoz
 
SL
,
Sanjuán
 
R
,
West
 
S
.
Sociovirology: conflict, cooperation, and communication among viruses
.
Cell Host Microbe
 
2017
;
22
:
437
41
. .

29

Segredo-Otero
 
E
,
Sanjuán
 
R
.
Cooperative virus-virus interactions: an evolutionary perspective
.
BioDesign Res
 
2022
;
2022
:9819272. .

30

Albrycht
 
K
,
Rynkiewicz
 
AA
,
Harasymczuk
 
M
. et al. .  
Daily reports on phage-host interactions
.
Front Microbiol
 
2022
;
13
:
946070
. .

31

Benson
 
DA
,
Cavanaugh
 
M
,
Clark
 
K
. et al. .  
GenBank
.
Nucleic Acids Res
 
2012
;
41
:
D36
42
. .

32

Pruitt
 
KD
,
Brown
 
GR
,
Hiatt
 
SM
. et al. .  
RefSeq: an update on mammalian reference sequences
.
Nucleic Acids Res
 
2014
;
42
:
D756
63
. .

33

Gao
 
NL
,
Zhang
 
C
,
Zhang
 
Z
. et al. .  
MVP: a microbe–phage interaction database
.
Nucleic Acids Res
 
2018
;
46
:
D700
7
. .

34

Qin
 
J
,
Li
 
Y
,
Cai
 
Z
. et al. .  
A metagenome-wide association study of gut microbiota in type 2 diabetes
.
Nature
 
2012
;
490
:
55
60
. .

35

Guo
 
J
,
Bolduc
 
B
,
Zayed
 
AA
. et al. .  
VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses
.
Microbiome
 
2021
;
9
:
1
13
.

36

Zhao
 
G
,
Guang
 
W
,
Lim
 
ES
. et al. .  
VirusSeeker, a computational pipeline for virus discovery and virome composition analysis
.
Virology
 
2017
;
503
:
21
30
. .

37

Ren
 
J
,
Song
 
K
,
Deng
 
C
. et al. .  
Identifying viruses from metagenomic data using deep learning
.
Quant Biol
 
2020
;
8
:
64
77
. .

38

Wood
 
DE
,
Jennifer
 
L
,
Langmead
 
B
.
Improved metagenomic analysis with kraken 2
.
Genome Biol
 
2019
;
20
:
1
13
. .

39

Jurtz
 
VI
,
Villarroel
 
J
,
Lund
 
O
. et al. .  
MetaPhinder—identifying bacteriophage sequences in metagenomic data sets
.
PloS One
 
2016
;
11
:
e0163111
. .

40

Ho
 
SFS
,
Wheeler
 
NE
,
Millard
 
AD
. et al. .  
Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data
.
Microbiome
 
2023
;
11
:
1
15
.

41

McIver
 
LJ
,
Abu-Ali
 
G
,
Franzosa
 
EA
. et al. .  
BioBakery: a meta’omic analysis environment
.
Bioinformatics
 
2018
;
34
:
1235
7
. .

42

Jennifer
 
L
,
Breitwieser
 
FP
,
Thielen
 
P
. et al. .  
bracken: estimating species abundance in metagenomics data
.
PeerJ Comput Sci
 
2017
;
3
:
e104
.

43

Bolger
 
AM
,
Lohse
 
M
,
Usadel
 
B
.
Trimmomatic: a flexible trimmer for illumina sequence data
.
Bioinformatics
 
2014
;
30
:
2114
20
. .

44

Langmead
 
B
,
Salzberg
 
SL
.
Fast gapped-read alignment with bowtie 2
.
Nat Methods
 
2012
;
9
:
357
9
. .

45

Bowers
 
PM
,
Cokus
 
SJ
,
Eisenberg
 
D
. et al. .  
Use of logic relationships to decipher protein network organization
.
Science
 
2004
;
306
:
2246
9
. .

46

Sun
 
F
,
Sun
 
J
,
Zhao
 
Q
.
A deep learning method for predicting metabolite–disease associations via graph neural network
.
Brief Bioinform
 
2022
;
23
:bbac266. .

47

Li
 
M
,
Wang
 
Y
,
Li
 
F
. et al. .  
A deep learning-based method for identification of bacteriophage-host interaction
.
IEEE/ACM Trans Comput Biol Bioinform
 
2020
;
18
:
1801
10
. .

48

Yang
 
H
,
Ding
 
Y
,
Tang
 
J
. et al. .  
Inferring human microbe–drug associations via multiple kernel fusion on graph neural network
.
Knowl-Based Syst
 
2022
;
238
:
107888
. .

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Supplementary data