Abstract

Motivation

Recent computational approaches for predicting phage–host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity.

Results

We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7%–9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5%–6% increase over BLASTp.

Availability and implementation

The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/PHIStruct.

1 Introduction

Described as a silent pandemic, antimicrobial resistance is among the foremost threats to public health. If left unaddressed, it is estimated that, by 2050, the number of deaths linked to antimicrobial resistance will reach 10 million annually and the global economic cost will exceed $100 trillion (Tang et al. 2023, Walsh et al. 2023). As alternatives to conventional antibiotics, phages have recently been used in phage therapy to treat bacterial infections (Górski et al. 2020, Hitchcock et al. 2023) and in phage steering to re-sensitize pathogens to antibiotics (Ashworth et al. 2024). The bactericidal action of phages has also been leveraged in various settings outside the clinical domain, including crop protection (Stone et al. 2019), oilfield reservoir souring prevention (Pannekens et al. 2019), and corrosion mitigation (Harada et al. 2018).

Identifying phage–host pairs is crucial in actualizing these applications; however, in vitro experiments can be time-consuming and expensive. In order to accelerate the shortlisting of candidate pairs, several in silico methods that capitalize on alignment-based and alignment-free techniques have been developed (Iuchi et al. 2023, Nie et al. 2024). Alignment-based approaches (Coutinho et al. 2021, Shang and Sun 2021, Zielezinski et al. 2021) focus on finding shared genomic regions between phages and putative hosts, whereas alignment-free approaches (Leite et al. 2018, Boeckaerts et al. 2021, Li et al. 2021, Lu et al. 2021, Li and Zhang 2022, Zhou et al. 2022) hinge on features such as sequence composition and co-abundance profiles of phages and their hosts.

Among the key determinants of phage–host specificity are receptor-binding proteins or RBPs (Dowah and Clokie 2018, Klumpp et al. 2023). Located at the distal end of phages, RBPs, such as tailspike proteins, adsorb to receptors found on the surface of the host bacteria. As such, these proteins have also been engineered to reprogram the target spectrum of phage-based antibacterials (Dams et al. 2019, Yehl et al. 2019, Zampara et al. 2020), thereby facilitating novel therapeutic applications.

In our previous work (Gonzales et al. 2023), we explored the use of representation learning to convert RBPs into dense vector representations (embeddings) without the need for manual feature engineering. To generate the embeddings, we explored different protein language models, such as ESM-1b (Rives et al. 2021), ProtT5 (Elnaggar et al. 2022), and SeqVec (Heinzinger et al. 2019). These models were pretrained in a self-supervised fashion on sequences from large-scale protein databases (Apweiler et al. 2004, Suzek et al. 2007, Steinegger and Söding 2018, Steinegger et al. 2019) and have been shown to capture relevant physicochemical characteristics (Heinzinger et al. 2019, Rives et al. 2021, Elnaggar et al. 2022, Lin et al. 2023). We found that this approach presents performance improvements over using manually feature-engineered sequence properties.

From a biological perspective, additional benefit may be gained by incorporating structure information, as underscored by its role in determining function and elucidating protein-protein interaction signatures, such as complementary binding surfaces and molecular forces (Ma et al. 2003, Planas-Iglesias et al. 2013). In the particular context of phage–host interaction, recent studies have investigated the structure of RBPs to obtain insights into the mechanisms underlying host recognition and binding machinery. The structural diversity of these proteins has been found to reflect the evolutionary pressure to adapt to the surface proteins of bacteria occupying the same ecological niche (Hawkins et al. 2022, Cambillau and Goulet 2023, Degroux et al. 2023).

Indeed, we found that phages infecting the same host genus have RBPs that are structurally more similar than those infecting different host genera, particularly at low sequence similarity settings. To see this, we paired RBPs targeting hosts from the ESKAPEE genera, which are among the leading causes of nosocomial infections worldwide (Venkateswaran et al. 2023). The pairing was done such that the sequence similarity between the RBPs in each pair is below 40%. Using US-align (Zhang et al. 2022), we then computed the root mean square deviation (RMSD) between their ColabFold (Mirdita et al. 2022)-predicted structures. Afterwards, for every genus, we constructed two groups, each with 500 randomly sampled RBP pairs. In the first group, both RBPs in each pair target the same genus of interest. In the second group, one of the RBPs in each pair targets the genus of interest, while the other RBP targets a different genus. Performing a Mann–Whitney U test showed that, for all but one genus, the difference between the distributions of the RMSD scores in these two groups is statistically significant at a P-value cutoff of 0.05 (Supplementary Table S7).

These point towards the utility of integrating structure information in predicting phage–host interaction. A limitation of sequence-only protein language models (i.e. those pretrained only on sequences, without explicitly involving structure data) is that they do not directly capture interactions between residues that are in close proximity when the 3D conformation is considered. This has motivated the development of structure-aware protein language models such as ProstT5 (Heinzinger et al. 2024), SaProt (Su et al. 2024), and PST (Chen et al. 2024), which adopt a custom vocabulary for encoding structure information (Heinzinger et al. 2024, Su et al. 2024) or augment the architecture of an existing sequence-only model in order to inject structure bias into the output embeddings (Chen et al. 2024). While structure-aware embeddings have been used for downstream benchmark tasks such as thermostability, metal ion binding, and protein function prediction (Chen et al. 2024, Heinzinger et al. 2024, Su et al. 2024), their application to computationally predicting phage–host pairs remains unexplored.

In this paper, we introduce PHIStruct, a deep-learning model for phage–host interaction prediction that uses structure-aware embeddings of receptor-binding proteins for phage–host interaction prediction. We focused our scope on hosts belonging to the ESKAPEE genera (Enterococcus, Staphylococcus, Klebsiella, Acinetobacter, Pseudomonas, Enterobacter, and Escherichia), which include bacteria that are known to exhibit multidrug resistance (Ayobami et al. 2022, Kritsotakis et al. 2022) and are among the priority pathogens identified by the World Health Organization (2024) and the Antimicrobial resistance surveillance in Europe 2023 - 2021 data (2023). Our contributions are as follows:

  • We constructed a dataset of protein structures, computationally predicted via ColabFold (Mirdita et al. 2022), of 7627 nonredundant (i.e. with duplicates removed) receptor-binding proteins from 3350 phages that target hosts from the ESKAPEE genera.

  • We fed the ColabFold-predicted structures to the structure-aware protein language model SaProt (Su et al. 2024) in order to generate embeddings of the receptor-binding proteins. We then trained a two-hidden-layer perceptron that takes in these structure-aware embeddings as input and predicts the host genus.

  • We found that our model, PHIStruct, presents improvements over state-of-the-art tools that take in sequence-only protein embeddings and feature-engineered genomic and protein sequence properties, as well as BLASTp—especially as the sequence similarity between the training and test set entries decreases. Further evaluation highlights its ability to make high-confidence predictions without heavily compromising between precision and recall. When the sequence similarity drops below 40% and the confidence threshold is set to above 50%, our tool outperforms machine learning tools that do not directly integrate structure information by 7%–9% and BLASTp by 5%–6% in terms of class-averaged F1.

2 Materials and methods

An overview of our methodology is provided in Fig. 1. The steps are described in detail in the subsequent subsections.

Methodology. Step 1: We collected phage genome and protein sequences from GenBank (Benson et al. 2007) using INPHARED (Cook et al. 2021). Step 2: Receptor-binding proteins (RBPs) were identified following the methodology in our previous work (Gonzales et al. 2023). Step 3: We fed the RBP sequences to ColabFold (Mirdita et al. 2022) to predict their structures. Step 4: The proteins, alongside their predicted structures, were fed to SaProt. For a protein of length n, the input to SaProt is 〈(r1,f1),(r2,f2),…,(rn,fn)〉, where 〈r1,r2,…,rn〉 is the sequence representation and 〈f1,f2,…,fn〉 is the structure representation from Foldseek (van Kempen et al. 2024). SaProt outputs the structure-aware vector representations (embeddings). Step 5: In constructing our training and test sets, we partitioned our dataset with respect to different train-versus-test sequence similarity thresholds via CD-HIT (Fu et al. 2012). Step 6: We built a two-hidden-layer perceptron that takes in the SaProt embedding of an RBP as input and outputs the host genus from among the ESKAPEE genera. Step 7: We evaluated our model’s performance. Icon sources: Bacteriophage: https://static.thenounproject.com/png/1372464-200.png; Deep learning: https://static.thenounproject.com/png/2424485-200.png; Isolated icon of a neural network. concept of artificial intelligence, deep learning and machine learning: https://t4.ftcdn.net/jpg/04/30/22/13/360_F_430221349_N1HJUZArv5f4dhmzOYUzuCpxGQZ5rTO5.jpg; Percentage free icon: https://cdn-icons-png.flaticon.com/512/156/156877.png; Protein structure flat simple icon: https://t4.ftcdn.net/jpg/04/30/22/13/360_F_430221349_N1HJUZArv5f4dhmzOYUzuCpxGQZ5rTO5.jpg.
Figure 1.

Methodology. Step 1: We collected phage genome and protein sequences from GenBank (Benson et al. 2007) using INPHARED (Cook et al. 2021). Step 2: Receptor-binding proteins (RBPs) were identified following the methodology in our previous work (Gonzales et al. 2023). Step 3: We fed the RBP sequences to ColabFold (Mirdita et al. 2022) to predict their structures. Step 4: The proteins, alongside their predicted structures, were fed to SaProt. For a protein of length n, the input to SaProt is (r1,f1),(r2,f2),,(rn,fn), where r1,r2,,rn is the sequence representation and f1,f2,,fn is the structure representation from Foldseek (van Kempen et al. 2024). SaProt outputs the structure-aware vector representations (embeddings). Step 5: In constructing our training and test sets, we partitioned our dataset with respect to different train-versus-test sequence similarity thresholds via CD-HIT (Fu et al. 2012). Step 6: We built a two-hidden-layer perceptron that takes in the SaProt embedding of an RBP as input and outputs the host genus from among the ESKAPEE genera. Step 7: We evaluated our model’s performance. Icon sources: Bacteriophage: https://static.thenounproject.com/png/1372464-200.png; Deep learning: https://static.thenounproject.com/png/2424485-200.png; Isolated icon of a neural network. concept of artificial intelligence, deep learning and machine learning: https://t4.ftcdn.net/jpg/04/30/22/13/360_F_430221349_N1HJUZArv5f4dhmzOYUzuCpxGQZ5rTO5.jpg; Percentage free icon: https://cdn-icons-png.flaticon.com/512/156/156877.png; Protein structure flat simple icon: https://t4.ftcdn.net/jpg/04/30/22/13/360_F_430221349_N1HJUZArv5f4dhmzOYUzuCpxGQZ5rTO5.jpg.

2.1 Data collection

We collected 20 941 phage genome sequences, alongside their respective protein sequences (if provided), that were uploaded to GenBank before October 2023. We used INPHARED (Cook et al. 2021), a pipeline for downloading phage sequences and host genus information from GenBank (Benson et al. 2007). For phages where INPHARED did not return a host, we retrieved the isolation host information from their respective GenBank entries, if provided. Afterwards, we discarded phages with nonbacterial hosts, leaving a total of 17 790 phages across 273 host genera.

2.2 Receptor-binding protein identification

Among the 17 790 collected phage sequences, 16 627 already had genome annotations from GenBank. We ran the remaining 1163 sequences without annotations through Prokka (Seemann 2014), which serves as a wrapper for a two-stage pipeline of calling Prodigal (Hyatt et al. 2010) for gene prediction and PHROG (Terzian et al. 2021), a collection of viral protein families, for functional annotation.

With all the collected sequences annotated, we identified the receptor-binding proteins (RBPs). To this end, we matched the gene product annotations against a regular expression that we proposed in our previous work (Gonzales et al. 2023) based on the pattern given by Boeckaerts et al. (2022). Since this pattern also captures RBP-related annotations that are not RBPs per se (e.g. tailspike adaptor proteins), we referred to the exclusion list provided by Boeckaerts et al. (2022) to filter out these cases.

Moreover, in view of the possibility that some proteins tagged as hypothetical might be RBPs, we converted these proteins into embeddings using the protein language model ProtBert (Elnaggar et al. 2022) and then passed the resulting embeddings as input to the extreme gradient boosting model PhageRBPdetect (Boeckaerts et al. 2022) to predict whether the proteins are RBPs.

We subsequently discarded RBPs with outlying lengths, i.e. those with lengths outside the interval [Q11.5·IQR,Q3+1.5·IQR], where Q1, IQR, and Q3 refer to the first quartile, interquartile range, and third quartile of the RBP lengths, respectively. Finally, we removed duplicate RBPs using CD-HIT (Fu et al. 2012), a tool for clustering protein sequences.

2.3 Protein structure prediction

We fed the RBP sequences to ColabFold (Mirdita et al. 2022), a protein structure prediction tool that accelerates the inference time of AlphaFold 2 (Jumper et al. 2021) by building multiple sequence alignments using MMseqs2 (Steinegger and Söding 2017) instead of JackHMMer (Johnson et al. 2010). ColabFold predicts the x-, y-, and z-coordinates of all the heavy atoms (carbon, hydrogen, nitrogen, oxygen, and sulfur) of a given protein. Supplementary Table S1 lists the parameters at which we ran ColabFold.

Our dataset consists of the predicted structures of 7627 RBPs from 3350 phages targeting hosts from the ESKAPEE genera (Table 1). The distributions of the RBP lengths and ColabFold’s confidence scores per genus are plotted in Supplementary Figs S1 and S2. We computed the confidence score for each protein by averaging the predicted local distance difference test (pLDDT) scores across its residues; the pLDDT score is a superposition-free estimate of the agreement between a predicted structure and a reference structure (Jumper et al. 2021). For each of the ESKAPEE genera, the mean confidence score is above 77% and the median is above 80%.

Table 1.

Per-genus dataset characteristics.

RBP length (a.a.)
pLDDT score
GenusCountMeanMedianMeanMedian
Enterococcus17070462077.57%80.87%
Staphylococcus52151048182.74%83.24%
Klebsiella153567165882.00%83.48%
Acinetobacter36959960777.80%81.53%
Pseudomonas118050947777.48%79.50%
Enterobacter38854551281.26%82.24%
Escherichia346462258181.39%82.42%
RBP length (a.a.)
pLDDT score
GenusCountMeanMedianMeanMedian
Enterococcus17070462077.57%80.87%
Staphylococcus52151048182.74%83.24%
Klebsiella153567165882.00%83.48%
Acinetobacter36959960777.80%81.53%
Pseudomonas118050947777.48%79.50%
Enterobacter38854551281.26%82.24%
Escherichia346462258181.39%82.42%
Table 1.

Per-genus dataset characteristics.

RBP length (a.a.)
pLDDT score
GenusCountMeanMedianMeanMedian
Enterococcus17070462077.57%80.87%
Staphylococcus52151048182.74%83.24%
Klebsiella153567165882.00%83.48%
Acinetobacter36959960777.80%81.53%
Pseudomonas118050947777.48%79.50%
Enterobacter38854551281.26%82.24%
Escherichia346462258181.39%82.42%
RBP length (a.a.)
pLDDT score
GenusCountMeanMedianMeanMedian
Enterococcus17070462077.57%80.87%
Staphylococcus52151048182.74%83.24%
Klebsiella153567165882.00%83.48%
Acinetobacter36959960777.80%81.53%
Pseudomonas118050947777.48%79.50%
Enterobacter38854551281.26%82.24%
Escherichia346462258181.39%82.42%

We also ran our pipeline on phages that target other host genera, thereby expanding our dataset of RBPs and their predicted structures to 19 081 RBPs from 8525 phages across 238 host genera. Supplementary Figure S3a shows the distribution of the RBP lengths; the mean length is 565 amino acids, and the median is 507 amino acids. Supplementary Figure S3b shows the distribution of ColabFold’s confidence scores in its predicted protein structures. The mean confidence score across all the proteins in our dataset is 78.45%, and the median is 81.69%.

2.4 Structure-aware embedding generation

In order to generate the vector representations (embeddings) of the RBPs, we employed SaProt, a structure-aware protein language model that adopts the architecture of ESM-2 (Lin et al. 2023) but with the embedding layer modified in view of its structure-aware alphabet. SaProt’s alphabet attempts to capture both sequence and structure information by taking the Cartesian product of the set of all possible residues and Foldseek’s (van Kempen et al. 2024) alphabet. Foldseek’s alphabet consists of 20 letters that describe the tertiary interaction between residues and their nearest neighbors; this alphabet was learned via a vector quantized variational autoencoder (van den Oord et al. 2017). Formally, suppose we are given a protein of length n. Its sequence representation is r1,r2,r3,,rn, where ri is its ith residue. Foldseek encodes its structure information into an n-letter string f1,f2,f3,,fn. The input to SaProt is (r1,f1),(r2,f2),(r3,f3),,(rn,fn). To obtain a fixed-length embedding, we averaged the last layer’s hidden states over the sequence length, resulting in a 1280-long dense vector representation (structure-aware embedding) for each RBP.

We used the 650-million-parameter version of SaProt that was pretrained on 40 million structures from AlphaFold DB (Varadi et al. 2022).

2.5 Sequence similarity-based training and test set construction

We investigated model performance at different train-versus-test sequence similarity thresholds. A lower train-versus-test sequence similarity indicates that the sequences in the training set are more dissimilar to those in the test set.

To this end, we used CD-HIT (Fu et al. 2012) to group the RBPs into clusters at a similarity threshold s and assign a representative sequence per cluster. Let R be the set of representative sequences with class labels among the ESKAPEE genera. We split R into two sets D1 and D2, with 70% of the sequences assigned to D1 and the remaining 30% to D2; this splitting was stratified based on the class sizes. Clusters with representative sequences in D1 were assigned to the training set, while those with representative sequences in D2 were assigned to the test set. We also randomly sampled RBPs with class labels outside the ESKAPEE genera and added these to our test set; the number of RBPs sampled is equal to the size of the smallest ESKAPEE genus in the test set. The rationale for the inclusion of these RBPs is for the performance evaluation to reflect scenarios where our model is fed with an RBP from a phage infecting a host outside our genera of interest.

Finally, in order to mitigate class imbalance (Supplementary Table S2), we performed data augmentation via SMOTE-Tomek (Batista et al. 2003). Our training and test set statistics are reported in Table 2; the per-class breakdown is given in Supplementary Table S2.

Table 2.

Training and test set statistics.

Sequence similarity sTraining set size before SMOTE-TOMEKTraining set size after SMOTE-TomekTest set size
100%533816 9422340
80%486514 8222822
60%452313 0993153
40%446511 6623203
Sequence similarity sTraining set size before SMOTE-TOMEKTraining set size after SMOTE-TomekTest set size
100%533816 9422340
80%486514 8222822
60%452313 0993153
40%446511 6623203
Table 2.

Training and test set statistics.

Sequence similarity sTraining set size before SMOTE-TOMEKTraining set size after SMOTE-TomekTest set size
100%533816 9422340
80%486514 8222822
60%452313 0993153
40%446511 6623203
Sequence similarity sTraining set size before SMOTE-TOMEKTraining set size after SMOTE-TomekTest set size
100%533816 9422340
80%486514 8222822
60%452313 0993153
40%446511 6623203

2.6 Classifier building

We defined phage–host interaction prediction as a multiclass classification task where the input is the SaProt embedding of a given RBP and the output is one of ESKAPEE host genera.

To this end, we built a multilayer perceptron with two hidden layers (Fig. 2). The first and second hidden layers have 160 (one-eighth of the size of the input layer) and 80 neurons, respectively. We implemented L2 regularization and applied dropout with a rate of 0.2 after the first hidden layer. We employed rectified linear unit as the activation function, softmax as the output function, and cross-entropy as the loss function. We trained the model for 200 epochs with Adam (Kingma and Ba 2015) as the optimizer; other training hyperparameters are listed in Table 3. Hyperparameter tuning was done by performing grid search and optimizing the F1 on a validation set constructed by setting aside 10% of our training set; the hyperparameter search space is given in Supplementary Table S3. We call the resulting model PHIStruct.

PHIStruct classifier architecture. The number below the label of each layer denotes the size of that layer.
Figure 2.

PHIStruct classifier architecture. The number below the label of each layer denotes the size of that layer.

Table 3.

PHIStruct training hyperparameters.

HyperparameterValue
L2 regularization strength104
Batch size128
Learning rate103
Exponential decay rate for the first moment vector0.9
Exponential decay rate for the second moment vector0.999
HyperparameterValue
L2 regularization strength104
Batch size128
Learning rate103
Exponential decay rate for the first moment vector0.9
Exponential decay rate for the second moment vector0.999
Table 3.

PHIStruct training hyperparameters.

HyperparameterValue
L2 regularization strength104
Batch size128
Learning rate103
Exponential decay rate for the first moment vector0.9
Exponential decay rate for the second moment vector0.999
HyperparameterValue
L2 regularization strength104
Batch size128
Learning rate103
Exponential decay rate for the first moment vector0.9
Exponential decay rate for the second moment vector0.999

2.7 Performance evaluation

We evaluated the performance of our model in terms of class-averaged metrics: macro-precision, macro-recall, and macro-F1. To account for our model’s confidence in its prediction, we parameterized these metrics based on a confidence threshold k. Let p1 and p2 be the class probabilities for the model’s highest-ranked and its second-highest-ranked predictions, respectively. The input is classified under the highest-ranked prediction only if p1p2k. In this regard, higher values of k prioritize precision, whereas lower values of k prioritize recall. We also parameterized the metrics based on the maximum train-versus-test sequence similarity s.

Let C be the set of ESKAPEE class labels, and let TPc,k,s,TNc,k,s,FPc,k,s, and FNc,k,s denote the number of true positive, true negative, false positive, and false negative classifications for class cC, respectively, at confidence threshold k and maximum train-versus-test sequence similarity s. We define the class-averaged (macro) metrics formally in Equations (1)(3).
(1)
(2)
(3)
We additionally evaluated our model’s performance in terms of weighted metrics, with the weights corresponding to the class sizes. Let N refer to the total number of samples in the test set, and let nc,s refer to the number of test samples under class cC at maximum train-versus-test sequence similarity s. We define the weighted metrics formally in Equations (4)(6).
(4)
(5)
(6)

3 Results and discussion

3.1 At lower train-versus-test sequence similarity, PHIStruct outperforms state-of-the-art machine learning tools that take in receptor-binding proteins as input, as well as sequence alignment-based tools

We benchmarked our tool, PHIStruct, against state-of-the-art machine learning tools that also map receptor-binding proteins to host bacteria: Boeckaerts et al.’s (2021) tool, PHIEmbed (Gonzales et al. 2023), and Badam and Rao’s (2024) tool. Boeckaerts et al.’s (2021) tool is a random forest model that takes in manually feature-engineered genomic and protein sequence properties. Our previous work, PHIEmbed, is a random forest model that takes in ProtT5 (Elnaggar et al. 2022) embeddings. Badam and Rao’s (2024) tool is a multilayer perceptron that takes in ESM-1b (Rives et al. 2021) embeddings. To ensure a fair comparison, we retrained each of them on our dataset. We also benchmarked PHIStruct against BLASTp, which we ran at an E-value of 0.05 and with BLOSUM62 as the scoring matrix, and against PSI-BLAST, which we ran for up to five iterations at an E-value of 0.05 and a profile inclusion threshold of 0.002; for these sequence alignment-based tools, the class label of the reported top hit was taken as the predicted host genus. We evaluated the tools’ performance across maximum train-versus-test sequence similarity thresholds s=40% to 100% in steps of 20% and across confidence thresholds k=0% to 90% in steps of 10%.

Our experiments showed that PHIStruct presents improvements over these tools, especially at low sequence similarity settings. Its performance gains were most pronounced at s=40% (Fig. 3). It registered a maximum macro-recall of 63.09%, achieving a margin of 17% over PSI-BLAST, 11% over PHIEmbed, 5% over BLASTp, and 4% over Badam and Rao’s (2024) tool and performing competitively with Boeckaerts et al.’s (2021) tool. It obtained a maximum macro-precision of 69.43%, outperforming Badam and Rao’s (2024) tool by 11%, BLASTp by 16%, and PSI-BLAST by 25%. Moreover, it recorded the highest maximum macro-F1 at 57.67%, outperforming Boeckaerts et al.’s (2021) tool by 2%, BLASTp by 6%, Badam and Rao’s (2024) tool by 8%, PHIEmbed by 14%, and PSI-BLAST by 18%.

Comparison of the performance of PHIStruct with state-of-the-art machine learning and sequence alignment-based tools that map receptor-binding proteins to host bacteria. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 3.

Comparison of the performance of PHIStruct with state-of-the-art machine learning and sequence alignment-based tools that map receptor-binding proteins to host bacteria. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

Although Boeckaerts et al.’s (2021) tool and PHIEmbed had higher maximum macro-precision, this was accompanied by a marked loss in recall. This trade-off was less pronounced with PHIStruct, as reflected in the precision–recall curves (Fig. 3a and Supplementary Figs S4a–S6a) and macro-F1 scores (Fig. 3b and Supplementary Figs S4b–S6b). Across all the tested sequence similarity thresholds, the decrease in its macro-F1 from k=0% to 90% was under 2%. Furthermore, at k>50%, it outperformed the machine learning tool with the next-highest macro-F1, i.e. Badam and Rao’s (2024) tool, by 7%–9%, BLASTp by 5%–6%, and PSI-BLAST by 17%–18%. These results highlight our tool’s ability to make high-confidence predictions without heavily compromising between precision and recall.

PHIStruct also performed competitively at higher sequence similarity thresholds. At s=60%, it recorded a maximum macro-recall of 65.83%, macro-precision of 71.77%, and macro-F1 of 62.95%, outperforming the other machine learning tools in terms of maximum macro-recall and F1 and the sequence alignment-based tools in terms of maximum macro-precision (Supplementary Fig. S4). At s=80%, it obtained a maximum macro-recall of 67.08%, macro-precision of 69.10%, and macro-F1 of 63.73%, placing its maximum macro-recall within 1.6% of the top score (Supplementary Fig. S5). At s=100%, it reported a maximum macro-recall of 81.69%, macro-precision of 87.40%, and macro-F1 of 81.12%, placing its maximum macro-recall within 0.4% of the top-performing machine learning tool’s score and its maximum macro-F1 within 0.1% (Supplementary Fig. S6).

In addition, we evaluated our model’s performance when the metrics were weighted based on the class sizes and found that the results were generally similar to those computed with macro-averaged metrics. At s=40%, PHIStruct achieved the highest maximum weighted recall at 64.23%, outperforming PSI-BLAST by 22%, BLASTp by 9%, PHIEmbed by 18%, and Badam and Rao’s (2024) tool by 15%. It registered a maximum weighted precision of 74.50%, outperforming Badam and Rao’s (2024) tool by 4%, BLASTp by 7%, and PSI-BLAST by 17%. It also obtained a maximum weighted F1 of 63.41%, recording a margin of 4% over BLASTp, 12% over Badam and Rao’s (2024) tool, 14% over PHIEmbed, and 17% over PSI-BLAST. It performed competitively with Boeckaerts et al.’s (2021) tool, while also maintaining a more stable F1 (Supplementary Fig. S7).

Moving to the higher sequence similarity thresholds s=60% and 80%, PHIStruct registered the highest maximum weighted recall and F1 among the machine learning tools (Supplementary Figs S8 and S9). At s=100%, its maximum weighted recall and F1 were within 0.3% and 0.1%, respectively, of the top-performing machine learning tool’s scores (Supplementary Fig. S10).

These results demonstrate the utility of PHIStruct for improved prediction of phage–host pairs, especially in use cases where phages of interest have receptor-binding proteins with low sequence similarity to those of known phages. The per-genus results are reported in Supplementary Tables S4–S6.

3.2 Visualizing SaProt embeddings reveals local clusters corresponding to the ESKAPEE host genera

We attempted to visualize the SaProt embeddings using uniform manifold approximation and projection or UMAP (McInnes et al. 2018). To this end, we ranked the components of the embeddings by their importance based on Shapley additive explanations (Lundberg and Lee 2017) and constructed a vector consisting of the top 25% components with the highest importance, which we then projected onto a 2D space via UMAP. The rationale is to take into account both the input embeddings and the weights assigned by the downstream classifier. As seen in Fig. 4, the formation of local clusters corresponding to the ESKAPEE host genera can be observed. Supplementary Figure S11 shows the UMAP projections of the SaProt embeddings vis-á-vis those of other protein embeddings.

Visualization of the SaProt embeddings using uniform manifold approximation and projection (UMAP). We projected the top 25% SaProt embedding components with the highest importance based on Shapley additive explanations.
Figure 4.

Visualization of the SaProt embeddings using uniform manifold approximation and projection (UMAP). We projected the top 25% SaProt embedding components with the highest importance based on Shapley additive explanations.

3.3 Sequence information and structure information are complementary in improving performance, especially at lower train-versus-test sequence similarity

As discussed in Section 2.4: “Structure-aware embedding generation,” prior to the generation of the structure-aware embedding, SaProt requires a protein to be input as the sequence (r1,f1),(r2,f2),,(rn,fn), where ri is the ith residue and fi is the corresponding structure (Foldseek) token.

To investigate the impact of incorporating structure information in this input step, we benchmarked different masking strategies: (i) masking all residue tokens, i.e. replacing all (ri,fi) with (#,fi), where # is a special token denoting the mask, (ii) masking all structure tokens, i.e. replacing all (ri,fi) with (ri,#), (iii) masking the structure tokens of low-confidence regions, and (iv) not masking any token (which we employed in PHIStruct). For the third masking strategy, low-confidence regions are defined as those having a pLDDT score below 70%, following the threshold in previous studies (Varadi et al. 2022, Su et al. 2024).

Our experiments showed that including both sequence and structure tokens improves performance, especially at lower train-versus-test sequence similarity thresholds. At s=40%, the highest performance across all three metrics was obtained by not masking any token (Fig. 5). Masking all the structure tokens resulted in a 7%, 2%, and 8% decrease in maximum macro-recall, precision, and F1, respectively, whereas masking all the residue tokens resulted in a 22%, 26%, and 21% decrease in these aforementioned metrics. Not masking any token also returned the highest maximum macro-recall, precision, and F1 at s=60% (Supplementary Fig. S12) and the highest maximum macro-recall and F1 at s=80% (Supplementary Fig. S13). At s=100%, the highest performance across all three metrics was achieved by masking the structure tokens of low-confidence regions (Supplementary Fig. S14).

Moreover, we evaluated the performance when the metrics were weighted based on the class sizes and found that the results were generally similar to those computed with macro-averaged metrics. At s=40%, not masking any token resulted in the highest performance across all three metrics. Masking the structure tokens of low-confidence regions led to a 7% decrease in maximum weighted recall and a 5% decrease in both maximum weighted precision and F1. Masking all the structure tokens led to a 4% decrease in both maximum weighted recall and F1, with the maximum weighted precision staying relatively the same. Masking all the sequence tokens led to a 19%, 17%, and 16% decrease in maximum weighted recall, precision, and F1, respectively (Supplementary Fig. S15). At s=60% and 80%, not masking any token returned the highest maximum weighted recall and F1 (Supplementary Figs S16 and S17). At s=100%, it recorded the highest maximum weighted precision, while the highest maximum weighted recall and F1 were achieved by masking the structure tokens of low-confidence regions; however, the difference in the performance between these two approaches was below 1% for all three metrics (Supplementary Fig. S18).

Comparison of the performance of different masking strategies for inputting proteins to SaProt. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 5.

Comparison of the performance of different masking strategies for inputting proteins to SaProt. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

These results suggest that, as the sequence similarity between an RBP of interest and known RBPs decreases, the model benefits from complementing sequence information with structure-informed signals.

3.4 Differences in the receptor-binding protein embeddings in relation to the host genus may be contributing to PHIStruct’s discriminative power

To investigate the differences in the embeddings in relation to the host genus, we paired the RBPs such that the sequence similarity between the RBPs in each pair is below 40%. We then computed the cosine distance between the SaProt embeddings of the RBPs in each pair. Afterwards, for every genus, we constructed two groups, each with 500 randomly sampled RBP pairs. In the first group, both RBPs in each pair target the same genus of interest. In the second group, one of the RBPs in each pair targets the genus of interest, while the other RBP targets a different genus. Performing a Mann–Whitney U test showed that, for all but two genera, the difference between the distributions of the cosine distance values in these two groups is statistically significant at a P-value cutoff of 0.05 (Supplementary Table S8).

It is possible that PHIStruct may be capturing these differences in the structure-aware embeddings in relation to the host genus, thereby contributing to the model’s discriminative power.

3.5 PHIStruct’s use of structure-aware embeddings outperforms the use of sequence-only embeddings, especially at lower train-versus-test sequence similarity

To investigate the impact of the protein representation (i.e. structure-aware versus sequence-only embeddings), we benchmarked PHIStruct against multilayer perceptron models that take in embeddings from ProtT5 (Elnaggar et al. 2022), ESM-1b (Rives et al. 2021), and ESM-2 (Lin et al. 2023), state-of-the-art sequence-only protein language models used in existing phage–host interaction prediction tools (Gonzales et al. 2023, Badam and Rao 2024, Boeckaerts et al. 2024). The specific model versions used in our benchmarking experiments are given in Supplementary Table S9. The downstream multilayer perceptron models share the same architecture as that of PHIStruct: their first and second hidden layers have L8 and L16 neurons, respectively, where L is the length of the protein embedding. To ensure a fair comparison, we performed hyperparameter tuning for each model.

Our experiments showed that our use of structure-aware embeddings presents improvements over using sequence-only embeddings, especially at lower train-versus-test sequence similarity thresholds. At s=40%, PHIStruct’s margins over the model with the next-highest performance (i.e. the model using ProtT5) were 7% in terms of maximum macro-precision and 2% in terms of recall and F1 (Fig. 6). At s=60%, it outperformed the model with the next-highest maximum macro-recall (i.e. the model using ProtT5) by 2%, the model with the next-highest maximum macro-precision (i.e. the model using ESM-1b) by 3%, and the model with the next-highest maximum macro-F1 (i.e. the model using ProtT5) by 4% (Supplementary Fig. S19). At s=80%, it obtained the highest maximum macro-recall (Supplementary Fig. S20). At s=100%, it registered the highest maximum macro-precision and F1 (Supplementary Fig. S21).

Comparison of the performance of PHIStruct with same-architecture multilayer perceptron models that take in sequence-only embeddings. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 6.

Comparison of the performance of PHIStruct with same-architecture multilayer perceptron models that take in sequence-only embeddings. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

We also evaluated our model’s performance when the metrics were weighted based on the class sizes and found that the results were generally similar to those computed with macro-averaged metrics. At s=40%, PHIStruct’s margins over the model with the next-highest performance (i.e. the model using ProtT5) were 5%, 4%, and 3% in terms of maximum weighted recall, precision, and F1, respectively (Supplementary Fig. S22). At s=60% and 100%, it registered the highest scores across all three metrics (Supplementary Figs S23 and S25). At s=80%, it registered the highest maximum weighted recall and F1 (Supplementary Fig. S24).

3.6 PHIStruct’s use of SaProt embeddings outperforms the use of other structure-aware protein embeddings, especially at lower train-versus-test sequence similarity

To further investigate the impact of the protein representation (i.e. SaProt embeddings versus other structure-aware embeddings), we benchmarked PHIStruct against multilayer perceptron models that share the same architecture as that of PHIStruct but take in embeddings from other structure-aware protein language models: ProstT5 (Heinzinger et al. 2024) and PST (Chen et al. 2024). ProstT5 adopts the architecture of ProtT5 (Elnaggar et al. 2022) and is fine-tuned to convert bidirectionally between protein sequence and structure. PST adopts the architecture of ESM-2 (Lin et al. 2023) and augments each self-attention block with a two-layer graph isomorphism network (Xu et al. 2019) that serves as a protein structure extractor module. The specific model versions used in our benchmarking experiments are given in Supplementary Table S9. To ensure a fair comparison, we performed hyperparameter tuning for each model.

Our experiments showed that our use of SaProt embeddings presents improvements over using other structure-aware embeddings, especially at lower train-versus-test sequence similarity thresholds. At s=40% and 60%, PHIStruct achieved the highest performance across all three metrics (Fig. 7 and Supplementary Fig. S26). In particular, its performance gains were most pronounced at s=40%, where it outperformed the model with the next-highest performance (i.e. the model using PST) by 3%, 10%, and 4% in terms of maximum macro-recall, precision, and F1, respectively (Fig. 7). At s=80%, it obtained the highest maximum macro-recall and F1, and its maximum macro-precision was within 1.5% of the top score (i.e. from the model using PST) (Supplementary Fig. S27). At s=100%, its maximum macro-recall was within 1.3% of the score of the model using PST and its maximum macro-precision and F1 were both within 0.4% (Supplementary Fig. S28).

Comparison of the performance of PHIStruct with same-architecture multilayer perceptron models that take in structure-aware protein embeddings other than SaProt. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 7.

Comparison of the performance of PHIStruct with same-architecture multilayer perceptron models that take in structure-aware protein embeddings other than SaProt. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

Furthermore, we evaluated our model’s performance when the metrics were weighted based on the class sizes and found that the results were generally similar to those computed with macro-averaged metrics. At s=40%, it achieved the highest performance across all three metrics, outperforming the model with the next-highest performance (i.e. the model using PST) by 3% in terms of maximum weighted recall and 5% in terms of maximum weighted precision (Supplementary Fig. S29). At s=60% and 80%, it obtained the highest maximum weighted recall and F1, and its maximum weighted precision was within 0.6% of the score of the model using PST (Supplementary Figs S30 and S31). At s=100%, its performance across all three metrics were within 0.6% of the performance of the model using PST (Supplementary Fig. S32).

3.7 PHIStruct’s use of multilayer perceptron outperforms the use of other downstream classifiers, especially at lower train-versus-test sequence similarity

To investigate the impact of the choice of downstream classifier (i.e. multilayer perceptron versus other machine learning models), we benchmarked PHIStruct against a support vector machine and a random forest model that take in the same SaProt embeddings. To ensure a fair comparison, we performed hyperparameter tuning for each model.

Our experiments showed that our use of multilayer perceptron presents improvements over using other machine learning models for the downstream classifier, especially at lower train-versus-test sequence similarity thresholds. From s=40% to 80%, PHIStruct achieved the highest maximum macro-F1 (Fig. 8 and Supplementary Figs S33 and S34). Moreover, at s=40% and s=100%, it also achieved the highest maximum macro-recall (Fig. 8 and Supplementary Fig. S35). At s=60% and s=80%, its maximum macro-recall was within 1.6% of the highest score (Supplementary Figs S33 and S34). Although the random forest model returned the highest maximum macro-precision, this was accompanied by a significant drop in recall, whereas PHIStruct registered the most stable macro-F1 scores across the range of all tested confidence thresholds (Fig. 8 and Supplementary Figs S33–S35).

Comparison of the performance of PHIStruct with other downstream classifiers take in the same SaProt embeddings. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.
Figure 8.

Comparison of the performance of PHIStruct with other downstream classifiers take in the same SaProt embeddings. The maximum train-versus-test sequence similarity is set to s=40%. Performance is measured in terms of class-averaged (macro) metrics. (a) Precision–recall curves. The label of each point denotes the confidence threshold k (%) at which the performance was measured. (b) F1 scores. Higher values of k prioritize precision over recall, whereas lower values prioritize recall.

In addition, we evaluated our model’s performance when the metrics were weighted based on the class sizes and found that the results were generally similar to those computed with macro-averaged metrics. PHIStruct achieved the highest maximum weighted recall and F1, as well as the most stable weighted F1, across all tested sequence similarity thresholds (Supplementary Figs S36–S39).

3.8 Room for improvement

We performed error analysis on PHIStruct’s predictions and found that, at lower train-versus-test sequence similarity thresholds, it tends to misclassify phages that infect Enterobacter (and Klebsiella, albeit to a lesser degree) as infecting Escherichia (Fig. 9). This could possibly be related to these three genera belong to the same family, Enterobacteriaceae. The four remaining ESKAPEE genera each belong to different families.

Confusion matrices at maximum train-versus-test sequence similarity s=40%. (a) Confusion matrix at confidence threshold k=0%, normalized over the true class labels. The main diagonal reflects the per-class recall. (b) Confusion matrix at k=90%, normalized over the predicted class labels. The main diagonal reflects the per-class precision. Lower values of k prioritize recall over precision, whereas higher values prioritize precision.
Figure 9.

Confusion matrices at maximum train-versus-test sequence similarity s=40%. (a) Confusion matrix at confidence threshold k=0%, normalized over the true class labels. The main diagonal reflects the per-class recall. (b) Confusion matrix at k=90%, normalized over the predicted class labels. The main diagonal reflects the per-class precision. Lower values of k prioritize recall over precision, whereas higher values prioritize precision.

In this regard, possible directions for future work include investigating biological markers that can more strongly distinguish phage–host specificity at lower taxonomic levels. It may also be helpful to develop computational approaches that are explicitly trained to discriminate between biological signals at finer taxonomic resolutions.

Moreover, the identification of receptor-binding proteins is central to our approach. For proteins without existing genome annotations, we employed in silico methods to predict whether they are RBPs, and the quality of these predictions is important to our model’s performance. Recent advancements in RBP screening methods (Gonzalez and Scharf 2021, Ge et al. 2023) and the possibility of improving computational tools with the inclusion of protein structure information are complementary towards constructing high-quality RBP datasets, from which our model can benefit.

Model interpretability remains a challenge as well. While attention-based methods for probing sequence-only embeddings and amino acid associations have been proposed (Vig et al. 2021, Yamaguchi and Saito 2021, Wang et al. 2023, Yagimoto et al. 2024), extending these techniques to structure-aware tokens and embeddings is an area for further research.

4 Conclusion

In this paper, we presented PHIStruct, a deep learning model that capitalizes on structure-aware embeddings of receptor-binding proteins for predicting phage–host interaction. To this end, we predicted the protein structures via ColabFold and generated the structure-aware embeddings via SaProt. Our experiments showed that our proposed approach presents improvements over state-of-the-art methods, especially as the sequence similarity between the training and test set entries decreases. These results highlight the applicability of PHIStruct in improving the prediction of phage–host pairs, especially in settings where newly discovered phages have receptor-binding proteins with low sequence similarity to those of known phages.

With recent investigations showing an association between tailspike proteins and bacterial polysaccharide receptors (Yang et al. 2024), future directions include jointly incorporating these proteins, alongside other phage-encoded and host-encoded proteins involved in different stages of the phage infection process.

Acknowledgements

The authors thank Dr Ann Franchesca B. Laguna for her advice on the methodology.

Author contributions

Mark Edward M. Gonzales (Conceptualization [equal], Methodology [equal], Resources [equal], Software [equal], Validation [equal], Visualization [equal], Writing—original draft [equal], Writing—review & editing [equal]), Jennifer C. Ureta (Resources [equal], Supervision [equal], Writing—review & editing [equal]), and Anish M.S. Shrestha (Conceptualization [equal], Funding acquisition [equal], Methodology [equal], Resources [equal], Supervision [equal], Writing—review & editing [equal])

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest: None declared.

Funding

This work was partly funded by the Department of Science and Technology Philippine Council for Health Research and Development (DOST-PCHRD) under the e-Asia JRP 2021 Alternative therapeutics to tackle AMR pathogens (ATTACK-AMR) program. This work was supported with Cloud TPUs from Google’s TPU Research Cloud (TRC) and with computing resources from the Machine Learning eResearch Platform (MLeRP) of Monash University, University of Queensland, and Queensland Cyber Infrastructure Foundation Ltd.

Data availability

The data underlying this article are available at https://github.com/bioinfodlsu/PHIStruct

References

Antimicrobial resistance surveillance in Europe 2023 - 2021 data
. Stockholm: European Centre for Disease Prevention and Control and World Health Organization,
2023
.

Apweiler
R
,
Bairoch
A
,
Wu
C
et al.
UniProt: the universal protein knowledgebase
.
Nucleic Acids Res
2004
;
32
:
D115
9
.

Ashworth
EA
,
Wright
RCT
,
Shears
RK
et al.
Exploiting lung adaptation and phage steering to clear pan-resistant Pseudomonas aeruginosa infections in vivo
.
Nat Commun
2024
;
15
:
1547
. ISSN
2041
1723
.

Ayobami
O
,
Brinkwirth
S
,
Eckmanns
T
et al.
Antibiotic resistance in hospital-acquired ESKAPE-E infections in low- and lower-middle-income countries: a systematic review and meta-analysis
.
Emerg Microbes Infect
2022
;
11
:
443
51
.

Badam
S
,
Rao
S.
Harnessing genome representation learning for decoding phage–host interactions. bioRxiv,
2024
, preprint: not peer reviewed.

Batista
GEAPA
,
Bazzan
ALC
,
Monard
MC.
Balancing training data for automated annotation of keywords: a case study. In:
Lifschitz
S
,
NF
A
Jr
,
GJ
P
Jr
,
Linden
R
(eds.), WOB, Brazil,
2003
,
10
8
.

Benson
DA
,
Karsch-Mizrachi
I
,
Lipman
DJ
et al.
GenBank
.
Nucleic Acids Res
2007
;
35
:
D21
5
.

Boeckaerts
D
,
Stock
M
,
Criel
B
et al.
Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins
.
Sci Rep
2021
;
11
:
1467
.

Boeckaerts
D
,
Stock
M
,
De Baets
B
et al.
Identification of phage receptor-binding protein sequences with hidden markov models and an extreme gradient boosting classifier
.
Viruses
2022
;
14
:
1329
.

Boeckaerts
D
,
Stock
M
,
Ferriol-González
C
et al.
Prediction of Klebsiella phage–host specificity at the strain level
.
Nat Commun
2024
;
15
:
4355
.

Cambillau
C
,
Goulet
A.
Exploring host-binding machineries of mycobacteriophages with AlphaFold2
.
J Virol
2023
;
97
:
e0179322
.

Chen
D
,
Hartout
P
,
Pellizzoni
P
et al. Endowing protein language models with structural knowledge. arXiv,
2024
, preprint: not peer reviewed.

Cook
R
,
Brown
N
,
Redgwell
T
et al.
INfrastructure for a PHAge REference Database: identification of large-scale biases in the current collection of cultured phage genomes
.
Phage (New Rochelle)
2021
;
2
:
214
23
.

Coutinho
FH
,
Zaragoza-Solas
A
,
López-Pérez
M
et al.
RaFAH: host prediction for viruses of bacteria and archaea based on protein content
.
Patterns (NY)
2021
;
2
:
100274
.

Dams
D
,
Brøndsted
L
,
Drulis-Kawa
Z
et al.
Engineering of receptor-binding proteins in bacteriophages and phage tail-like bacteriocins
.
Biochem Soc Trans
2019
;
47
:
449
60
.

Degroux
S
,
Effantin
G
,
Linares
R
et al.
Deciphering bacteriophage T5 host recognition mechanism and infection trigger
.
J Virol
2023
;
97
:
e0158422
.

Dowah
ASA
,
Clokie
MRJ.
Review of the nature, diversity and structure of bacteriophage receptor binding proteins that target gram-positive bacteria
.
Biophys Rev
2018
;
10
:
535
42
.

Elnaggar
A
,
Heinzinger
M
,
Dallago
C
et al.
ProtTrans: toward understanding the language of life through self-supervised learning
.
IEEE Trans Pattern Anal Mach Intell
2022
;
44
:
7112
27
.

Fu
L
,
Niu
B
,
Zhu
Z
et al.
CD-HIT: accelerated for clustering the next-generation sequencing data
.
Bioinformatics
2012
;
28
:
3150
2
.

Ge
H
,
Ye
L
,
Cai
Y
et al.
Efficient screening of adsorbed receptors for salmonella phage lp31 and identification of receptor-binding protein
.
Microbiol Spectr
2023
;
11
:
e02604–23
.

Gonzales
MEM
,
Ureta
JC
,
Shrestha
AMS.
Protein embeddings improve phage–host interaction prediction
.
PLoS One
2023
;
18
:
e0289030
.

Gonzalez
F
,
Scharf
BE.
Identification of receptor binding proteins in flagellotropic agrobacterium phage 7-7-1
.
Viruses
2021
;
13
:
1267
.

Górski
A
,
Borysowski
JAN
,
Międzybrodzki
R.
Phage therapy: Towards a successful clinical trial
.
Antibiotics
2020
;
9
:
827
.

Harada
LK
,
Silva
EC
,
Campos
WF
et al.
Biotechnological applications of bacteriophages: state of the art
.
Microbiol Res
2018
;
212–213
:
38
58
.

Hawkins
NC
,
Kizziah
JL
,
Hatoum-Aslan
A
et al.
Structure and host specificity of Staphylococcus epidermidis bacteriophage Andhra
.
Sci Adv
2022
;
8
:
eade0459
.

Heinzinger
M
,
Elnaggar
A
,
Wang
Y
et al.
Modeling aspects of the language of life through transfer-learning protein sequences
.
BMC Bioinformatics
2019
;
20
:
723
1471
.

Heinzinger
M
,
Weissenow
K
,
Sanchez
J
et al.
Bilingual language model for protein sequence and structure
.
NAR Genom Bioinform
2024
;
6
:
lqae150
.

Hitchcock
NM
,
Devequi Gomes Nunes
D
,
Shiach
JOB
et al.
Current clinical landscape and global potential of bacteriophage therapy
.
Viruses
2023
;
15
:
1020
.

Hyatt
D
,
Chen
GL
,
LoCascio
PF
et al.
Prodigal: prokaryotic gene recognition and translation initiation site identification
.
BMC Bioinformatics
2010
;
11
:
119
.

Iuchi
H
,
Kawasaki
J
,
Kubo
K
et al.
Bioinformatics approaches for unveiling virus–host interactions
.
Comput Struct Biotechnol J
2023
;
21
:
1774
84
.

Johnson
LS
,
Eddy
SR
,
Portugaly
E.
Hidden Markov model speed heuristic and iterative HMM search procedure
.
BMC Bioinformatics
2010
;
11
:
431
.

Jumper
J
,
Evans
R
,
Pritzel
A
et al.
Highly accurate protein structure prediction with AlphaFold
.
Nature
2021
;
596
:
583
9
.

Kingma
DP
,
Ba
J.
Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR), San Diego,
2015
.

Klumpp
J
,
Dunne
M
,
Loessner
MJ.
A perfect fit: bacteriophage receptor-binding proteins for diagnostic and therapeutic applications
.
Curr Opin Microbiol
2023
;
71
:
102240
.

Kritsotakis
EI
,
Lagoutari
D
,
Michailellis
E
et al.
Burden of multidrug and extensively drug-resistant ESKAPEE pathogens in a secondary hospital care setting in Greece
.
Epidemiol Infect
2022
;
150
:
e170
.

Leite
D
,
Lopez
J
,
Brochet
X
et al. Exploration of multiclass and one-class learning methods for prediction of phage-bacteria interaction at strain level. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Los Alamitos, CA, USA: IEEE Computer Society,
2018
,
1818
25
.

Li
M
,
Zhang
W.
PHIAF: prediction of phage–host interactions with GAN-based data augmentation and sequence-based feature fusion
.
Brief Bioinform
2022
;
23
:
bbab348
.

Li
M
,
Wang
Y
,
Li
F
et al.
A deep learning-based method for identification of bacteriophage–host interaction
.
IEEE/ACM Trans Comput Biol Bioinform
2021
;
18
:
1801
10
.

Lin
Z
,
Akin
H
,
Rao
R
et al.
Evolutionary-scale prediction of atomic-level protein structure with a language model
.
Science
2023
;
379
:
1123
30
.

Lu
C
,
Zhang
Z
,
Cai
Z
et al.
Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics
.
BMC Biol
2021
;
19
:
5
.

Lundberg
SM
,
Lee
S-I.
A unified approach to interpreting model predictions. In:
Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17
.
Red Hook, NY, USA
:
Curran Associates Inc
.,
2017
, page
4768
4777
.

Ma
B
,
Elkayam
T
,
Wolfson
H
et al.
Protein–protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces
.
Proc Natl Acad Sci USA
2003
;
100
:
5772
7
.

McInnes
L
,
Healy
J
,
Saul
N
et al.
UMAP: uniform manifold approximation and projection
.
JOSS
2018
;
3
:
861
.

Mirdita
M
,
Schütze
K
,
Moriwaki
Y
et al.
ColabFold: making protein folding accessible to all
.
Nat Methods
2022
;
19
:
679
82
.

Nie
W
,
Qiu
T
,
Wei
Y
et al.
Advances in phage–host interaction prediction: in silico method enhances the development of phage therapies
.
Brief Bioinform
2024
;
25
:
bbae117
.

Pannekens
M
,
Kroll
L
,
Müller
H
et al.
Oil reservoirs, an exceptional habitat for microorganisms
.
Nat Biotechnol
2019
;
49
:
1
9
.

Planas-Iglesias
J
,
Bonet
J
,
García-García
J
et al.
Understanding protein–protein interactions using local structural features
.
J Mol Biol
2013
;
425
:
1210
24
.

Rives
A
,
Meier
J
,
Sercu
T
et al.
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
.
Proc Natl Acad Sci USA
2021
;
118
:
e2016239118
.

Seemann
T.
Prokka: rapid prokaryotic genome annotation
.
Bioinformatics
2014
;
30
:
2068
9
.

Shang
J
,
Sun
Y.
Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning
.
BMC Biol
2021
;
19
:
250
.

Steinegger
M
,
Söding
J.
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
.
Nat Biotechnol
2017
;
35
:
1026
8
.

Steinegger
M
,
Söding
J.
Clustering huge protein sequence sets in linear time
.
Nat Commun
2018
;
9
:
2542
.

Steinegger
M
,
Mirdita
M
,
Söding
J.
Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold
.
Nat Methods
2019
;
16
:
603
6
.

Stone
E
,
Campbell
K
,
Grant
I
et al.
Understanding and exploiting phage–host interactions
.
Viruses
2019
;
11
:
567
.

Su
J
,
Han
C
,
Zhou
Y
et al. SaProt: protein language modeling with structure-aware vocabulary. In: The Twelfth International Conference on Learning Representations, Vienna,
2024
.

Suzek
BE
,
Huang
H
,
McGarvey
P
et al.
UniRef: comprehensive and non-redundant UniProt reference clusters
.
Bioinformatics
2007
;
23
:
1282
8
.

Tang
KWK
,
Millar
BC
,
Moore
JE.
Antimicrobial resistance (AMR)
.
Br J Biomed Sci
2023
;
80
:
11387
.

Terzian
P
,
Olo Ndela
E
,
Galiez
C
et al.
PHROG: families of prokaryotic virus proteins clustered using remote homology
.
NAR Genom Bioinform
2021
;
3
:
lqab067
.

van den Oord
A
,
Vinyals
O
,
Kavukcuoglu
K.
Neural discrete representation learning. In:
Guyon
I
,
Luxburg
UV
,
Bengio
S
et al. (eds.),
Advances in Neural Information Processing Systems
, Vol.
30
.
Curran Associates, Inc
.
2017
.

van Kempen
M
,
Kim
SS
,
Tumescheit
C
et al.
Fast and accurate protein structure search with Foldseek
.
Nat Biotechnol
2024
;
42
:
243
6
.

Varadi
M
,
Anyango
S
,
Deshpande
M
et al.
AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
.
Nucleic Acids Res
2022
;
50
:
D439
44
.

Venkateswaran
P
,
Vasudevan
S
,
David
H
et al.
Revisiting ESKAPE pathogens: virulence, resistance, and combating strategies focusing on quorum sensing
.
Front Cell Infect Microbiol
2023
;
13
:
1159798
.

Vig
J
,
Madani
A
,
Varshney
LR
et al. {BERT}ology meets biology: interpreting attention in protein language models. In: International Conference on Learning Representations,
2021
.

Walsh
TR
,
Gales
AC
,
Laxminarayan
R
et al.
Antimicrobial resistance: addressing a global threat to humanity
.
PLoS Med
2023
;
20
:
e1004264
.

Wang
R
,
Jiang
Y
,
Jin
J
et al.
DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis
.
Nucleic Acids Res
2023
;
51
:
3017
29
.

World Health Organization
. WHO bacterial priority pathogens list, 2024,
2024
.

Xu
K
,
Hu
W
,
Leskovec
J
et al. How powerful are graph neural networks? In: International Conference on Learning Representations, New Orleans,
2019
.

Yagimoto
K
,
Hosoda
S
,
Sato
M
et al.
Prediction of antibiotic resistance mechanisms using a protein language model
.
Bioinformatics
2024
;
40
:
btae550
.

Yamaguchi
H
,
Saito
Y.
Evotuning protocols for transformer-based variant effect prediction on multi-domain proteins
.
Brief Bioinform
2021
;
22
:
bbab234
.

Yang
Y
,
Dufault-Thompson
K
,
Yan
W
et al.
Large-scale genomic survey with deep learning-based method reveals strain-level phage specificity determinants
.
Gigascience
2024
;
13
:giae017.

Yehl
K
,
Lemire
S
,
Yang
AC
et al.
Engineering phage host-range and suppressing bacterial resistance through phage tail fiber mutagenesis
.
Cell
2019
;
179
:
459
69.e9
.

Zampara
A
,
Sørensen
MCH
,
Grimon
D
et al.
Exploiting phage receptor binding proteins to enable endolysins to kill gram-negative bacteria
.
Sci Rep
2020
;
10
:
12087
.

Zhang
C
,
Shine
M
,
Pyle
AM
et al.
US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes
.
Nat Methods
2022
;
19
:
1109
15
.

Zhou
F
,
Gan
R
,
Zhang
F
et al.
PHISDetector: a tool to detect diverse in silico phage–host interaction signals for virome studies
.
Genomics Proteomics Bioinf
2022
;
20
:
508
23
.

Zielezinski
A
,
Barylski
J
,
Karlowski
WM.
Taxonomy-aware, sequence similarity ranking reliably predicts phage–host relationships
.
BMC Biol
2021
;
19
:
223
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Macha Nikolski
Macha Nikolski
Associate Editor
Search for other works by this author on:

Supplementary data