Abstract

Motivation

Protein–protein interactions are essential for a variety of biological phenomena including mediating biochemical reactions, cell signaling, and the immune response. Proteins seek to form interfaces which reduce overall system energy. Although determination of single polypeptide chain protein structures has been revolutionized by deep learning techniques, complex prediction has still not been perfected. Additionally, experimentally determining structures is incredibly resource and time expensive. An alternative is the technique of computational docking, which takes the solved individual structures of proteins to produce candidate interfaces (decoys). Decoys are then scored using a mathematical function that assess the quality of the system, known as scoring functions. Beyond docking, scoring functions are a critical component of assessing structures produced by many protein generative models. Scoring models are also used as a final filtering in many generative deep learning models including those that generate antibody binders, and those which perform docking.

Results

In this work, we present improved scoring functions for protein–protein interactions which utilizes cutting-edge Euclidean graph neural network architectures, to assess protein–protein interfaces. These Euclidean docking score models are known as EuDockScore, and EuDockScore-Ab with the latter being antibody–antigen dock specific. Finally, we provided EuDockScore-AFM a model trained on antibody–antigen outputs from AlphaFold-Multimer (AFM) which proves useful in reranking large numbers of AFM outputs.

Availability and implementation

The code for these models is available at https://gitlab.com/mcfeemat/eudockscore.

1 Introduction

Protein–protein interactions are essential to a variety of biological processes including mediating chemical reactions, signaling, the immune response, and controlling the cell life cycle. Aberrations in protein–protein interactions are associated with a variety of diseases including cancers (Garner and Janda 2011). To understand these interactions, the structures of individual proteins as well as protein complexes need to be obtained and studied. Historically, these structures were solved by using techniques such as X-ray crystallography, nuclear magnetic resonance, or cryogenic electronic microscopy. However, these techniques are expensive, time-consuming, and challenging leading to the development of a variety of computational techniques for structure prediction (Zhang 2008).

1.1 An introduction to protein–protein docking and scoring

Protein–protein docking encompasses the development of computational pipelines to propose potential interfaced complexes of two or more solved individual protein structures, most often from the protein data bank (PDB) (Burley et al. 2018). The docking process is typically broken down into two stages. The docking procedure proposes complex structural models known as decoys, the scoring stage (Tsuchiya et al. 2022). The scoring stage involves a physical model taking in structural variables (e.g. coordinates and atom types) and assigning a score to the complex indicating structural favorability. These scoring functions are used to rank proposed models for selection and further investigation (Guedes et al. 2021).

1.2 An overview of scoring functions

In the past, scoring functions were typically hand-engineered to capitulate the known physics driving the folding and interactions of proteins. For illustrative purposes, a popular traditional scoring function is the Rosetta energy function (Alford et al. 2017), which includes terms involving the electrostatics of charged atoms, bond lengths, Van der Waals forces, etc. These models involved some model parameters being fit to available physical datasets (typically small). Although these models are useful, they often fail to capture the correct energetics of protein systems and fail to fully leverage the increasing amount of available biological data. Deep learning allows for the development of highly complex nonlinear functions using available data for model parameter tuning.

Many earlier efforts for protein–protein interface scoring with deep learning involved discretization of protein–protein interfaces into 3D image-like grids to be input into convolutional neural networks (Renaud et al. 2021, Schneider et al. 2021, Wang et al. 2020). Due to the limitations of these models, researchers moved toward using graph neural networks (GNNs) which utilize mathematical graph representations of proteins as model inputs (Wang et al. 2021, Réau et al. 2023).

Continued efforts have been made to develop advanced graph neural network architectures to operate on graph representations at both the atom and residue levels for proteins. Modern architectures typically depend on equivariant mathematical operations derived from physics (Zitnick et al. 2024, Liao and Smidt 2023, Liao et al. 2023) which can operate directly on input 3D coordinates without any necessary engineering. Typically, these models will only need atom types and 3D coordinates to make predictions. Here, equivariance means that any outputs of the model that are vectors will rotate appropriately with the inputs, furthermore, scalar outputs will be invariant.

1.3 Scoring functions in the era of AlphaFold

Generative deep learning models now dominate over traditional docking methods. For example, the bioinformatics field was revolutionized by the publication of AlphaFold2 (Jumper et al. 2021), a deep learning model which can take in protein sequences and produce highly accurate structures. The AlphaFold2 architecture was then extended to generate the structures of multichain protein complexes in the form of AlphaFold-Multimer (AFM) (Evans et al. 2022), providing an alternative avenue to computational docking techniques. Although AFM can produce accurate complex models in a large number of cases, there are many inputs in which it still struggles to correctly determine the interface (Evans et al. 2022) and quality metrics (ipTM, plDDT, etc.) are still necessary to differentiate high-quality from low-quality models. Another example is AbDiffuser (Martinkus et al. 2024), a deep learning model that generates antibody structures. It uses additional deep learning models that provide information such as predicted naturalness of predictions and predicted stability. Scoring models also remain a critical component of deep learning-based docking methods such as DiffDock (Corso et al. 2023) and DiffDock-PP (Ketata et al. 2023). Finally, DiffPack (Zhang et al. 2024), a recent protein side-chain packing model, utilizes a confidence model. Thus, scoring models are still very useful supplemental tools in modern structural bioinformatics.

1.4 The specific challenges of antibody–antigen binding physics

Antibodies remain an important area of study in the development of biologics for the treatment of a variety of diseases and deep learning techniques for the design and development of antibody binders to specific epitopes is currently a very active area of deep learning research (Kim et al. 2023) with generative pipelines often using some form of scoring function. Furthermore, antibody interface scoring is particularly difficult due to the immense amount of structural variation in heavy chain H3 loops with little human-detectable pattern (Marks and Deane 2017). Thus, this challenging problem may better be assisted by a specialized scoring function to help assess model generations.

2 Materials and methods

2.1 Generation of an antibody–antigen structure dataset using docking

Similarly to our work in GDockScore (McFee and Kim 2023), we decided to replicate this training scheme for an antibody–antigen complex dataset. We used the AbDb (Ferdous and Martin 2018), an antibody database extracted from the PDB with comprehensive processing and redundancy analysis amounting to several thousand nonredundant antibodies. Although we simply fine-tuned the previous pretrained EuDockScore on the GDockScore and DeepRank decoys, we sought to increase antibody-specific training data complexity with redocking and structural relaxation. To generate our dataset from AbDb, we performed local docking with the RosettaDock protocol (Gray et al. 2003).

We used the CAPRI standard assessment for docking quality as specified in Méndez et al. (2003). We assign non-native decoys as label 0 and acceptable or better decoys as label 1. We generated roughly 50 000 docking decoys across the AbDb with a 70%, 15%, and 15% split to training, validation, and test sets. Redundancy information is provided by AbDb (Ferdous and Martin 2018) and we ensured that no redundant structures were shared among the splits to ensure unbiased performance reporting. To further improve the quality of the interfaces and introduce backbone flexibility into our docking models, we ran the RosettaRelax protocol (Tyka et al. 2011) on the docking decoys to improve structure quality.

For this dataset, to ensure no redundancy between train, validation, and test sets, we used the redundancy information provided by the AbDb (Ferdous and Martin 2018) to split the AbDb into train/validation/test splits.

2.2 Generation of an antibody–antigen complex dataset using AFM and AFSample

To generate AFM predictions of Ab complexes, we used the protocol above, but instead generated models using the localcolabfold implementation of ColabFold (Mirdita et al. 2022) with templates enabled and all other options set to the default settings. Again, we split the data to avoid redundancies by clustering antigens by 30% sequence similarity with MMSeqs2 (Steinegger and Söding 2017) and making sure there were no cluster overlaps between PDBs used in model training and validation and those used for testing. To generate a test set with a large enough number of models for meaningful rerankings, we used AFSample (Wallner 2023) to generate a thousand models for each PDB assigned to the test set.

2.3 Details of the input representation

The inputs to EuDockScore are all the raw 3D coordinates of each amino acid residue, as well as one hot encoding of amino acid identity, a binary indicator if the amino acid is within 16 angstroms of an amino acid on the other protein, as well as associated amino acid embeddings from our fine-tuned DistilProtBert model. The scalar and vector input features are projected down using equivariant operations and used to initialize irreducible node features used in the Equiformer architectures (Liao and Smidt 2023, Liao et al. 2023).

2.4 Dataset redundancy analysis for BioGRID data

For the BioGRID dataset used to train the NLP model presented in this paper, we used the MCODE algorithm (Bader and Hogue 2003) to cluster BioGRID and then split the data such that there was no cluster overlap in the interactions in the training and validation set.

2.5 Model implementation, hyperparameter tuning, and training details

Hyperparameter tuning was performed by adjusting the number of spherical channels, i.e. the number of types of each order l feature for each node in the EquiformerV2 (Liao et al. 2023) architecture as well as adjusting the maximum channel order of the type-l features of the model. We found that lmax = 2 and 16 spherical channels. Three EquiformerV2 layers are used in the model. The optimal hyperparameter configuration is set to the default configuration in the model implementation. Furthermore, we trained the model using the Adam optimizer (Kingma and Ba 2015). The cross-entropy loss was used for the loss function during training with binary labels as specified in McFee and Kim (2023) and the mean-squared-error loss for training on continuous targets as in EuDockScore-AFM.

2.6 Definition of average precision

The average precision reported is from the implementation provided by the scikit-learn package (Pedregosa et al. 2011).
(1)

Here, AP represents average precision, the summation is over the n thresholds, Rn is the recall at the given threshold, Rn1 at the previous threshold, and the weighting term Pn is current precision.

2.7 Data analysis and statistical testing

The receiver operating characteristics (ROCs), precision-recall curves, Matthews correlation coefficients (MCCs), and F1 scores in this paper were computed using scikit-learn (Pedregosa et al. 2011). The area under the curve (AUC) of the ROC curves was compared using the DeLong test (DeLong et al. 1988) provided by the pROC R (R Core Team 2021) package (Robin et al. 2011). The Bonferonni correction was used for multiple comparisons.

3 Results

3.1 An Euclidean graph neural network architecture utilizing equivariant attention operations leveraging protein natural language model embeddings

EuDockScore, our latest scoring function, uses the equivariant attention-based architecture developed in the Equiformer and EquiformerV2 models (Liao and Smidt 2023, Liao et al. 2023). These architectures have achieved state-of-the-art performance in many molecular graph learning tasks, showing increased expressivity in comparison to other graph-based architectures. In these architectures, nodes are initialized with scalar and vector features of orders l, where the dimensionality of the feature is given by d=2l+1, and there may be an arbitrary number of each feature scalar or vectors. Different orders of features can equivariantly, or invariantly update each other via tensor product operations first introduced in Thomas et al. (2018). The Equiformer (Liao and Smidt 2023, Liao et al. 2023) introduces node level attention by generation attention values from the invariant scalar node features. Relative positional information is encoded by using the tensor product between node features and the spherical harmonics (Ledesma and Mewes 2020) of relative positional vectors. Two major modifications were made to the EquiformerV2 architecture for this model. We moved from an all-atom representation to a residue-level representation for computational efficiency. Secondly, we input complexes rather than single polypeptide chains, and finally, we provide scalar NLP embeddings to the model instead of just raw coordinates and atom/residue identities in the form of a one-hot encoding. Please consult the EquiformerV2 (Liao et al. 2023) manuscript for further details.

3.2 An interacting protein-pair natural language model to supplement structure-based deep learning models

Large language models that implement the transformer architecture (Vaswani et al. 2017) have revolutionized natural language processing (NLP) and immediately were put to use in learning protein embeddings using the plethora of available sequencing data. In particular, the models introduced by Elnaggar et al. (2022) and more recently Meta’s ESM2 (Lin et al. 2023) have been impactful to the field of protein bioinformatics. Single sequence protein folding models illustrate the power of these models’ sequence embeddings providing evidence that NLP models trained on sequence data are implicitly learning about protein folding physics, e.g. ESMFold (Lin et al. 2023). Furthermore, protein contact maps have been shown to be recoverable to some extend from NLP embeddings (Wang et al. 2022). Thus, NLP models are learning important protein physical ideas that can be applied to the task of scoring input structures.

Similarly to DeepRank-GNN-esm (Xu and Bonvin 2024), another recent scoring model, we sought to utilize NLP embeddings to supplement structure-based graph-based neural network models. We started by taking DistilProtBert (Geffen et al. 2022), a distilled version of the original ProtBert model (Elnaggar et al. 2022). We took nonredundant interacting protein sequence pairs from BioGRID (Oughtred et al. 2021), concatenated them together with a separating token, and fine-tuned DistilProtBert to learn about protein–protein interaction physics. These NLP embeddings are then used as input scalar features.

3.3 EuDockScore outperforms state-of-the-art models on the challenging CAPRI score_set for protein–protein docking

As in our previous work GDockScore (McFee and Kim 2023), we first pretrained the model using docking decoys generated from solved structures in the PDB. We then used the refined decoys generated from BM5 (Vreven et al. 2015) to train the original DeepRank model (Renaud et al. 2021) to fine-tune the model to better understand backbone flexibility in docking and interface formation. The CAPRI score_set (Lensink and Wodak 2014) still remains the gold-standard challenging test set for assessing protein–protein docking scoring functions and we again use it as our primary benchmarking set. We chose to benchmark EuDockScore against our previous model GDockScore, and the latest iteration of the DeepRank (Renaud et al. 2021, Réau et al. 2023) scoring model recently published in the literature known as DeepRank-GNN-esm (Xu and Bonvin 2024) referred to solely as DeepRank for the remainder of this manuscript for brevity. Furthermore, we also compare our deep learning models to the more traditional machine learning scoring function, iScore (Geng et al. 2020). To compare the models, we used the metrics of AUC of the ROC known as the AUCROC as well as compared precision-recall curves (Fig. 1a and b). Additionally, we provide the maximum MCC and F1 score across all thresholds. We found that in comparison to DeepRank and iScore, EuDockScore and GDockScore had significantly better AUCROCs (P < 2.2e−16) for both comparisons indicating EuDockScore more effectively can select out true positives without undesirable false positives. In comparison to each other, EuDockScore and GDockScore have similar AUCROCs. This indicates that there is a threshold where we get more true positives and less false positives in comparison to GDockScore. However, when applying statistical testing, the difference in AUCROCs is not statistically significant (P = 0.2468). EuDockScore has the highest MCC and F1 score across all thresholds in comparison to all models. Although AUCs may be similar the distribution of the “density” of this area may be different and important such as in the case of ROCs. The MCCs for GDockScore, EuDockScore, DeepRank, and iScore are 0.458. 0.538, 0.166, and 0.279. The F1 scores for GDockScore, EuDockScore, DeepRank, and iScore are 0.528, 0.588, 0.270, and 0.369. We further sought to investigate how the scores of near-native and incorrect data points were distributed by EuDockScore using a violin plot (Fig. 1c). We can see from the violin plot that EuDockScore very confidently applies high scores to near-native decoys but is less confident (indicated by a larger spread of the distribution of scores) in incorrect decoys. Logically, this is because binding is highly specific and inherently will have very precise features in comparison to the infinite number of incorrect interfaces.

(a) ROC curves comparing EuDockScore, GDockScore, DeepRank, and iScore. (b) Precision–recall curves of EuDockScore, GDockScore, Deeprank, and iScore. (c) Violin plot comparing the distribution of scores produced by EuDockScore for the CAPRI score_set for near-native and incorrect decoys.
Figure 1.

(a) ROC curves comparing EuDockScore, GDockScore, DeepRank, and iScore. (b) Precision–recall curves of EuDockScore, GDockScore, Deeprank, and iScore. (c) Violin plot comparing the distribution of scores produced by EuDockScore for the CAPRI score_set for near-native and incorrect decoys.

Biologists only want to experimentally validate or study a handful of structure models they are confident are near-native, so a scoring model’s ability to place near-native decoys in the top N decoys is important. Typically, small values of N are selected such as 5, or 10 but we choose to graphically display the number of hits in the top N selected structures (Fig. 2) to be more comprehensive. In Fig. 2, we can see that EuDockScore may do a better job on some structures (more near-native structures in a lower N) such as T30 and T40 but does worse on others. Ensembling improved overall ranking ability (Fig. 2) suggesting that multiple deep learning models contribute useful information to ranking proposed multimeric structural models.

Number of hits in selected models versus top N selected curves for EuDockScore, GDockScore, DeepRank, iScore, and an ensemble average of all four models. Here, a “hit” is defined as a near-native structure with a CAPRI quality of acceptable or better.
Figure 2.

Number of hits in selected models versus top N selected curves for EuDockScore, GDockScore, DeepRank, iScore, and an ensemble average of all four models. Here, a “hit” is defined as a near-native structure with a CAPRI quality of acceptable or better.

3.4 EuDockScore performs well on CASP15 models generated by a variety of methods

To further test the capabilities of our model, we tested dimer models from CASP15 which included traditional modeling techniques and generative modeling including systems based on AlphaFold. We found that EuDockScore can differentiate high-quality versus low-quality models for this set. We achieve an AUCROC of 0.81, an AP of 0.92 (Fig. 3a and b), an MCC of 0.423, and an F1 of 0.868. Here, we define a high-quality model as having a DockQ score 0.23. DockQ (Basu and Wallner 2016) was selected as it combines many traditional model interface quality metrics into a single comprehensive metric. Furthermore, we find that when inspecting a violin plot, our model provides a “tight” distribution of scores for high-quality models with increasing spread at lower qualities (Fig. 3c). This result is in agreement with our past results (McFee and Kim 2023) indicates it can confidently tell what is near-native. Referring to Supplementary Fig. S1, there are 145 outliers out of 7421 CASP15 structures, in terms of the difference between DockQ and EuDockScore.

(a) ROC curve for EuDockScore for dimers in the CASP15 competition. (b) Precision–recall for dimers in the CASP15 competition. (c) Violin plot showing EuDockScore scores for CASP15 for each category of model quality provided by DockQ.
Figure 3.

(a) ROC curve for EuDockScore for dimers in the CASP15 competition. (b) Precision–recall for dimers in the CASP15 competition. (c) Violin plot showing EuDockScore scores for CASP15 for each category of model quality provided by DockQ.

3.5 EuDockScore-Ab: a novel antibody–antigen complex scoring model trained on antibody–antigen docks

Next, we sought to extend EuDockScore beyond standard protein–protein docking, into the realm of assessing antibody–antigen interfaces, a task still proving to be computationally very challenging. We fine-tuned the previous EuDockScore model to produce EuDockScore-Ab, an antibody-specific model on a dataset of docked antigen–antibody structures (see Materials and Methods). Using a holdout, nonredundant test set, we see that EuDockScore-Ab can differentiate near-native from incorrect docks in unseen data (Fig. 4a and b). It is important to note our model is specific to scoring heavy chain and antigen complexes, as we found no benefit of training and using an additional light chain-specific model. Furthermore, we found that the model can extrapolate to AFM generations but does not surpass the performance of AFM in our AUCROC or AP (Fig. 5a and b). This is likely due to the end-to-end nature of the AFM training protocol and overall model size and expressiveness.

(a) ROC curve for EuDockScore-Ab on holdout, nonredundant antibody docks. (b) Precision–recall curve for EuDockScore-Ab on holdout, nonredundant antibody docks.
Figure 4.

(a) ROC curve for EuDockScore-Ab on holdout, nonredundant antibody docks. (b) Precision–recall curve for EuDockScore-Ab on holdout, nonredundant antibody docks.

(a) ROC curves comparing EuDockScore-Ab, ColabFold-Multimer [AFM ranking confidence (0.8 × ipTM + 0.2 × pTM)], and an ensemble of both scoring metrics. (b) Precision–recall curves comparing EuDockScore-Ab, ColabFold-Multimer (AFM ranking confidence), and an ensemble of both scoring metrics.
Figure 5.

(a) ROC curves comparing EuDockScore-Ab, ColabFold-Multimer [AFM ranking confidence (0.8 × ipTM + 0.2 × pTM)], and an ensemble of both scoring metrics. (b) Precision–recall curves comparing EuDockScore-Ab, ColabFold-Multimer (AFM ranking confidence), and an ensemble of both scoring metrics.

3.6 EuDockScore-AFM: attempting to distil the scoring abilities of AFM type architectures with ground-truth correction

Since AlphaFold was previously used to rank monomeric structures (Roney and Ovchinnikov 2022), GDockScore was successful in filtering AFM outputs (Tomezsko et al. 2024), we next decided to distill AFM’s scoring abilities into an independent model. Instead of engineering AlphaFold2 to output single chain scores by utilizing decoys as templates such as AF2Rank (Roney and Ovchinnikov 2022), we propose learning a new model.

Instead of simply training to distill AFM, the new training target is the sum of the AFM ranking confidence of the model and the DockQ against the native (true) structure. This combined metric allows our model to “correct” regions of the AFM scoring function using ground truth structural data. When taking this new model, hereby known as EuDockScore-AFM, and testing it on a holdout test set of nonredundant antibody structures generated by AFSample, it is important to note that we only use the heavy chain and antigen as inputs to our model as before. Although our model does possess ranking ability according to AUCROC and AP of the precision-recall curve (Fig. 6a and b), we do not outperform the stock AFM scoring metric or pDockQ2 (Zhu et al. 2023), a recent scoring metric constructed from AFM quality predictions. Pairwise comparisons of all combinations of models result in statistically significant differences in AUCROC with P < 2.2e−16 for all comparisons. When we look at EuDockScore’s ranking ability for individual DockQ quality classifications, there is a trend that higher quality structures have higher EuDockScore values (Fig. 6c). For AFSample, pDockQ2, EuDockScore, and the ensemble, the F1 scores are 0.631, 0.625, 0.324, and 0.587. The maximum MCCs achieved in the same order are 0.642, 0.645, 0.268, and 0.594.

(a) ROC curves comparing EuDockScore-AFM, AFSample (AFM ranking confidence), pDockQ2, and an ensemble of both scoring metrics. (b) Precision–recall curves comparing EuDockScore-Ab, AFSample (AFM ranking confidence), and an ensemble of both scoring metrics. (c) Violin plots comparing score distributions for each available DockQ classification.
Figure 6.

(a) ROC curves comparing EuDockScore-AFM, AFSample (AFM ranking confidence), pDockQ2, and an ensemble of both scoring metrics. (b) Precision–recall curves comparing EuDockScore-Ab, AFSample (AFM ranking confidence), and an ensemble of both scoring metrics. (c) Violin plots comparing score distributions for each available DockQ classification.

More interestingly, when we actually compare the ranking abilities of each model for AFSample outputs, in Fig. 6, in a similar fashion to Fig. 3. We find that there are many cases in which EuDockScore has superior ranking ability such as PDB IDs 1MLC, 5WK3, and 6A77 (Figs 7 and 8). Ensembling all scoring systems also seems to be helpful in some cases, e.g. 3MXW. This suggests that our model is useful in the context of helping rerank model outputs from generative models such as AFM which has been shown in additional studies (Tomezsko et al. 2024). Therefore, although current scoring models do not outright replace the quality metrics of generative models such as AFM, they do provide useful information for additional filtering of structure candidate models. In the case of EuDockScore-AFM, we outperform AFSample in almost 50% of the cases and perform similarly in the majority of the remaining cases.

Number of hits in selected top N models versus top N selected models for EuDockScore-AFM, AFSample, pDockQ2, and an ensemble average of all four models. Here, an acceptable model or “hit” has DockQ ≥0.23.
Figure 7.

Number of hits in selected top N models versus top N selected models for EuDockScore-AFM, AFSample, pDockQ2, and an ensemble average of all four models. Here, an acceptable model or “hit” has DockQ 0.23.

Example structures where EudockScore outperformed AFsample. The PDB ID is (a) 1MLC, (b) 5WK3, and (c) 6A77, respectively. Structures are superimposed based on the antigen structure. Here, gray indicates the experimentally defined PDB structure. Blue indicates structures with good EuDockScore-AFM and DockQ, but poor AFM ranking confidence. Red indicates structures with poor EuDockScore-AFM and DockQ, but good AFM ranking confidence.
Figure 8.

Example structures where EudockScore outperformed AFsample. The PDB ID is (a) 1MLC, (b) 5WK3, and (c) 6A77, respectively. Structures are superimposed based on the antigen structure. Here, gray indicates the experimentally defined PDB structure. Blue indicates structures with good EuDockScore-AFM and DockQ, but poor AFM ranking confidence. Red indicates structures with poor EuDockScore-AFM and DockQ, but good AFM ranking confidence.

4 Discussion

Here, we have presented several EuDockScore scoring functions that can be used to accurately score protein–protein docks as well as antibody–antigen complexes. We iterate upon our previous work (McFee and Kim 2023) by using modern, increasingly expressive Euclidean graph networks that can operate on raw coordinates (Liao and Smidt 2023, Liao et al. 2023) and supplemented the inputs with NLP embeddings from a bidirectional encoder representations from transformers (BERT) based model (Geffen et al. 2022).

We continue to achieve state-of-the-art performance on the CAPRI score_set (Lensink and Wodak 2014) and find that ensembling with other models such as DeepRank-GNN-esm (Xu and Bonvin 2024), and iScore improves overall discriminative power suggesting the importance of using multiple deep learning models to assess the quality of generative modeling predictions. Our base model also generalizes well to models submitted to CASP15 further supporting the generalizability of our scoring system.

EuDockScore-Ab is a heavy chain antigen-specific scoring model trained on redocked and relaxed real antibody complexes. EuDockScore-Ab can differentiate near-native (acceptable or better by traditional CAPRI metrics) from non-native antibody complexes. Additionally, EuDockScore-AFM, a model trained specifically on AFM antibody–antigen complex predictions shows the improved ability to select out near-native structures in large pools of candidates that the standard AFM scoring metrics provided by ColabFold and AFSample showing the continued applicability of scoring function development. Although our scoring models prove useful, it is clear a significant degree of work must be done in the antibody space to achieve a state-of-the-art replacement for AFM metrics.

The work presented in this article can be extended in many ways. For example, AFM models could be generated for the entire PDB to create an AFM-specific scoring function that is not limited to antibodies. However, this will be a computationally intensive task. Recent literature has also suggested that transformers can be applied directly to atomistic data while achieving approximate equivariance. Transformers are significantly more scalable than current equivariant networks and may allow for improved learning.

Acknowledgements

We would like to thank the Digital Research Alliance of Canada for providing computing resources.

Author contributions

The model architecture was conceptualized by M.M. and P.M.K. Model implementation was conducted by M.M. Data generation and data analysis were performed by M.M. and J.K. Writing was performed by M.M., J.K., and P.M.K. Supervision and funding acquisition was by P.M.K.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

P.M.K. is a cofounder and consultant to several biotechnology companies, including Fable Therapeutics, Resolute Bio, and Navega Therapeutics. P.M.K. also serves on the scientific advisory board of ProteinQure. M.M. and J.K. report no conflicts.

Funding

This work was supported by the CIHR Project [PJT-159750 and PJT-166008].

Data availability

The dataset introduced in this article is available at http://gdockscore.ccbr.proteinsolver.org including the language model weights/data splits. The CAPRI score set files used in this article are available at https://data.sbgrid.org/dataset/684/. The DeepRank dataset is available at https://data.sbgrid.org/dataset/843/. The original AbDb data is available at http://www.abybank.org/abdb/. The CASP15 data, including DockQ values, was downloaded from https://predictioncenter.org/download_area/. The code is available at https://gitlab.com/mcfeemat/eudockscore.

References

Alford
RF
,
Leaver-Fay
A
,
Jeliazkov
JR
et al.
The Rosetta all-atom energy function for macromolecular modeling and design
.
J Chem Theory Comput
2017
;
13
:
3031
48
.

Bader
GD
,
Hogue
CW.
An automated method for finding molecular complexes in large protein interaction networks
.
BMC Bioinformatics
2003
;
4
:
2
.

Basu
S
,
Wallner
B.
Dockq: a quality measure for protein–protein docking models
.
PLoS One
2016
;
11
:
e0161879
.

Burley
SK
,
Berman
HM
,
Bhikadiya
C
et al.
RCSB protein data bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy
.
Nucleic Acids Res
2018
;
47
:
D464
74
.

Corso
G
,
Stärk
H
,
Jing
B
et al. Diffdock: diffusion steps, twists, and turns for molecular docking. In: The Eleventh International Conference on Learning Representations.
2023
. https://neurips.cc/Conferences/2022

DeLong
ER
,
DeLong
DM
,
Clarke-Pearson
DL.
Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach
.
Biometrics
1988
;
44
:
837
.

Elnaggar
A
,
Heinzinger
M
,
Dallago
C
et al.
Prottrans: toward understanding the language of life through self-supervised learning
.
IEEE Trans Pattern Anal Mach Intell
2022
;
44
:
7112
27
.

Evans
R
,
O’Neill
M
,
Pritzel
A
et al. Protein complex prediction with alphafold-multimer. bioRxiv,
2022
, preprint: not peer reviewed.

Ferdous
S
,
Martin
ACR.
AbDb: antibody structure database—a database of PDB-derived antibody structures
.
Database
2018
;
2018
:
1
9
.

Geffen
Y
,
Ofran
Y
,
Unger
R.
Distilprotbert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts
.
Bioinformatics
2022
;
38
:
ii95
8
.

Geng
C
,
Jung
Y
,
Renaud
N
et al.
iScore: a novel graph kernel-based function for scoring protein–protein docking models
.
Bioinformatics
2020
;
36
:
112
21
.

Gray
JJ
,
Moughon
S
,
Wang
C
et al.
Protein–protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations
.
J Mol Biol
2003
;
331
:
281
99
.

Guedes
IA
,
Barreto
AMS
,
Marinho
D
et al.
New machine learning and physics-based scoring functions for drug discovery
.
Sci Rep
2021
;
11
:
3198
.

Jumper
J
,
Evans
R
,
Pritzel
A
et al.
Highly accurate protein structure prediction with alphafold
.
Nature
2021
;
596
:
583
9
.

Ketata
MA
,
Laue
C
,
Mammadov
R
et al. DiffDock-pp: rigid protein–protein docking with diffusion models. https://arxiv.org/abs/2304.03889,
2023
.

Kim
J
,
McFee
M
,
Fang
Q
et al.
Computational and artificial intelligence-based methods for antibody development
.
Trends Pharmacol Sci
2023
;
44
:
175
89
.

Kingma
DP
,
Ba
J.
Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings.
2015
. https://arxiv.org/abs/1412.6980

Garner
AL
,
Janda
KD.
Protein–protein interactions and cancer: targeting the central dogma
.
Curr Top Med Chem
2011
;
11
:
258
80
.

Ledesma
FG
,
Mewes
M.
Spherical-harmonic tensors
.
Phys Rev Res
2020
;
2
:
043061
.

Lensink
MF
,
Wodak
SJ.
Score_set: a capri benchmark for scoring protein complexes
.
Proteins
2014
;
82
:
3163
9
.

Liao
Y-L
,
Smidt
T.
Equiformer: equivariant graph attention transformer for 3D atomistic graphs. In: The Eleventh International Conference on Learning Representations.
2023
. https://arxiv.org/abs/2306.12059

Liao
Y-L
,
Wood
B
,
Das
A
et al. Equiformerv2: improved equivariant transformer for scaling to higher-degree representations.
2023
. https://iclr.cc/virtual/2024/poster/17903

Lin
Z
,
Akin
H
,
Rao
R
et al. Evolutionary-scale prediction of atomic level protein structure with a language model. Science 2023;
379
:
1123
30
.

Marks
C
,
Deane
CM.
Antibody h3 structure prediction
.
Comput Struct Biotechnol J
2017
;
15
:
222
31
.

Martinkus
K
,
Ludwiczak
J
,
Cho
K
et al. Abdiffuser: full-atom generation of in vitro functioning antibodies.
2024
. https://neurips.cc/virtual/2023/poster/72715

McFee
M
,
Kim
PM.
Gdockscore: a graph-based protein–protein docking scoring function
.
Bioinform Adv
2023
;
3
:
vbad072
.

Mirdita
M
,
Schütze
K
,
Moriwaki
Y
et al.
Colabfold: making protein folding accessible to all
.
Nat Methods
2022
;
19
:
679
82
.

Méndez
R
,
Leplae
R
,
De Maria
L
et al.
Assessment of blind predictions of protein–protein interactions: current status of docking methods
.
Proteins
2003
;
52
:
51
67
.

Oughtred
R
,
Rust
J
,
Chang
C
et al.
The biogrid database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions
.
Protein Sci
2021
;
30
:
187
200
.

Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
et al.
Scikit-learn: machine learning in Python
.
J Mach Learn Res
2011
;
12
:
2825
30
.

R Core Team
.
R: A Language and Environment for Statistical Computing
.
Vienna, Austria
:
R Foundation for Statistical Computing
,
2021
.

Renaud
N
,
Geng
C
,
Georgievska
S
et al.
Deeprank: a deep learning framework for data mining 3D protein–protein interfaces
.
Nat Commun
2021
;
12
:
7068
.

Robin
X
,
Turck
N
,
Hainard
A
et al.
pROC: an open-source package for r and s+ to analyze and compare ROC curves
.
BMC Bioinformatics
2011
;
12
:
77
.

Roney
JP
,
Ovchinnikov
S.
State-of-the-art estimation of protein model accuracy using alphafold
.
Phys Rev Lett
2022
;
129
:
238101
.

Réau
M
,
Renaud
N
,
Xue
LC
et al.
DeepRank-GNN: a graph neural network framework to learn patterns in protein–protein interfaces
.
Bioinformatics
2023
;
39
:btac759.

Schneider
C
,
Buchanan
A
,
Taddese
B
et al.
Dlab: deep learning methods for structure-based virtual screening of antibodies
.
Bioinformatics
2021
;
38
:
377
83
.

Steinegger
M
,
Söding
J.
Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
.
Nat Biotechnol
2017
;
35
:
1026
8
.

Thomas
N
,
Smidt
T
,
Kearnes
S
et al. Tensor field networks: rotation- and translation-equivariant neural networks for 3D point clouds.
2018
. https://arxiv.org/abs/1802.08219

Tomezsko
PJ
,
Ford
CT
,
Meyer
AE
et al.
Human cytokine and coronavirus nucleocapsid protein interactivity using large-scale virtual screens
.
Front Bioinform
2024
;
4
:
1397968
.

Tsuchiya
Y
,
Yamamori
Y
,
Tomii
K.
Protein–protein interaction prediction methods: from docking-based to AI-based approaches
.
Biophys Rev
2022
;
14
:
1341
8
.

Tyka
MD
,
Keedy
DA
,
André
I
et al.
Alternate states of proteins revealed by detailed energy landscape mapping
.
J Mol Biol
2011
;
405
:
607
18
.

Vaswani
A
,
Shazeer
N
,
Parmar
N
et al. Attention is all you need. CoRR abs/1706.03762,
2017
. https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

Vreven
T
,
Moal
IH
,
Vangone
A
et al.
Updates to the integrated protein–protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2
.
J Mol Biol
2015
;
427
:
3031
41
.

Wallner
B.
AFsample: improving multimer prediction with AlphaFold using massive sampling
.
Bioinformatics
2023
;
39
:btad573.

Wang
X
,
Terashi
G
,
Christoffer
CW
et al.
Protein docking model evaluation by 3D deep convolutional neural networks
.
Bioinformatics
2020
;
36
:
2113
8
.

Wang
X
,
Flannery
ST
,
Kihara
D.
Protein docking model evaluation by graph neural networks
.
Front Mol Biosci
2021
;
8
:
647915
.

Wang
Z
,
Combs
SA
,
Brand
R
et al.
LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction
.
Sci Rep
2022
;
12
:
6832
.

Xu
X
,
Bonvin
AMJJ.
DeepRank-GNN-esm: a graph neural network for scoring protein–protein models using protein language model
.
Bioinform Adv
2024
;
4
:
vbad191
.

Zhang
Y.
Progress and challenges in protein structure prediction
.
Curr Opin Struct Biol
2008
;
18
:
342
8
.

Zhang
Y
,
Zhang
Z
,
Zhong
B
et al. Diffpack: a torsional diffusion model for autoregressive protein side-chain packing.
2024
. https://proceedings.neurips.cc/paper_files/paper/2023/hash/96a54c09569ebbdd9ecb22f5012e6b66-Abstract-Conference.html

Zhu
W
,
Shenoy
A
,
Kundrotas
PJ
et al.
Evaluation of alphafold-multimer prediction on multi-chain protein complexes
.
Bioinformatics
2023
;
39
:btad424.

Zitnick
CL
,
Das
A
,
Kolluru
A
et al. Spherical channels for modeling atomic interactions. In:
Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, NIPS '22, p. 14
. Red Hook, NY, USA: Curran Associates Inc.,
2024
.

Author notes

Matthew McFee and Jisun Kim Equal contribution.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Arne Elofsson
Arne Elofsson
Associate Editor
Search for other works by this author on:

Supplementary data