Abstract

Scoring functions (SFs) based on complex machine learning (ML) algorithms have gradually emerged as a promising alternative to overcome the weaknesses of classical SFs. However, extensive efforts have been devoted to the development of SFs based on new protein–ligand interaction representations and advanced alternative ML algorithms instead of the energy components obtained by the decomposition of existing SFs. Here, we propose a new method named energy auxiliary terms learning (EATL), in which the scoring components are extracted and used as the input for the development of three levels of ML SFs including EATL SFs, docking-EATL SFs and comprehensive SFs with ascending VS performance. The EATL approach not only outperforms classical SFs for the absolute performance (ROC) and initial enrichment (BEDROC) but also yields comparable performance compared with other advanced ML-based methods on the diverse subset of Directory of Useful Decoys: Enhanced (DUD-E). The test on the relatively unbiased actives as decoys (AD) dataset also proved the effectiveness of EATL. Furthermore, the idea of learning from SF components to yield improved screening power can also be extended to other docking programs and SFs available.

Introduction

As the most important approach used in structure-based virtual screening (SBVS), molecular docking can predict how a ligand binds to its biomolecular target and estimate the binding affinity between the ligand and target [1–3]. Operationally, the docking process consists of two basic stages: predicting the position, orientation and conformation of the ligand when docked into the binding pocket of a certain protein (pose generation or conformational sampling) and estimating how strongly the docked poses bind to the target (scoring). Correspondingly, the reliability of molecular docking depends on the coverage of pose generation and the accuracy of binding affinity predicted by scoring functions (SFs) [4, 5].

It must be recognized that SFs are created on the basis of some essential simplifications for the sake of efficiency [6, 7], which is the reason why no individual SFs could cover the whole process of protein–ligand recognition. However, SFs could still efficiently enrich active molecules from a huge pool of candidate compounds, and many successful applications have been publicly reported [8–12]. Actually, the state-of-the-art docking programs have worked reasonably well in conformation sampling, but the inaccuracies in the prediction of binding affinity by SFs continue to be the major limiting factor for more reliable docking, which is also one of the most important open problems in computational chemistry [13].

SFs can be roughly classified into two major categories from the view of methodology: classical SFs (i.e. force field, empirical, knowledge, etc.) and machine learning (ML)-based SFs [14]. Most SFs implemented in docking programs belong to classical SFs, for which a common characteristic is that they normally take the form of linear regression with a small number of expert-selected features (i.e. non-bonded interaction terms, solvent accessibility surface area (SASA), atom pairwise contacts, etc.), assuming there is a theory-motivated functional form for the relationship between the binding affinity and the features characterizing a complex even though the linear relationship may not exist at any time [15]. On the contrary, ML-based SFs prefer to learn from very large volumes of structural and interaction data instead of making linearity as their assumption [16]. Consequently, these ML-based SFs have the ability to capture nonlinear binding interactions that are difficult to be characterized by classical SFs, thus yielding better binding affinity predictions. ML-based SFs have seen a surge in popularity for their superiority over classical SFs. The past few years have witnessed significant advances in ML-based SFs, and the development of ML-based SFs can be envisaged along two aspects: exploring novel protein–ligand interaction characterization and switching to alternative ML algorithms. With respect to the characterization, knowledge-based pairwise potentials [17–20], terms extracted from existing SFs [21–23] and protein–ligand interaction-related descriptors/fingerprints[24, 25] are the most popular representation methods. Besides, a wide variety of ML algorithms, including random forest (RF) [16, 26, 27], support vector machine (SVM) [24, 28], artificial neural network (ANN) [29, 30], gradient boosting decision tree (GBDT) [31] and convolutional neural network (CNN) [32] have been used to build ML-based SFs.

For normal academic researchers, the commonly used docking programs, such as MOE [33], Glide [34, 35], GOLD [36], AutoDock [37] and AutoDock Vina [38], which have been developed and continually updated, are still their first choice when conducting VS considering the maneuverability and availability. Usually, the final score is the only criterion to decide if a ligand can bind to its target or not. However, rather than a total score of the docked ligand, SFs could provide many detailed energy auxiliary terms which can reflect the features of protein–ligand recognition to some extent. Taking Goldscore, one of the SFs employed in GOLD, as an example, Goldscore is able to supply additional energy terms for various types of interactions, such as protein–ligand H-bond, van der Waals, ligand torsion strain and other energy auxiliary terms along with the total score (Goldscore.Fitness). A straightforward question may be raised: can classical SFs be improved by optimizing their scoring components through ML algorithms?

Actually, some researchers have examined the potential benefit of the combination of the scoring components of existing SFs or interaction energy components and ML. For example, MIEC-GBDT was a VS classification model whose prediction accuracy achieved 87% for luciferase inhibitors [39]. Molecular interaction energy components (MIEC) were extracted from the decomposition of molecular mechanics/generalized born surface area (MM/GBSA), and then the GBDT algorithm was employed to build the MIEC-GBDT classification model. Yasuo et al. proposed a novel approach referred to as similarity of interaction energy vector score (SIEVE-Score) [40], which was trained by random forest based on the per-residue interaction energy terms extracted from the Glide-SP docking results. The SIEVE-Score showed much better performance than popular docking programs in distinguishing active compounds, but it also needed more known experimental data and expert interventions. Yunierkis et al. put forward CompScore [41] by simply incorporating the information provided by SF components into a consensus scoring scheme [42] to obtain boosted virtual screening (VS) performance without training any ML model. The above reported studies demonstrate that energy components may be effective features to represent protein–ligand recognition so that the existing popular docking programs and SFs are worth further exploration.

In this paper we presented a novel and handy approach based on the scoring components extracted from the output of the SFs implemented in several popular docking programs. The diverse subset of the Directory of Useful Decoys: Enhanced (DUD-E) [43] was used to test our approach. Each molecule in DUD-E was docked into its target using several popular docking programs. Then different levels of ML models were built based on the scoring components yielded in the last stage, and the major focus in the evaluation should be paid to the screening power, the capability to identify actives from decoys. Through this study, we expect to gain more insight into the following three questions. (i) Here, advanced ML algorithms were employed to develop improved SFs including EATL SFs, docking-EATL SFs and comprehensive SFs by analyzing different combinations of scoring components given by multiple classical SFs. Therefore, can the screening power of our improved SFs be improved compared with that of the classical SFs? (ii) Many ML-based SFs have proved themselves in the field of VS, so can our model achieve comparable results to other advanced SFs or methods? (iii) It is reported that the DUD-E benchmark may suffer from negative bias resulting from the decoys selection criteria, rendering the actives more distinguishable from the decoys. Accordingly, can our proposed method still yield a satisfactory performance on another relatively unbiased dataset?

Materials and methods

Validation dataset

The diverse subset of the DUD-E dataset [43], broadly accepted for benchmarking VS protocols, was used to test our approach. The whole DUD-E dataset consists of 102 target proteins, and the diverse subset of DUE-E contains 8 target proteins belonging to seven broad protein categories. Hence, the diverse subset is qualified to stand for the whole DUD-E dataset to some extent. More information about the subset is shown in Table 1. The active and inactive compounds for each target were taken from the ChEMBL [44] and ZINC [45] databases, respectively. The ratio of actives to inactives (decoys) for each target is 33.1 on average. Structurally similar active compounds have already been filtered out by clustering analysis. The decoys were chosen to resemble ligands physically in order to fulfill their role as negative controls which were challenging for docking but topologically dissimilar to minimize the likelihood of actual actives. Generally, there are no identical or excessively similar data for each target. The DUD-E benchmark is available at http://dude.docking.org/

Table 1

Information of the diverse subset of DUD-E

TargetPDBDescriptionClassActivesDecoys
AMPC1l2sBeta-lactamaseOther enzymes622902
CXCR43oduC-X-C chemokineGPCR1223414
KIF113cjoC-X-C chemokine receptor type 4Miscellaneous1976912
CP3A43nxuCytochrome P450 3A4CYP45036311 940
GCR3bqdGlucocorticoid receptorNuclear56315 185
AKT13cqwSerine/threonine–protein kinase Akt-1Kinase42316 576
HIVRT3lanHIV type 1 reverse transcriptaseOther enzymes63919 134
HIVPR1xl2HIV type 1 proteaseProtease139536 278
TargetPDBDescriptionClassActivesDecoys
AMPC1l2sBeta-lactamaseOther enzymes622902
CXCR43oduC-X-C chemokineGPCR1223414
KIF113cjoC-X-C chemokine receptor type 4Miscellaneous1976912
CP3A43nxuCytochrome P450 3A4CYP45036311 940
GCR3bqdGlucocorticoid receptorNuclear56315 185
AKT13cqwSerine/threonine–protein kinase Akt-1Kinase42316 576
HIVRT3lanHIV type 1 reverse transcriptaseOther enzymes63919 134
HIVPR1xl2HIV type 1 proteaseProtease139536 278
Table 1

Information of the diverse subset of DUD-E

TargetPDBDescriptionClassActivesDecoys
AMPC1l2sBeta-lactamaseOther enzymes622902
CXCR43oduC-X-C chemokineGPCR1223414
KIF113cjoC-X-C chemokine receptor type 4Miscellaneous1976912
CP3A43nxuCytochrome P450 3A4CYP45036311 940
GCR3bqdGlucocorticoid receptorNuclear56315 185
AKT13cqwSerine/threonine–protein kinase Akt-1Kinase42316 576
HIVRT3lanHIV type 1 reverse transcriptaseOther enzymes63919 134
HIVPR1xl2HIV type 1 proteaseProtease139536 278
TargetPDBDescriptionClassActivesDecoys
AMPC1l2sBeta-lactamaseOther enzymes622902
CXCR43oduC-X-C chemokineGPCR1223414
KIF113cjoC-X-C chemokine receptor type 4Miscellaneous1976912
CP3A43nxuCytochrome P450 3A4CYP45036311 940
GCR3bqdGlucocorticoid receptorNuclear56315 185
AKT13cqwSerine/threonine–protein kinase Akt-1Kinase42316 576
HIVRT3lanHIV type 1 reverse transcriptaseOther enzymes63919 134
HIVPR1xl2HIV type 1 proteaseProtease139536 278

The actives as decoys (AD) dataset, a relatively unbiased version of DUD-E, was recently proposed by Kurtzman et al. in their work investigating the impact of potential biases in DUD-E on the performance of CNN models [46]. The AD dataset consists of the same 102 targets and inherits the active datasets of DUD-E, while its decoy sets are entirely differen. For each target in the AD dataset, all the actives from other 101 targets were docked to the identified protein target by Vina, one after another. Then the top 50 actives of the 101 targets ranked by binding affinity (approximately 5050, 50 × 101) were selected to comprise the decoy set for the docked target. Due to the special decoy selection criteria, it seems that the AD dataset is a more challenging VS benchmark for scoring evaluation schemes. Here we chose the same eight targets as the diverse subset for further strategy validation. The complete constructed AD dataset for 102 targets is freely available at www.lehman.edu/faculty/tkurtzman/files/102_targets_AD_dataset.tar

Protein–ligand docking and rescoring

All the eight targets from the diverse subset were utilized for the validation of our methodology. Compounds were preprocessed by OMEGA [47] to yield appropriate conformations and isomers. Proteins were processed by the protein preparation and energy minimization module implemented in MOE (version2018.01). During the docking calculations with MOE, the binding site was defined by the co-crystalized ligand included in the DUD-E file, and the conformation search of each molecule was conducted with the triangle matcher algorithm. Each compound contains 30 docking poses after conformational sample, and only the top-scored one was kept for the following rescoring.

The top-scored poses were then rescored with MOE dock (version2018.01), GOLD (version5.3.0), and Schrodinger Glide (version7.1), respectively, with all default parameters unless otherwise specified. All the SFs implemented in MOE and GOLD were employed for rescoring, whereas for Glide only the standard precision (SP) mode was employed considering the efficiency and precision.

Scoring functions and components

The features used for model training are the scoring components extracted from the rescoring output files. The basic information of the 10 SFs and the corresponding scoring components is summarized in Table 2, and additional descriptions can be found in Supplementary Table S1. Of note, all the scoring components are energy terms representing the protein–ligand recognition instead of only ligand-based features which are not sufficient for prediction of binding possibility due to the lack of information about the receptor. Here we defined these scoring functions components as energy auxiliary terms (EATs).

Table 2

Information of the SFs and scoring components of each docking program

SoftwareSFScoring componentsNumber of components
MOE (version2018.01)Affinity-dGS, E_conf, E_score1, E_refine E_place5
Alpha-HBS, E_conf, E_score1, E_refine, E_place5
GBVIWSA-dGS, E_conf, E_score1, E_refine, E_place5
London-dGS, E_conf, E_score1, E_refine, E_place5
ASES, E_conf, E_score1, E_refine, E_place5
GOLD (version5.3.0)GoldscoreGoldscore.Fitness,
Goldscore.External.HBond
Goldscore.External.Vdw
Goldscore.Internal.Torsion
Goldscore.Internal.Vdw
5
ChemPLPPLP.Fitness,PLP.PLP, PLP.Chemscore.Hbond,PLP.ligand.torsion, PLP.part.buried,PLP.part.hbond, PLP.part.nonpolar, PLP.part.repulsive8
ASPASP.Fitness, ASP.ASP, ASP.DEClash, ASP.DEInternal, ASP.Map5
ChemscoreChemscore.Fitness, Chemscore.DEClash, Chemscore.DEInternal, Chemscore.DG, Chemscore.Hbond, Chemscore.Lipo, Chemscore.Rot7
Schrödinger (version7.1)GlideScore-SPdocking score, glide ligand efficiency, glide ligand efficiency sa, glide ligand efficiency ln, glide gscore, glide lipo, glide hbond, glide rewards, glide evdw, glide ecoul, glide erotb, glide esite, glide emodel, glide energy, glide einternal15
SoftwareSFScoring componentsNumber of components
MOE (version2018.01)Affinity-dGS, E_conf, E_score1, E_refine E_place5
Alpha-HBS, E_conf, E_score1, E_refine, E_place5
GBVIWSA-dGS, E_conf, E_score1, E_refine, E_place5
London-dGS, E_conf, E_score1, E_refine, E_place5
ASES, E_conf, E_score1, E_refine, E_place5
GOLD (version5.3.0)GoldscoreGoldscore.Fitness,
Goldscore.External.HBond
Goldscore.External.Vdw
Goldscore.Internal.Torsion
Goldscore.Internal.Vdw
5
ChemPLPPLP.Fitness,PLP.PLP, PLP.Chemscore.Hbond,PLP.ligand.torsion, PLP.part.buried,PLP.part.hbond, PLP.part.nonpolar, PLP.part.repulsive8
ASPASP.Fitness, ASP.ASP, ASP.DEClash, ASP.DEInternal, ASP.Map5
ChemscoreChemscore.Fitness, Chemscore.DEClash, Chemscore.DEInternal, Chemscore.DG, Chemscore.Hbond, Chemscore.Lipo, Chemscore.Rot7
Schrödinger (version7.1)GlideScore-SPdocking score, glide ligand efficiency, glide ligand efficiency sa, glide ligand efficiency ln, glide gscore, glide lipo, glide hbond, glide rewards, glide evdw, glide ecoul, glide erotb, glide esite, glide emodel, glide energy, glide einternal15
Table 2

Information of the SFs and scoring components of each docking program

SoftwareSFScoring componentsNumber of components
MOE (version2018.01)Affinity-dGS, E_conf, E_score1, E_refine E_place5
Alpha-HBS, E_conf, E_score1, E_refine, E_place5
GBVIWSA-dGS, E_conf, E_score1, E_refine, E_place5
London-dGS, E_conf, E_score1, E_refine, E_place5
ASES, E_conf, E_score1, E_refine, E_place5
GOLD (version5.3.0)GoldscoreGoldscore.Fitness,
Goldscore.External.HBond
Goldscore.External.Vdw
Goldscore.Internal.Torsion
Goldscore.Internal.Vdw
5
ChemPLPPLP.Fitness,PLP.PLP, PLP.Chemscore.Hbond,PLP.ligand.torsion, PLP.part.buried,PLP.part.hbond, PLP.part.nonpolar, PLP.part.repulsive8
ASPASP.Fitness, ASP.ASP, ASP.DEClash, ASP.DEInternal, ASP.Map5
ChemscoreChemscore.Fitness, Chemscore.DEClash, Chemscore.DEInternal, Chemscore.DG, Chemscore.Hbond, Chemscore.Lipo, Chemscore.Rot7
Schrödinger (version7.1)GlideScore-SPdocking score, glide ligand efficiency, glide ligand efficiency sa, glide ligand efficiency ln, glide gscore, glide lipo, glide hbond, glide rewards, glide evdw, glide ecoul, glide erotb, glide esite, glide emodel, glide energy, glide einternal15
SoftwareSFScoring componentsNumber of components
MOE (version2018.01)Affinity-dGS, E_conf, E_score1, E_refine E_place5
Alpha-HBS, E_conf, E_score1, E_refine, E_place5
GBVIWSA-dGS, E_conf, E_score1, E_refine, E_place5
London-dGS, E_conf, E_score1, E_refine, E_place5
ASES, E_conf, E_score1, E_refine, E_place5
GOLD (version5.3.0)GoldscoreGoldscore.Fitness,
Goldscore.External.HBond
Goldscore.External.Vdw
Goldscore.Internal.Torsion
Goldscore.Internal.Vdw
5
ChemPLPPLP.Fitness,PLP.PLP, PLP.Chemscore.Hbond,PLP.ligand.torsion, PLP.part.buried,PLP.part.hbond, PLP.part.nonpolar, PLP.part.repulsive8
ASPASP.Fitness, ASP.ASP, ASP.DEClash, ASP.DEInternal, ASP.Map5
ChemscoreChemscore.Fitness, Chemscore.DEClash, Chemscore.DEInternal, Chemscore.DG, Chemscore.Hbond, Chemscore.Lipo, Chemscore.Rot7
Schrödinger (version7.1)GlideScore-SPdocking score, glide ligand efficiency, glide ligand efficiency sa, glide ligand efficiency ln, glide gscore, glide lipo, glide hbond, glide rewards, glide evdw, glide ecoul, glide erotb, glide esite, glide emodel, glide energy, glide einternal15

Model training

Five-fold cross-validation (CV) [48] was employed to evaluate the performance of the derived models. In practice, the data are first partitioned into five equally sized segments. Subsequently five iterations of training and validation were performed such that 4-folds of the data were utilized for learning and the remaining 1-fold was used for validation within each iteration. In addition, stratified CV [49] was used to ensure that the class distribution is kept as close as possible to be the same across all folds. As mentioned above, our dataset is unbalanced (the ratio of the decoys versus actives is approximately 33 on average), so the undersampling strategy was conducted on the negative samples to make its number the same as that of the positive samples for each segment in the training set; in other words, the final training set consists of the same numbers of actives and decoys [50]. On the purpose of fully utilizing the negative data, the undersampling process was repeated 100 times.

Extreme gradient boosting (XGBoost) [51], a type of gradient boosting decision tree method, was used to construct the classification model. The XGBoost model has been widely used in all kinds of data mining fields [52–54] as well as the development of ML-based SFs [55, 56]. Both the first and second derivatives of the loss function are used, which enables the algorithm to coverage the global optimality faster and improve the efficiency of the optimal solution of the model. The main hyper-parameters of XGBoost, including the learning rate, the maximum depth of a tree and the boosting rounds, were optimized by the grid search method and 5-fold cross-validation. In our study, the output of a XGBoost classifier is the predicted possibilities of true binders in a scale of 0∼1. The final prediction of each compound is the arithmetic mean of the 100 predictions. All the models were implemented in KNIME (version3.7.1) [57]. In addition, the input feature files and workflow can be downloaded here ( https://github.com/xiongguoli/data_EATs-learning ).

Evaluation metrics

To rigorously evaluate the performance of our model, the enrichment factor (EF) values at the 0.1%, 0.5%, 1%, 2% and 5% levels, the area under the curve (AUC) of receiver operating characteristic (ROC) curve, the Boltzmann-enhanced discrimination of receiver operating characteristic (BEDROC) score and log-scaled area under the receiver-operating characteristic curve (LogAUC) were adopted in the screening power assessment. All the metrics were calculated by in-house Python scripts.

The EF values are commonly used in VS evaluation as accuracy metrics. The EFx% value is defined as the ratio between the predicted hit rate and the random hit rate when the top x% ranked compounds are selected as actives. The ROC curve reflects the relationship between the sensitivity and specificity [58]. The AUC is defined as the area under the ROC curve to measure the overall performance of docking enrichment. The possible values of the AUC range from 0 to 1 and the AUC values corresponding to ideal and random prediction are 1 and 0.5, respectively. In simple words, the larger the AUC value, the better the model performs. The BEDROC score is generalized to incorporate a weighting function that adapts it for use in early recognition problems. For each protein, the BEDROC score takes all the compounds into account rather than being limited to a percentile of the chemical library as EFX%. Simultaneously, the score can be modulated by the weight given to the top-ranked compounds by adjusting the parameter α. In this study, the parameter α was set to 80.5, which means the 2% top-ranked molecules account for 80% of the BEDROC score [59]. LogAUC was introduced to tackle the early enrichment problem by computing the percentage of the ideal area that lay under the semilog ROC curve. Formally, we defined LogAUCλ, where the log area was computed from λ = 0 to 1.0, and in this study LogAUC0.001 was simply denoted by LogAUCλ [60].

Results and discussion

The performance of EATL SFs

We constructed the EATs learning (EATL) SFs for each target by employing the EATs provided by each classical SF as the input features for training. The ROC curves and associated AUCs of each classical and EATL SFs are illustrated in Figure 1. Comparison between the classical and EATL SFs in terms of the AUC-ROC values and BEDROC scores are shown in Figures 2 and 3. The detailed results obtained for the diverse subset are provided as Supplementary Table S2.

ROC curves of 10 classical SFs (A) and EATL SFs (B) on 8 targets. ADG, affinity dG; AHB, alpha HB; ASE, ASE; GW, GBVI/WSA dG; and LDG, London dG. And the dashed line indicates random selection.
Figure 1

ROC curves of 10 classical SFs (A) and EATL SFs (B) on 8 targets. ADG, affinity dG; AHB, alpha HB; ASE, ASE; GW, GBVI/WSA dG; and LDG, London dG. And the dashed line indicates random selection.

Scatter plots of the AUC-ROC (A) and LogAUC (B) results of classical and EATL SFs. The shape and color of each point represent the target and SF, respectively. And the dashed line indicates the limit where both scoring schemes perform equally.
Figure 2

Scatter plots of the AUC-ROC (A) and LogAUC (B) results of classical and EATL SFs. The shape and color of each point represent the target and SF, respectively. And the dashed line indicates the limit where both scoring schemes perform equally.

Box plot of the BEDROC (α = 80.5) results of classical and EATL-SFs. The horizontal lines indicate median, and the plus signs represent the mean BEDROC score. ADG, affinity dG; AHB, alpha HB; ASE, ASE; GW, GBVI/WSA dG; and LDG, London dG. The plot is built with the values of BEDROC for all the targets as provided in Supplementary Table S2.
Figure 3

Box plot of the BEDROC (α = 80.5) results of classical and EATL-SFs. The horizontal lines indicate median, and the plus signs represent the mean BEDROC score. ADG, affinity dG; AHB, alpha HB; ASE, ASE; GW, GBVI/WSA dG; and LDG, London dG. The plot is built with the values of BEDROC for all the targets as provided in Supplementary Table S2.

It can be seen from Figure 1A that some classical SFs can yield satisfactory performance on some targets such as Glide-SP for AMPC (0.849), Alpha-HB for KIF11 (0.878) and Affinity-dG for GCR (0.812) but perform badly on other targets whose values are even lower than a random level (0.5) such as Goldscore for GCR (0.307), London-dG for HIVRT (0.369) and Goldscore for HIVPR (0.374), even though the improvement over the random level is statistically significant (P < 0.05) across the dataset for most classical SFs except for London-dG and Goldscore. Among all the 10 classical SFs, Glide-SP can achieve the best performance (0.723) on average. However, the SFs from MOE and GOLD can only obtain the values ranging from 0.523 to 0.671 on average, implying that it’s hard for most classical SFs to distinguish ligands from decoys in VS campaigns. While for our EATL SFs (Figure 1B), each of the AUC values is far higher than 0.5, with 34 of them reaching the AUC values above 0.8. In addition, some EATL SFs can yield remarkable performance on certain targets, such as Glide-EATL for KIF11 (0.915) and HIVPR (0.916). The comparison results for LogAUC are similar to those for AUC, as shown in Supplementary Figure S1.

For comparison, Figure 2 presents the scatter plot of the ROC-AUC (A) and LogAUC (B) results for classical SFs versus EATL SFs. The shape and color of each point represent the target and SF, respectively. Our EATL SFs achieve improved VS performance by considering the two metrics for all the scoring schemes across the dataset, and the improvement is statistically significant (P = 3.4 × 10−17 and 3.4 × 10−14 for ROC-AUC and LogAUC, respectively) in the light of the paired t-test. Further learning of the SF components renders the AUC to grow at an astonishing average of 24.67%. More importantly, it is interesting to observe that the worst mean AUC values provided by the EATL SFs are even higher than the best value provided by the classical SFs.

Next, BEDROC with α = 80.5 (see Materials and methods) was chosen to work out the ‘early recognition’ problem and to weight the contribution of the rank axis to the final score. As illustrated in Figure 3, sharp differences exist among the classical SFs implemented in different docking programs. Glide-SP can still yield the best performance in terms of the early recognition ability. The worst early enrichment is obtained when employing London-dG, and most other classical SFs have similar poor performance (lower than 0.2). Based on the mean BERROC values, the early recognition power of classical SFs has the following rank: Glide-SP (0.274) > ASP (0.196) > Alpha-HB (0.173) > PLP (0.156) > Chemscore (0.145) > GBVI/WSA-dG (0.133) > ASE (0.113) > Goldscore (0.085) > Affinity-dG (0.083) > London-dG (0.060). Similar to the results shown in the previous section, further learning of the SF components using XGBoost (in blue) leads to the obvious improvement of the BEDROC scores relative to classical SFs (in red). This supports our hypothesis that the SF components could provide meaningful information on protein–ligand interactions and learning from these energy auxiliary terms using ML algorithms could compensate the weaknesses of individual scoring terms to better serve VS campaigns.

The performance of docking-EATL SFs

After comparing classical and EATL SFs, our new feature-based learning method has demonstrated the huge superiority in screening power. Hence, we also explored whether the learning of all the energy auxiliary terms from the same docking program defined as docking-EATL SFs could outperform the previous EATL SFs. The ROC and semilog ROC curves of the docking-EATL SFs and the best-performed EATL SF for each docking program (ASE for MOE and Chemscore for GOLD) are illustrated in Figures 4 and 5. The results of EF are presented as Supplementary Table S3, and the distributions of EF across targets are shown in Figure 6. The BEDROC scores of the tested models are listed in Table 3.

ROC curves of docking-EATL SFs and the best EATL SFs across the eight targets. Chemscore-EATL and ASE-EATL are the best EATLs SF of GOLD and MOE, respectively. And the dashed line indicates random selection.
Figure 4

ROC curves of docking-EATL SFs and the best EATL SFs across the eight targets. Chemscore-EATL and ASE-EATL are the best EATLs SF of GOLD and MOE, respectively. And the dashed line indicates random selection.

Semilog ROC curves of docking-EATL SFs and the best EATL SFs across the eight targets. Chemscore-EATL and ASE-EATL are the best EATLs SF of GOLD and MOE, respectively. And the dashed line indicates random selection.
Figure 5

Semilog ROC curves of docking-EATL SFs and the best EATL SFs across the eight targets. Chemscore-EATL and ASE-EATL are the best EATLs SF of GOLD and MOE, respectively. And the dashed line indicates random selection.

EFs at the top 0.1%, 0.5%, 1%, 2% and 5% levels distributions of docking-EATL SFs and the best EATL SFs across the eight targets. The horizontal lines indicate median, and the plus signs represent the mean enrichment factor. The plot is built with the values of EF at different levels for the eight targets as provided in Supplementary Table S3.
Figure 6

EFs at the top 0.1%, 0.5%, 1%, 2% and 5% levels distributions of docking-EATL SFs and the best EATL SFs across the eight targets. The horizontal lines indicate median, and the plus signs represent the mean enrichment factor. The plot is built with the values of EF at different levels for the eight targets as provided in Supplementary Table S3.

Table 3

Comparison of BEDROC values for docking-EATL and the best EATL SFs on eight targets

TargetASE-EATLChemscore-EATLMOE-EATLGOLD-EATLGlide-EATL
AMPC0.0560.1960.2940.3710.504
CXCR40.2460.2800.2900.5910.578
KIF110.5670.6110.7120.7610.671
CP3A40.0850.1660.2770.3380.324
GCR0.4440.3410.6040.7370.442
AKT10.1580.2410.2080.5320.552
HIVRT0.5180.5460.7780.6700.711
HIVPR0.4400.5370.5970.7720.710
Mean0.3140.3650.4700.5970.562
Median0.3430.3110.4460.6310.565
TargetASE-EATLChemscore-EATLMOE-EATLGOLD-EATLGlide-EATL
AMPC0.0560.1960.2940.3710.504
CXCR40.2460.2800.2900.5910.578
KIF110.5670.6110.7120.7610.671
CP3A40.0850.1660.2770.3380.324
GCR0.4440.3410.6040.7370.442
AKT10.1580.2410.2080.5320.552
HIVRT0.5180.5460.7780.6700.711
HIVPR0.4400.5370.5970.7720.710
Mean0.3140.3650.4700.5970.562
Median0.3430.3110.4460.6310.565
Table 3

Comparison of BEDROC values for docking-EATL and the best EATL SFs on eight targets

TargetASE-EATLChemscore-EATLMOE-EATLGOLD-EATLGlide-EATL
AMPC0.0560.1960.2940.3710.504
CXCR40.2460.2800.2900.5910.578
KIF110.5670.6110.7120.7610.671
CP3A40.0850.1660.2770.3380.324
GCR0.4440.3410.6040.7370.442
AKT10.1580.2410.2080.5320.552
HIVRT0.5180.5460.7780.6700.711
HIVPR0.4400.5370.5970.7720.710
Mean0.3140.3650.4700.5970.562
Median0.3430.3110.4460.6310.565
TargetASE-EATLChemscore-EATLMOE-EATLGOLD-EATLGlide-EATL
AMPC0.0560.1960.2940.3710.504
CXCR40.2460.2800.2900.5910.578
KIF110.5670.6110.7120.7610.671
CP3A40.0850.1660.2770.3380.324
GCR0.4440.3410.6040.7370.442
AKT10.1580.2410.2080.5320.552
HIVRT0.5180.5460.7780.6700.711
HIVPR0.4400.5370.5970.7720.710
Mean0.3140.3650.4700.5970.562
Median0.3430.3110.4460.6310.565
ROC and semilog ROC curves of CompF and CompM across the dataset.
Figure 7

ROC and semilog ROC curves of CompF and CompM across the dataset.

A detailed analysis of the data presented in Figures 4 and 5 shows that the results of ROC-AUC and LogAUC share highly similar trends. For the sake of simplicity, our discussions will be mainly focused on ROC-AUC. Here, MOE-EATL and GOLD-EATL, trained with all the energy auxiliary terms provided by each docking program, can obtain eye-catching AUC performance, which exceeds the corresponding best EATL SFs (ASE-EATL and Chemscore-EATL, respectively) by 15.4% and 14.7% on average. The absolute ROC-AUC on AMPC can be improved from 0.641 to 0.876 by learning all the scoring components within MOE rather than those within ASE. Similarly, an improvement over 0.2 (from 0.687 to 0.953) can be observed on the target HIVPR by learning all the scoring components within GOLD instead of those within Chemscore. According to the averaged AUCs, an order of the docking-EATL SFs can be obtained: GOLD-EATL (0.901) > MOE-EATL (0.895) > Glide-EATL (0.883). While when the medians are regarded as the criteria, the order should be changed to GOLD-EATL (0.918) > Glide-EATL (0.900) > MOE-EATL (0.892). It should be noted that Glide, the docking module of Schrödinger, is different from the other tested docking programs with many SFs available to be chosen. Considering the precision and computation time, only the standard precision (SP) mode was selected for scoring. The SF under this mode is unique, and the number of the scoring components used for model training is far fewer than those of GOLD-EATL and MOE-EATL. However, the performance of Glide-EATL is definitely not inferior to its counterparts, indicating the 15 components from the Glide-SP docking mode are capable of capturing sufficient effective protein–ligand binding information to distinguish true ligands from decoys.

We also evaluated the performance of the docking-EATL SFs with respect to BEDROC. Similar to the previous findings, the docking-EATL SFs also outperform the corresponding top-ranked EATL SFs, and the improvement of BEDROC is statistically significant (P = 0.08 and 0.009 for MOE and GOLD, respectively). In addition, the improvements on the BEDROC scores by learning more scoring components can reach up to 49.6% and 63.5% for MOE and GOLD on average, respectively, which are quite higher than those on AUC. The largest improvement of the BEDROC score is observed for the AMPC target with 425% (over 5 times) from ASE-EATL (0.056) to MOE-EATL (0.294). We also find that the classical ASE SF achieves a BEDROC of 0.001 on this target, indicating that ASE may have some inherent defects on this target so that the components provided by ASE may be unable to characterize protein–ligand interaction patterns properly. However, our EATL method could compensate for the inherent defects of classical SFs to provide acceptable performance on both the SF and docking program levels. Overall, the rank of the docking-EATL SFs should be GOLD-EATL > Glide-EATL > MOE-EATL considering the mean and median BEDROC scores.

The distributions of the enrichment factors across the targets are shown in Figure 6. By closely examining EF at the top 0.1%, 0.5%, 1%, 2% and 5% levels (Supplementary Table S3), it can be found that the docking-EATL SFs have absolute superiority over EATL SFs in terms of initial enrichment. Apparently, GOLD-EATL and Glide-EATL are more suitable for VS campaigns due to their stable fluctuation ranges and larger medians, suggesting they are able to provide generally excellent initial enrichment on almost all kinds of proteins. One can also see from Figure 6 that MOE-EATL SF fails to generate excellent enrichment as good as the other two docking-EATL SFs and even has similar performance as the two EATL SFs although it yields acceptable results in terms of mean ROC-AUC. In truth, the enrichment factors should be more reliable metric because we usually pay more attention to top-ranked molecules after ranking large chemical database from the highest to lowest probabilities of binding to a certain target.

A closer look at the results in term of all the evaluation metrics shows that all the EATL SFs discussed in this section yield slightly worse performance on the targets CP3A4 and AMPC. We also checked the data of the other scoring schemes mentioned above, and the same phenomenon could be observed in most scoring schemes. The inherent characteristic of protein may contribute to the poor VS performance on the target CP3A4. CP3A4 is a major drug-metabolizing enzyme, which plays an important role in the metabolism of approximately 30% of clinical used drugs [61, 62]. Its binding pocket is rather wider and more flexible than most other proteins, ranging in volume from 950 Å3 to 2000 Å3 in the absence and presence of bound ligands [63, 64] so that diverse substrates and inhibitors can be accommodated in the active site, making scoring a difficult task. Therefore, it is not strange that learning popular SF components on CP3A4 could not obtain results as good as on the other targets. As for AMPC, the loss in accuracy may be explained by the bias in the training data. We know that this library constitutes only 62 ligands and four-fifth of the actives are available in the training set. Accordingly, the XGBoost model may not be able to distinguish active from decoys and give true binders higher prediction values by learning limited information.

Another interesting finding is that the impact of the target on the same scoring scheme is significant; in other words, no scoring schemes here can obtain good AUC values for all the eight targets simultaneously. Taking GOLD-EATL that has the most comprehensive excellent performance as an example, the BEDROC scores range from 0.371 (AMPC) to 0.772 (HIVPR). The same phenomenon is identified by the other scoring schemes and evaluation metrics. From this result, it’s pivotal to make a detailed scoring scheme assessment before VS campaigns to guarantee that the SFs used for docking is reliable. Future research efforts should be devoted to the development of target-customized SFs.

The performance of comprehensive EATL SFs

We also compared the predicted compounds given by the three docking-EATL SFs. Taking the target HIVPR as an example, the top 1% predicted compounds (343 compounds) were extracted from the three docking-EATL SFs for further analysis. A total of 238, 303 and 288 actives were identified by MOE-EATL, GOLD-EATL and Glide-EATL, respectively, in the top 1% lists. As to these actives, only 63 compounds could be identified by the three docking-EATL SFs simultaneously, 164 could be identified by two docking-EATL SFs, and 312 could be identified by a single docking-EATL SF. The results for the other targets are shown in Supplementary Figure S2. These findings indicate that the overlaps among the predictions given by the three docking-EATL SFs are not quite high, and therefore, there is a possibility to improve VS performance when these methods are appropriately combined.

To further explore whether the combination of these three docking-EATL SFs can provide boosting VS performance, two new experiments were performed. Firstly, all the 61 energy auxiliary terms extracted from the three docking programs were used as the input features to train a new comprehensive model named as CompF. The details about the model training are the same as the other EATL SFs mentioned above. The second experiment is based on the prediction results of the three docking-EATL SFs. For each compound in the library, the arithmetic mean of the predictions given by the three docking-EATL SFs was defined as the final prediction of each compound. We named this scoring scheme as CompM. The ROC and semilog ROC curves of the two comprehensive EATL SFs are shown in Figure 7. And the initial enrichment results are summarized in Table 4.

Table 4

The VS performance of CompF and CompM

TargetAUCBEDROCLogAUCEF0.1%
CompFCompMCompFCompMCompFCompMCompFCompM
AMPC0.9190.9390.5450.5740.5730.61132.25832.258
CXCR40.9430.9540.7450.8520.6630.74025.00025.000
KIF110.9620.9650.8130.8140.7400.74535.53335.533
CP3A40.9090.9140.5400.5740.5360.55919.60833.613
GCR0.9500.9580.8070.7840.7340.72437.14337.143
AKT10.9440.9500.7270.7130.6940.69544.24844.248
HIVRT0.9850.9890.9060.9310.8360.87030.20130.201
HIVPR0.9810.9710.8950.8620.8020.75124.22924.963
TargetAUCBEDROCLogAUCEF0.1%
CompFCompMCompFCompMCompFCompMCompFCompM
AMPC0.9190.9390.5450.5740.5730.61132.25832.258
CXCR40.9430.9540.7450.8520.6630.74025.00025.000
KIF110.9620.9650.8130.8140.7400.74535.53335.533
CP3A40.9090.9140.5400.5740.5360.55919.60833.613
GCR0.9500.9580.8070.7840.7340.72437.14337.143
AKT10.9440.9500.7270.7130.6940.69544.24844.248
HIVRT0.9850.9890.9060.9310.8360.87030.20130.201
HIVPR0.9810.9710.8950.8620.8020.75124.22924.963
Table 4

The VS performance of CompF and CompM

TargetAUCBEDROCLogAUCEF0.1%
CompFCompMCompFCompMCompFCompMCompFCompM
AMPC0.9190.9390.5450.5740.5730.61132.25832.258
CXCR40.9430.9540.7450.8520.6630.74025.00025.000
KIF110.9620.9650.8130.8140.7400.74535.53335.533
CP3A40.9090.9140.5400.5740.5360.55919.60833.613
GCR0.9500.9580.8070.7840.7340.72437.14337.143
AKT10.9440.9500.7270.7130.6940.69544.24844.248
HIVRT0.9850.9890.9060.9310.8360.87030.20130.201
HIVPR0.9810.9710.8950.8620.8020.75124.22924.963
TargetAUCBEDROCLogAUCEF0.1%
CompFCompMCompFCompMCompFCompMCompFCompM
AMPC0.9190.9390.5450.5740.5730.61132.25832.258
CXCR40.9430.9540.7450.8520.6630.74025.00025.000
KIF110.9620.9650.8130.8140.7400.74535.53335.533
CP3A40.9090.9140.5400.5740.5360.55919.60833.613
GCR0.9500.9580.8070.7840.7340.72437.14337.143
AKT10.9440.9500.7270.7130.6940.69544.24844.248
HIVRT0.9850.9890.9060.9310.8360.87030.20130.201
HIVPR0.9810.9710.8950.8620.8020.75124.22924.963
Table 5

Comparison with several other methods in terms of EF0.1% and BEDROC (α = 80.5)

EF0.1%GOLD-EATLGlide-EATLCompFCompMCompScoreCNNDenseFSSIEVE-ScoreRF-Score-VS v3
AMPC25.80635.48432.25832.25839.5702.08314.58330.70032.265
CXCR420.00020.83325.00025.00051.6105.0005.00061.10060.853
KIF1133.50330.45735.53335.53351.34011.2074.31053.4004.527
CP3A417.92715.12619.60833.61314.04028.74344.3116.70025.865
GCR34.57124.00037.14337.14327.09012.79120.93033.30032.450
AKT129.20431.56344.24844.24837.57084.64289.42042.10041.893
HIVRT28.69129.02730.20130.20121.77012.19512.76039.80039.813
HIVPR23.42121.95324.22924.96318.2009.8518.39638.30065.735
BEDROCGOLD-EATLGlide-EATLCompFCompMCompScoreGOLDGlideSurflexFlex X
AMPC0.2940.3710.5450.5740.660.040.090.000.04
CXCR40.2900.5910.7450.8520.710.080.010.270.01
KIF110.7120.7610.8130.8140.840.550.590.120.08
CP3A40.2770.3380.540.5740.710.210.170.130.08
GCR0.6040.7370.8070.7840.480.130.210.300.18
AKT10.2080.5320.7270.7130.680.420.240.050.11
HIVRT0.7780.6700.9060.9310.430.420.370.130.19
HIVPR0.5970.7720.8950.8620.300.300.140.100.05
EF0.1%GOLD-EATLGlide-EATLCompFCompMCompScoreCNNDenseFSSIEVE-ScoreRF-Score-VS v3
AMPC25.80635.48432.25832.25839.5702.08314.58330.70032.265
CXCR420.00020.83325.00025.00051.6105.0005.00061.10060.853
KIF1133.50330.45735.53335.53351.34011.2074.31053.4004.527
CP3A417.92715.12619.60833.61314.04028.74344.3116.70025.865
GCR34.57124.00037.14337.14327.09012.79120.93033.30032.450
AKT129.20431.56344.24844.24837.57084.64289.42042.10041.893
HIVRT28.69129.02730.20130.20121.77012.19512.76039.80039.813
HIVPR23.42121.95324.22924.96318.2009.8518.39638.30065.735
BEDROCGOLD-EATLGlide-EATLCompFCompMCompScoreGOLDGlideSurflexFlex X
AMPC0.2940.3710.5450.5740.660.040.090.000.04
CXCR40.2900.5910.7450.8520.710.080.010.270.01
KIF110.7120.7610.8130.8140.840.550.590.120.08
CP3A40.2770.3380.540.5740.710.210.170.130.08
GCR0.6040.7370.8070.7840.480.130.210.300.18
AKT10.2080.5320.7270.7130.680.420.240.050.11
HIVRT0.7780.6700.9060.9310.430.420.370.130.19
HIVPR0.5970.7720.8950.8620.300.300.140.100.05
Table 5

Comparison with several other methods in terms of EF0.1% and BEDROC (α = 80.5)

EF0.1%GOLD-EATLGlide-EATLCompFCompMCompScoreCNNDenseFSSIEVE-ScoreRF-Score-VS v3
AMPC25.80635.48432.25832.25839.5702.08314.58330.70032.265
CXCR420.00020.83325.00025.00051.6105.0005.00061.10060.853
KIF1133.50330.45735.53335.53351.34011.2074.31053.4004.527
CP3A417.92715.12619.60833.61314.04028.74344.3116.70025.865
GCR34.57124.00037.14337.14327.09012.79120.93033.30032.450
AKT129.20431.56344.24844.24837.57084.64289.42042.10041.893
HIVRT28.69129.02730.20130.20121.77012.19512.76039.80039.813
HIVPR23.42121.95324.22924.96318.2009.8518.39638.30065.735
BEDROCGOLD-EATLGlide-EATLCompFCompMCompScoreGOLDGlideSurflexFlex X
AMPC0.2940.3710.5450.5740.660.040.090.000.04
CXCR40.2900.5910.7450.8520.710.080.010.270.01
KIF110.7120.7610.8130.8140.840.550.590.120.08
CP3A40.2770.3380.540.5740.710.210.170.130.08
GCR0.6040.7370.8070.7840.480.130.210.300.18
AKT10.2080.5320.7270.7130.680.420.240.050.11
HIVRT0.7780.6700.9060.9310.430.420.370.130.19
HIVPR0.5970.7720.8950.8620.300.300.140.100.05
EF0.1%GOLD-EATLGlide-EATLCompFCompMCompScoreCNNDenseFSSIEVE-ScoreRF-Score-VS v3
AMPC25.80635.48432.25832.25839.5702.08314.58330.70032.265
CXCR420.00020.83325.00025.00051.6105.0005.00061.10060.853
KIF1133.50330.45735.53335.53351.34011.2074.31053.4004.527
CP3A417.92715.12619.60833.61314.04028.74344.3116.70025.865
GCR34.57124.00037.14337.14327.09012.79120.93033.30032.450
AKT129.20431.56344.24844.24837.57084.64289.42042.10041.893
HIVRT28.69129.02730.20130.20121.77012.19512.76039.80039.813
HIVPR23.42121.95324.22924.96318.2009.8518.39638.30065.735
BEDROCGOLD-EATLGlide-EATLCompFCompMCompScoreGOLDGlideSurflexFlex X
AMPC0.2940.3710.5450.5740.660.040.090.000.04
CXCR40.2900.5910.7450.8520.710.080.010.270.01
KIF110.7120.7610.8130.8140.840.550.590.120.08
CP3A40.2770.3380.540.5740.710.210.170.130.08
GCR0.6040.7370.8070.7840.480.130.210.300.18
AKT10.2080.5320.7270.7130.680.420.240.050.11
HIVRT0.7780.6700.9060.9310.430.420.370.130.19
HIVPR0.5970.7720.8950.8620.300.300.140.100.05

From Figure 7 and Table 4, it can be observed that CompF and CompM yield similar outstanding results with absolute predominance across the targets relative to a single EATL scoring scheme. Despite some improvements for CompM compared with CompF in terms of ROC-AUC, BEDROC, LogAUC and EF1%, the Wilcoxon matched-pairs signed rank test indicates the improvements are not statistically significant for all the four metrics. It should be noted that the performance achieved by CompF and CompM can be mainly attributed to the learning of these SF components with XGboost instead of a simple consensus scoring. The EF data in Table 4 and Supplementary Table S3 show that neither CompM nor CompF could bring about a big boost for the initial enrichment at the 0.1% level although the two comprehensive EATL SFs are indeed better VS scoring schemes. However, the comprehensive EATL models show better performance at the top 0.5% and 1% levels as provided in Supplementary Table S3.

Comparison with other methods

As MOE-EATL SF does not perform so well in the assessments of screening power, it was excluded from the comparison with other methods with relatively gratifying accuracy to the same dataset. Ragoza et al. proposed a convolutional neural network (CNN) SF that automatically learned the key features of protein–ligand interactions [65]. Then, based on previous work, Imrie et al. presented a deep learning approach named DenseFS that gives substantial improvement over the baseline CNN of Ragoza et al. in VS on DUD-E [66]. Besides, Yasuo et al. developed a new VS method named similarity of interaction energy vector score (SIEVE-Score), in which the van der Waals, Coulomb and hydrogen bonding interactions were extracted to represent docking poses for ML [40]. In the same work, RF-Score-VS v3, a state-of-the-art ML-based SF was evaluated as a comparison. Recently, Yunierkis et al. implemented and tested a new method named CompScore that used genetic algorithms to find the best combination of scoring components [41]. The comparisons between EATL and several other scoring schemes were conducted and the corresponding EF0.1%, and BEDROC are listed in Table 5. Note that the data for GOLD, Glide, Surflex and Flex X was from [67].

Firstly, consistent with previous findings, the initial enrichment performance of some models especially CNN and DenseFS is largely determined by the target library used for testing. For example, CNN can yield satisfactory enrichment factor on AKT1 (84.642) while poor enrichment on AMPC (2.083) relative to other scoring schemes. DenseFS and CNN both use direct and comprehensive 3D descriptions of complex structures as the input, implying that structure-based features may provide enough information on certain targets but fail to capture adequate information on others. From the color of each cell, we may conclude that SIEVE-Score and RF-Score-VS v3 should be the most applicable scoring schemes for the dataset despite the existence of extreme low values. However, it can be observed that the superiority of SIEVE-Score and RF-Score-VS v3 over our two comprehensive EATL SFs is not statistically significant according to the Wilcoxon matched-pairs signed rank test (95% confidence interval). Furthermore, CompF and CompM show absolute predominance in terms of BEDROC than the four popular docking programs. It’s of great importance for general scoring methods to obtain stable acceptable performance on most target libraries. For this reason, CompF and CompM have the most comprehensive performance.

Extra-validation on AD dataset

It is important to note that DUD-E may not be a completely perfect VS benchmark as we thought before and a cloud of bias problems is in dispute [46, 68–70]. The study of Sieg et al. claimed that the six unbiased properties (molecular weight, number of hydrogen bond acceptors, number of hydrogen bond donors, number of rotatable bonds, logP and net charge) that were originally used to eliminate artificial enrichment in DUD-E were indeed biased [68]. They applied the actives and decoys with matched physicochemical properties to simple ML algorithms and yielded high-performance VS models. However, these results might be insufficient enough to overturn all the ML methods tested on DUD-E dataset. After all, the DUD-E benchmark was initially designed for docking evaluation schemes rather than ligand-based virtual screening. The subtle negative bias may be minimized if proper characteristics are chosen to represent the protein–ligand complexes. Simple ligand-based physicochemical features or 2D topology descriptors or any others explicitly based on these features may cause high artificial enrichment and lead ML-SFs astray [43], but the features adopted in our strategy are scoring function components that depict protein–igand recognition events, which may be considered to not reflect the negative bias in the dataset too much.

Even so, it is necessary to test the effectiveness of our energy auxiliary terms learning method on the unbiased dataset. The AD dataset is considered as such an unbiased version of DUD-E, whose decoys are selected from Vina docking with high-predicted binding affinity. The construction process of the AD dataset, which has been described in Materials and methods, is vastly different from that of DUD-E. In a way, it is more appropriate to define it as a challenging version of DUD-E, because some decoys may have higher predicted binding affinity than true actives by Vina. We repeated the EATL validation procedure on the corresponding eight targets of the AD dataset. For simplicity, only docking-EATL SFs and comprehensive SFs on the AD dataset are included in this discussion. The evaluation results in terms of ROC-AUC and BEDROC are summarized in Figure 8, and detailed data is provided in Supplementary Table S4. Additionally, the results of 10 classical SFs on the AD dataset are presented in Supplementary Table S5.

From Figure 8A, it can be seen that a slight decrease (about 0.033 on average) occurred on the AUC values when the five EATL SFs are tested on the AD dataset. In addition, the AUC values showed higher volatility characteristics across the eight AD targets, which may result from two extremes targets AMPC and CP3A4. That is, all the five EATL SFs could yield their best absolute screening performance on AMPC while the worst performance on CP3A4. As we have discussed previously, the larger and more flexible receptor pocket of CP3A4 protein makes molecular docking a tough task. When it turns to the early enrichment metric BEDROC (Figure 8B), we can also observe a moderate decrease relative to that of DUD-E targets; however, the fluctuation range across the targets seems comparable as its counterpart. As for the performance among the SFs, comprehensive EATL SFs show better performance than docking-EATL SFs, which is in line with previous studies. On the whole, the degradation of screening performance is clear, indicating the predictions on the DUD-E dataset may be relatively optimistic compared to those on the AD dataset.

We also analyzed the screening performance of 10 classical SFs (Supplementary Table S5). As expected, classical SFs have trouble in distinguishing actives from decoys on the AD dataset, which is represented by many AUC values below random selection and BEDROC values approaching to zero. In other words, these classical SFs incline to mistake decoys as actives and could not score actives with higher fitness values or lower docking scores. Even so, it should be noted that Glide-SP and Chemscore showed relatively good screening power with average AUC values close to 0.6. The poor screening power should be attributed to the harsh decoy selection criteria of the AD dataset, which also demonstrates that ranking by single docking score is insufficient for VS works, even leading to diametrically opposite prediction.

Generally, it is promising to make full use of multiple scoring function components rather than simple docking scores in VS projects. The validation results showed our EATL method is effective in improving the poor screening power of classical SFs on the unbiased dataset. Although optimistic predictions may exist in DUD-E experiments, it should be realized that the high screening performance of these ML models is obtained largely by learning from true protein–ligand interactions instead of negative bias, which is supported by the fact that the five EATL SFs do not obtain a sharp decline in screening power on the AD dataset.

Conclusions

In this study, we first proposed a simple, convenient and universal scoring method named EATL that brought about and implemented the idea of analyzing scoring components generated from popular docking programs using ML algorithm to immeasurably improve the screening power. The analysis work was put into effect by XGBoost, one of the most powerful ML methods currently. We tested our method using the diverse subset of DUD-E on different learning levels and compared their absolute performance and initial enrichment with a number of classical SFs and latest state-of-the-art ML-based SFs. Also, we made further validation on the relatively unbiased AD dataset, which yielded decreasing but acceptable screening performance compared to that of DUD-E.

Box plot of the ROC-AUC (A) and BEDROC (B) results of five EATL SFs on eight targets of DUD-E and AD dataset. The horizontal lines indicate median, and the plus signs represent the mean value. MOE, MOE-EATL; GOLD, GOLD-EATL; and Glide, Glide-EATL. The plot is built with the values of AUC and BEDROC for all the targets as provided in Supplementary Table S4.
Figure 8

Box plot of the ROC-AUC (A) and BEDROC (B) results of five EATL SFs on eight targets of DUD-E and AD dataset. The horizontal lines indicate median, and the plus signs represent the mean value. MOE, MOE-EATL; GOLD, GOLD-EATL; and Glide, Glide-EATL. The plot is built with the values of AUC and BEDROC for all the targets as provided in Supplementary Table S4.

EATL-SFs are far superior to conventional SFs in terms of the well-established VS metrics. It was observed that EATL could highly improve the poor early recognition obtained by classical SFs across the dataset. This suggests that it’s a promising attempt to make the best use of the components provided by popular SFs to boost VS performance. We also find that complementarity exists between different scoring functions since docking-EATL SFs which use all the components within each docking program as input can obtain improved screening power. Furthermore, two comprehensive EATL models, CompF and CompM, which employ all the components from three docking programs can achieve comparably stable performance relative to other advanced ML-based SFs, suggesting that combining multiple docking programs may better compensate their individual weaknesses. Although the usefulness of rescoring aided by ML methods has already been noted, what distinguishes our method is that we make the best of conventional scoring system whose effectiveness and accuracy have been verified by extensive studies. Overall, EATL represents a new, simple and practical analytical method that not only is easy to be utilized by most VS researchers but also can be extended to other docking programs.

In addition, we notably suggest, any validation of ML-based SBVS methods on DUD-E should be cautious of the features characterizing protein–ligand interactions and make essential bias discussion on the dataset. It is impossible to require the benchmark to simulate real scenarios from all perspectives; actually, all the ML-based methods should be experimentally tested in practice. Here we encourage more adoption of our EATL strategy in the implementation of VS work and expect the development of other boosted methods and more reliable VS benchmarks that are instrumental to the community of SBVS.

Key Points:
  • Components generated from 10 scoring functions were analyzed by XGBoost algorithm to improve the screening power.

  • EATL-SFs, docking-EATL SFs and comprehensive SFs are superior to classical scoring functions in terms of well-established virtual screening metrics across the diverse subset of DUD-E.

  • Complementarity exists between different scoring functions and docking programs.

  • The improved VS performance of EATL could be reproduce on unbiased AD dataset.

Abbreviations

VS, virtual screening; SF, scoring functions; SBVS, structure-based virtual screening; DUD-E, Directory of Useful Decoys: Enhanced; AD, the actives as decoys; ML, machine learning; XGBoost, extreme gradient boosting; ROC-AUC, area under the receiver operating characteristic curve; EF, enrichment factor; BEDROC, Boltzmann-enhanced discrimination of receiver operating characteristic; LogAUC, log-scaled area under the receiver-operating characteristic curve.

Guo-Li Xiong is a master student at the Xiangya School of Pharmaceutical Sciences, Central South University, China. Her research focuses on the development of novel structure-based virtual screening methodologies.

Wen-Ling Ye is a master student at the Xiangya School of Pharmaceutical Sciences, Central South University, China. Her research theme is design and discovery of small molecular inhibitors of important protein targets.

Chao Shen is a PhD candidate student in the College of Pharmaceutical Sciences, Zhejiang University, China. His research interests lie in the area of computer-aided drug design.

Ai-Ping Lu is currently a professor in the Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong.

Ting-Jun Hou is currently a Professor in the College of Pharmaceutical Sciences, Zhejiang University, China. His research interests can be found at the website of his group: http://cadd.zju.edu.cn

Dong-Sheng Cao is currently an Associate Professor in the Xiangya School of Pharmaceutical Sciences, Central South University, China. His research interests can be found at the website of his group: http://www.scbdd.com

References

1.

Kuntz
ID
.
Structure-based strategies for drug design and discovery
.
Science
1992
;
257
:
1078
82
.

2.

Shoichet
BK
.
Virtual screening of chemical libraries
.
Nature
2004
;
432
:
862
5
.

3.

Cheng
T
,
Li
Q
,
Zhou
Z
, et al.
Structure-based virtual screening for drug discovery: a problem-centric review
.
AAPS J
2012
;
14
:
133
41
.

4.

Shen
C
,
Ding
J
,
Wang
Z
, et al.
From machine learning to deep learning: advances in scoring functions for protein–ligand docking
.
WIREs Comput Mol Sci
2020
;
10
:
e1429
.

5.

Wang
Z
,
Sun
H
,
Yao
X
, et al.
Comprehensive evaluation of ten docking programs on a diverse set of protein-ligand complexes: the prediction accuracy of sampling power and scoring power
.
Phys Chem Chem Phys
2016
;
18
:
12964
75
.

6.

Marrone
TJ
,
Briggs
JM
,
McCammon
JA
.
Structure-based drug design: computational advances
.
Annu Rev Pharmacol Toxicol
1997
;
37
:
71
90
.

7.

Wong
CF
,
McCammon
JA
.
Protein flexibility and computer-aided drug design
.
Annu Rev Pharmacol Toxicol
2003
;
43
:
31
45
.

8.

Gancia
E
,
De Groot
M
,
Burton
B
, et al.
Discovery of LRRK2 inhibitors by using an ensemble of virtual screening methods
.
Bioorg Med Chem Lett
2017
;
27
:
2520
7
.

9.

Wang
F
,
Yang
W
,
Hu
X
.
Discovery of high affinity receptors for dityrosine through inverse virtual screening and docking and molecular dynamics
.
Int J Mol Sci
2018
;
20
:115.

10.

Meekrathok
P
,
Thongsom
S
,
Aunkham
A
, et al.
Novel GH-20 beta-N-acetylglucosaminidase inhibitors: virtual screening, molecular docking, binding affinity, and anti-tumor activity
.
Int J Biol Macromol
2020
;
142
:503–512.

11.

Russo Spena
C
,
De Stefano
L
,
Poli
G
, et al.
Virtual screening identifies a PIN1 inhibitor with possible antiovarian cancer effects
.
J Cell Physiol
2019
;
234
:15708–15716.

12.

Ramirez
D
,
Concha
G
,
Arevalo
B
, et al.
Discovery of novel TASK-3 channel blockers using a Pharmacophore-based virtual screening
.
Int J Mol Sci
2019
;
20
:4014.

13.

Leach
AR
,
Shoichet
BK
,
Peishoff
CE
.
Prediction of protein-ligand interactions. Docking and scoring: successes and gaps
, J Med Chem
2006
;
49
:
5851
5
.

14.

Khamis
MA
,
Gomaa
W
,
Ahmed
WF
.
Machine learning in computational docking
.
Artif Intell Med
2015
;
63
:
135
52
.

15.

Li
H
,
Peng
J
,
Sidorov
P
, et al.
Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data
.
Bioinformatics
2019
;
35
:3989–3995.

16.

Ballester
PJ
,
Mitchell
JB
.
A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking
.
Bioinformatics
2010
;
26
:
1169
75
.

17.

Li
L
,
Wang
B
,
Meroueh
SO
.
Support vector regression scoring of receptor-ligand complexes for rank-ordering and virtual screening of chemical libraries
.
J Chem Inf Model
2011
;
51
:
2132
8
.

18.

Li
L
,
Khanna
M
,
Jo
I
, et al.
Target-specific support vector machine scoring in structure-based virtual screening: computational validation, in vitro testing in kinases, and effects on lung cancer cell proliferation
.
J Chem Inf Model
2011
;
51
:
755
9
.

19.

Xu
D
,
Meroueh
SO
.
Effect of binding pose and Modeled structures on SVMGen and GlideScore enrichment of chemical libraries
.
J Chem Inf Model
2016
;
56
:
1139
51
.

20.

Durrant
JD
,
McCammon
JA
.
NNScore: a neural-network-based scoring function for the characterization of protein-ligand complexes
.
J Chem Inf Model
2010
;
50
:
1865
71
.

21.

Li
H
,
Leung
KS
,
Wong
MH
, et al.
Improving AutoDock Vina using random forest: the growing accuracy of binding affinity prediction by the effective exploitation of larger data sets
.
Mol Inform
2015
;
34
:
115
26
.

22.

Li
H
,
Leung
KS
,
Wong
MH
, et al.
Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study
.
BMC Bioinformatics
2014
;
15
:
291
.

23.

Zilian
D
,
Sotriffer
CA
.
SFCscore(RF): a random forest-based scoring function for improved affinity prediction of protein-ligand complexes
.
J Chem Inf Model
2013
;
53
:
1923
33
.

24.

Li
GB
,
Yang
LL
,
Wang
WJ
, et al.
ID-score: a new empirical scoring function based on a comprehensive set of descriptors related to protein-ligand interactions
.
J Chem Inf Model
2013
;
53
:
592
600
.

25.

Yan
Y
,
Wang
W
,
Sun
Z
, et al.
Protein-ligand empirical interaction components for virtual screening
.
J Chem Inf Model
2017
;
57
:
1793
806
.

26.

Ballester
PJ
,
Schreyer
A
,
Blundell
TL
.
Does a more precise chemical description of protein-ligand complexes lead to more accurate prediction of binding affinity?
J Chem Inf Model
2014
;
54
:
944
55
.

27.

Wang
C
,
Zhang
Y
.
Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest
.
J Comput Chem
2017
;
38
:
169
77
.

28.

Sun
H
,
Pan
P
,
Tian
S
, et al.
Constructing and validating high-performance MIEC-SVM models in virtual screening for kinases: a better way for actives discovery
.
Sci Rep
2016
;
6
:
24817
.

29.

Durrant
JD
,
McCammon
JA
.
NNScore 2.0: a neural-network receptor-ligand scoring function
.
J Chem Inf Model
2011
;
51
:
2897
903
.

30.

Ashtawy
HM
,
Mahapatra
NR
.
BgN-score and BsN-score: bagging and boosting based ensemble neural networks scoring functions for accurate binding affinity prediction of protein-ligand complexes
.
BMC Bioinformatics
2015
;
16
(
Suppl 4
):
S8
.

31.

Nguyen
DD
,
Wei
GW
.
DG-GL: differential geometry-based geometric learning of molecular datasets
.
Int J Numer Method Biomed Eng
2019
;
35
:
e3179
.

32.

Jimenez
J
,
Skalic
M
,
Martinez-Rosell
G
, et al.
KDEEP: protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks
.
J Chem Inf Model
2018
;
58
:
287
96
.

33.

Molecular Operating Environment (MOE) CCGU
,
1010 Sherbrooke St. West, Suite #910, Montreal, QC, Canada, H3A 2R7
,
2018
.

34.

Friesner
RA
,
Banks
JL
,
Murphy
RB
, et al.
Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy
.
J Med Chem
2004
;
47
:
1739
49
.

35.

Halgren
TA
,
Murphy
RB
,
Friesner
RA
, et al.
Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening
.
J Med Chem
2004
;
47
:
1750
9
.

36.

Jones
G
,
Willett
P
,
Glen
RC
, et al.
Development and validation of a genetic algorithm for flexible docking11Edited by F. E. Cohen
.
J Mol Biol
1997
;
267
:
727
48
.

37.

Morris
GM
,
Huey
R
,
Lindstrom
W
, et al.
AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility
.
J Comput Chem
2009
;
30
:
2785
91
.

38.

Trott
O
,
Olson
AJ
.
AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading
.
J Comput Chem
2010
;
31
:
455
61
.

39.

Chen
F
,
Sun
H
,
Liu
H
, et al.
Prediction of luciferase inhibitors by the high-performance MIEC-GBDT approach based on interaction energetic patterns
.
Phys Chem Chem Phys
2017
;
19
:
10163
76
.

40.

Yasuo
N
,
Sekijima
M
.
Improved method of structure-based virtual screening via interaction-energy-based learning
.
J Chem Inf Model
2019
;
59
:
1050
61
.

41.

Perez-Castillo
Y
,
Sotomayor-Burneo
S
,
Jimenes-Vargas
K
, et al.
CompScore: boosting structure-based virtual screening performance by incorporating docking scoring function components into consensus scoring
.
J Chem Inf Model
2019
;
59
:
3655
66
.

42.

Wang
R
,
Wang
S
.
How does consensus scoring work for virtual library screening? An idealized computer experiment
.
J Chem Inf Comput Sci
2001
;
41
:
1422
6
.

43.

Mysinger
MM
,
Carchia
M
,
Irwin
JJ
, et al.
Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking
.
J Med Chem
2012
;
55
:
6582
94
.

44.

Gaulton
A
,
Hersey
A
,
Nowotka
M
, et al.
The ChEMBL database in 2017
.
Nucleic Acids Res
2017
;
45
:
D945
54
.

45.

Irwin
JJ
,
Sterling
T
,
Mysinger
MM
, et al.
ZINC: a free tool to discover chemistry for biology
.
J Chem Inf Model
2012
;
52
:
1757
68
.

46.

Chen
L
,
Cruz
A
,
Ramsey
S
, et al.
Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening
.
PLoS One
2019
;
14
:
e0220113
.

47.

Hawkins
PC
,
Skillman
AG
,
Warren
GL
, et al.
Conformer generation with OMEGA: algorithm and validation using high quality structures from the protein databank and Cambridge structural database
.
J Chem Inf Model
2010
;
50
:
572
84
.

48.

Refaeilzadeh
P
,
Tang
L
,
Liu
H
. Cross-validation. In:
Liu
L
,
Özsu
MT
(eds).
Encyclopedia of Database Systems
.
New York, NY
:
Springer New York
,
2016
,
1
7
.

49.

Sammut
C
,
Webb
GI
(eds). Stratified cross validation. In:
Encyclopedia of Machine Learning and Data Mining
.
Boston, MA
:
Springer US
,
2017
,
1191
1
.

50.

Blagus
R
,
Lusa
L
.
Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
.
BMC Bioinformatics
2015
;
16
:
363
.

51.

Chen
T
,
Guestrin
C
. XGBoost: a scalable tree boosting system. In:
Acm Sigkdd International Conference on Knowledge Discovery & Data Mining
.
2016
.

52.

Babajide Mustapha
I
,
Saeed
F
.
Bioactive molecule prediction using extreme gradient boosting
.
Molecules
2016
;
21
:983.

53.

Chen
X
,
Huang
L
,
Xie
D
, et al.
EGBMMDA: extreme gradient boosting machine for MiRNA-disease association prediction
.
Cell Death Dis
2018
;
9
:
3
.

54.

Zhao
Z
,
Peng
H
,
Lan
C
, et al.
Imbalance learning for the prediction of N(6)-methylation sites in mRNAs
.
BMC Genomics
2018
;
19
:
574
.

55.

Lu
J
,
Hou
X
,
Wang
C
, et al.
Incorporating explicit water molecules and ligand conformation stability in machine-learning scoring functions
.
J Chem Inf Model
2019
;
59
:
4540
9
.

56.

Li
H
,
Peng
J
,
Sidorov
P
, et al.
Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data
.
Bioinformatics
2019
;
35
:
3989
95
.

57.

Berthold
MR
,
Cebron
N
,
Dill
F
, et al.
KNIME: the Konstanz information miner
.
ACM SIGKDD Explorations Newsletter
2006
;
11
:
26
31
.

58.

Hanley
JA
,
McNeil
BJ
.
The meaning and use of the area under a receiver operating characteristic (ROC) curve
.
Radiology
1982
;
143
:
29
36
.

59.

Truchon
JF
,
Bayly
CI
.
Evaluating virtual screening methods: good and bad metrics for the "early recognition" problem
.
J Chem Inf Model
2007
;
47
:
488
508
.

60.

Mysinger
MM
,
Shoichet
BK
.
Rapid context-dependent ligand desolvation in molecular docking
.
J Chem Inf Model
2010
;
50
:
1561
73
.

61.

Rendic
S
,
Guengerich
FP
.
Survey of human oxidoreductases and cytochrome P450 enzymes involved in the metabolism of xenobiotic and natural chemicals
.
Chem Res Toxicol
2015
;
28
:
38
42
.

62.

Zanger
UM
,
Schwab
M
.
Cytochrome P450 enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation
.
Pharmacol Ther
2013
;
138
:
103
41
.

63.

Ekroos
M
,
Sjogren
T
.
Structural basis for ligand promiscuity in cytochrome P450 3A4
.
Proc Natl Acad Sci U S A
2006
;
103
:
13682
7
.

64.

Lewis
DF
.
Structural characteristics of human P450s involved in drug metabolism: QSARs and lipophilicity profiles
.
Toxicology
2000
;
144
:
197
203
.

65.

Ragoza
M
,
Hochuli
J
,
Idrobo
E
, et al.
Protein-ligand scoring with convolutional neural networks
.
J Chem Inf Model
2017
;
57
:
942
57
.

66.

Imrie
F
,
Bradley
AR
,
van der
Schaar
M
, et al.
Protein family-specific models using deep neural networks and transfer learning improve virtual screening and highlight the need for more data
.
J Chem Inf Model
2018
;
58
:
2319
30
.

67.

Chaput
L
,
Martinez-Sanz
J
,
Saettel
N
, et al.
Benchmark of four popular virtual screening programs: construction of the active/decoy dataset remains a major determinant of measured performance
.
J Chem
2016
;
8
:
56
.

68.

Sieg
J
,
Flachsenberg
F
,
Rarey
M
.
In need of bias control: evaluating chemical data for machine learning in structure-based virtual screening
.
J Chem Inf Model
2019
;
59
:
947
61
.

69.

Rifaioglu
AS
,
Nalbat
E
,
Atalay
V
, et al.
DEEPScreen: high performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations
.
Chem Sci
2020
;
11
:
2531
57
.

70.

Yang
J
,
Shen
C
,
Huang
N
.
Predicting or pretending: artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets
.
Front Pharmacol
2020
;
11
:
69
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data