Abstract

Natural products (NPs) and their derivatives are important resources for drug discovery. There are many in silico target prediction methods that have been reported, however, very few of them distinguish NPs from synthetic molecules. Considering the fact that NPs and synthetic molecules are very different in many characteristics, it is necessary to build specific target prediction models of NPs. Therefore, we collected the activity data of NPs and their derivatives from the public databases and constructed four datasets, including the NP dataset, the NPs and its first-class derivatives dataset, the NPs and all its derivatives and the ChEMBL26 compounds dataset. Conditions, including activity thresholds and input features, were explored to access the performance of eight machine learning methods of target prediction of NPs, including support vector machines (SVM), extreme gradient boosting, random forests, K-nearest neighbor, naive Bayes, feedforward neural networks (FNN), convolutional neural networks and recurrent neural networks. As a result, the NPs and all their derivatives datasets were selected to build the best NP-specific models. Furthermore, the consensus models, as well as the voting models, were additionally applied to improve the prediction performance. More evaluations were made on the external validation set and the results demonstrated that (1) the NP-specific model performed better on the target prediction of NPs than the traditional models training on the whole compounds of ChEMBL26. (2) The consensus model of FNN + SVM possessed the best overall performance, and the voting model can significantly improve recall and specificity.

Introduction

Nature is a valuable repository of novel bioactive entities. In the past 30 years, the percentage of new chemical entities inspired by natural products (NPs) and related molecules has risen to approximately 50%, 74% of which are concentrated in the field of antitumor [1]. NPs and their structural analogs have historically made significant contributions to the treatment of drugs, especially for cancer and infectious diseases. For example, quinine extracted from the bark of the Rubiaceae Cinchona tree plays an important role in the treatment of malaria [2]. Rosuvastatin, a blockbuster drug for the treatment of high cholesterol, mimics the pharmacophore of the NP mevastatin from the fungus Penicillium citrinum [3]. In general, the discovery of NPs profoundly affected the advances in biology and inspired drug discovery and therapy.

After entering the 21st century, technological and scientific advances, including analytical techniques, genome mining, engineering and cultivation systems, greatly promoted the NP-based drug discovery [4]. At the same time, a large amount of data on NPs have been produced. According to statistics, >120 different NPs databases and collections have been published and reused since 2000 [5]. There are many NP molecules, for example, COCONUT contains 885 447 NP molecules [6], however, the activity data are relatively lacking. The largest database of NP activity, NPASS, contains only 446 552 quantitative activity records [7].

The therapeutic activity of most drugs depends on their interaction with the targets, and about 95% of the targets are proteins [8]. Recent studies have shown that the NPs approved by the FDA or clinically investigated often target multiple proteins, which is called polypharmacology [9, 10]. Thus, systematically identifying the targets of NPs at the human proteome level would provide unexpected opportunities for the drug repositioning and reducing toxicity of NPs [11–13]. More activity data of NPs can be obtained using traditional experimental methods, but affinity chromatography and activity-based protein analysis experiments will inevitably produce high false alarm rates, and most traditional experimental assays for identifying targets of NPs are expensive and time-consuming [14–16]. Therefore, there is an urgent need of in silico prediction of the active targets for NPs to increase efficiency and save costs.

At present, many methods can be used for the target prediction of small molecules such as the reverse molecular docking [17], network pharmacology [9, 18] and similarity ensemble approach [19, 20]. Based on a large amount of public biological activity data, machine learning and deep learning algorithms have also been proposed and widely used to speed up the process of protein target identification for small molecules [21, 22]. Many of these virtual methods have also been practiced in the target prediction of NPs. For example, Begnini et al. [23] described how the core of macrocyclic NPs can serve as a high-quality in silico screening library for difficult-to-drug targets using reverse molecular docking strategy. Chen et al. [24] systematically explored the capacity of an alignment-based approach to identify the targets of large and flexible NPs and macrocyclic compounds. Cheng et al. [9, 25] developed statistical network models in order to link NPs to anti-cancer targets and proteins involved in aging-associated disorders. Cockroft et al. [26] developed a stacked ensemble approach for target prediction, StarFish, which used K-nearest neighbors (KNNs), multi-layer perceptron, random forest, logistic regression and model stacking methods and applied it to a NP set.

NPs are of great differences in many respects with synthetic molecules [3, 27–32], which can be roughly concluded as follows:

(i) Chiral carbon: NPs contain a much larger fraction of sp3-hybridized bridgehead atoms and chiral centers compared with synthetic small molecules, and NPs usually have higher steric complexity.

(ii) Diversity of ring systems: Only about 20% of the ring systems present in NPs can be found in trade drugs.

(iii) Element type: Drugs and combinatorial molecules tend to contain more nitrogen-, sulfur- and halogen- containing groups, while NPs have more oxygen atoms.

(iv) Functional group: NPs differ significantly from synthetic drugs and combinatorial libraries in the ratio of aromatic ring atoms, the number of solvated hydrogen-bond donors and acceptors.

(v) Molecular properties: NP libraries have a wider range of molecular properties, such as molecular mass and octanol–water partition coefficient, compared with synthetic and combinatorial counterparts.

To our knowledge, none of the current methods specifically distinguish between NPs and synthetic molecules to make target prediction for NPs, nor there has been a large-scale comparison of methods of NP target prediction. For this reason, we hope to use multiple algorithms to build the NP-specific models based on the activity data of NPs and their derivatives, which are defined as naturally occurring compounds that have been chemically modified or a collection of the purely synthetic medicinal compounds inspired by natural compounds [33–36].

In the present study, several NPs specific datasets and the all compounds’ dataset from CHEMBL26 were constructed to build a variety of machine learning classification models for the evaluation of the difference between NP-specific datasets and traditional mixed datasets on the target prediction task of NPs. A total of eight machine learning algorithms were used here, including standard feedforward neural networks (FNN), convolutional neural networks (CNN), recurrent neural networks (RNN), support vector machines (SVM), naive Bayes (NB), KNN, random forests (RF) and extreme gradient boosting (XGBoost).

Methods

Data construction

The original NPs were extracted from COCONUT. The derivatives were mined from BindingDB [37], ChEMBL [38] and PubChem Bioassay [39] according to the definition that derivatives were chemically modified from natural compounds. We define compounds that are modified directly from NPs as first-class derivatives, and the rest obtained by remodifying from derivatives along with first-class derivatives are all derivatives of NPs. More detailed information can be found in the Data Collection section and Figure S3 of Supporting information.

To obtain high-quality training datasets, only assays measured in IC50, Ki or Kd and unit of nM against a single protein target were considered [40]. Specifically, we kept the biochemical assays which were annotated with a confidence score of ≥4 [41]. Afterward, the molecules with quantitative activity records were discretized into four class categories based on activity thresholds, namely active, weak active, inactive and weak inactive as described by Mayr et al. [21]. The activity thresholds of the labels are shown in Table 1. The active and inactive data were used as positive and negative samples for training a model, and compounds labeled with weak active or weak inactive were additionally used as weakly data to discuss the impact of activity thresholds. To guarantee the quality of the model, the compound-target pairs with ambiguous labels were discarded and the targets with <100 molecules or <3 active/inactive molecules were further removed [42].

Table 1

Division criteria of threshold

LabelThresholds in −log10(M)
Active≥5.5
Weak active>5.0 and <5.5
Inactive≤4.5
Weak inactive>4.5 and <5.0
LabelThresholds in −log10(M)
Active≥5.5
Weak active>5.0 and <5.5
Inactive≤4.5
Weak inactive>4.5 and <5.0
Table 1

Division criteria of threshold

LabelThresholds in −log10(M)
Active≥5.5
Weak active>5.0 and <5.5
Inactive≤4.5
Weak inactive>4.5 and <5.0
LabelThresholds in −log10(M)
Active≥5.5
Weak active>5.0 and <5.5
Inactive≤4.5
Weak inactive>4.5 and <5.0

Following the above steps, we built eight datasets from the ChEMBL26 as training sets, and for a better distinction, we added the label ‘Weak’ to the abbreviation of datasets with weakly data to distinguish from the datasets without weakly data: (1) the NPs dataset without weakly data (NPs), (2) the NPs and its first-class derivatives dataset without weakly data (NPs + Der1), (3) the NPs and all its derivatives dataset without weakly data (NPs + DerALL), (4) whole compounds dataset from ChEMBL26 without weakly data (ChEMBL26), (5) the NPs dataset with weakly data (Weak NPs), (6) the NPs and its first-class derivatives dataset with weakly data (Weak NPs + Der1), (7) the NPs and all its derivatives dataset with weakly data (Weak NPs + DerALL) and the (8) whole compounds dataset from ChEMBL26 with weakly data (Weak ChEMBL26).

Furthermore, an external validation set was constructed using the ChEMBL29, the NPASS, BindingDB and PubChem Assays, which removed the target-compound pairs overlapping with the training dataset.

Molecular fingerprints

Three binary fingerprints, extended connectivity fingerprints (ECFP), functional connectivity fingerprints (FCFP) and molecular access system (MACCS) were used as chemical descriptors in this study. MACCS is a 166-bit fingerprint based on a well-defined dictionary of substructure(MACCS keys) [43]. ECFP and FCFP descriptors are substructure fingerprints based on the Morgan algorithm [44], which represent feature sets of circular atom neighborhoods by compiling the surrounding environment of atoms iteratively. The differences between ECFP and FCFP are the atomic characteristics during initialization. The initial identifier of an atom in ECFP is obtained from several properties of the atom itself, whereas FCFP uses functional group types before substructures are enumerated. We compare the prediction performance of each fingerprint and the combination of them. Fingerprint representations are generated using the RDKit implementation of MACCS (166-bit), 2048-bit ECFP6 (radius = 3) and 2048-bit FCFP6 (radius = 3). The combinations of different fingerprints refer to connecting the bit string of each fingerprint, for example, a combination of 166-bit MACCS and 2048-bit ECFP6 had 2214 bits.

Cluster cross-validation

Cluster cross-validation is today’s popular data-partitioning scheme that which compounds are clustered based on the chemical similarity before dividing the training and testing sets [45]. In the experiment, we conducted 3-fold cross-validation with compounds clustered in advance to evaluate the performance of the model. To prevent similar data points from falling into the training set and the test set at once, all of the molecules were first clustered by the single linkage algorithm. Jaccard distance using binarized Morgan fingerprints with a radius of 2 was employed as the metric between any two compounds and the minimum distance was set to 0.3. In the next step, molecules belonging to the same cluster were randomly assigned to one of the 3-folds. We also compared the performance of the cluster cross-validation and random cross-validation (see Table S12).

Nested cluster cross-validation

To get a fair evaluation, nested cross-validation was performed for parameter tuning. For nested cross-validation, training data are split into two portions: inner and outer portions. Different hyperparameter combinations were attempted in the inner loop to evaluate which hyperparameter can achieve the best performance. Hyperparameters and the performance comparison of RF and XGBoost with default parameters and selected hyperparameter can be found in Table S1, S2 and Table S9. Then, the selected hyperparameter is employed in the outer loop to get a model for each fold, which could avoid the hyperparameter selection bias of performance evaluation. The area under the curve (AUC) of the receiver operating characteristic was calculated here for assessing model performance. For each hyperparameter combination, we obtained the AUC values from the inner rings and the mean AUC values of two inner loops were adopted as the criteria to select the optimal hyperparameter combination for the corresponding outer ring. At last, we summarized the AUC values of three outer loops by calculating their average values to obtain the most realistic results of model performance evaluation. At the same time, the best optimal hyperparameter combination was confirmed by calculating the mean AUC values of the six inner loops of the nested cluster cross-validation [22] and then used to train the final models on all data.

Machine learning methods

We compared the prediction performances of eight machine learning architectures for NP target fishing, including three deep learning methods (FNN, CNN and RNN) and five traditional machine learning approaches (SVM, XGBoost, RF, KNN and NB).

NB has been widely used in target prediction and is included as a baseline method. Particularly, SVM and KNN are typical classification methods based on similarity, RF and XGBoost are representative classification methods based on feature, while XGBoost implements gradient tree boosting. Deep learning methods have recently garnered significant attention in target fishing, and three representative architectures of deep neural networks are considered in this study. Among them, FNN follows the standard feedforward architecture and takes vectorial inputs, CNN has advantages in image processing and mimics its traits in the convolutional layers and RNN process the sequence data with cyclic connections using memory blocks. The details of each algorithm were provided in the Supporting information. The overall workflow to assess the predictive performances of the machine learning algorithms on different datasets is shown in Figure 1.

The overall workflow to assess the predictive performances of the machine learning algorithms on different datasets.
Figure 1

The overall workflow to assess the predictive performances of the machine learning algorithms on different datasets.

Results and Discussion

Dataset

Eight different datasets and an external validation set were prepared for evaluating the NP target prediction models. The statistic of each dataset can be found in Table 2. A total of 899 single protein targets with varying numbers of data points were identified for ChEMBL26. The targets contain between 100 and 7086 unique compounds with a mean of 795, a median of 410 and the first quartile of 191. The smallest dataset, NPs, contained 26 targets. The min, max, mean, median and the first quartile of data points of NPs were 100, 592, 174, 148 and 122, respectively. The detailed information is provided in the Supplement Materials.

Table 2

The number of targets, compounds and bioactivities of each dataset

DatasetsTargets_numberCompound_numberBioactivity_number
ChEMBL26899458 198714 438
NPs + DerALL47097 706164 195
NPs + Der115018 49330 543
NPs2630524521
Weak ChEMBL261084522 416863 677
Weak NPs + DerALL585121 729219 793
Weak NPs + Der121126 27946 236
Weak NPs3741767250
External validation set666720510 776
DatasetsTargets_numberCompound_numberBioactivity_number
ChEMBL26899458 198714 438
NPs + DerALL47097 706164 195
NPs + Der115018 49330 543
NPs2630524521
Weak ChEMBL261084522 416863 677
Weak NPs + DerALL585121 729219 793
Weak NPs + Der121126 27946 236
Weak NPs3741767250
External validation set666720510 776
Table 2

The number of targets, compounds and bioactivities of each dataset

DatasetsTargets_numberCompound_numberBioactivity_number
ChEMBL26899458 198714 438
NPs + DerALL47097 706164 195
NPs + Der115018 49330 543
NPs2630524521
Weak ChEMBL261084522 416863 677
Weak NPs + DerALL585121 729219 793
Weak NPs + Der121126 27946 236
Weak NPs3741767250
External validation set666720510 776
DatasetsTargets_numberCompound_numberBioactivity_number
ChEMBL26899458 198714 438
NPs + DerALL47097 706164 195
NPs + Der115018 49330 543
NPs2630524521
Weak ChEMBL261084522 416863 677
Weak NPs + DerALL585121 729219 793
Weak NPs + Der121126 27946 236
Weak NPs3741767250
External validation set666720510 776

Fingerprint selection

Feature selection is a key step in machine learning. For small molecules, the SMILS strings, molecular fingerprints and molecular graphs can be used as input features. Most machine learning algorithms use molecular fingerprints as input [22, 26, 46, 47]. With the progress of deep learning algorithms, the construction of CNN models using molecular graphs as input features and RNN models using SMILES strings as input features have also shown good performance [21]. Therefore, in this work, the molecular graphs and SMILES strings are adopted for CNN and RNN, respectively, whereas the rest six algorithms take molecule fingerprints as input. Generally, the choice of molecular fingerprints will affect the performance of a ligand-based target prediction model [47]. Here, we estimated three fingerprints (ECFP6, FCFP6 and MACCS) and combinations of them (ECFP6 + FCFP6, ECFP6 + MACCS, FCFP6 + MACCS and ECFP6 + FCFP6 + MACCS) using six machine learning methods (FNN, SVM, RF, KNN, NB and XGBoost) on six datasets, including NPs, NPs + Der1, ChEMBL26, Weak NPs, Weak NPs + Der1 and Weak ChEMBL26. The intersection targets of NPs, NPs + Der1 and ChEMBL26, which include 26 targets, were taken into comparison. Accordingly, 37 targets of Weak NPs, Weak NPs + Der1 and Weak ChEMBL26 were discussed as the intersection. The results of ChEMBL26 are listed in Table 3, and the result of the other five datasets can be found in Supplementary Tables S3S7 available online at https://dbpia.nl.go.kr/bib. As shown in Table 3, the averaged AUC values of models using ECFP6 + MACCS+FCFP6 ranked first at four out of six machine learning methods. In terms of the other five datasets, ECFP6 + MACCS+FCFP6 performed best on Weak NPs + Der1 (Supplementary Table S6 available online at https://dbpia.nl.go.kr/bib) and Weak ChEMBL26 (Supplementary Table S7 available online at https://dbpia.nl.go.kr/bib). The poor performances of ECFP6 + MACCS+FCFP6 on NPs (Supplementary Table S3 available online at https://dbpia.nl.go.kr/bib), Weak NPs (Supplementary Table S4 available online at https://dbpia.nl.go.kr/bib) and NPs + Der1 (Supplementary Table S5 available online at https://dbpia.nl.go.kr/bib) were possibly due to the small dataset which brought about a relatively unobvious, unstable and biased evaluation. In addition, no matter which dataset, the combination of different fingerprints always performed better than a particular fingerprint. Overall, using the combined fingerprints, which contain more molecular properties, yields better performance. Therefore, the ECFP6 + MACCS+FCFP6 was picked out as the optimal fingerprint combination and was selected as the input feature in the following training work.

Table 3

Performance comparison of different target prediction methods on ChEMBL26; the table gives the means and SDs of AUC values for the compared algorithms and feature categories or input types; the top 1 AUC values were marked as bold text

FNNRFSVMXGBOOSTKNNNB
ECFP60.869 ± 0.0720.890 ± 0.0610.889 ± 0.0720.869 ± 0.0710.828 ± 0.1080.840 ± 0.079
FCFP60.871 ± 0.0800.884 ± 0.0710.892 ± 0.0470.876 ± 0.0690.825 ± 0.1090.842 ± 0.065
MACCS0.878 ± 0.0630.884 ± 0.0620.871 ± 0.0590.877 ± 0.0630.827 ± 0.0800.779 ± 0.076
ECFP6 + FCFP60.880 ± 0.0740.893 ± 0.0620.908 ± 0.0470.885 ± 0.0620.832 ± 0.1100.847 ± 0.069
ECFP6 + MACCS0.888 ± 0.0530.896 ± 0.0620.898 ± 0.0650.889 ± 0.0590.838 ± 0.1130.836 ± 0.067
FCFP6 + MACCS0.877 ± 0.0650.900 ± 0.0630.902 ± 0.0450.891 ± 0.0680.838 ± 0.1050.840 ± 0.060
ECFP6 + FCFP6 + MACCS0.880 ± 0.0650.900 ± 0.0590.911 ± 0.0460.892 ± 0.0640.838 ± 0.1110.844 ± 0.062
FNNRFSVMXGBOOSTKNNNB
ECFP60.869 ± 0.0720.890 ± 0.0610.889 ± 0.0720.869 ± 0.0710.828 ± 0.1080.840 ± 0.079
FCFP60.871 ± 0.0800.884 ± 0.0710.892 ± 0.0470.876 ± 0.0690.825 ± 0.1090.842 ± 0.065
MACCS0.878 ± 0.0630.884 ± 0.0620.871 ± 0.0590.877 ± 0.0630.827 ± 0.0800.779 ± 0.076
ECFP6 + FCFP60.880 ± 0.0740.893 ± 0.0620.908 ± 0.0470.885 ± 0.0620.832 ± 0.1100.847 ± 0.069
ECFP6 + MACCS0.888 ± 0.0530.896 ± 0.0620.898 ± 0.0650.889 ± 0.0590.838 ± 0.1130.836 ± 0.067
FCFP6 + MACCS0.877 ± 0.0650.900 ± 0.0630.902 ± 0.0450.891 ± 0.0680.838 ± 0.1050.840 ± 0.060
ECFP6 + FCFP6 + MACCS0.880 ± 0.0650.900 ± 0.0590.911 ± 0.0460.892 ± 0.0640.838 ± 0.1110.844 ± 0.062
Table 3

Performance comparison of different target prediction methods on ChEMBL26; the table gives the means and SDs of AUC values for the compared algorithms and feature categories or input types; the top 1 AUC values were marked as bold text

FNNRFSVMXGBOOSTKNNNB
ECFP60.869 ± 0.0720.890 ± 0.0610.889 ± 0.0720.869 ± 0.0710.828 ± 0.1080.840 ± 0.079
FCFP60.871 ± 0.0800.884 ± 0.0710.892 ± 0.0470.876 ± 0.0690.825 ± 0.1090.842 ± 0.065
MACCS0.878 ± 0.0630.884 ± 0.0620.871 ± 0.0590.877 ± 0.0630.827 ± 0.0800.779 ± 0.076
ECFP6 + FCFP60.880 ± 0.0740.893 ± 0.0620.908 ± 0.0470.885 ± 0.0620.832 ± 0.1100.847 ± 0.069
ECFP6 + MACCS0.888 ± 0.0530.896 ± 0.0620.898 ± 0.0650.889 ± 0.0590.838 ± 0.1130.836 ± 0.067
FCFP6 + MACCS0.877 ± 0.0650.900 ± 0.0630.902 ± 0.0450.891 ± 0.0680.838 ± 0.1050.840 ± 0.060
ECFP6 + FCFP6 + MACCS0.880 ± 0.0650.900 ± 0.0590.911 ± 0.0460.892 ± 0.0640.838 ± 0.1110.844 ± 0.062
FNNRFSVMXGBOOSTKNNNB
ECFP60.869 ± 0.0720.890 ± 0.0610.889 ± 0.0720.869 ± 0.0710.828 ± 0.1080.840 ± 0.079
FCFP60.871 ± 0.0800.884 ± 0.0710.892 ± 0.0470.876 ± 0.0690.825 ± 0.1090.842 ± 0.065
MACCS0.878 ± 0.0630.884 ± 0.0620.871 ± 0.0590.877 ± 0.0630.827 ± 0.0800.779 ± 0.076
ECFP6 + FCFP60.880 ± 0.0740.893 ± 0.0620.908 ± 0.0470.885 ± 0.0620.832 ± 0.1100.847 ± 0.069
ECFP6 + MACCS0.888 ± 0.0530.896 ± 0.0620.898 ± 0.0650.889 ± 0.0590.838 ± 0.1130.836 ± 0.067
FCFP6 + MACCS0.877 ± 0.0650.900 ± 0.0630.902 ± 0.0450.891 ± 0.0680.838 ± 0.1050.840 ± 0.060
ECFP6 + FCFP6 + MACCS0.880 ± 0.0650.900 ± 0.0590.911 ± 0.0460.892 ± 0.0640.838 ± 0.1110.844 ± 0.062

Graph selection

In the case of CNN, the ConvMolFeaturizer and WeaveFeaturizer were compared as input features, which are referred to as GC and Weave, respectively. The detailed comparison results are listed in Table 4. According to AUC values, the GC performed better than Weave on the six datasets. Thus, GC was applied in the follow-up works.

Table 4

The means and SDs of AUC values of GC and Weave for the datasets with and without weakly data

DatasetGCWeave
NPs0.824 ± 0.0810.794 ± 0.072
NPs + Der10.855 ± 0.0650.839 ± 0.061
ChEMBL260.882 ± 0.0560.855 ± 0.055
Weak NPs0.747 ± 0.1110.712 ± 0.098
Weak NPs + Der10.790 ± 0.0720.780 ± 0.065
Weak ChEMBL260.834 ± 0.0750.814 ± 0.058
DatasetGCWeave
NPs0.824 ± 0.0810.794 ± 0.072
NPs + Der10.855 ± 0.0650.839 ± 0.061
ChEMBL260.882 ± 0.0560.855 ± 0.055
Weak NPs0.747 ± 0.1110.712 ± 0.098
Weak NPs + Der10.790 ± 0.0720.780 ± 0.065
Weak ChEMBL260.834 ± 0.0750.814 ± 0.058
Table 4

The means and SDs of AUC values of GC and Weave for the datasets with and without weakly data

DatasetGCWeave
NPs0.824 ± 0.0810.794 ± 0.072
NPs + Der10.855 ± 0.0650.839 ± 0.061
ChEMBL260.882 ± 0.0560.855 ± 0.055
Weak NPs0.747 ± 0.1110.712 ± 0.098
Weak NPs + Der10.790 ± 0.0720.780 ± 0.065
Weak ChEMBL260.834 ± 0.0750.814 ± 0.058
DatasetGCWeave
NPs0.824 ± 0.0810.794 ± 0.072
NPs + Der10.855 ± 0.0650.839 ± 0.061
ChEMBL260.882 ± 0.0560.855 ± 0.055
Weak NPs0.747 ± 0.1110.712 ± 0.098
Weak NPs + Der10.790 ± 0.0720.780 ± 0.065
Weak ChEMBL260.834 ± 0.0750.814 ± 0.058

Activity threshold selection

In some cases, existing research use traditional machine learning with weakly active data removed, and deep learning is considered to distinguish the data in the weakly active region [21, 40, 48]. Since NPs often interact with multiple targets in a weak-bonded way [49–52], it is necessary for NPs to discuss the effects of weak activity data. Therefore, we used two dataset partitioning methods to explore whether it is better to directly select a certain threshold value or exclude the weakly data points. The results of eight algorithms (FNN, GC SVM, RF, KNN, NB, XGBoost and LSTM) were displayed as a boxplot in Figure 2. The orange line in each boxplot represents the median AUC values of the eight models. It can be seen from Figure 2 that the models excluding weakly active data (blue boxes) generated significantly higher median AUC values (orange lines in the boxes) than the models containing weakly active data (green boxes). Without special instructions, all models in the late evaluations were defaulted to use the dataset excluding weakly active data.

The effects of weakly data and no weakly data on the performance of NPs, NPs+Der1 and ChEMBL26.
Figure 2

The effects of weakly data and no weakly data on the performance of NPs, NPs+Der1 and ChEMBL26.

Large-scale comparison

After determining the input characteristics and activity thresholds on a small range of crossover targets (26 targets for datasets without weakly data and 37 targets for datasets with weakly data), we compared the training results of eight algorithms on a larger benchmark of four datasets with whole targets (ChEMBL26 with 899 targets, NPs + DerALL with 470 targets, NPs + Der1 with 150 targets and NPs with 26 targets). The means and standard deviations (SDs) of AUC values of the eight algorithms in the four datasets are shown in Table 5. It can be found that FNN performed best with the highest averaged AUC value (marked as bold text in Table 5) in three datasets, which is consistent with the work of Mayr et al. [21], that deep learning methods significantly outperform all competing methods. In addition, we found that FNN, GC, SVM and RF performed stable with an averaged AUC value >0.8, and LSTM, XGBoost and KNN have poor performance in small datasets (NPs and NPs + Der1), while NB always worked poorly in all datasets. We also trained models on ChEMBL29 benchmark with more hyperparameters, and the results were similar to the models constructed using the ChEMBL26 database, see Table S10, S11.

Table 5

The means and SDs of AUC values of the eight methods in the four datasets, the top 1 AUC values were marked as bold text

NPsNPs + Der1NPs + DerALLChEMBL26
FNN0.811 ± 0.0830.854 ± 0.1090.890 ± 0.0990.884 ± 0.091
RF0.825 ± 0.0980.838 ± 0.1300.873 ± 0.1060.866 ± 0.106
SVM0.835 ± 0.0800.837 ± 0.1400.871 ± 0.1270.856 ± 0.134
LSTM0.667 ± 0.0870.772 ± 0.1170.859 ± 0.1100.850 ± 0.105
GC0.824 ± 0.0810.834 ± 0.1130.851 ± 0.1150.842 ± 0.110
XGBoost0.793 ± 0.1150.816 ± 0.1340.841 ± 0.1230.835 ± 0.126
KNN0.761 ± 0.1090.785 ± 0.1250.819 ± 0.1240.815 ± 0.115
NB0.782 ± 0.1090.717 ± 0.1610.737 ± 0.1630.739 ± 0.159
NPsNPs + Der1NPs + DerALLChEMBL26
FNN0.811 ± 0.0830.854 ± 0.1090.890 ± 0.0990.884 ± 0.091
RF0.825 ± 0.0980.838 ± 0.1300.873 ± 0.1060.866 ± 0.106
SVM0.835 ± 0.0800.837 ± 0.1400.871 ± 0.1270.856 ± 0.134
LSTM0.667 ± 0.0870.772 ± 0.1170.859 ± 0.1100.850 ± 0.105
GC0.824 ± 0.0810.834 ± 0.1130.851 ± 0.1150.842 ± 0.110
XGBoost0.793 ± 0.1150.816 ± 0.1340.841 ± 0.1230.835 ± 0.126
KNN0.761 ± 0.1090.785 ± 0.1250.819 ± 0.1240.815 ± 0.115
NB0.782 ± 0.1090.717 ± 0.1610.737 ± 0.1630.739 ± 0.159
Table 5

The means and SDs of AUC values of the eight methods in the four datasets, the top 1 AUC values were marked as bold text

NPsNPs + Der1NPs + DerALLChEMBL26
FNN0.811 ± 0.0830.854 ± 0.1090.890 ± 0.0990.884 ± 0.091
RF0.825 ± 0.0980.838 ± 0.1300.873 ± 0.1060.866 ± 0.106
SVM0.835 ± 0.0800.837 ± 0.1400.871 ± 0.1270.856 ± 0.134
LSTM0.667 ± 0.0870.772 ± 0.1170.859 ± 0.1100.850 ± 0.105
GC0.824 ± 0.0810.834 ± 0.1130.851 ± 0.1150.842 ± 0.110
XGBoost0.793 ± 0.1150.816 ± 0.1340.841 ± 0.1230.835 ± 0.126
KNN0.761 ± 0.1090.785 ± 0.1250.819 ± 0.1240.815 ± 0.115
NB0.782 ± 0.1090.717 ± 0.1610.737 ± 0.1630.739 ± 0.159
NPsNPs + Der1NPs + DerALLChEMBL26
FNN0.811 ± 0.0830.854 ± 0.1090.890 ± 0.0990.884 ± 0.091
RF0.825 ± 0.0980.838 ± 0.1300.873 ± 0.1060.866 ± 0.106
SVM0.835 ± 0.0800.837 ± 0.1400.871 ± 0.1270.856 ± 0.134
LSTM0.667 ± 0.0870.772 ± 0.1170.859 ± 0.1100.850 ± 0.105
GC0.824 ± 0.0810.834 ± 0.1130.851 ± 0.1150.842 ± 0.110
XGBoost0.793 ± 0.1150.816 ± 0.1340.841 ± 0.1230.835 ± 0.126
KNN0.761 ± 0.1090.785 ± 0.1250.819 ± 0.1240.815 ± 0.115
NB0.782 ± 0.1090.717 ± 0.1610.737 ± 0.1630.739 ± 0.159
Table 6

Statistics of the better and poorer targets of NPs + DerALL, NPs + Der1 and NPs compared with the same targets in ChEMBL26

26 targets150 targets463 targets
Better_numberPoorer_numberBetter_numberPoorer_numberBetter_numberPoorer_number
FNN
 NPs224
 NPs + Der1111550100
 NPs + DerALL15118268248215
GC
 NPs1214
 NPs + Der112145595
 NPs + DerALL1798565251212
SVM
 NPs422
 NPs + Der191746104
 NPs + DerALL12147971245218
RF
 NPs818
 NPs + Der1111547103
 NPs + DerALL12147278230233
26 targets150 targets463 targets
Better_numberPoorer_numberBetter_numberPoorer_numberBetter_numberPoorer_number
FNN
 NPs224
 NPs + Der1111550100
 NPs + DerALL15118268248215
GC
 NPs1214
 NPs + Der112145595
 NPs + DerALL1798565251212
SVM
 NPs422
 NPs + Der191746104
 NPs + DerALL12147971245218
RF
 NPs818
 NPs + Der1111547103
 NPs + DerALL12147278230233
Table 6

Statistics of the better and poorer targets of NPs + DerALL, NPs + Der1 and NPs compared with the same targets in ChEMBL26

26 targets150 targets463 targets
Better_numberPoorer_numberBetter_numberPoorer_numberBetter_numberPoorer_number
FNN
 NPs224
 NPs + Der1111550100
 NPs + DerALL15118268248215
GC
 NPs1214
 NPs + Der112145595
 NPs + DerALL1798565251212
SVM
 NPs422
 NPs + Der191746104
 NPs + DerALL12147971245218
RF
 NPs818
 NPs + Der1111547103
 NPs + DerALL12147278230233
26 targets150 targets463 targets
Better_numberPoorer_numberBetter_numberPoorer_numberBetter_numberPoorer_number
FNN
 NPs224
 NPs + Der1111550100
 NPs + DerALL15118268248215
GC
 NPs1214
 NPs + Der112145595
 NPs + DerALL1798565251212
SVM
 NPs422
 NPs + Der191746104
 NPs + DerALL12147971245218
RF
 NPs818
 NPs + Der1111547103
 NPs + DerALL12147278230233
The relationship between the amount of data under the target and AUC value. The abscissa is the amount of data under the target point, and the ordinate is the AUC value. The blue dotted lines at data size value of 1000 differentiate models with stable or unstable results.
Figure 3

The relationship between the amount of data under the target and AUC value. The abscissa is the amount of data under the target point, and the ordinate is the AUC value. The blue dotted lines at data size value of 1000 differentiate models with stable or unstable results.

We also discussed the target prediction results of a particular algorithm across different datasets. To be fair, the intersection targets of different datasets were taken for comparison furtherly. For example, there were 26 targets of ChEMBL26, NPs + DerALL and NPs + Der that intersected with NPs; 150 targets of ChEMBL26 and NPs + DerALL that intersected with NPs + Der1 and 463 targets of ChEMBL26 that intersected with NPs + DerALL. Based on the AUC value, we compared the performance of NPs + DerALL, NPs + Der1 and NPs with ChEMBL26 on each target, and then the number of targets with higher AUC values than ChEMBL26 was calculated as the Better Number, while those with lower or equal AUC values were called Poorer Number and Equal Number, respectively. Table 6 shows the results of FNN, GC, SVM and RF, and the results of KNN, NB, XGBoost, and LSTM can be found in Supplementary Table S8 available online at https://dbpia.nl.go.kr/bib. For FNN, among the 26 NPs targets, 2 targets had higher AUC values than that of ChEMBL26. This number increased from 2 to 12 on NPs + Der1 and then increased to 15, which was more than half of 26 NPs targets. For 150 targets of NPs + Der1, a greater number of targets (50) presented a better AUC value. And 83 out of 150 targets of NPs + DerALL exceeded the ChEMBL26. It should be noted that the NPs + DerALL dataset had more better targets than ChEMBL26 on three out of four stable machine learning methods. From the above results, we can see that NPs + DerALL, NPs + Der1 and NPs with less data than ChEMBL26 can still obtain higher-level models, which shows the great potential of NP-specific datasets.

Previous works showed that the number of training sets increased, the performance of the model increased [21, 53, 54]. We also investigated the correlation between dataset size and performance. The scatter plots of data sizes against AUC values were drawn for all models. As shown in Figure 3, from left to right, the distribution of AUC was getting closer and closer to the top, demonstrating that the larger training set leads to better predictions. This is consistent with previous work [55–59]. Especially, when the amount of data reaches 103–104, the AUC values were concentrated at the range of 0.8~1, which was a relatively high and stable level. In general, NPs (green points) and NPs + Der1 (blue points) were difficult to achieve a stable data size, and the NPs + DerALL (yellow points) met this requirement and thus got better performance than ChEMBL26. Therefore, we believe that, when the datasets are sufficient in the future, NP-specific datasets will have the potential of getting a better model for NP target prediction rather than the models built with a hybrid dataset of all ChEMBL molecules which have more data.

Table 7

The number of targets, molecules and bioactivity measurements in the intersection of external validation set

Targets_numberCompound_numberBioactivity_number
19245165824
Targets_numberCompound_numberBioactivity_number
19245165824
Table 7

The number of targets, molecules and bioactivity measurements in the intersection of external validation set

Targets_numberCompound_numberBioactivity_number
19245165824
Targets_numberCompound_numberBioactivity_number
19245165824
Table 8

The statistical of better, poorer and equal targets based on 13 targets (with more than 100 compounds) in the external validation set

Optimal number of targets in NPs + DerALL models
Better_numberEqual_numberPoorer_number
FNN823
GC616
LSTM814
SVM625
RF526
XGBOOST616
KNN517
NB715
Optimal number of targets in NPs + DerALL models
Better_numberEqual_numberPoorer_number
FNN823
GC616
LSTM814
SVM625
RF526
XGBOOST616
KNN517
NB715
Table 8

The statistical of better, poorer and equal targets based on 13 targets (with more than 100 compounds) in the external validation set

Optimal number of targets in NPs + DerALL models
Better_numberEqual_numberPoorer_number
FNN823
GC616
LSTM814
SVM625
RF526
XGBOOST616
KNN517
NB715
Optimal number of targets in NPs + DerALL models
Better_numberEqual_numberPoorer_number
FNN823
GC616
LSTM814
SVM625
RF526
XGBOOST616
KNN517
NB715
The AUC values of the model itself versus external validation.
Figure 4

The AUC values of the model itself versus external validation.

External validation

Several studies demonstrated the performance of cross-validation differences between in-sample and out-of-sample test pairs [60, 61]. To better evaluate the models, the external validation set without training samples was built and used on the final models which were trained on all data. Considering that most of the targets of NPs and NPs + Der1 have very little data of the training and external validation sets, and the data were distributed unevenly, only the performance of NPs + DerALL and ChEMBL26 were compared on the external validation set.

First, the evaluation results of the internal validation based on nested cluster cross-validation were compared with that of external validation to evaluate the generalization capacity of our models. The result of NPs + DerALL is displayed in Figure 4. As shown in the boxplot, the results from the internal (green boxes) and external validation (blue boxes) of those models displayed comparable performance, and most of the time, the external validation possessed higher median AUC values (orange lines in the boxes) than the internal validation. Therefore, our training models possessed good robustness.

Next, we evaluated whether the NP specificity models built with NPs + DerALL performed better for the NP target prediction than traditional models with all mixed molecules of ChEMBL26. Due to the data size limitation, the AUC value was unable to be calculated for some targets, then the intersection targets of NPs + DerALL and ChEMBL26 with complete estimate values were picked out. In the end, 192 targets were selected (Table 7) and the details of those 192 targets can be found in Supplement Materials.

Considering that the size of the external validation set also had a significant impact on the model estimate, the correlation of the data distribution and the AUC values were further analyzed and these are displayed in Figure 5. It was shown when the data volume of a target in the external validation set was >100, the performance of most models could reach a reliable level with an AUC value of >0.8. On the contrary, the results were very messy when the amount of the external validation set was <100. Therefore, only targets with >100 compounds were explored in the following discussion.

The relationship between the data size of the external validation set and the performance of the eight different models. The blue dotted lines at data size value of 100 differentiate models with stable or unstable results.
Figure 5

The relationship between the data size of the external validation set and the performance of the eight different models. The blue dotted lines at data size value of 100 differentiate models with stable or unstable results.

Table 8 shows a chart of the numbers of better, poorer and equal targets with a data size of >100. In this case, there were relatively more targets of NPs + DerALL performing better than ChEMBL26. Besides, the means and SDs of the AUC value of eight methods for those targets with external data size >100 were tallied up and these are listed in Table 9. The FNN and LSTM obtained better performances on NPs + DerALL and the FNN model of NPs + DerALL performed best with the highest AUC value of 0.944. For other methods, the averaged AUC values on NP + DerALL were very close to the results on ChEMBL26. In summary, the NP-specific models (NPs + DerALL) are able to produce a better predictive ability of target prediction for NPs and their derivatives.

Table 9

The means and SDs of the AUC values of eight methods for the NPs + DerALL and ChEMBL26

NP + DerALLALL
FNN0.944 ± 0.0440.933 ± 0.057
GC0.911 ± 0.0850.918 ± 0.061
LSTM0.910 ± 0.0730.883 ± 0.101
SVM0.929 ± 0.0700.933 ± 0.061
RF0.931 ± 0.0660.935 ± 0.063
XGBOOST0.915 ± 0.0810.913 ± 0.083
KNN0.888 ± 0.0780.902 ± 0.069
NB0.879 ± 0.1040.874 ± 0.110
NP + DerALLALL
FNN0.944 ± 0.0440.933 ± 0.057
GC0.911 ± 0.0850.918 ± 0.061
LSTM0.910 ± 0.0730.883 ± 0.101
SVM0.929 ± 0.0700.933 ± 0.061
RF0.931 ± 0.0660.935 ± 0.063
XGBOOST0.915 ± 0.0810.913 ± 0.083
KNN0.888 ± 0.0780.902 ± 0.069
NB0.879 ± 0.1040.874 ± 0.110
Table 9

The means and SDs of the AUC values of eight methods for the NPs + DerALL and ChEMBL26

NP + DerALLALL
FNN0.944 ± 0.0440.933 ± 0.057
GC0.911 ± 0.0850.918 ± 0.061
LSTM0.910 ± 0.0730.883 ± 0.101
SVM0.929 ± 0.0700.933 ± 0.061
RF0.931 ± 0.0660.935 ± 0.063
XGBOOST0.915 ± 0.0810.913 ± 0.083
KNN0.888 ± 0.0780.902 ± 0.069
NB0.879 ± 0.1040.874 ± 0.110
NP + DerALLALL
FNN0.944 ± 0.0440.933 ± 0.057
GC0.911 ± 0.0850.918 ± 0.061
LSTM0.910 ± 0.0730.883 ± 0.101
SVM0.929 ± 0.0700.933 ± 0.061
RF0.931 ± 0.0660.935 ± 0.063
XGBOOST0.915 ± 0.0810.913 ± 0.083
KNN0.888 ± 0.0780.902 ± 0.069
NB0.879 ± 0.1040.874 ± 0.110

Consensus model

Ensemble methods by combining multiple learners can often obtain significantly better generalization performance than a single learner [62–65]. Therefore, we combined eight different models to build consensus models for NP target prediction and to evaluate their performance on external validation sets. The average probability was used as the predicted score for two-algorithms-combined (28 in total) or three-algorithms-combined (56 in total) consensus models [66, 67]. Eight measurements of performance, including AUC, the area under the PR curve (AP), accuracy, precision, specificity, F1-score, kappa and recall, were used to estimate the overall performance of different consensus models in target prediction jobs. The partial results are shown in Figure 6, and the rest are displayed in Supplementary Figure S1 available online at https://dbpia.nl.go.kr/bib. The full evaluation results of the consensus models can be found in the Supplement Materials.

The values of AUC, AP, F1-score and kappa of two combined models. The red standard line represents the best value of single models.
Figure 6

The values of AUC, AP, F1-score and kappa of two combined models. The red standard line represents the best value of single models.

For the two-algorithms-combined models, the GC + SVM ranked first at the AUC. FNN + SVM ranked first at AP and F1- score, FNN + GC ranked first at the kappa and accuracy, FNN + KNN ranked first at the precision and specificity, while LSTM + XGBoost ranked first at recall. Therefore, different consensus models have their advantages. But from a comprehensive view, five out of eight indicators (AP, kappa, accuracy, precision and F1-score) of the FNN + SVM ranked first or second, which indicated that FNN + SVM had the best overall performance. For the three-algorithms-combined models (Supplementary Figure S2 available online at https://dbpia.nl.go.kr/bib), the FNN + GC + XGBoost performed best at five indicators (accuracy, AP, precision, F1-score and kappa). But, the best two-algorithms-combined models were superior to the best three-algorithms-combined model at six indicators excluding the accuracy and AP. So overall, the FNN + SVM possessed the best overall performance, and this consensus model indeed improved the score of multiple evaluation methods than a single model and thereby highlights the comprehensive advantages of ensemble methods for NP target fishing.

Multi-voting method

The voting method is another ensemble technique. We use the voting method to predict targets for NPs by combining the eight algorithms. The results are listed in Table 10. If we considered 1 vote (Vote_1 scheme), a positive label was given only if one or more models were giving a positive label. And the Vote_8 scheme required all eight models to label positive. Although the voting model performed poorly on most metrics, the vote_1 model had the highest recall of 0.927, which is a good choice when we want to find more candidate targets. On the other side, the Vote_8 model had the highest specificity with a massive improvement from 0.725 (the best of the single model) to 0.923, which demonstrated its great ability to raise the true negative rate. In another word, if it aimed to accurately exclude targets without effect on molecules, more votes were necessary.

Table 10

Voting results and single model results of eight models on accuracy, precision, specificity, balanced average (F1-score), consistency test index (kappa) and recall; the top 1 of each indicators were marked as bold text

AccuracyRecallPrecisionSpecificityF1-scoreKappa
Vote_10.7348200.9273570.6904740.3342980.7688110.257488
Vote_20.7566050.8633490.7057540.4579330.7537610.327965
Vote_30.7680860.8245710.7217730.5326120.7465950.363039
Vote_40.7855800.7946630.7204530.6061430.7319230.395788
Vote_50.7847540.7529090.7226850.6837650.7122880.425853
Vote_60.7866890.7069610.7276580.7600240.6919290.451502
Vote_70.7773910.6406420.7304980.8378070.6571550.442357
Vote_80.7278140.5182300.7077290.9234820.5704140.374407
FNN0.7983080.7910620.7677440.7253920.7606130.491853
KNN0.7293260.6880610.7009370.6570920.6596020.320299
XGBoost0.7860300.8033870.7373620.6151270.7440920.419844
NB0.7202820.6624180.6903090.6439220.6431780.280391
RF0.7627310.7602210.6907350.5987780.6977850.353804
SVM0.7650610.7786160.7185260.5817660.7201790.359531
GC0.7929230.7762710.7282460.6933270.7331310.445779
LSTM0.7670790.7686470.7262240.6206600.7220730.386801
AccuracyRecallPrecisionSpecificityF1-scoreKappa
Vote_10.7348200.9273570.6904740.3342980.7688110.257488
Vote_20.7566050.8633490.7057540.4579330.7537610.327965
Vote_30.7680860.8245710.7217730.5326120.7465950.363039
Vote_40.7855800.7946630.7204530.6061430.7319230.395788
Vote_50.7847540.7529090.7226850.6837650.7122880.425853
Vote_60.7866890.7069610.7276580.7600240.6919290.451502
Vote_70.7773910.6406420.7304980.8378070.6571550.442357
Vote_80.7278140.5182300.7077290.9234820.5704140.374407
FNN0.7983080.7910620.7677440.7253920.7606130.491853
KNN0.7293260.6880610.7009370.6570920.6596020.320299
XGBoost0.7860300.8033870.7373620.6151270.7440920.419844
NB0.7202820.6624180.6903090.6439220.6431780.280391
RF0.7627310.7602210.6907350.5987780.6977850.353804
SVM0.7650610.7786160.7185260.5817660.7201790.359531
GC0.7929230.7762710.7282460.6933270.7331310.445779
LSTM0.7670790.7686470.7262240.6206600.7220730.386801
Table 10

Voting results and single model results of eight models on accuracy, precision, specificity, balanced average (F1-score), consistency test index (kappa) and recall; the top 1 of each indicators were marked as bold text

AccuracyRecallPrecisionSpecificityF1-scoreKappa
Vote_10.7348200.9273570.6904740.3342980.7688110.257488
Vote_20.7566050.8633490.7057540.4579330.7537610.327965
Vote_30.7680860.8245710.7217730.5326120.7465950.363039
Vote_40.7855800.7946630.7204530.6061430.7319230.395788
Vote_50.7847540.7529090.7226850.6837650.7122880.425853
Vote_60.7866890.7069610.7276580.7600240.6919290.451502
Vote_70.7773910.6406420.7304980.8378070.6571550.442357
Vote_80.7278140.5182300.7077290.9234820.5704140.374407
FNN0.7983080.7910620.7677440.7253920.7606130.491853
KNN0.7293260.6880610.7009370.6570920.6596020.320299
XGBoost0.7860300.8033870.7373620.6151270.7440920.419844
NB0.7202820.6624180.6903090.6439220.6431780.280391
RF0.7627310.7602210.6907350.5987780.6977850.353804
SVM0.7650610.7786160.7185260.5817660.7201790.359531
GC0.7929230.7762710.7282460.6933270.7331310.445779
LSTM0.7670790.7686470.7262240.6206600.7220730.386801
AccuracyRecallPrecisionSpecificityF1-scoreKappa
Vote_10.7348200.9273570.6904740.3342980.7688110.257488
Vote_20.7566050.8633490.7057540.4579330.7537610.327965
Vote_30.7680860.8245710.7217730.5326120.7465950.363039
Vote_40.7855800.7946630.7204530.6061430.7319230.395788
Vote_50.7847540.7529090.7226850.6837650.7122880.425853
Vote_60.7866890.7069610.7276580.7600240.6919290.451502
Vote_70.7773910.6406420.7304980.8378070.6571550.442357
Vote_80.7278140.5182300.7077290.9234820.5704140.374407
FNN0.7983080.7910620.7677440.7253920.7606130.491853
KNN0.7293260.6880610.7009370.6570920.6596020.320299
XGBoost0.7860300.8033870.7373620.6151270.7440920.419844
NB0.7202820.6624180.6903090.6439220.6431780.280391
RF0.7627310.7602210.6907350.5987780.6977850.353804
SVM0.7650610.7786160.7185260.5817660.7201790.359531
GC0.7929230.7762710.7282460.6933270.7331310.445779
LSTM0.7670790.7686470.7262240.6206600.7220730.386801

Conclusions

NPs are valuable resources of drugs, and the study of the activity of NPs, especially the discovery of specific targets, is very important for the development of NPs. With the increase of data, various algorithms have been successfully used in molecular target prediction, but considering the obvious differences between the characteristics of NPs and synthetic molecules, it is significantly necessary to construct prediction models that are specific for NPs. Therefore, we collected the activity data of NPs and their derivatives to build three specific datasets of NPs, NPs, NPs + Der1 and NPs + DerALL. Multiple machine learning methods, including SVM, XGBoost, RF, KNN, NB, FNN, CNN and RNN, were used to construct NP-specific target prediction models and were then compared with the traditional models constructed by ChEMBL26.

We first discussed the effects of the input features, activity thresholds on different datasets with multiple algorithms. The results showed that the combination of ECFP6 + MACCS+FCFP6 fingerprints had a more comprehensive performance because of the advantages of integrating more molecular information from different single fingerprints. For the CNN, the ConvMolFeaturizer did better than the WeaveFeaturizer. And, the models excluding weakly active data performed better than the models containing weakly active data. Then, the best conditions obtained above were used for the next large-scale comparison of multiple algorithms on different datasets (NPs, NPs + Der1, NPs + DerALL and ChEMBL26). First, the deep learning method, FNN, performed best with the highest averaged AUC value on most datasets. Second, although the model performances of NPs and NPs + Der1 were poor and unstable during the data limitation, the NPs + DerALL possessed a better predictive ability than ChEMBL26 on most algorithms. Then, we took the prediction model based on NPs + DerALL as the representative NP-specific model and evaluated its performance on the external validation set. On the one hand, the AUC values of the external validation were comparable to the results of internal validation, which demonstrated the good generalization ability of our model. On the other hand, the models built with NPs + DerALL possessed better classification ability and robustness than ChEMBL26 when the number of validation sets was sufficient (>100 per target). In addition, among consensus models, the combination of FNN and SVM performed best comprehensively with the improving score at multiple evaluating indicators compared to the single algorithms. Another ensemble method by taking votes of different algorithms was also applied in this work. The results showed that the fewer votes we took, the better recall rate we got, thus fewer votes can be used to get more candidate targets. Instead, the more votes we took, the better specificity we got, indicating that more votes could exclude more impossible targets.

In summary, NP-specific models are more suitable for the target prediction of NPs, while integrated methods can further improve various indicators of prediction, and different types of ensemble methods can be selected according to different requirements.

Key Points
  • Three NP-specific datasets were constructed and compared with the traditional mixed datasets from ChEMBL26 on the target prediction task of NPs using eight machine learning algorithms.

  • The combination of ECFP6 + MACCS+FCFP6 fingerprints had a more comprehensive performance because of the advantages of integrating more molecular information from different single fingerprints.

  • The models excluding weakly active data performed better than the models containing weakly active data.

  • The NP-specific dataset, NPs + DerALL, possessed a better predictive ability than ChEMBL26 on most algorithms both in internal validation and external validation.

  • Among ensemble methods, the combination of FNN and SVM performed best comprehensively and voting models significantly improved the recall and specificity.

Data and code availability

The authors declare that all data supporting the findings of this study are available within the article and ESI files. The Supplement Materials can be accessed from https://doi.org/10.5281/zenodo.6904699 and also from the corresponding authors upon reasonable request. The scripts for training and using our models have been uploaded to GitHub (https://github.com/lianglu-nk/NPTP, https://github.com/lianglu-nk/NPTP_external).

Authors’ contributions

Y.L. carried out the experimental work, analysis and interpretation of the results and wrote the original draft. L.L. conceived and designed the experiments, supervised the project, supported the analysis and interpretation of the results and wrote the original draft. B.K. and X.-F.M. improved computing power and guided the model optimization. J.-P.L. supervised the project and participated in the writing and editing of the manuscript. R.W., M.-Y.S. and Q.W. supported the analysis and participated in the experimental work. All authors discussed the results and edited the manuscript.

Funding

This work was supported by the National Key R&D Program of China [No. 2017YFC1104400].

Lu Liang, PhD, is an assistant research fellow at the College of Pharmacy, Nankai University. Her research interests include the development of cheminformatics tools and computational target prediction approaches.

Ye Liu is a student at the College of Pharmacy, Nankai University. Her research interests include computational target prediction approaches and machine learning.

Bo Kang, PhD, is a professional senior engineer of high-performance computing applications at the National Supercomputer Center in Tianjin. His research interests include the large-scale high performance computing, AI computing, big data analysis and molecular modeling.

Ru Wang is a student at the College of Pharmacy, Nankai University. Her research interests include computational antibody design and machine learning.

Meng-Yu Sun is a student at the College of Pharmacy, Nankai University. Her research interests include computational target prediction approaches and machine learning.

Qi Wu, Master, is an R&D engineer of high-performance computing at the National Supercomputer Center in Tianjin. His research interests include the bioinformatics computing, molecular modeling and machine learning.

Xiang-Fei Meng, PhD, is the chief scientist of high-performance computing applications at the National Supercomputer Center in Tianjin. His research interests include the large-scale high performance computing, AI computing and high throughout computing on materials.

Jian-Ping Lin, PhD, is a professor at the College of Pharmacy, Nankai University. His research interests include molecular dynamics, virtual screening, computational target prediction, drug repositioning, ADMET prediction and the development of cheminformatics tools.

References

1.

Katz
L
,
Baltz
RH
.
Natural product discovery: past, present, and future
.
J Ind Microbiol Biotechnol
2016
;
43
:
155
76
.

2.

Achan
J
,
Talisuna
AO
,
Erhart
A
, et al.
Quinine, an old anti-malarial drug in a modern world: role in the treatment of malaria
.
Malar J
2011
;
10
(
1
):
1
12
.

3.

Rodrigues
T
,
Reker
D
,
Schneider
P
, et al.
Counting on Natural Products For Drug Design
.
Nature Chemistry
2016;
8
:531–41.

4.

Atanasov
AG
,
Zotchev
SB
,
Dirsch
VM
, et al.
Natural products in drug discovery: advances and opportunities
.
Nat Rev Drug Discov
2021
;
20
:
200
16
.

5.

Sorokina
M
,
Steinbeck
C
.
Review on natural products databases: where to find data in 2020
.
J Chem
2020
;
12
(
1
):
1
51
.

6.

Sorokina
M
,
Merseburger
P
,
Rajan
K
, et al.
COCONUT online: collection of open natural products database
.
J Chem
2021
;
13
(
1
):
1
13
.

7.

Zeng
X
,
Zhang
P
,
He
W
, et al.
NPASS: natural product activity and species source database for natural product research, discovery and tool development
.
Nucleic Acids Res
2018
;
46
:
D1217
22
.

8.

Wang
C
,
Kurgan
L
.
Review and comparative assessment of similarity-based methods for prediction of drug – protein interactions in the druggable human proteome
.
Brief Bioinform
2018
;
20
:
1
22
.

9.

Fang
J
,
Wu
Z
,
Cai
C
, et al.
Quantitative and systems pharmacology. 1. In silico prediction of drug-target interactions of natural products enables new targeted cancer therapy
.
J Chem Inf Model
2017
;
57
:
2657
71
.

10.

Fang
J
,
Cai
C
,
Wang
Q
, et al.
Systems pharmacology-based discovery of natural products for precision oncology through targeting cancer mutated genes
.
CPT Pharmacometrics Syst Pharmacol
2017
;
6
(
3
):
177
87
.

11.

Fang
J
,
Liu
C
,
Wang
Q
, et al.
In silico polypharmacology of natural products
.
Brief Bioinform
2017
;
19
:
1153
71
.

12.

Hong
H
,
Lmendrick
D
,
Mattes
W
, et al.
Molecular docking for identification of potential targets for drug repurposing
.
Curr Top Med Chem
2016
;
16
:
3636
45
.

13.

Ye
H
,
Wei
J
,
Tang
K
, et al.
Drug repositioning through network pharmacology
.
Curr Top Med Chem
2016
;
16
:
3646
56
.

14.

Kenny
HA
,
Hart
PC
,
Kordylewicz
K
, et al.
The natural product β-escin targets cancer and stromal cells of the tumor microenvironment to inhibit ovarian cancer metastasis
.
Cancer
2021
;
13
:3931.

15.

Rariza
PDD
,
Billones
JBB
.
Retusenol potentially inhibits putative drug targets for tuberculosis, cardiovascular diseases, cancer and HIV: (a reverse docking study)
.
Orient J Chem
2018
;
34
:
1795
801
.

16.

Dunyak
BM
,
Gestwicki
JE
.
Peptidyl-proline isomerases (PPIases): targets for natural products and natural product-inspired compounds
.
J Med Chem
2016
;
59
:
9622
44
.

17.

Yin
L
,
Zheng
L
,
Xu
L
, et al.
In-silico prediction of drug targets, biological activities, signal pathways and regulating networks of dioscin based on bioinformatics
.
BMC Complement Altern Med
2015
;
15
(
1
):
1
17
.

18.

Zhang
H
,
Ma
S
,
Feng
Z
, et al.
Cardiovascular disease chemogenomics knowledgebase-guided target identification and drug synergy mechanism study of an herbal formula
.
Sci Rep
2016
;
6
(
1
):
1
14
.

19.

Keiser
MJ
,
Setola
V
,
Irwin
JJ
, et al.
Predicting new molecular targets for known drugs
.
Nature
2009
;
462
:
175
81
.

20.

Wang
Z
,
Liang
L
,
Yin
Z
, et al.
Improving chemical similarity ensemble approach in target prediction
.
J Chem
2016
;
8
:
20
.

21.

Mayr
A
,
Klambauer
G
,
Unterthiner
T
, et al.
Large-scale comparison of machine learning methods for drug target prediction on ChEMBL
.
Chem Sci
2018
;
9
:
5441
51
.

22.

Cheminform
J
,
Sturm
N
,
Mayr
A
, et al.
Industry-scale application and evaluation of deep learning for drug target prediction
.
J Chem
2020
;
12
:
1
13
.

23.

Begnini
F
,
Poongavanam
V
,
Over
B
, et al.
Mining natural products for macrocycles to drug difficult targets
.
J Med Chem
2021
;
64
:
1054
72
.

24.

Chen
Y
,
Mathai
N
,
Kirchmair
J
, et al.
Scope of 3D shape-based approaches in predicting the macromolecular targets of structurally complex small molecules including natural products and macrocyclic ligands
.
J Chem Inf Model
2020
;
60
(
6
):
2858
75
.

25.

Chen
Y
,
Kirchmair
J
.
Cheminformatics in natural product-based drug discovery
.
Mol Inf
2020
;
39
:
2000171
.

26.

Cockroft
NT
,
Cheng
X
,
Fuchs
JR
.
STarFish: a stacked ensemble target fishing approach and its application to natural products
.
J Chem Inf Model
2019
;
59
:
4906
20
.

27.

Shen
B
.
A new golden age of natural products drug discovery
.
Cell
2015
;
163
:
1297
300
.

28.

Feher
M
,
Schmidt
JM
.
Property distributions: differences between drugs, natural products, and molecules from combinatorial chemistry
.
ChemInform
2003
;
34
(
17
):
218
27
.

29.

Lee
ML
,
Schneider
G
.
Scaffold architecture and pharmacophoric properties of natural products and trade drugs: application in the design of natural product-based combinatorial libraries
.
J Comb Chem
2001
;
3
:
284
9
.

30.

Stahura
FL
,
Godden
JW
,
Xue
L
, et al.
Distinguishing between natural products and synthetic molecules by descriptor Shannon entropy analysis and binary QSAR calculations
.
J Chem Inf Comput Sci
2000
;
40
:
1245
52
.

31.

Koehn
FE
,
Carter
GT
.
The evolving role of natural products in drug discovery
.
Nat Rev Drug Discov
2005
;
4
:
206
20
.

32.

Reker
D
,
Perna
AM
,
Rodrigues
T
, et al.
Revealing the macromolecular targets of complex natural products
.
Nat Chem
2014
;
6
:
1072
8
.

33.

Newman
DJ
,
Cragg
GM
.
Natural products as sources of new drugs over the last 25 years
.
J Nat Prod
2007
;
70
:
461
77
.

34.

Bade
R
,
Chan
HF
,
Reynisson
J
.
Characteristics of known drug space. Natural products, their derivatives and synthetic drugs
.
Eur J Med Chem
2010
;
45
:
5646
52
.

35.

Patridge
E
,
Gareiss
P
,
Kinch
MS
, et al.
An analysis of FDA-approved drugs: natural products and their derivatives
.
Drug Discov Today
2016
;
21
:
204
7
.

36.

Newman
DJ
,
Cragg
GM
.
Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019
.
J Nat Prod
2020
;
83
:
770
803
.

37.

Gilson
MK
,
Liu
T
,
Baitaluk
M
, et al.
BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology
.
Nucleic Acids Res
2016
;
44
:
D1045
53
.

38.

Wassermann
AM
,
Bajorath
J
.
BindingDB and ChEMBL: online compound databases for drug discovery
.
Expert Opin Drug Discovery
2011
;
6
:
683
7
.

39.

Wang
Y
,
Xiao
J
,
Suzek
TO
, et al.
PubChem: a public information system for analyzing bioactivities of small molecules
.
Nucleic Acids Res
2009
;
37
:
W623
33
.

40.

Li
X
,
Li
Z
,
Wu
X
, et al.
Deep learning enhancing kinome-wide polypharmacology profiling: model construction and experiment validation
.
J Med Chem
2019
;
63
(
16
):
8723
37
.

41.

Kalliokoski
T
,
Kramer
C
,
Vulpetti
A
, et al.
Comparability of mixed IC50 data - a statistical analysis
.
PLoS One
2013
;
8
:
e61007
.

42.

Stockwell
DRB
,
Peterson
AT
.
Effects of sample size on accuracy of species distribution models
.
Ecol Model
2002
;
148
:
1
13
.

43.

Barcza
S
,
Kelly
LA
,
Wahrman
SS
, et al.
Structured biological data in the molecular access system
.
J Chem Inf Comput Sci
1985
;
25
:
55
9
.

44.

Rogers
D
,
Hahn
M
.
Extended-connectivity fingerprints
.
J Chem Inf Model
2010
;
50
:
742
54
.

45.

Martin
EJ
,
Polyakov
VR
,
Tian
L
, et al.
Profile-QSAR 2.0: kinase virtual screening accuracy comparable to four-concentration IC50s for realistically novel compounds
.
J Chem Inf Model
2017
;
57
:
2077
88
.

46.

Miljković
F
,
Rodríguez-Pérez
R
,
Bajorath
J
.
Machine learning models for accurate prediction of kinase inhibitors with different binding modes
.
J Med Chem
2019
;
63
(
16
):
8738
48
.

47.

Rahaman
O
,
Gagliardi
A
.
Deep learning total energies and orbital energies of large organic molecules using hybridization of molecular fingerprints
.
J Chem Inf Model
2020
;
60
:
5971
83
.

48.

Cai
C
,
Guo
P
,
Zhou
Y
, et al.
Deep learning-based prediction of drug-induced cardiotoxicity
.
J Chem Inf Model
2019
;
59
:
1073
84
.

49.

Zheng
C
,
Wang
J
,
Liu
J
, et al.
System-level multi-target drug discovery from natural products with applications to cardiovascular diseases
.
Mol Divers
2014
;
18
:
621
35
.

50.

Dvorak
Z
,
Klapholz
M
,
Burris
TP
, et al.
Weak microbial metabolites: a treasure trove for using biomimicry to discover and optimize drugs
.
Mol Pharmacol
2020
;
98
:
343
9
.

51.

Yao
H
,
Liu
J
,
Xu
S
, et al.
The structural modification of natural products for novel drug discovery
.
Expert Opin Drug Discovery
2017
;
12
:
121
40
.

52.

Huang
S
,
Chen
F
,
Cheng
H
, et al.
Modification and application of polysaccharide from traditional Chinese medicine such as dendrobium officinale
.
Int J Biol Macromol
2020
;
157
:
385
93
.

53.

Jollans
L
,
Boyle
R
,
Artiges
E
, et al.
Quantifying performance of machine learning methods for neuroimaging data
.
Neuroimage
2019
;
199
:
351
65
.

54.

Moghaddam
DD
,
Rahmati
O
,
Panahi
M
, et al.
The effect of sample size on different machine learning models for groundwater potential mapping in mountain bedrock aquifers
.
Catena
2020
;
187
:
104421
.

55.

Beleites
C
,
Neugebauer
U
,
Bocklitz
T
, et al.
Sample size planning for classification models
.
Anal Chim Acta
2013
;
760
:
25
33
.

56.

Vabalas
A
,
Gowen
E
,
Poliakoff
E
, et al.
Machine learning algorithm validation with a limited sample size
.
PLoS One
2019
;
14
(
11
):
1
20
.

57.

Figueroa
RL
,
Zeng-Treitler
Q
,
Kandula
S
, et al.
Predicting sample size required for classification performance
.
BMC Med Inform Decis Mak
2012
;
12
:8.

58.

Alwosheel
A
,
van
Cranenburgh
S
,
Chorus
CG
.
Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis
.
J Choice Model
2018
;
28
:
167
82
.

59.

Joulin
A
,
van Der
Maaten
L
,
Jabri
A
, et al.
Learning visual features from large weakly supervised data
.
Computer Vision – ECCV 2016
2016
;
67
84
.

60.

Chen
X
,
Yan
CC
,
Zhang
X
, et al.
Drug-target interaction prediction: databases, web servers and computational models
.
Brief Bioinform
2016
;
17
:
696
712
.

61.

Chen
X
,
Zhou
C
,
Wang
CC
, et al.
Predicting potential small molecule-miRNA associations based on bounded nuclear norm regularization
.
Brief Bioinform
2021
;
22
(
6
):
1
14
.

62.

Modi
S
,
Li
J
,
Malcomber
S
, et al.
Integrated in silico approaches for the prediction of Ames test mutagenicity
.
J Comput Aided Mol Des
2012
;
26
:
1017
33
.

63.

Gini
G
,
Garg
T
,
Stefanelli
M
.
Ensembling regression models to improve their predictivity: a case study in qsar (quantitative structure activity relationships) with computational chemometrics
.
Appl Artif Intell
2009
;
23
:
261
81
.

64.

Wang
CC
,
Zhu
CC
,
Chen
X
.
Ensemble of kernel ridge regression-based small molecule-miRNA association prediction in human disease
.
Brief Bioinform
2022
;
23
(
1
):
1
11
.

65.

Chen
X
,
Guan
NN
,
Sun
YZ
, et al.
MicroRNA-small molecule association identification: from experimental results to computational models
.
Brief Bioinform
2018
;
21
:
47
61
.

66.

Wu
Z
,
Zhu
M
,
Kang
Y
, et al.
Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets
.
Brief Bioinform
2020
;
00
:
1
17
.

67.

Hewitt
M
,
Cronin
MTD
,
Madden
JC
, et al.
Consensus QSAR models: Do the benefits outweigh the complexity?
J Chem Inf Model
2007
;
47
:
1460
8
.

Author notes

Lu Liang, Ye Liu and Bo Kang contribute equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data