Large-scale comparison of machine learning algorithms for target prediction of natural products

Division criteria of threshold

Label	Thresholds in −log10(M)
Active	≥5.5
Weak active	>5.0 and <5.5
Inactive	≤4.5
Weak inactive	>4.5 and <5.0

Table 1

Division criteria of threshold

Label	Thresholds in −log10(M)
Active	≥5.5
Weak active	>5.0 and <5.5
Inactive	≤4.5
Weak inactive	>4.5 and <5.0

Following the above steps, we built eight datasets from the ChEMBL26 as training sets, and for a better distinction, we added the label ‘Weak’ to the abbreviation of datasets with weakly data to distinguish from the datasets without weakly data: (1) the NPs dataset without weakly data (NPs), (2) the NPs and its first-class derivatives dataset without weakly data (NPs + Der1), (3) the NPs and all its derivatives dataset without weakly data (NPs + DerALL), (4) whole compounds dataset from ChEMBL26 without weakly data (ChEMBL26), (5) the NPs dataset with weakly data (Weak NPs), (6) the NPs and its first-class derivatives dataset with weakly data (Weak NPs + Der1), (7) the NPs and all its derivatives dataset with weakly data (Weak NPs + DerALL) and the (8) whole compounds dataset from ChEMBL26 with weakly data (Weak ChEMBL26).

Furthermore, an external validation set was constructed using the ChEMBL29, the NPASS, BindingDB and PubChem Assays, which removed the target-compound pairs overlapping with the training dataset.

Molecular fingerprints

Three binary fingerprints, extended connectivity fingerprints (ECFP), functional connectivity fingerprints (FCFP) and molecular access system (MACCS) were used as chemical descriptors in this study. MACCS is a 166-bit fingerprint based on a well-defined dictionary of substructure(MACCS keys) [43]. ECFP and FCFP descriptors are substructure fingerprints based on the Morgan algorithm [44], which represent feature sets of circular atom neighborhoods by compiling the surrounding environment of atoms iteratively. The differences between ECFP and FCFP are the atomic characteristics during initialization. The initial identifier of an atom in ECFP is obtained from several properties of the atom itself, whereas FCFP uses functional group types before substructures are enumerated. We compare the prediction performance of each fingerprint and the combination of them. Fingerprint representations are generated using the RDKit implementation of MACCS (166-bit), 2048-bit ECFP6 (radius = 3) and 2048-bit FCFP6 (radius = 3). The combinations of different fingerprints refer to connecting the bit string of each fingerprint, for example, a combination of 166-bit MACCS and 2048-bit ECFP6 had 2214 bits.

Cluster cross-validation

Cluster cross-validation is today’s popular data-partitioning scheme that which compounds are clustered based on the chemical similarity before dividing the training and testing sets [45]. In the experiment, we conducted 3-fold cross-validation with compounds clustered in advance to evaluate the performance of the model. To prevent similar data points from falling into the training set and the test set at once, all of the molecules were first clustered by the single linkage algorithm. Jaccard distance using binarized Morgan fingerprints with a radius of 2 was employed as the metric between any two compounds and the minimum distance was set to 0.3. In the next step, molecules belonging to the same cluster were randomly assigned to one of the 3-folds. We also compared the performance of the cluster cross-validation and random cross-validation (see Table S12).

Nested cluster cross-validation

To get a fair evaluation, nested cross-validation was performed for parameter tuning. For nested cross-validation, training data are split into two portions: inner and outer portions. Different hyperparameter combinations were attempted in the inner loop to evaluate which hyperparameter can achieve the best performance. Hyperparameters and the performance comparison of RF and XGBoost with default parameters and selected hyperparameter can be found in Table S1, S2 and Table S9. Then, the selected hyperparameter is employed in the outer loop to get a model for each fold, which could avoid the hyperparameter selection bias of performance evaluation. The area under the curve (AUC) of the receiver operating characteristic was calculated here for assessing model performance. For each hyperparameter combination, we obtained the AUC values from the inner rings and the mean AUC values of two inner loops were adopted as the criteria to select the optimal hyperparameter combination for the corresponding outer ring. At last, we summarized the AUC values of three outer loops by calculating their average values to obtain the most realistic results of model performance evaluation. At the same time, the best optimal hyperparameter combination was confirmed by calculating the mean AUC values of the six inner loops of the nested cluster cross-validation [22] and then used to train the final models on all data.

Machine learning methods

We compared the prediction performances of eight machine learning architectures for NP target fishing, including three deep learning methods (FNN, CNN and RNN) and five traditional machine learning approaches (SVM, XGBoost, RF, KNN and NB).

NB has been widely used in target prediction and is included as a baseline method. Particularly, SVM and KNN are typical classification methods based on similarity, RF and XGBoost are representative classification methods based on feature, while XGBoost implements gradient tree boosting. Deep learning methods have recently garnered significant attention in target fishing, and three representative architectures of deep neural networks are considered in this study. Among them, FNN follows the standard feedforward architecture and takes vectorial inputs, CNN has advantages in image processing and mimics its traits in the convolutional layers and RNN process the sequence data with cyclic connections using memory blocks. The details of each algorithm were provided in the Supporting information. The overall workflow to assess the predictive performances of the machine learning algorithms on different datasets is shown in Figure 1.

Figure 1

The overall workflow to assess the predictive performances of the machine learning algorithms on different datasets.

Results and Discussion

Dataset

Eight different datasets and an external validation set were prepared for evaluating the NP target prediction models. The statistic of each dataset can be found in Table 2. A total of 899 single protein targets with varying numbers of data points were identified for ChEMBL26. The targets contain between 100 and 7086 unique compounds with a mean of 795, a median of 410 and the first quartile of 191. The smallest dataset, NPs, contained 26 targets. The min, max, mean, median and the first quartile of data points of NPs were 100, 592, 174, 148 and 122, respectively. The detailed information is provided in the Supplement Materials.

Table 2

The number of targets, compounds and bioactivities of each dataset

Datasets	Targets_number	Compound_number	Bioactivity_number
ChEMBL26	899	458 198	714 438
NPs + DerALL	470	97 706	164 195
NPs + Der1	150	18 493	30 543
NPs	26	3052	4521
Weak ChEMBL26	1084	522 416	863 677
Weak NPs + DerALL	585	121 729	219 793
Weak NPs + Der1	211	26 279	46 236
Weak NPs	37	4176	7250
External validation set	666	7205	10 776

Datasets	Targets_number	Compound_number	Bioactivity_number
ChEMBL26	899	458 198	714 438
NPs + DerALL	470	97 706	164 195
NPs + Der1	150	18 493	30 543
NPs	26	3052	4521
Weak ChEMBL26	1084	522 416	863 677
Weak NPs + DerALL	585	121 729	219 793
Weak NPs + Der1	211	26 279	46 236
Weak NPs	37	4176	7250
External validation set	666	7205	10 776

Table 2

The number of targets, compounds and bioactivities of each dataset

Datasets	Targets_number	Compound_number	Bioactivity_number
ChEMBL26	899	458 198	714 438
NPs + DerALL	470	97 706	164 195
NPs + Der1	150	18 493	30 543
NPs	26	3052	4521
Weak ChEMBL26	1084	522 416	863 677
Weak NPs + DerALL	585	121 729	219 793
Weak NPs + Der1	211	26 279	46 236
Weak NPs	37	4176	7250
External validation set	666	7205	10 776

Datasets	Targets_number	Compound_number	Bioactivity_number
ChEMBL26	899	458 198	714 438
NPs + DerALL	470	97 706	164 195
NPs + Der1	150	18 493	30 543
NPs	26	3052	4521
Weak ChEMBL26	1084	522 416	863 677
Weak NPs + DerALL	585	121 729	219 793
Weak NPs + Der1	211	26 279	46 236
Weak NPs	37	4176	7250
External validation set	666	7205	10 776

Fingerprint selection

Feature selection is a key step in machine learning. For small molecules, the SMILS strings, molecular fingerprints and molecular graphs can be used as input features. Most machine learning algorithms use molecular fingerprints as input [22, 26, 46, 47]. With the progress of deep learning algorithms, the construction of CNN models using molecular graphs as input features and RNN models using SMILES strings as input features have also shown good performance [21]. Therefore, in this work, the molecular graphs and SMILES strings are adopted for CNN and RNN, respectively, whereas the rest six algorithms take molecule fingerprints as input. Generally, the choice of molecular fingerprints will affect the performance of a ligand-based target prediction model [47]. Here, we estimated three fingerprints (ECFP6, FCFP6 and MACCS) and combinations of them (ECFP6 + FCFP6, ECFP6 + MACCS, FCFP6 + MACCS and ECFP6 + FCFP6 + MACCS) using six machine learning methods (FNN, SVM, RF, KNN, NB and XGBoost) on six datasets, including NPs, NPs + Der1, ChEMBL26, Weak NPs, Weak NPs + Der1 and Weak ChEMBL26. The intersection targets of NPs, NPs + Der1 and ChEMBL26, which include 26 targets, were taken into comparison. Accordingly, 37 targets of Weak NPs, Weak NPs + Der1 and Weak ChEMBL26 were discussed as the intersection. The results of ChEMBL26 are listed in Table 3, and the result of the other five datasets can be found in Supplementary Tables S3–S7 available online at https://dbpia.nl.go.kr/bib. As shown in Table 3, the averaged AUC values of models using ECFP6 + MACCS+FCFP6 ranked first at four out of six machine learning methods. In terms of the other five datasets, ECFP6 + MACCS+FCFP6 performed best on Weak NPs + Der1 (Supplementary Table S6 available online at https://dbpia.nl.go.kr/bib) and Weak ChEMBL26 (Supplementary Table S7 available online at https://dbpia.nl.go.kr/bib). The poor performances of ECFP6 + MACCS+FCFP6 on NPs (Supplementary Table S3 available online at https://dbpia.nl.go.kr/bib), Weak NPs (Supplementary Table S4 available online at https://dbpia.nl.go.kr/bib) and NPs + Der1 (Supplementary Table S5 available online at https://dbpia.nl.go.kr/bib) were possibly due to the small dataset which brought about a relatively unobvious, unstable and biased evaluation. In addition, no matter which dataset, the combination of different fingerprints always performed better than a particular fingerprint. Overall, using the combined fingerprints, which contain more molecular properties, yields better performance. Therefore, the ECFP6 + MACCS+FCFP6 was picked out as the optimal fingerprint combination and was selected as the input feature in the following training work.

Table 3

Performance comparison of different target prediction methods on ChEMBL26; the table gives the means and SDs of AUC values for the compared algorithms and feature categories or input types; the top 1 AUC values were marked as bold text

	FNN	RF	SVM	XGBOOST	KNN	NB
ECFP6	0.869 ± 0.072	0.890 ± 0.061	0.889 ± 0.072	0.869 ± 0.071	0.828 ± 0.108	0.840 ± 0.079
FCFP6	0.871 ± 0.080	0.884 ± 0.071	0.892 ± 0.047	0.876 ± 0.069	0.825 ± 0.109	0.842 ± 0.065
MACCS	0.878 ± 0.063	0.884 ± 0.062	0.871 ± 0.059	0.877 ± 0.063	0.827 ± 0.080	0.779 ± 0.076
ECFP6 + FCFP6	0.880 ± 0.074	0.893 ± 0.062	0.908 ± 0.047	0.885 ± 0.062	0.832 ± 0.110	0.847 ± 0.069
ECFP6 + MACCS	0.888 ± 0.053	0.896 ± 0.062	0.898 ± 0.065	0.889 ± 0.059	0.838 ± 0.113	0.836 ± 0.067
FCFP6 + MACCS	0.877 ± 0.065	0.900 ± 0.063	0.902 ± 0.045	0.891 ± 0.068	0.838 ± 0.105	0.840 ± 0.060
ECFP6 + FCFP6 + MACCS	0.880 ± 0.065	0.900 ± 0.059	0.911 ± 0.046	0.892 ± 0.064	0.838 ± 0.111	0.844 ± 0.062

	FNN	RF	SVM	XGBOOST	KNN	NB
ECFP6	0.869 ± 0.072	0.890 ± 0.061	0.889 ± 0.072	0.869 ± 0.071	0.828 ± 0.108	0.840 ± 0.079
FCFP6	0.871 ± 0.080	0.884 ± 0.071	0.892 ± 0.047	0.876 ± 0.069	0.825 ± 0.109	0.842 ± 0.065
MACCS	0.878 ± 0.063	0.884 ± 0.062	0.871 ± 0.059	0.877 ± 0.063	0.827 ± 0.080	0.779 ± 0.076
ECFP6 + FCFP6	0.880 ± 0.074	0.893 ± 0.062	0.908 ± 0.047	0.885 ± 0.062	0.832 ± 0.110	0.847 ± 0.069
ECFP6 + MACCS	0.888 ± 0.053	0.896 ± 0.062	0.898 ± 0.065	0.889 ± 0.059	0.838 ± 0.113	0.836 ± 0.067
FCFP6 + MACCS	0.877 ± 0.065	0.900 ± 0.063	0.902 ± 0.045	0.891 ± 0.068	0.838 ± 0.105	0.840 ± 0.060
ECFP6 + FCFP6 + MACCS	0.880 ± 0.065	0.900 ± 0.059	0.911 ± 0.046	0.892 ± 0.064	0.838 ± 0.111	0.844 ± 0.062

Table 3

Performance comparison of different target prediction methods on ChEMBL26; the table gives the means and SDs of AUC values for the compared algorithms and feature categories or input types; the top 1 AUC values were marked as bold text

	FNN	RF	SVM	XGBOOST	KNN	NB
ECFP6	0.869 ± 0.072	0.890 ± 0.061	0.889 ± 0.072	0.869 ± 0.071	0.828 ± 0.108	0.840 ± 0.079
FCFP6	0.871 ± 0.080	0.884 ± 0.071	0.892 ± 0.047	0.876 ± 0.069	0.825 ± 0.109	0.842 ± 0.065
MACCS	0.878 ± 0.063	0.884 ± 0.062	0.871 ± 0.059	0.877 ± 0.063	0.827 ± 0.080	0.779 ± 0.076
ECFP6 + FCFP6	0.880 ± 0.074	0.893 ± 0.062	0.908 ± 0.047	0.885 ± 0.062	0.832 ± 0.110	0.847 ± 0.069
ECFP6 + MACCS	0.888 ± 0.053	0.896 ± 0.062	0.898 ± 0.065	0.889 ± 0.059	0.838 ± 0.113	0.836 ± 0.067
FCFP6 + MACCS	0.877 ± 0.065	0.900 ± 0.063	0.902 ± 0.045	0.891 ± 0.068	0.838 ± 0.105	0.840 ± 0.060
ECFP6 + FCFP6 + MACCS	0.880 ± 0.065	0.900 ± 0.059	0.911 ± 0.046	0.892 ± 0.064	0.838 ± 0.111	0.844 ± 0.062

	FNN	RF	SVM	XGBOOST	KNN	NB
ECFP6	0.869 ± 0.072	0.890 ± 0.061	0.889 ± 0.072	0.869 ± 0.071	0.828 ± 0.108	0.840 ± 0.079
FCFP6	0.871 ± 0.080	0.884 ± 0.071	0.892 ± 0.047	0.876 ± 0.069	0.825 ± 0.109	0.842 ± 0.065
MACCS	0.878 ± 0.063	0.884 ± 0.062	0.871 ± 0.059	0.877 ± 0.063	0.827 ± 0.080	0.779 ± 0.076
ECFP6 + FCFP6	0.880 ± 0.074	0.893 ± 0.062	0.908 ± 0.047	0.885 ± 0.062	0.832 ± 0.110	0.847 ± 0.069
ECFP6 + MACCS	0.888 ± 0.053	0.896 ± 0.062	0.898 ± 0.065	0.889 ± 0.059	0.838 ± 0.113	0.836 ± 0.067
FCFP6 + MACCS	0.877 ± 0.065	0.900 ± 0.063	0.902 ± 0.045	0.891 ± 0.068	0.838 ± 0.105	0.840 ± 0.060
ECFP6 + FCFP6 + MACCS	0.880 ± 0.065	0.900 ± 0.059	0.911 ± 0.046	0.892 ± 0.064	0.838 ± 0.111	0.844 ± 0.062

Graph selection

In the case of CNN, the ConvMolFeaturizer and WeaveFeaturizer were compared as input features, which are referred to as GC and Weave, respectively. The detailed comparison results are listed in Table 4. According to AUC values, the GC performed better than Weave on the six datasets. Thus, GC was applied in the follow-up works.

Table 4

The means and SDs of AUC values of GC and Weave for the datasets with and without weakly data

Dataset	GC	Weave
NPs	0.824 ± 0.081	0.794 ± 0.072
NPs + Der1	0.855 ± 0.065	0.839 ± 0.061
ChEMBL26	0.882 ± 0.056	0.855 ± 0.055
Weak NPs	0.747 ± 0.111	0.712 ± 0.098
Weak NPs + Der1	0.790 ± 0.072	0.780 ± 0.065
Weak ChEMBL26	0.834 ± 0.075	0.814 ± 0.058

Dataset	GC	Weave
NPs	0.824 ± 0.081	0.794 ± 0.072
NPs + Der1	0.855 ± 0.065	0.839 ± 0.061
ChEMBL26	0.882 ± 0.056	0.855 ± 0.055
Weak NPs	0.747 ± 0.111	0.712 ± 0.098
Weak NPs + Der1	0.790 ± 0.072	0.780 ± 0.065
Weak ChEMBL26	0.834 ± 0.075	0.814 ± 0.058

Table 4

The means and SDs of AUC values of GC and Weave for the datasets with and without weakly data

Dataset	GC	Weave
NPs	0.824 ± 0.081	0.794 ± 0.072
NPs + Der1	0.855 ± 0.065	0.839 ± 0.061
ChEMBL26	0.882 ± 0.056	0.855 ± 0.055
Weak NPs	0.747 ± 0.111	0.712 ± 0.098
Weak NPs + Der1	0.790 ± 0.072	0.780 ± 0.065
Weak ChEMBL26	0.834 ± 0.075	0.814 ± 0.058

Dataset	GC	Weave
NPs	0.824 ± 0.081	0.794 ± 0.072
NPs + Der1	0.855 ± 0.065	0.839 ± 0.061
ChEMBL26	0.882 ± 0.056	0.855 ± 0.055
Weak NPs	0.747 ± 0.111	0.712 ± 0.098
Weak NPs + Der1	0.790 ± 0.072	0.780 ± 0.065
Weak ChEMBL26	0.834 ± 0.075	0.814 ± 0.058

Activity threshold selection

In some cases, existing research use traditional machine learning with weakly active data removed, and deep learning is considered to distinguish the data in the weakly active region [21, 40, 48]. Since NPs often interact with multiple targets in a weak-bonded way [49–52], it is necessary for NPs to discuss the effects of weak activity data. Therefore, we used two dataset partitioning methods to explore whether it is better to directly select a certain threshold value or exclude the weakly data points. The results of eight algorithms (FNN, GC SVM, RF, KNN, NB, XGBoost and LSTM) were displayed as a boxplot in Figure 2. The orange line in each boxplot represents the median AUC values of the eight models. It can be seen from Figure 2 that the models excluding weakly active data (blue boxes) generated significantly higher median AUC values (orange lines in the boxes) than the models containing weakly active data (green boxes). Without special instructions, all models in the late evaluations were defaulted to use the dataset excluding weakly active data.

Figure 2

The effects of weakly data and no weakly data on the performance of NPs, NPs+Der1 and ChEMBL26.

Large-scale comparison

After determining the input characteristics and activity thresholds on a small range of crossover targets (26 targets for datasets without weakly data and 37 targets for datasets with weakly data), we compared the training results of eight algorithms on a larger benchmark of four datasets with whole targets (ChEMBL26 with 899 targets, NPs + DerALL with 470 targets, NPs + Der1 with 150 targets and NPs with 26 targets). The means and standard deviations (SDs) of AUC values of the eight algorithms in the four datasets are shown in Table 5. It can be found that FNN performed best with the highest averaged AUC value (marked as bold text in Table 5) in three datasets, which is consistent with the work of Mayr et al. [21], that deep learning methods significantly outperform all competing methods. In addition, we found that FNN, GC, SVM and RF performed stable with an averaged AUC value >0.8, and LSTM, XGBoost and KNN have poor performance in small datasets (NPs and NPs + Der1), while NB always worked poorly in all datasets. We also trained models on ChEMBL29 benchmark with more hyperparameters, and the results were similar to the models constructed using the ChEMBL26 database, see Table S10, S11.

Table 5

The means and SDs of AUC values of the eight methods in the four datasets, the top 1 AUC values were marked as bold text

	NPs	NPs + Der1	NPs + DerALL	ChEMBL26
FNN	0.811 ± 0.083	0.854 ± 0.109	0.890 ± 0.099	0.884 ± 0.091
RF	0.825 ± 0.098	0.838 ± 0.130	0.873 ± 0.106	0.866 ± 0.106
SVM	0.835 ± 0.080	0.837 ± 0.140	0.871 ± 0.127	0.856 ± 0.134
LSTM	0.667 ± 0.087	0.772 ± 0.117	0.859 ± 0.110	0.850 ± 0.105
GC	0.824 ± 0.081	0.834 ± 0.113	0.851 ± 0.115	0.842 ± 0.110
XGBoost	0.793 ± 0.115	0.816 ± 0.134	0.841 ± 0.123	0.835 ± 0.126
KNN	0.761 ± 0.109	0.785 ± 0.125	0.819 ± 0.124	0.815 ± 0.115
NB	0.782 ± 0.109	0.717 ± 0.161	0.737 ± 0.163	0.739 ± 0.159

	NPs	NPs + Der1	NPs + DerALL	ChEMBL26
FNN	0.811 ± 0.083	0.854 ± 0.109	0.890 ± 0.099	0.884 ± 0.091
RF	0.825 ± 0.098	0.838 ± 0.130	0.873 ± 0.106	0.866 ± 0.106
SVM	0.835 ± 0.080	0.837 ± 0.140	0.871 ± 0.127	0.856 ± 0.134
LSTM	0.667 ± 0.087	0.772 ± 0.117	0.859 ± 0.110	0.850 ± 0.105
GC	0.824 ± 0.081	0.834 ± 0.113	0.851 ± 0.115	0.842 ± 0.110
XGBoost	0.793 ± 0.115	0.816 ± 0.134	0.841 ± 0.123	0.835 ± 0.126
KNN	0.761 ± 0.109	0.785 ± 0.125	0.819 ± 0.124	0.815 ± 0.115
NB	0.782 ± 0.109	0.717 ± 0.161	0.737 ± 0.163	0.739 ± 0.159

Table 5

The means and SDs of AUC values of the eight methods in the four datasets, the top 1 AUC values were marked as bold text

	NPs	NPs + Der1	NPs + DerALL	ChEMBL26
FNN	0.811 ± 0.083	0.854 ± 0.109	0.890 ± 0.099	0.884 ± 0.091
RF	0.825 ± 0.098	0.838 ± 0.130	0.873 ± 0.106	0.866 ± 0.106
SVM	0.835 ± 0.080	0.837 ± 0.140	0.871 ± 0.127	0.856 ± 0.134
LSTM	0.667 ± 0.087	0.772 ± 0.117	0.859 ± 0.110	0.850 ± 0.105
GC	0.824 ± 0.081	0.834 ± 0.113	0.851 ± 0.115	0.842 ± 0.110
XGBoost	0.793 ± 0.115	0.816 ± 0.134	0.841 ± 0.123	0.835 ± 0.126
KNN	0.761 ± 0.109	0.785 ± 0.125	0.819 ± 0.124	0.815 ± 0.115
NB	0.782 ± 0.109	0.717 ± 0.161	0.737 ± 0.163	0.739 ± 0.159

	NPs	NPs + Der1	NPs + DerALL	ChEMBL26
FNN	0.811 ± 0.083	0.854 ± 0.109	0.890 ± 0.099	0.884 ± 0.091
RF	0.825 ± 0.098	0.838 ± 0.130	0.873 ± 0.106	0.866 ± 0.106
SVM	0.835 ± 0.080	0.837 ± 0.140	0.871 ± 0.127	0.856 ± 0.134
LSTM	0.667 ± 0.087	0.772 ± 0.117	0.859 ± 0.110	0.850 ± 0.105
GC	0.824 ± 0.081	0.834 ± 0.113	0.851 ± 0.115	0.842 ± 0.110
XGBoost	0.793 ± 0.115	0.816 ± 0.134	0.841 ± 0.123	0.835 ± 0.126
KNN	0.761 ± 0.109	0.785 ± 0.125	0.819 ± 0.124	0.815 ± 0.115
NB	0.782 ± 0.109	0.717 ± 0.161	0.737 ± 0.163	0.739 ± 0.159

Table 6

Statistics of the better and poorer targets of NPs + DerALL, NPs + Der1 and NPs compared with the same targets in ChEMBL26

	26 targets		150 targets		463 targets
	Better_number	Poorer_number	Better_number	Poorer_number	Better_number	Poorer_number
FNN
NPs	2	24
NPs + Der1	11	15	50	100
NPs + DerALL	15	11	82	68	248	215
GC
NPs	12	14
NPs + Der1	12	14	55	95
NPs + DerALL	17	9	85	65	251	212
SVM
NPs	4	22
NPs + Der1	9	17	46	104
NPs + DerALL	12	14	79	71	245	218
RF
NPs	8	18
NPs + Der1	11	15	47	103
NPs + DerALL	12	14	72	78	230	233

	26 targets		150 targets		463 targets
	Better_number	Poorer_number	Better_number	Poorer_number	Better_number	Poorer_number
FNN
NPs	2	24
NPs + Der1	11	15	50	100
NPs + DerALL	15	11	82	68	248	215
GC
NPs	12	14
NPs + Der1	12	14	55	95
NPs + DerALL	17	9	85	65	251	212
SVM
NPs	4	22
NPs + Der1	9	17	46	104
NPs + DerALL	12	14	79	71	245	218
RF
NPs	8	18
NPs + Der1	11	15	47	103
NPs + DerALL	12	14	72	78	230	233

Table 6

Statistics of the better and poorer targets of NPs + DerALL, NPs + Der1 and NPs compared with the same targets in ChEMBL26

	26 targets		150 targets		463 targets
	Better_number	Poorer_number	Better_number	Poorer_number	Better_number	Poorer_number
FNN
NPs	2	24
NPs + Der1	11	15	50	100
NPs + DerALL	15	11	82	68	248	215
GC
NPs	12	14
NPs + Der1	12	14	55	95
NPs + DerALL	17	9	85	65	251	212
SVM
NPs	4	22
NPs + Der1	9	17	46	104
NPs + DerALL	12	14	79	71	245	218
RF
NPs	8	18
NPs + Der1	11	15	47	103
NPs + DerALL	12	14	72	78	230	233

	26 targets		150 targets		463 targets
	Better_number	Poorer_number	Better_number	Poorer_number	Better_number	Poorer_number
FNN
NPs	2	24
NPs + Der1	11	15	50	100
NPs + DerALL	15	11	82	68	248	215
GC
NPs	12	14
NPs + Der1	12	14	55	95
NPs + DerALL	17	9	85	65	251	212
SVM
NPs	4	22
NPs + Der1	9	17	46	104
NPs + DerALL	12	14	79	71	245	218
RF
NPs	8	18
NPs + Der1	11	15	47	103
NPs + DerALL	12	14	72	78	230	233

Figure 3

The relationship between the amount of data under the target and AUC value. The abscissa is the amount of data under the target point, and the ordinate is the AUC value. The blue dotted lines at data size value of 1000 differentiate models with stable or unstable results.

We also discussed the target prediction results of a particular algorithm across different datasets. To be fair, the intersection targets of different datasets were taken for comparison furtherly. For example, there were 26 targets of ChEMBL26, NPs + DerALL and NPs + Der that intersected with NPs; 150 targets of ChEMBL26 and NPs + DerALL that intersected with NPs + Der1 and 463 targets of ChEMBL26 that intersected with NPs + DerALL. Based on the AUC value, we compared the performance of NPs + DerALL, NPs + Der1 and NPs with ChEMBL26 on each target, and then the number of targets with higher AUC values than ChEMBL26 was calculated as the Better Number, while those with lower or equal AUC values were called Poorer Number and Equal Number, respectively. Table 6 shows the results of FNN, GC, SVM and RF, and the results of KNN, NB, XGBoost, and LSTM can be found in Supplementary Table S8 available online at https://dbpia.nl.go.kr/bib. For FNN, among the 26 NPs targets, 2 targets had higher AUC values than that of ChEMBL26. This number increased from 2 to 12 on NPs + Der1 and then increased to 15, which was more than half of 26 NPs targets. For 150 targets of NPs + Der1, a greater number of targets (50) presented a better AUC value. And 83 out of 150 targets of NPs + DerALL exceeded the ChEMBL26. It should be noted that the NPs + DerALL dataset had more better targets than ChEMBL26 on three out of four stable machine learning methods. From the above results, we can see that NPs + DerALL, NPs + Der1 and NPs with less data than ChEMBL26 can still obtain higher-level models, which shows the great potential of NP-specific datasets.

Previous works showed that the number of training sets increased, the performance of the model increased [21, 53, 54]. We also investigated the correlation between dataset size and performance. The scatter plots of data sizes against AUC values were drawn for all models. As shown in Figure 3, from left to right, the distribution of AUC was getting closer and closer to the top, demonstrating that the larger training set leads to better predictions. This is consistent with previous work [55–59]. Especially, when the amount of data reaches 10³–10⁴, the AUC values were concentrated at the range of 0.8~1, which was a relatively high and stable level. In general, NPs (green points) and NPs + Der1 (blue points) were difficult to achieve a stable data size, and the NPs + DerALL (yellow points) met this requirement and thus got better performance than ChEMBL26. Therefore, we believe that, when the datasets are sufficient in the future, NP-specific datasets will have the potential of getting a better model for NP target prediction rather than the models built with a hybrid dataset of all ChEMBL molecules which have more data.

Table 7

The number of targets, molecules and bioactivity measurements in the intersection of external validation set

Targets_number	Compound_number	Bioactivity_number
192	4516	5824

Table 7

The number of targets, molecules and bioactivity measurements in the intersection of external validation set

Targets_number	Compound_number	Bioactivity_number
192	4516	5824

Table 8

The statistical of better, poorer and equal targets based on 13 targets (with more than 100 compounds) in the external validation set

	Optimal number of targets in NPs + DerALL models
	Better_number	Equal_number	Poorer_number
FNN	8	2	3
GC	6	1	6
LSTM	8	1	4
SVM	6	2	5
RF	5	2	6
XGBOOST	6	1	6
KNN	5	1	7
NB	7	1	5

	Optimal number of targets in NPs + DerALL models
	Better_number	Equal_number	Poorer_number
FNN	8	2	3
GC	6	1	6
LSTM	8	1	4
SVM	6	2	5
RF	5	2	6
XGBOOST	6	1	6
KNN	5	1	7
NB	7	1	5

Table 8

The statistical of better, poorer and equal targets based on 13 targets (with more than 100 compounds) in the external validation set

	Optimal number of targets in NPs + DerALL models
	Better_number	Equal_number	Poorer_number
FNN	8	2	3
GC	6	1	6
LSTM	8	1	4
SVM	6	2	5
RF	5	2	6
XGBOOST	6	1	6
KNN	5	1	7
NB	7	1	5

	Optimal number of targets in NPs + DerALL models
	Better_number	Equal_number	Poorer_number
FNN	8	2	3
GC	6	1	6
LSTM	8	1	4
SVM	6	2	5
RF	5	2	6
XGBOOST	6	1	6
KNN	5	1	7
NB	7	1	5

Figure 4

The AUC values of the model itself versus external validation.

External validation

Several studies demonstrated the performance of cross-validation differences between in-sample and out-of-sample test pairs [60, 61]. To better evaluate the models, the external validation set without training samples was built and used on the final models which were trained on all data. Considering that most of the targets of NPs and NPs + Der1 have very little data of the training and external validation sets, and the data were distributed unevenly, only the performance of NPs + DerALL and ChEMBL26 were compared on the external validation set.

First, the evaluation results of the internal validation based on nested cluster cross-validation were compared with that of external validation to evaluate the generalization capacity of our models. The result of NPs + DerALL is displayed in Figure 4. As shown in the boxplot, the results from the internal (green boxes) and external validation (blue boxes) of those models displayed comparable performance, and most of the time, the external validation possessed higher median AUC values (orange lines in the boxes) than the internal validation. Therefore, our training models possessed good robustness.

Next, we evaluated whether the NP specificity models built with NPs + DerALL performed better for the NP target prediction than traditional models with all mixed molecules of ChEMBL26. Due to the data size limitation, the AUC value was unable to be calculated for some targets, then the intersection targets of NPs + DerALL and ChEMBL26 with complete estimate values were picked out. In the end, 192 targets were selected (Table 7) and the details of those 192 targets can be found in Supplement Materials.

Considering that the size of the external validation set also had a significant impact on the model estimate, the correlation of the data distribution and the AUC values were further analyzed and these are displayed in Figure 5. It was shown when the data volume of a target in the external validation set was >100, the performance of most models could reach a reliable level with an AUC value of >0.8. On the contrary, the results were very messy when the amount of the external validation set was <100. Therefore, only targets with >100 compounds were explored in the following discussion.

Figure 5

The relationship between the data size of the external validation set and the performance of the eight different models. The blue dotted lines at data size value of 100 differentiate models with stable or unstable results.

Table 8 shows a chart of the numbers of better, poorer and equal targets with a data size of >100. In this case, there were relatively more targets of NPs + DerALL performing better than ChEMBL26. Besides, the means and SDs of the AUC value of eight methods for those targets with external data size >100 were tallied up and these are listed in Table 9. The FNN and LSTM obtained better performances on NPs + DerALL and the FNN model of NPs + DerALL performed best with the highest AUC value of 0.944. For other methods, the averaged AUC values on NP + DerALL were very close to the results on ChEMBL26. In summary, the NP-specific models (NPs + DerALL) are able to produce a better predictive ability of target prediction for NPs and their derivatives.

Table 9

The means and SDs of the AUC values of eight methods for the NPs + DerALL and ChEMBL26

	NP + DerALL	ALL
FNN	0.944 ± 0.044	0.933 ± 0.057
GC	0.911 ± 0.085	0.918 ± 0.061
LSTM	0.910 ± 0.073	0.883 ± 0.101
SVM	0.929 ± 0.070	0.933 ± 0.061
RF	0.931 ± 0.066	0.935 ± 0.063
XGBOOST	0.915 ± 0.081	0.913 ± 0.083
KNN	0.888 ± 0.078	0.902 ± 0.069
NB	0.879 ± 0.104	0.874 ± 0.110

	NP + DerALL	ALL
FNN	0.944 ± 0.044	0.933 ± 0.057
GC	0.911 ± 0.085	0.918 ± 0.061
LSTM	0.910 ± 0.073	0.883 ± 0.101
SVM	0.929 ± 0.070	0.933 ± 0.061
RF	0.931 ± 0.066	0.935 ± 0.063
XGBOOST	0.915 ± 0.081	0.913 ± 0.083
KNN	0.888 ± 0.078	0.902 ± 0.069
NB	0.879 ± 0.104	0.874 ± 0.110

Table 9

The means and SDs of the AUC values of eight methods for the NPs + DerALL and ChEMBL26

	NP + DerALL	ALL
FNN	0.944 ± 0.044	0.933 ± 0.057
GC	0.911 ± 0.085	0.918 ± 0.061
LSTM	0.910 ± 0.073	0.883 ± 0.101
SVM	0.929 ± 0.070	0.933 ± 0.061
RF	0.931 ± 0.066	0.935 ± 0.063
XGBOOST	0.915 ± 0.081	0.913 ± 0.083
KNN	0.888 ± 0.078	0.902 ± 0.069
NB	0.879 ± 0.104	0.874 ± 0.110

	NP + DerALL	ALL
FNN	0.944 ± 0.044	0.933 ± 0.057
GC	0.911 ± 0.085	0.918 ± 0.061
LSTM	0.910 ± 0.073	0.883 ± 0.101
SVM	0.929 ± 0.070	0.933 ± 0.061
RF	0.931 ± 0.066	0.935 ± 0.063
XGBOOST	0.915 ± 0.081	0.913 ± 0.083
KNN	0.888 ± 0.078	0.902 ± 0.069
NB	0.879 ± 0.104	0.874 ± 0.110

Consensus model

Ensemble methods by combining multiple learners can often obtain significantly better generalization performance than a single learner [62–65]. Therefore, we combined eight different models to build consensus models for NP target prediction and to evaluate their performance on external validation sets. The average probability was used as the predicted score for two-algorithms-combined (28 in total) or three-algorithms-combined (56 in total) consensus models [66, 67]. Eight measurements of performance, including AUC, the area under the PR curve (AP), accuracy, precision, specificity, F1-score, kappa and recall, were used to estimate the overall performance of different consensus models in target prediction jobs. The partial results are shown in Figure 6, and the rest are displayed in Supplementary Figure S1 available online at https://dbpia.nl.go.kr/bib. The full evaluation results of the consensus models can be found in the Supplement Materials.

Figure 6

The values of AUC, AP, F1-score and kappa of two combined models. The red standard line represents the best value of single models.

For the two-algorithms-combined models, the GC + SVM ranked first at the AUC. FNN + SVM ranked first at AP and F1- score, FNN + GC ranked first at the kappa and accuracy, FNN + KNN ranked first at the precision and specificity, while LSTM + XGBoost ranked first at recall. Therefore, different consensus models have their advantages. But from a comprehensive view, five out of eight indicators (AP, kappa, accuracy, precision and F1-score) of the FNN + SVM ranked first or second, which indicated that FNN + SVM had the best overall performance. For the three-algorithms-combined models (Supplementary Figure S2 available online at https://dbpia.nl.go.kr/bib), the FNN + GC + XGBoost performed best at five indicators (accuracy, AP, precision, F1-score and kappa). But, the best two-algorithms-combined models were superior to the best three-algorithms-combined model at six indicators excluding the accuracy and AP. So overall, the FNN + SVM possessed the best overall performance, and this consensus model indeed improved the score of multiple evaluation methods than a single model and thereby highlights the comprehensive advantages of ensemble methods for NP target fishing.

Multi-voting method

The voting method is another ensemble technique. We use the voting method to predict targets for NPs by combining the eight algorithms. The results are listed in Table 10. If we considered 1 vote (Vote_1 scheme), a positive label was given only if one or more models were giving a positive label. And the Vote_8 scheme required all eight models to label positive. Although the voting model performed poorly on most metrics, the vote_1 model had the highest recall of 0.927, which is a good choice when we want to find more candidate targets. On the other side, the Vote_8 model had the highest specificity with a massive improvement from 0.725 (the best of the single model) to 0.923, which demonstrated its great ability to raise the true negative rate. In another word, if it aimed to accurately exclude targets without effect on molecules, more votes were necessary.

Table 10

Voting results and single model results of eight models on accuracy, precision, specificity, balanced average (F1-score), consistency test index (kappa) and recall; the top 1 of each indicators were marked as bold text

	Accuracy	Recall	Precision	Specificity	F1-score	Kappa
Vote_1	0.734820	0.927357	0.690474	0.334298	0.768811	0.257488
Vote_2	0.756605	0.863349	0.705754	0.457933	0.753761	0.327965
Vote_3	0.768086	0.824571	0.721773	0.532612	0.746595	0.363039
Vote_4	0.785580	0.794663	0.720453	0.606143	0.731923	0.395788
Vote_5	0.784754	0.752909	0.722685	0.683765	0.712288	0.425853
Vote_6	0.786689	0.706961	0.727658	0.760024	0.691929	0.451502
Vote_7	0.777391	0.640642	0.730498	0.837807	0.657155	0.442357
Vote_8	0.727814	0.518230	0.707729	0.923482	0.570414	0.374407
FNN	0.798308	0.791062	0.767744	0.725392	0.760613	0.491853
KNN	0.729326	0.688061	0.700937	0.657092	0.659602	0.320299
XGBoost	0.786030	0.803387	0.737362	0.615127	0.744092	0.419844
NB	0.720282	0.662418	0.690309	0.643922	0.643178	0.280391
RF	0.762731	0.760221	0.690735	0.598778	0.697785	0.353804
SVM	0.765061	0.778616	0.718526	0.581766	0.720179	0.359531
GC	0.792923	0.776271	0.728246	0.693327	0.733131	0.445779
LSTM	0.767079	0.768647	0.726224	0.620660	0.722073	0.386801

	Accuracy	Recall	Precision	Specificity	F1-score	Kappa
Vote_1	0.734820	0.927357	0.690474	0.334298	0.768811	0.257488
Vote_2	0.756605	0.863349	0.705754	0.457933	0.753761	0.327965
Vote_3	0.768086	0.824571	0.721773	0.532612	0.746595	0.363039
Vote_4	0.785580	0.794663	0.720453	0.606143	0.731923	0.395788
Vote_5	0.784754	0.752909	0.722685	0.683765	0.712288	0.425853
Vote_6	0.786689	0.706961	0.727658	0.760024	0.691929	0.451502
Vote_7	0.777391	0.640642	0.730498	0.837807	0.657155	0.442357
Vote_8	0.727814	0.518230	0.707729	0.923482	0.570414	0.374407
FNN	0.798308	0.791062	0.767744	0.725392	0.760613	0.491853
KNN	0.729326	0.688061	0.700937	0.657092	0.659602	0.320299
XGBoost	0.786030	0.803387	0.737362	0.615127	0.744092	0.419844
NB	0.720282	0.662418	0.690309	0.643922	0.643178	0.280391
RF	0.762731	0.760221	0.690735	0.598778	0.697785	0.353804
SVM	0.765061	0.778616	0.718526	0.581766	0.720179	0.359531
GC	0.792923	0.776271	0.728246	0.693327	0.733131	0.445779
LSTM	0.767079	0.768647	0.726224	0.620660	0.722073	0.386801

Table 10

Voting results and single model results of eight models on accuracy, precision, specificity, balanced average (F1-score), consistency test index (kappa) and recall; the top 1 of each indicators were marked as bold text

	Accuracy	Recall	Precision	Specificity	F1-score	Kappa
Vote_1	0.734820	0.927357	0.690474	0.334298	0.768811	0.257488
Vote_2	0.756605	0.863349	0.705754	0.457933	0.753761	0.327965
Vote_3	0.768086	0.824571	0.721773	0.532612	0.746595	0.363039
Vote_4	0.785580	0.794663	0.720453	0.606143	0.731923	0.395788
Vote_5	0.784754	0.752909	0.722685	0.683765	0.712288	0.425853
Vote_6	0.786689	0.706961	0.727658	0.760024	0.691929	0.451502
Vote_7	0.777391	0.640642	0.730498	0.837807	0.657155	0.442357
Vote_8	0.727814	0.518230	0.707729	0.923482	0.570414	0.374407
FNN	0.798308	0.791062	0.767744	0.725392	0.760613	0.491853
KNN	0.729326	0.688061	0.700937	0.657092	0.659602	0.320299
XGBoost	0.786030	0.803387	0.737362	0.615127	0.744092	0.419844
NB	0.720282	0.662418	0.690309	0.643922	0.643178	0.280391
RF	0.762731	0.760221	0.690735	0.598778	0.697785	0.353804
SVM	0.765061	0.778616	0.718526	0.581766	0.720179	0.359531
GC	0.792923	0.776271	0.728246	0.693327	0.733131	0.445779
LSTM	0.767079	0.768647	0.726224	0.620660	0.722073	0.386801

	Accuracy	Recall	Precision	Specificity	F1-score	Kappa
Vote_1	0.734820	0.927357	0.690474	0.334298	0.768811	0.257488
Vote_2	0.756605	0.863349	0.705754	0.457933	0.753761	0.327965
Vote_3	0.768086	0.824571	0.721773	0.532612	0.746595	0.363039
Vote_4	0.785580	0.794663	0.720453	0.606143	0.731923	0.395788
Vote_5	0.784754	0.752909	0.722685	0.683765	0.712288	0.425853
Vote_6	0.786689	0.706961	0.727658	0.760024	0.691929	0.451502
Vote_7	0.777391	0.640642	0.730498	0.837807	0.657155	0.442357
Vote_8	0.727814	0.518230	0.707729	0.923482	0.570414	0.374407
FNN	0.798308	0.791062	0.767744	0.725392	0.760613	0.491853
KNN	0.729326	0.688061	0.700937	0.657092	0.659602	0.320299
XGBoost	0.786030	0.803387	0.737362	0.615127	0.744092	0.419844
NB	0.720282	0.662418	0.690309	0.643922	0.643178	0.280391
RF	0.762731	0.760221	0.690735	0.598778	0.697785	0.353804
SVM	0.765061	0.778616	0.718526	0.581766	0.720179	0.359531
GC	0.792923	0.776271	0.728246	0.693327	0.733131	0.445779
LSTM	0.767079	0.768647	0.726224	0.620660	0.722073	0.386801

Conclusions

NPs are valuable resources of drugs, and the study of the activity of NPs, especially the discovery of specific targets, is very important for the development of NPs. With the increase of data, various algorithms have been successfully used in molecular target prediction, but considering the obvious differences between the characteristics of NPs and synthetic molecules, it is significantly necessary to construct prediction models that are specific for NPs. Therefore, we collected the activity data of NPs and their derivatives to build three specific datasets of NPs, NPs, NPs + Der1 and NPs + DerALL. Multiple machine learning methods, including SVM, XGBoost, RF, KNN, NB, FNN, CNN and RNN, were used to construct NP-specific target prediction models and were then compared with the traditional models constructed by ChEMBL26.

We first discussed the effects of the input features, activity thresholds on different datasets with multiple algorithms. The results showed that the combination of ECFP6 + MACCS+FCFP6 fingerprints had a more comprehensive performance because of the advantages of integrating more molecular information from different single fingerprints. For the CNN, the ConvMolFeaturizer did better than the WeaveFeaturizer. And, the models excluding weakly active data performed better than the models containing weakly active data. Then, the best conditions obtained above were used for the next large-scale comparison of multiple algorithms on different datasets (NPs, NPs + Der1, NPs + DerALL and ChEMBL26). First, the deep learning method, FNN, performed best with the highest averaged AUC value on most datasets. Second, although the model performances of NPs and NPs + Der1 were poor and unstable during the data limitation, the NPs + DerALL possessed a better predictive ability than ChEMBL26 on most algorithms. Then, we took the prediction model based on NPs + DerALL as the representative NP-specific model and evaluated its performance on the external validation set. On the one hand, the AUC values of the external validation were comparable to the results of internal validation, which demonstrated the good generalization ability of our model. On the other hand, the models built with NPs + DerALL possessed better classification ability and robustness than ChEMBL26 when the number of validation sets was sufficient (>100 per target). In addition, among consensus models, the combination of FNN and SVM performed best comprehensively with the improving score at multiple evaluating indicators compared to the single algorithms. Another ensemble method by taking votes of different algorithms was also applied in this work. The results showed that the fewer votes we took, the better recall rate we got, thus fewer votes can be used to get more candidate targets. Instead, the more votes we took, the better specificity we got, indicating that more votes could exclude more impossible targets.

In summary, NP-specific models are more suitable for the target prediction of NPs, while integrated methods can further improve various indicators of prediction, and different types of ensemble methods can be selected according to different requirements.

Key Points

Three NP-specific datasets were constructed and compared with the traditional mixed datasets from ChEMBL26 on the target prediction task of NPs using eight machine learning algorithms.
The combination of ECFP6 + MACCS+FCFP6 fingerprints had a more comprehensive performance because of the advantages of integrating more molecular information from different single fingerprints.
The models excluding weakly active data performed better than the models containing weakly active data.
The NP-specific dataset, NPs + DerALL, possessed a better predictive ability than ChEMBL26 on most algorithms both in internal validation and external validation.
Among ensemble methods, the combination of FNN and SVM performed best comprehensively and voting models significantly improved the recall and specificity.

Data and code availability

The authors declare that all data supporting the findings of this study are available within the article and ESI files. The Supplement Materials can be accessed from https://doi.org/10.5281/zenodo.6904699 and also from the corresponding authors upon reasonable request. The scripts for training and using our models have been uploaded to GitHub (https://github.com/lianglu-nk/NPTP, https://github.com/lianglu-nk/NPTP_external).

Authors’ contributions

Y.L. carried out the experimental work, analysis and interpretation of the results and wrote the original draft. L.L. conceived and designed the experiments, supervised the project, supported the analysis and interpretation of the results and wrote the original draft. B.K. and X.-F.M. improved computing power and guided the model optimization. J.-P.L. supervised the project and participated in the writing and editing of the manuscript. R.W., M.-Y.S. and Q.W. supported the analysis and participated in the experimental work. All authors discussed the results and edited the manuscript.

Funding

This work was supported by the National Key R&D Program of China [No. 2017YFC1104400].

Lu Liang, PhD, is an assistant research fellow at the College of Pharmacy, Nankai University. Her research interests include the development of cheminformatics tools and computational target prediction approaches.

Ye Liu is a student at the College of Pharmacy, Nankai University. Her research interests include computational target prediction approaches and machine learning.

Bo Kang, PhD, is a professional senior engineer of high-performance computing applications at the National Supercomputer Center in Tianjin. His research interests include the large-scale high performance computing, AI computing, big data analysis and molecular modeling.

Ru Wang is a student at the College of Pharmacy, Nankai University. Her research interests include computational antibody design and machine learning.

Meng-Yu Sun is a student at the College of Pharmacy, Nankai University. Her research interests include computational target prediction approaches and machine learning.

Qi Wu, Master, is an R&D engineer of high-performance computing at the National Supercomputer Center in Tianjin. His research interests include the bioinformatics computing, molecular modeling and machine learning.

Xiang-Fei Meng, PhD, is the chief scientist of high-performance computing applications at the National Supercomputer Center in Tianjin. His research interests include the large-scale high performance computing, AI computing and high throughout computing on materials.

Jian-Ping Lin, PhD, is a professor at the College of Pharmacy, Nankai University. His research interests include molecular dynamics, virtual screening, computational target prediction, drug repositioning, ADMET prediction and the development of cheminformatics tools.

References

1.

Katz

L

,

Baltz

RH

.

Natural product discovery: past, present, and future

.

J Ind Microbiol Biotechnol

2016

;

43

:

155

–

76

.

2.

Achan

J

,

Talisuna

AO

,

Erhart

A

, et al.

Quinine, an old anti-malarial drug in a modern world: role in the treatment of malaria

.

Malar J

2011

;

10

(

1

):

1

–

12

.

3.

Rodrigues

T

,

Reker

D

,

Schneider

P

, et al.

Counting on Natural Products For Drug Design

.

Nature Chemistry

2016;

8

:531–41.

4.

Atanasov

AG

,

Zotchev

SB

,

Dirsch

VM

, et al.

Natural products in drug discovery: advances and opportunities

.

Nat Rev Drug Discov

2021

;

20

:

200

–

16

.

5.

Sorokina

M

,

Steinbeck

C

.

Review on natural products databases: where to find data in 2020

.

J Chem

2020

;

12

(

1

):

1

–

51

.

6.

Sorokina

M

,

Merseburger

P

,

Rajan

K

, et al.

COCONUT online: collection of open natural products database

.

J Chem

2021

;

13

(

1

):

1

–

13

.

7.

Zeng

X

,

Zhang

P

,

He

W

, et al.

NPASS: natural product activity and species source database for natural product research, discovery and tool development

.

Nucleic Acids Res

2018

;

46

:

D1217

–

22

.

8.

Wang

C

,

Kurgan

L

.

Review and comparative assessment of similarity-based methods for prediction of drug – protein interactions in the druggable human proteome

.

Brief Bioinform

2018

;

20

:

1

–

22

.

9.

Fang

J

,

Wu

Z

,

Cai

C

, et al.

Quantitative and systems pharmacology. 1. In silico prediction of drug-target interactions of natural products enables new targeted cancer therapy

.

J Chem Inf Model

2017

;

57

:

2657

–

71

.

10.

Fang

J

,

Cai

C

,

Wang

Q

, et al.

Systems pharmacology-based discovery of natural products for precision oncology through targeting cancer mutated genes

.

CPT Pharmacometrics Syst Pharmacol

2017

;

6

(

3

):

177

–

87

.

11.

Fang

J

,

Liu

C

,

Wang

Q

, et al.

In silico polypharmacology of natural products

.

Brief Bioinform

2017

;

19

:

1153

–

71

.

12.

Hong

H

,

Lmendrick

D

,

Mattes

W

, et al.

Molecular docking for identification of potential targets for drug repurposing

.

Curr Top Med Chem

2016

;

16

:

3636

–

45

.

13.

Ye

H

,

Wei

J

,

Tang

K

, et al.

Drug repositioning through network pharmacology

.

Curr Top Med Chem

2016

;

16

:

3646

–

56

.

14.

Kenny

HA

,

Hart

PC

,

Kordylewicz

K

, et al.

The natural product β-escin targets cancer and stromal cells of the tumor microenvironment to inhibit ovarian cancer metastasis

.

Cancer

2021

;

13

:3931.

15.

Rariza

PDD

,

Billones

JBB

.

Retusenol potentially inhibits putative drug targets for tuberculosis, cardiovascular diseases, cancer and HIV: (a reverse docking study)

.

Orient J Chem

2018

;

34

:

1795

–

801

.

16.

Dunyak

BM

,

Gestwicki

JE

.

Peptidyl-proline isomerases (PPIases): targets for natural products and natural product-inspired compounds

.

J Med Chem

2016

;

59

:

9622

–

44

.

17.

Yin

L

,

Zheng

L

,

Xu

L

, et al.

In-silico prediction of drug targets, biological activities, signal pathways and regulating networks of dioscin based on bioinformatics

.

BMC Complement Altern Med

2015

;

15

(

1

):

1

–

17

.

18.

Zhang

H

,

Ma

S

,

Feng

Z

, et al.

Cardiovascular disease chemogenomics knowledgebase-guided target identification and drug synergy mechanism study of an herbal formula

.

Sci Rep

2016

;

6

(

1

):

1

–

14

.

19.

Keiser

MJ

,

Setola

V

,

Irwin

JJ

, et al.

Predicting new molecular targets for known drugs

.

Nature

2009

;

462

:

175

–

81

.

20.

Wang

Z

,

Liang

L

,

Yin

Z

, et al.

Improving chemical similarity ensemble approach in target prediction

.

J Chem

2016

;

8

:

20

.

21.

Mayr

A

,

Klambauer

G

,

Unterthiner

T

, et al.

Large-scale comparison of machine learning methods for drug target prediction on ChEMBL

.

Chem Sci

2018

;

9

:

5441

–

51

.

22.

Cheminform

J

,

Sturm

N

,

Mayr

A

, et al.

Industry-scale application and evaluation of deep learning for drug target prediction

.

J Chem

2020

;

12

:

1

–

13

.

23.

Begnini

F

,

Poongavanam

V

,

Over

B

, et al.

Mining natural products for macrocycles to drug difficult targets

.

J Med Chem

2021

;

64

:

1054

–

72

.

24.

Chen

Y

,

Mathai

N

,

Kirchmair

J

, et al.

Scope of 3D shape-based approaches in predicting the macromolecular targets of structurally complex small molecules including natural products and macrocyclic ligands

.

J Chem Inf Model

2020

;

60

(

6

):

2858

–

75

.

25.

Chen

Y

,

Kirchmair

J

.

Cheminformatics in natural product-based drug discovery

.

Mol Inf

2020

;

39

:

2000171

.

26.

Cockroft

NT

,

Cheng

X

,

Fuchs

JR

.

STarFish: a stacked ensemble target fishing approach and its application to natural products

.

J Chem Inf Model

2019

;

59

:

4906

–

20

.

27.

Shen

B

.

A new golden age of natural products drug discovery

.

Cell

2015

;

163

:

1297

–

300

.

28.

Feher

M

,

Schmidt

JM

.

Property distributions: differences between drugs, natural products, and molecules from combinatorial chemistry

.

ChemInform

2003

;

34

(

17

):

218

–

27

.

29.

Lee

ML

,

Schneider

G

.

Scaffold architecture and pharmacophoric properties of natural products and trade drugs: application in the design of natural product-based combinatorial libraries

.

J Comb Chem

2001

;

3

:

284

–

9

.

30.

Stahura

FL

,

Godden

JW

,

Xue

L

, et al.

Distinguishing between natural products and synthetic molecules by descriptor Shannon entropy analysis and binary QSAR calculations

.

J Chem Inf Comput Sci

2000

;

40

:

1245

–

52

.

31.

Koehn

FE

,

Carter

GT

.

The evolving role of natural products in drug discovery

.

Nat Rev Drug Discov

2005

;

4

:

206

–

20

.

32.

Reker

D

,

Perna

AM

,

Rodrigues

T

, et al.

Revealing the macromolecular targets of complex natural products

.

Nat Chem

2014

;

6

:

1072

–

8

.

33.

Newman

DJ

,

Cragg

GM

.

Natural products as sources of new drugs over the last 25 years

.

J Nat Prod

2007

;

70

:

461

–

77

.

34.

Bade

R

,

Chan

HF

,

Reynisson

J

.

Characteristics of known drug space. Natural products, their derivatives and synthetic drugs

.

Eur J Med Chem

2010

;

45

:

5646

–

52

.

35.

Patridge

E

,

Gareiss

P

,

Kinch

MS

, et al.

An analysis of FDA-approved drugs: natural products and their derivatives

.

Drug Discov Today

2016

;

21

:

204

–

7

.

36.

Newman

DJ

,

Cragg

GM

.

Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019

.

J Nat Prod

2020

;

83

:

770

–

803

.

37.

Gilson

MK

,

Liu

T

,

Baitaluk

M

, et al.

BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology

.

Nucleic Acids Res

2016

;

44

:

D1045

–

53

.

38.

Wassermann

AM

,

Bajorath

J

.

BindingDB and ChEMBL: online compound databases for drug discovery

.

Expert Opin Drug Discovery

2011

;

6

:

683

–

7

.

39.

Wang

Y

,

Xiao

J

,

Suzek

TO

, et al.

PubChem: a public information system for analyzing bioactivities of small molecules

.

Nucleic Acids Res

2009

;

37

:

W623

–

33

.

40.

Li

X

,

Li

Z

,

Wu

X

, et al.

Deep learning enhancing kinome-wide polypharmacology profiling: model construction and experiment validation

.

J Med Chem

2019

;

63

(

16

):

8723

–

37

.

41.

Kalliokoski

T

,

Kramer

C

,

Vulpetti

A

, et al.

Comparability of mixed IC50 data - a statistical analysis

.

PLoS One

2013

;

8

:

e61007

.

42.

Stockwell

DRB

,

Peterson

AT

.

Effects of sample size on accuracy of species distribution models

.

Ecol Model

2002

;

148

:

1

–

13

.

43.

Barcza

S

,

Kelly

LA

,

Wahrman

SS

, et al.

Structured biological data in the molecular access system

.

J Chem Inf Comput Sci

1985

;

25

:

55

–

9

.

44.

Rogers

D

,

Hahn

M

.

Extended-connectivity fingerprints

.

J Chem Inf Model

2010

;

50

:

742

–

54

.

45.

Martin

EJ

,

Polyakov

VR

,

Tian

L

, et al.

Profile-QSAR 2.0: kinase virtual screening accuracy comparable to four-concentration IC50s for realistically novel compounds

.

J Chem Inf Model

2017

;

57

:

2077

–

88

.

46.

Miljković

F

,

Rodríguez-Pérez

R

,

Bajorath

J

.

Machine learning models for accurate prediction of kinase inhibitors with different binding modes

.

J Med Chem

2019

;

63

(

16

):

8738

–

48

.

47.

Rahaman

O

,

Gagliardi

A

.

Deep learning total energies and orbital energies of large organic molecules using hybridization of molecular fingerprints

.

J Chem Inf Model

2020

;

60

:

5971

–

83

.

48.

Cai

C

,

Guo

P

,

Zhou

Y

, et al.

Deep learning-based prediction of drug-induced cardiotoxicity

.

J Chem Inf Model

2019

;

59

:

1073

–

84

.

49.

Zheng

C

,

Wang

J

,

Liu

J

, et al.

System-level multi-target drug discovery from natural products with applications to cardiovascular diseases

.

Mol Divers

2014

;

18

:

621

–

35

.

50.

Dvorak

Z

,

Klapholz

M

,

Burris

TP

, et al.

Weak microbial metabolites: a treasure trove for using biomimicry to discover and optimize drugs

.

Mol Pharmacol

2020

;

98

:

343

–

9

.

51.

Yao

H

,

Liu

J

,

Xu

S

, et al.

The structural modification of natural products for novel drug discovery

.

Expert Opin Drug Discovery

2017

;

12

:

121

–

40

.

52.

Huang

S

,

Chen

F

,

Cheng

H

, et al.

Modification and application of polysaccharide from traditional Chinese medicine such as dendrobium officinale

.

Int J Biol Macromol

2020

;

157

:

385

–

93

.

53.

Jollans

L

,

Boyle

R

,

Artiges

E

, et al.

Quantifying performance of machine learning methods for neuroimaging data

.

Neuroimage

2019

;

199

:

351

–

65

.

54.

Moghaddam

DD

,

Rahmati

O

,

Panahi

M

, et al.

The effect of sample size on different machine learning models for groundwater potential mapping in mountain bedrock aquifers

.

Catena

2020

;

187

:

104421

.

55.

Beleites

C

,

Neugebauer

U

,

Bocklitz

T

, et al.

Sample size planning for classification models

.

Anal Chim Acta

2013

;

760

:

25

–

33

.

56.

Vabalas

A

,

Gowen

E

,

Poliakoff

E

, et al.

Machine learning algorithm validation with a limited sample size

.

PLoS One

2019

;

14

(

11

):

1

–

20

.

57.

Figueroa

RL

,

Zeng-Treitler

Q

,

Kandula

S

, et al.

Predicting sample size required for classification performance

.

BMC Med Inform Decis Mak

2012

;

12

:8.

58.

Alwosheel

A

,

van

Cranenburgh

S

,

Chorus

CG

.

Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis

.

J Choice Model

2018

;

28

:

167

–

82

.

59.

Joulin

A

,

van Der

Maaten

L

,

Jabri

A

, et al.

Learning visual features from large weakly supervised data

.

Computer Vision – ECCV 2016

2016

;

67

–

84

.

60.

Chen

X

,

Yan

CC

,

Zhang

X

, et al.

Drug-target interaction prediction: databases, web servers and computational models

.

Brief Bioinform

2016

;

17

:

696

–

712

.

61.

Chen

X

,

Zhou

C

,

Wang

CC

, et al.

Predicting potential small molecule-miRNA associations based on bounded nuclear norm regularization

.

Brief Bioinform

2021

;

22

(

6

):

1

–

14

.

62.

Modi

S

,

Li

J

,

Malcomber

S

, et al.

Integrated in silico approaches for the prediction of Ames test mutagenicity

.

J Comput Aided Mol Des

2012

;

26

:

1017

–

33

.

63.

Gini

G

,

Garg

T

,

Stefanelli

M

.

Ensembling regression models to improve their predictivity: a case study in qsar (quantitative structure activity relationships) with computational chemometrics

.

Appl Artif Intell

2009

;

23

:

261

–

81

.

64.

Wang

CC

,

Zhu

CC

,

Chen

X

.

Ensemble of kernel ridge regression-based small molecule-miRNA association prediction in human disease

.

Brief Bioinform

2022

;

23

(

1

):

1

–

11

.

65.

Chen

X

,

Guan

NN

,

Sun

YZ

, et al.

MicroRNA-small molecule association identification: from experimental results to computational models

.

Brief Bioinform

2018

;

21

:

47

–

61

.

66.

Wu

Z

,

Zhu

M

,

Kang

Y

, et al.

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets

.

Brief Bioinform

2020

;

00

:

1

–

17

.