PreTP-EL: prediction of therapeutic peptides based on ensemble learning

Information of the eight peptide training datasets and the independent test set

Datasets	Training set		Independent test set
	Positive	Negative	Positive	Negative
AAP	107	107	28	28
ABP	800	800	199	199
ACP	250	250	82	82
AIP	1258	1887	420	629
AVP	544	407	60	45
CPP	370	370	92	92
QSP	200	200	20	20
PBP	80	80	24	24

Datasets	Training set		Independent test set
	Positive	Negative	Positive	Negative
AAP	107	107	28	28
ABP	800	800	199	199
ACP	250	250	82	82
AIP	1258	1887	420	629
AVP	544	407	60	45
CPP	370	370	92	92
QSP	200	200	20	20
PBP	80	80	24	24

Table 1

Information of the eight peptide training datasets and the independent test set

Datasets	Training set		Independent test set
	Positive	Negative	Positive	Negative
AAP	107	107	28	28
ABP	800	800	199	199
ACP	250	250	82	82
AIP	1258	1887	420	629
AVP	544	407	60	45
CPP	370	370	92	92
QSP	200	200	20	20
PBP	80	80	24	24

Datasets	Training set		Independent test set
	Positive	Negative	Positive	Negative
AAP	107	107	28	28
ABP	800	800	199	199
ACP	250	250	82	82
AIP	1258	1887	420	629
AVP	544	407	60	45
CPP	370	370	92	92
QSP	200	200	20	20
PBP	80	80	24	24

Table 1 shows that each dataset contains a training dataset and an independent dataset. The positive peptide sequences are usually experimentally proven to be therapeutic peptides, such as AAPs. While the negative peptide sequences are usually common and featureless peptide sequences or only random disordered sequences [13].

Overview of PreTP-EL

In this paper, we proposed a novel therapeutic peptides prediction method PreTP-EL based on ensemble learning. The PreTP-EL combines individual predictors via a genetic algorithm to improve prediction performance. First, nine different features are used to reflect the characters of a peptide sequence from its different properties. Second, the feature vectors from the previous step are sent into the two prediction models, including SVM and RF classifiers. Then we concentrate the 18 individual predictors using the optimal weights based on the genetic algorithm. Finally, the prediction model has predicted whether the query sequence is a therapeutic peptide or not. The higher the score is, the more likely the peptide is to be the corresponding specific peptide. The framework of PreTP-EL is illustrated in Figure 1.

Figure 1

The framework of PreTP-EL. There are three main steps: (i) feature extraction. Each training or test peptide sequence is represented as nine feature vectors by nine feature extraction methods. (ii) Training phase. The nine feature vectors are fed into SVM and RF basic models for training, and then 18 individual basic predictors are constructed. The training sequences are embedded into the probability matrix based on 18 predictors, where the rows represent the probability scores obtained by the different predictors. Then, the probability matrix is fed into the genetic algorithm for training. The genetic algorithm is applied to learn the weights corresponding to the different predictors. (iii) Test phase. The therapeutic peptide sequence is predicted by Eq. (17).

Feature extraction methods

Feature extraction is an important step in therapeutic recognition. In this study, we utilize nine features to extract different properties of the peptides. A peptide sequence S can be expressed as:

$$\begin{align} & \mathrm{S}={\mathrm{R}}_1{\mathrm{R}}_2{\mathrm{R}}_3{\mathrm{R}}_4{\mathrm{R}}_5{\mathrm{R}}_6{\mathrm{R}}_7\dots{\mathrm{R}}_L \end{align}$$

(1)

where |${\mathrm{R}}_i$| represents the i-th residue of peptide |$\mathrm{S}$|⁠, L is the length of peptide S. In this study, we employed nine different feature extraction methods to encode the peptide sequence, including Kmer [24], PSSM and PSFM Cross Transformation (PPCT) [25], Top-n-gram (Tng) [26], distance-based Top-n-gram approach (DT) [27], distance amino acid pair or just distance-pair (DP) [28], distance-based residue (DR) [27], PC-PseAAC-General (PCG) [29], PSSM distance transformation (PSDT) [30], PSFMs with distance-Bigram transformation (PSBT) [31]. Among them, Kmer, Tng, DT, DP, DR and PCG are extracted by BioSeq-Analysis2.0 [32]. The parameters corresponding to different methods on eight datasets are listed in the Table S1. These feature extraction methods are introduced in details in the followings.

Kmer

Kmer is a feature extraction method commonly used in protein fold recognition [33], remote homology detection [33] and other biological tasks [34]. In this section, we apply this feature extraction method to represent the sequence information, indicating the occurrence information of k adjacent amino acids (k ∈ [1, ⋯, 5]). For example, if k = 2, the feature vector can be calculated as [24]:

$$\begin{equation} f(r)=\frac{N(r)}{L},r\in \left\{\mathrm{AC},\mathrm{AD},\mathrm{AE},\cdots \mathrm{YV},\mathrm{YW},\mathrm{YY}\right\} \end{equation}$$

(2)

where |$N(r)$| is the number of kmer type |$r$|⁠, and the exact composition of r is changed with the value of k. The feature dimension is |${20}^k$|–D.

P‌PCT

The sequence is searched by the PSI-BLAST [35] with the parameters (iterations = 3 and e-value = 0.001) through the non-redundant database NRDB90 [36], where the profile is represented as the PSSM and PSFM. PPCT [25] integrates evolutionary information of PSSMs and PSFMs with the sequence information of distance amino acid pairs. The correlation between the two positions in S can be expressed by the followings [25]:

$$\begin{equation} f\left({\mathrm{R}}_{\mathrm{x}},{\mathrm{R}}_{\mathrm{y}}|d\right)=\displaystyle\frac{1}{L-d}\sum_{i=1}^{L-d} \left({P}_{i,x}{P}_{i+d,y}+{F}_{i,y}{F}_{i+d,y}+{P}_{i,x}{F}_{i+d,y}+{F}_{i,x}{P}_{i+d,y}\right) \end{equation}$$

(3)

where|${P}_{i,x}$| and |${P}_{i+d,y}$| are the elements in |${P}_{PSSM}$| and |${F}_{i,y}$| and |${F}_{i+d,y}$| are the elements in |${F}_{PSFM}$|⁠. |$f\big({\mathrm{R}}_{\mathrm{x}},{\mathrm{R}}_{\mathrm{y}}|d\big)$| is the consistent expression probability of amino acids |${\mathrm{R}}_{\mathrm{x}}$| and |${\mathrm{R}}_{\mathrm{y}}$| divided by the distance |$d\big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)$| during the evolutionary process. A peptide S can be represented as follows [25]:

$$\begin{equation} \left[{\theta}_1,{\theta}_2,\dots, {\theta}_{20\left(x-1\right)+y+400{d}_{MAX}}\right] \end{equation}$$

(4)

The elements can be calculated by [25]:

$$\begin{equation} \left\{\begin{array}{@{}c}{\theta}_1=f\left({\mathrm{R}}_1,{\mathrm{R}}_1|0\right)\\{}{\theta}_2=f\left({\mathrm{R}}_1,{\mathrm{R}}_2|0\right)\\{}\dots \\{}{\theta}_{400}=f\left({\mathrm{R}}_{20},{\mathrm{R}}_{20}|0\right)\\{}\dots \\{}{\theta}_{20\left(x-1\right)+y+400d}=f\left({\mathrm{R}}_{\mathrm{x}},{\mathrm{R}}_{\mathrm{y}}|d\right)\\{}\dots \\{}{\theta}_{20\left(x-1\right)+y+400{d}_{MAX}}=f\left({\mathrm{R}}_{\mathrm{x}},{\mathrm{R}}_{\mathrm{y}}|{d}_{MAX}\right)\end{array},\left(0\le d\le{d}_{MAX}\right)\right. \end{equation}$$

(5)

The feature dimension is |$\big(20\times \big(x-1\big)+y+400\times{d}_{MAX}\big)$| − D.

Tng

Tng [26] utilizes the evolutionary information based on the PSSM profiles and is represented as the profile-based building block of proteins. Tng represents the top n most frequent amino acids in the corresponding columns of the PSSM profile. The feature vectors represent the occurrences frequencies of |$n$|-gram subsequence in the PSSM profile (n ∈ [1, ⋯, 5]). The feature dimension is |${20}^n$|–D.

DT

DT [27] combines evolutionary information from Top-n-gram and distance information between pairs of amino acids. DT utilizes the occurrence times of all possible Top-n-gram pairs to calculate the feature vector at a given distance.

Let |$\big({t}_i,{t}_j\big)$| represent the Top-n-gram pairs |${t}_i$| and |${t}_j$| and |${\mathrm{S}}^{\prime }$| is a Top-n-gram sequence. Assuming that Top-n-gram amino acid |${t}_i$| appears before |${t}_j$| and the distance between them is |$d\big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$|⁠. Within the maximum distance |${d}_{MAX}$| between |$\big({t}_i,{t}_j\big)$|⁠, we can obtain the feature vector of peptide sequence |$\mathrm{S}$| according to the occurrences between Top-n-gram pairs as follows [27]:

$$\begin{equation} {D}_d\left({\mathrm{S}}^{\prime}\right)=\left\{\begin{array}{@{}c}\left[{T^0}_A\left({\mathrm{S}}^{\prime}\right),{T^0}_R\left({\mathrm{S}}^{\prime}\right),\dots, {T^0}_V\left({\mathrm{S}}^{\prime}\right)\right]\ \left(d=0\right)\\{}\left[{T^d}_{AA}\left({\mathrm{S}}^{\prime}\right),{T^d}_{AR}\left({\mathrm{S}}^{\prime}\right),\dots, {T^d}_{VV}\left(\mathrm{A}{\mathrm{S}}^{\prime}\right)\right]\ \left(1\le d\le{d}_{MAX}\right)\end{array}\right. \end{equation}$$

(6)

where |${T^0}_i\big({\mathrm{S}}^{\prime}\big)$| is the occurrences of |${t}_i$|⁠, |${T^d}_{ij}\big({\mathrm{S}}^{\prime}\big)$| is the occurrences of |$\big({t}_i,{t}_j\big)$|⁠. The feature vector is achieved by combining all the Top-n-gram pairs as follows [27]:

$$\begin{equation} {F}_{d_{MAX}}\left({\mathrm{S}}^{\prime}\right)=\left[{D}_0\left({\mathrm{S}}^{\prime}\right),{D}_1\left({\mathrm{S}}^{\prime}\right),\dots, {D}_{d_{MAX}}\left({\mathrm{S}}^{\prime}\right)\right] \end{equation}$$

(7)

The dimension of DT is |$\big(20+20\times 20\times{d}_{MAX}\big)$| − D.

DR

DR [27] includes distance information between pairs of amino acids. DR calculates the occurrences of all possible residue pairs between distance |$d\ \big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$|⁠. The dimension of DR is |$\big(20+400\times{d}_{MAX}\big)$| − D.

PCG

Parallel correlation pseudo amino acid composition general, also named as PCG [29] extends the formulation of the amino acid composition, which is defined as,

$$\begin{equation} \mathrm{X}=\left[\begin{array}{c}{x}_1\\{}\vdots \\{}{x}_{20}\\{}{x}_{20+1}\\{}\vdots \\{}{x}_{20+\lambda}\end{array}\right] \end{equation}$$

(8)

where [29]:

$$\begin{equation} {x}_u=\left\{\begin{array}{@{}c}\frac{f_u}{\sum_{i=1}^{20}{f}_i+\omega \sum_{j=1}^{\lambda }{\theta}_j},\left(1\le u\le 20\right)\\{}\frac{\omega{\theta}_{u-20}}{\sum_{i=1}^{20}{f}_i+\omega \sum_{j=1}^{\lambda }{\theta}_j},\left(20+1\le u\le 20+\lambda \right)\end{array}\right. \end{equation}$$

(9)

where |${f}_i$| is occurrence frequency, |$\lambda$|(⁠|$\lambda$| ∈ [1, ⋯, |$L$|]) is the parameter, which is associated with the length of protein expansion, |${\theta}_j$| is the |$j$|-tier sequence correlating factor and |$\omega$| indicates the weight of sequence order. The dimension of PCG is |$\big(20+\lambda \big)$| − D.

PSDT

PSDT [30] constructs the feature vectors based on PSSM information. PSDT [30] is divided into two parts, including the combination of PSDT of pairs of the same amino acids (PSSM-SDT) and PSDT of pairs of different amino acids (PSSM-DDT) [30], which can be generated as [30]:

$$\begin{equation} \mathrm{PSSM}-\mathrm{SDT}\left(i,d\right)=\sum_{j=1}^{L-d}{P}_{i,j}\ast{P}_{i,j+d}/\left(L-d\right) \end{equation}$$

(10)

$$\begin{equation} \mathrm{PSSM}-\mathrm{DDT}\left({i}_1,{i}_2,d\right)=\sum_{j=1}^{L-d}{P}_{i_1,j}\ast{P}_{i_2,j+d}/\left(L-d\right) \end{equation}$$

(11)

where |${P}_{i,j}$| is the PSSM of j at i, |$d\big(d\in \big[1,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$| is the distance between two amino acids. The feature dimension of PSDT is |$\big(400\times{d}_{MAX}\big)$| − D.

PSBT

The PSBT feature vectors are captured from peptide sequences by combing PSFMs with distance-Bigram transformation [31]. PSBT accounts for the occurrence frequency of amino acid pairs detached by a certain distance and the evolutionary information from PSFMs. The feature vector can be obtained as followings [31]:

$$\begin{equation} f\left({\mathrm{R}}_x,{\mathrm{R}}_y|d\right)=\frac{1}{L-d}\sum_{i=1}^{L-d}\left({P}_{i,x}{P}_{i+d,y}\right) \end{equation}$$

(12)

where|${P}_{i,x}$| and |${P}_{i+d,y}$| are the elements in PSFMs. |$f\big({\mathrm{R}}_x,{\mathrm{R}}_y|d\big)$| is the expression of amino acids |${\mathrm{R}}_x$| and |${\mathrm{R}}_y$| divided by the distance 𝑑|$\big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$| during the evolutionary process. The feature dimension is |$\big(400+400\times{d}_{MAX}\big)$| − D.

DP

DP is the pseudo amino acid composition of distance-pairs and the reduced alphabet scheme containing sequence secondary structure and physicochemical property information. DP is able to effectively capture the sequence information of peptides with lower dimensions. The pseudo amino acid composition of distance-pairs can be formulated by [28]:

$$\begin{equation} f\left({\mathrm{R}}_i,{\mathrm{R}}_j|d\right) \end{equation}$$

(13)

where |${\mathrm{R}}_i$| represents the i-th residue of the peptide |$\mathrm{S}$|⁠, |$d\big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$| is the distance between two amino acids. Suppose |${\mathrm{R}}_i$| is A, |${\mathrm{R}}_j$| is K, and |$d=3$|⁠, then |$f\big({\mathrm{R}}_i,{\mathrm{R}}_j|d\big)$| means the occurrence frequency of the ‘A–K’ pair with its two counterparts separated by two residues along the protein chain.

Larger values of the parameter |${d}_{MAX}$| will cause high feature dimension. DP uses a reduced alphabet approach to decrease the dimension of feature vectors. In this study, we use cp(14) [28], which is a kind of amino acid cluster profile to replace the origin 20 amino acids. The detailed information of cp(14) is shown as below [28]:

$$\begin{equation} \mathrm{cp}(14)=\left\{\mathrm{EIMV};\mathrm{L};\mathrm{F};\mathrm{WY};\mathrm{G};\mathrm{P};\mathrm{C};\mathrm{A};\mathrm{S};\mathrm{T};\mathrm{N};\mathrm{HRKQ};\mathrm{E};\mathrm{D}\right\} \end{equation}$$

(14)

and the number of clusters of cp(14) is 14. Therefore, the final dimension of DP is |$\big(14+196\times{d}_{MAX}\big)$| − D.

Support vector machine and random forest

Widely used in the area of therapeutic peptide recognition and bioinformatics [37], the SVM [38] and RF [39] are two powerful algorithms. The SVM constructs the hyperplane for the decision boundary to produce the optimal separation class [38]. The scikit-learn package [40] with the radial basis function (RBF) kernel function is used to implement the SVM. The parameters C (with the regularization) and |$\gamma$| (with the kernel function) are optimized on the cross-validation.

The RF algorithm is one of the widely used algorithms in computational biology [12, 13]. It utilizes the automated voting system to determine the final classification results from multiple binary trees [39]. The detailed formulation of RF has been described in Breiman [41], and hence is not repeated here. The corresponding regularization parameters about the two predictors were given in Tables S2-S3.

Genetic algorithm

The genetic algorithm is an effective optimization method in the field of bioinformatics [42]. The genetic algorithm employs the crossover and mutation operation to optimize the individuals [43]. In this study, we calculate the fitness through the area under the curve (AUC) score to select excellent individuals. The characteristics of the better individuals will be passed on to the next generation through crossover and mutation, while eliminating the less desirable ones [43]. After hundreds of circulations, the optimal weight can be obtained.

In our study, we set the population size for the genetic algorithm as 1000, the generation for the genetic algorithm as 500, the number of folds for the cross-validation as 10 and the number of seeds for the partition of the dataset as 1.

Ensemble learning

As a series of previous peptides studies, such as PEPred-Suite [13], PPTPP [12], ACPred-Fuse [16], etc., the ensemble predictors derived from fusing varieties classifiers with different weights can provide better performance. In this study, nine feature extraction methods were fed into the two basic classifiers, SVM and RF, to produce 18 diverse predictors (such as RF-Kmer, SVM-Kmer, RF-DT, etc.). In order to improve the therapeutic peptide prediction performance, the proposed method employs 18 basic predictors with adaptive weights using a genetic algorithm for different peptide datasets.

By combination 18 individual predictors, we obtained the ensemble classifier as follows:

$$\begin{equation} {\mathbb{FL}}^E= Pre(1)\forall Pre(2)\forall \dots \forall Pre(18)={\forall}_{i=1}^{18} Pre(i) \end{equation}$$

(15)

where |${\mathbb{FL}}^E$| represents the ensemble fusing predictor, the symbol |$\forall$| denotes ensemble learning by using a genetic algorithm. |$Pre(i)$| indicates the predictor of each feature extraction method trained by SVM or RF, which can be formulated by

$$\begin{equation} Pre(i)\triangleright \mathrm{S}={P}_i\ \left(i\in \left[1,18\right]\right) \end{equation}$$

(16)

where |$\triangleright$| means the |$i$|-th predictor trained from the above steps to identify the peptide sequence S, |${P}_i$| is the probability that the sequence S belongs to a certain peptide. The query peptide sequence is predicted as follows:

$$\begin{equation} Y=\sum_{i=1}^{18}{w}_i{P}_i \end{equation}$$

(17)

where |${w}_i$| is the weight of each predictor obtained from a genetic algorithm, and |$\sum_{i=1}^{18}{w}_i$|=1. Y represents the final probability.

Performance evaluation

Because the training datasets of some peptides are unbalanced in this respect, we choose AUC to be the main index to evaluate the prediction power as it without fixed threshold [13]. AUC indicates the area under the receiver operating characteristic curve (ROC). The AUC can be got from the follows:

$$\begin{equation} \mathrm{AUC}=\frac{{{S}}_{{p}}-\left({{n}}_{{p}}\left({{n}}_{{p}}+1\right)\right)/2}{{{n}}_{{p}}{{n}}_{{f}}} \end{equation}$$

(18)

$$\begin{equation} {{S}}_{{p}}=\sum{r}_i \end{equation}$$

(19)

where |${n}_p$| is the number of positive sequences and |${n}_f$| is the negative sequences, |${r}_i$| indicates the ranking of the i-th positive sequence in the prediction result priority table.

Apart from that, another six metrics including accuracy (ACC), Matthews correlation coefficient (MCC), sensitive (SE), specificity (SP), precision (P) and F1-score (F1) were still calculated although not as the leading indicators.

$$\begin{equation} \mathrm{ACC}=\frac{{TP}+{TN}}{{TP}+{TN}+{FN}+{FP}}\times 100\% \end{equation}$$

(20)

$$\begin{equation} \mathrm{MCC}=\frac{{TP}\times{TN}-{FP}\times{FN}}{\sqrt{\left({TP}+{FN}\right)\left({TP}+{FP}\right)\left({TN}+{FP}\right)\left({TN}+{FN}\right)}} \end{equation}$$

(21)

$$\begin{equation} \mathrm{SE}=\frac{TP}{TP+ FN}\times 100\% \end{equation}$$

(22)

$$\begin{equation} \mathrm{SP}=\frac{TN}{TN+ FP}\times 100\% \end{equation}$$

(23)

$$\begin{equation} \mathrm{P}=\frac{{TP}}{{TP}+{FP}} \end{equation}$$

(24)

$$\begin{equation} \mathrm{F}1=\frac{2\mathrm{P}{R}}{\mathrm{P}+{R}} \end{equation}$$

(25)

where TP is true positive, TN is true negative, FP is false positive, FN is false negative and R is the recall [44]. Therefore, AUC, ACC, MCC, SE, SP, P and F1 can evaluate the performance of the predictive model.

Results and discussion

Cross-validation

In our experiment, we adopted the 10-fold cross-validation to evaluate the performance on 8 benchmark datasets. Genetic algorithms are used to generate the weights corresponding to 18 individual predictors per iteration. Then, the averages of 10 cross-validation optimization weights are set as the final weight results corresponding to certain dataset.

Performance of PreTP-EL

Importance of basic predictors

In this section, we evaluated and examined the importance of different predictors in distinguishing the therapeutic peptides. We compared the results of the proposed method with 18 ensemble learning methods, which integrate any 17 individual predictors with adaptive weights based on genetic algorithm. The results on the eight benchmark datasets are shown in Figure 2. The results show that the proposed method outperforms the other competition methods in terms of the AUC, indicating that the 18 predictors are necessary and removing one will significantly reduce the accuracy.

Figure 2

The performance of the PreTP-EL method and 18 individual predictors in terms of AUC on eight benchmark datasets. GA18 represents the proposed method PreTP-EL based on the 18 individual predictors. GA17-RF-Kmer represents the ensemble learning based on 17 predictors that the proposed method PreTP-EL eliminates the RF-Kmer predictor. The remaining 17 predictors employ the similar construction strategy.

Genetic algorithm improves the performance of the prediction model

To verify the effectiveness of adaptive weights based on genetic algorithm, we examined the ability of the proposed method with different weights strategies, including the average weights and the adaptive weights based on the genetic algorithm.

To show the performance of the average weights, we compared the prediction ability obtained by the 18 individual predictors (denoted as Ave18) and any 17 predictors (denoted as Ave17). The results are illustrated in Figure 3, from which we can see the following: (i) the red block means the Ave17 compared with Ave18 has a positive effect on prediction. In other words, the missing classifier has a negative effect, and vice versa. The darker the block, the greater the influence of the predictor on predicting the reference function peptide. (ii) The impact distribution of individual predictors is uneven on the eight benchmark datasets. Therefore, the same classifier has both positive and negative effects in different datasets.

Figure 3

Differences between Ave17 and Ave18. Each block represents the value of corresponding Ave17 method minus the value of Ave18 method. The x-axis represents 18 kinds of Ave17. Ave17-RF-Kmer means the prediction result of ensemble learning method using 17 classifiers except RF-Kmer. The remaining 17 methods employ the similar construction strategy. The y-axis represents eight peptides datasets.

Furthermore, we employ the adaptive weights corresponding to the 18 individual predictors strategy to improve the prediction performance effectively. The performance of the ensemble methods with average weights and adaptive weights on training datasets and test datasets are shown in Figures 4 and 5, respectively. The results are shown that the proposed method with adaptive weights (PreTP-EL) outperforms the method with the average weights (PreTP-Ave) in terms of the AUC, indicating that the proposed method using the genetic algorithm is an effective method for therapeutic peptide prediction.

Figure 4

Comparison between the PreTP-Ave and PreTP-EL on the eight training datasets in terms of AUC.

Figure 5

Comparison between the PreTP-Ave and PreTP-EL in terms of AUC on the eight independent test datasets.

To investigate the importance of different basic predictors to the overall prediction accuracy of PreTP-EL, the distribution of the weights is illustrated in Figure 6. Figure 6 shows that the distribution of the weights between eight peptide benchmark datasets is various. The darker the color, the more important and discriminative the basic predictor is. Due to the functions of the eight peptides datasets are different, the prediction performance of 18 basic predictors is uneven. Therefore, it is necessary to utilize the adaptive weights based on the genetic algorithm to improve the effectiveness and robustness performance of the therapeutic peptide prediction.

Figure 6

Weights of individual predictors in the ensemble learning. The x-axis represents 18 basic predictors, and the y-axis represents eight peptides datasets.

Analysis of basic predictors

In this section, we evaluated the performance of the proposed method PreTP-EL and 18 basic individual predictors. The results are illustrated in Figure 7, from which we can observe the following: (i) the basic predictors using sequences feature achieve the comparable performance in prediction of therapeutic peptides. (ii) PreTP-EL further outperforms all the other basic predictors in the most benchmark dataset, by integrating the multi-features and various basic predictors through the genetic algorithm.

Figure 7

ROC curves of 18 basic predictors and PreTP-EL on eight benchmark datasets.

Figure 8

PCA on independent test dataset of ACP using Bit20(NTCT = 2) and DT.

Comparison of PreTP-EL with existing predictors

To examine the predictive power of PreTP-EL, we further evaluated and compared it with other existing predictors, including PPTPP [12] and PEPred-Suite [13]. The performance of different predictors is listed in Table 2 and Tables S4–S8, including PPTPP [12], PEPred-Suite [13], AntiAngioPred [17], AntiBP [18], ACPred-FL [15], AIPpred [19], AVPpred [20], CPPred-RF [21], PSBinder [22] and QSPpred [23]. These results showed that PreTP-EL achieved better or comparable performance than all the other competing methods in terms of AUC, ACC, MCC and P indicate that combining multiple supervised frameworks contributes to the better performance of PreTP-EL for predicting the therapeutic peptides.

Table 2

The performance of the existing predictors and PreTP-EL on independent test datasets in terms of the AUC

Datasets	PPTPP [12]	PEPred-Suite [13]	Methods^a	PreTP-EL
AAP	0.770	0.804	0.742 (AntiAngioPred)	0.819
ABP	0.988	0.976	0.976 (AntiBP)	0.992
ACP	0.883	0.949	0.939 (ACPred-FL)	0.950^b
AIP	0.720	0.751	0.795 (AIPpred)	0.810
AVP	0.946	0.949	0.931 (AVPpred)	0.951
CPP	0.965	0.952	0.941 (CPPred-RF)	0.978
PBP	0.740	0.804	0.742 (PSBinder)	0.809
QSP	0.944	0.960	0.903 (QSPpred)	0.965
Ave.^c	0.869	0.893	0.871	0.909

Datasets	PPTPP [12]	PEPred-Suite [13]	Methods^a	PreTP-EL
AAP	0.770	0.804	0.742 (AntiAngioPred)	0.819
ABP	0.988	0.976	0.976 (AntiBP)	0.992
ACP	0.883	0.949	0.939 (ACPred-FL)	0.950^b
AIP	0.720	0.751	0.795 (AIPpred)	0.810
AVP	0.946	0.949	0.931 (AVPpred)	0.951
CPP	0.965	0.952	0.941 (CPPred-RF)	0.978
PBP	0.740	0.804	0.742 (PSBinder)	0.809
QSP	0.944	0.960	0.903 (QSPpred)	0.965
Ave.^c	0.869	0.893	0.871	0.909

^aThe results are reported in [13] and the method is following the corresponding results.

^bPreTP-EL utilizes one more feature extraction method Bit20(NTCT = 2) [13] on ACP dataset.

^cAve. represents the average value of each predictor over eight datasets.

Table 2

The performance of the existing predictors and PreTP-EL on independent test datasets in terms of the AUC

Datasets	PPTPP [12]	PEPred-Suite [13]	Methods^a	PreTP-EL
AAP	0.770	0.804	0.742 (AntiAngioPred)	0.819
ABP	0.988	0.976	0.976 (AntiBP)	0.992
ACP	0.883	0.949	0.939 (ACPred-FL)	0.950^b
AIP	0.720	0.751	0.795 (AIPpred)	0.810
AVP	0.946	0.949	0.931 (AVPpred)	0.951
CPP	0.965	0.952	0.941 (CPPred-RF)	0.978
PBP	0.740	0.804	0.742 (PSBinder)	0.809
QSP	0.944	0.960	0.903 (QSPpred)	0.965
Ave.^c	0.869	0.893	0.871	0.909

Datasets	PPTPP [12]	PEPred-Suite [13]	Methods^a	PreTP-EL
AAP	0.770	0.804	0.742 (AntiAngioPred)	0.819
ABP	0.988	0.976	0.976 (AntiBP)	0.992
ACP	0.883	0.949	0.939 (ACPred-FL)	0.950^b
AIP	0.720	0.751	0.795 (AIPpred)	0.810
AVP	0.946	0.949	0.931 (AVPpred)	0.951
CPP	0.965	0.952	0.941 (CPPred-RF)	0.978
PBP	0.740	0.804	0.742 (PSBinder)	0.809
QSP	0.944	0.960	0.903 (QSPpred)	0.965
Ave.^c	0.869	0.893	0.871	0.909

^aThe results are reported in [13] and the method is following the corresponding results.

^bPreTP-EL utilizes one more feature extraction method Bit20(NTCT = 2) [13] on ACP dataset.

^cAve. represents the average value of each predictor over eight datasets.

As shown in Table 2, PreTP-EL achieved better or comparable performance compared with the other existing methods, except the ACP. To test whether PreTP-EL is inherently unsuited for prediction ACP, or is influenced by the ACP dataset, we conducted a controlled experiment with different feature extraction methods. Due to the feature Bit20(NTCT = 2) obtains the highest feature representation score in the PEPred-Suite [13] on the ACP dataset, we integrate Bit20(NTCT = 2) with the PreTP-EL. The results in Table 2 showed that PEPred-Suite achieved a slightly higher than the PEPred-Suite, indicating that the proposed assembling framework can improve the performance in therapeutic peptide prediction.

Comparison of PreTP-EL with other existing predictors on the AP dataset

In this section, we further evaluated the performance of PreTP-EL on the AP dataset [45] by using the PreTP-EL model trained with the benchmark dataset, and compared with other related methods, including AntiCP2.0 [45], AntiCP [46], ACPred [47], ACPred-FL [15], ACPred-Fuse [16], PEPred-Suite [13] and iACP [48]. The results are shown in Table 3, from which we can see that: PreTP-EL outperforms all the other predictors in terms of ACC and MCC, further demonstrating the better performance of the proposed method.

Table 3

The performance of PreTP-EL and other existing predictors on the AP datasets^a

Methods	ACC	MCC	SE	SP
PreTP-EL	0.866	0.733	0.843	0.890
AntiCP2.0 [45]	0.754	0.510	0.775	0.734
AntiCP [46]	0.506	0.070	1.000	0.012
ACPred [47]	0.535	0.090	0.856	0.214
ACPred-FL [15]	0.448	−0.120	0.671	0.225
ACPpred-Fuse [16]	0.689	0.38	0.692	0.686
PEPred-Suite [13]	0.535	0.08	0.331	0.738
iACP [48]	0.551	0.110	0.779	0.322

Methods	ACC	MCC	SE	SP
PreTP-EL	0.866	0.733	0.843	0.890
AntiCP2.0 [45]	0.754	0.510	0.775	0.734
AntiCP [46]	0.506	0.070	1.000	0.012
ACPred [47]	0.535	0.090	0.856	0.214
ACPred-FL [15]	0.448	−0.120	0.671	0.225
ACPpred-Fuse [16]	0.689	0.38	0.692	0.686
PEPred-Suite [13]	0.535	0.08	0.331	0.738
iACP [48]	0.551	0.110	0.779	0.322

^aAll the results reported in this table except for the ones of PreTP-EL were obtained from [45].

Table 3

The performance of PreTP-EL and other existing predictors on the AP datasets^a

Methods	ACC	MCC	SE	SP
PreTP-EL	0.866	0.733	0.843	0.890
AntiCP2.0 [45]	0.754	0.510	0.775	0.734
AntiCP [46]	0.506	0.070	1.000	0.012
ACPred [47]	0.535	0.090	0.856	0.214
ACPred-FL [15]	0.448	−0.120	0.671	0.225
ACPpred-Fuse [16]	0.689	0.38	0.692	0.686
PEPred-Suite [13]	0.535	0.08	0.331	0.738
iACP [48]	0.551	0.110	0.779	0.322

Methods	ACC	MCC	SE	SP
PreTP-EL	0.866	0.733	0.843	0.890
AntiCP2.0 [45]	0.754	0.510	0.775	0.734
AntiCP [46]	0.506	0.070	1.000	0.012
ACPred [47]	0.535	0.090	0.856	0.214
ACPred-FL [15]	0.448	−0.120	0.671	0.225
ACPpred-Fuse [16]	0.689	0.38	0.692	0.686
PEPred-Suite [13]	0.535	0.08	0.331	0.738
iACP [48]	0.551	0.110	0.779	0.322

^aAll the results reported in this table except for the ones of PreTP-EL were obtained from [45].

Analysis of dataset ACP

In this section, we further examined the ability of the Bit20(NTCT = 2) and different sequence features in identifying ACP from non-ACP. We utilize the PCA to evaluate the performance of the Bit20(NTCT = 2) and DT feature, which is selected by the highest weight score based on the genetic algorithm. The results are illustrated in Figure 8, from which we can indicating that the Bit20(NTCT = 2) has better discrimination and can greatly separate ACP and non-ACP. The results in Table 2 and Figure 8 shown that the proposed method achieved a competitive performance in ACP prediction by utilizing the multiple discriminative features and integrating the basic predictors through the genetic algorithm.

Conclusion

In this study, a novel ensemble prediction model called PreTP-EL was proposed to identify multiple functional therapeutic peptides. The model combines nine feature extraction methods with two machine learning classifiers and integrated basic predictors based on a genetic algorithm. Furthermore, a user-friendly webserver of PreTP-EL is developed to predict the query peptide. The experimental results show that the proposed method outperforms the other state-of-the-art methods and achieves stable performance. Therefore, PreTP-EL is a useful algorithm for therapeutic peptide prediction. Because the prediction of novel peptides with post-translational modification is very important, in the future studies we will develop the computational predictors for post-translational modification prediction for novel peptides. Furthermore, because the post-translationally modified peptides are very interesting and important, we will analyze these peptides as well.

Key Points

In this study, we proposed a therapeutic peptides prediction model PreTP-EL by integrating two machine learning classifiers with nine feature extraction methods. For different kinds of peptides, we constructed adaptively weights corresponding to 18 classifiers based on a genetic algorithm.
Feature analysis results show that the nine features used in this study are complementary. Predictors based on these features are able to improve the performance of therapeutic peptide prediction.
Furthermore, we established a webserver for PreTP-EL to predict unknown peptide sequences using a genetic algorithm and is available at http://bliulab.net/PreTP-EL. We expect that this webserver can help users identify therapeutic peptides.

Acknowledgements

We are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful for strengthening the presentation of this paper.

Data Availability

The dataset used and analyzed during the current study are available from http://bliulab.net/PreTP-EL.

Funding

The National Natural Science Foundation of China (No. 62102030), the National Key R&D Program of China (No. 2018AAA0100100), and the Beijing Natural Science Foundation (No. JQ19019).

Yichen Guo is a master student at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. Her expertise is in bioinformatics, nature language processing and machine learning.

Ke Yan is a postdoctoral researcher at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, machine learning.

Hongwu Lv is a master student at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, nature language processing and machine learning.

Bin Liu, PhD, is a professor at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, nature language processing and machine learning.

References

1.

Fosgerau

K

,

Hoffmann

T

.

Peptide therapeutics: current status and future directions

.

Drug Discov Today

2015

;

20

:

122

–

8

.

2.

Vázquez-Prieto

S

,

Paniagua

E

,

Ubeira

FM

, et al.

QSPR-perturbation models for the prediction of B-epitopes from immune epitope database: a potentially valuable route for predicting "in silico" new optimal peptide sequences and/or boundary conditions for vaccine development

.

Int J Pep Res & Therapeutics

2016

;

22

:

445

–

50

.

3.

Borghouts

C

,

Kunz

C

,

Groner

B

.

Current strategies for the development of peptide-based anti-cancer therapeutics

.

J Pept Sci

2010

;

11

:

713

–

26

.

4.

Gupta

S

,

Sharma

A

,

Shastri

V

, et al.

Prediction of anti-inflammatory proteins/peptides: An insilico approach

.

J Transl Med

2017

;

15

:7.

5.

Wei

L

,

Xing

P

,

Shi

G

, et al.

Fast prediction of protein methylation sites using a sequence-based feature selection technique

.

IEEE/ACM Trans Comput Biol Bioinform

2019

;

16

:

1264

–

73

.

6.

Vázquez-Prieto

S

,

Paniagua

E

,

Solana

H

, et al.

A study of the immune epitope database for some fungi species using network topological indices

.

Mol Divers

2017

;

21

:

713

–

18

.

7.

Vazquez-Prieto

S

,

Paniagua

E

,

Solana

H

, et al.

Complex network study of the immune epitope database for parasitic organisms

.

Curr Top Med Chem

2018

;

17

:3249–55.

8.

Xiaoli

Q

,

Chen

, et al.

CPPred-FL: a sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning

.

Brief Bioinform

2018

. doi:

10.1093/bib/bby091

.

9.

Chen

X

,

Qiu

, et al.

Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites

.

Bioinformatics

2013

;

29

:1614–22.

10.

Wei

L

,

Xing

P

,

Shi

G

, et al.

Fast prediction of protein methylation sites using a sequence-based feature selection technique

2019

;

16

:

1264

–

73

.

11.

Shen

HB

,

Chou

KC

.

PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition

.

Anal Biochem

2008

;

373

:

386

–

8

.

12.

Zhang

YP

,

Zou

Q

.

PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning

.

Bioinformatics

2020

;

36

:

3982

–

7

.

13.

Wei

L

,

Zhou

C

,

Su

R

, et al.

PEPred-Suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning

.

Bioinformatics

2019

;

35

:

4272

–

80

.

14.

Liang

X

,

Li

F

,

Chen

J

, et al.

Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification

.

Brief Bioinform

2021

;

22

:bbaa312.

15.

Wei

L

,

Zhou

C

,

Chen

H

, et al.

ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides

.

Bioinformatics

2018

;

34

:

4007

–

16

.

PubMed

16.

Rao

B

,

Zhou

C

,

Zhang

G

, et al.

ACPred-Fuse: fusing multi-view information improves the prediction of anticancer peptides

.

Brief Bioinform

2020

;

21

:

1846

–

55

.

17.

Ettayapuram Ramaprasad

AS

,

Singh

S

,

Gajendra

PSR

, et al.

AntiAngioPred: a server for prediction of anti-angiogenic peptides

.

PLoS One

2015

;

10

:e0136990.

18.

Lata

S

,

Sharma

BK

,

Raghava

GP

.

Analysis and prediction of antibacterial peptides

.

BMC Bioinformatics

2007

;

8

:

263

.

19.

Manavalan

B

,

Shin

TH

,

Kim

MO

, et al.

AIPpred: sequence-based prediction of anti-inflammatory peptides using random forest

.

Front Pharmacol

2018

;

9

:

276

.

20.

Thakur

N

,

Qureshi

A

,

Kumar

M

.

AVPpred: collection and prediction of highly effective antiviral peptides

.

Nucleic Acids Res

2012

;

40

:

W199

–

204

.

21.

Wei

L

,

Xing

P

,

Su

R

, et al.

CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency

.

J Proteome Res

2017

;

16

:

2044

–

53

.

22.

Li

N

,

Kang

J

,

Jiang

L

, et al.

PSBinder: a web service for predicting polystyrene surface-binding peptides

.

Biomed Res Int

2017

;

2017

:

5761517

.

PubMed

23.

Rajput

A

,

Gupta

AK

,

Kumar

M

.

Prediction and analysis of quorum sensing peptides based on sequence features

.

PLoS One

2015

;

10

:e0120066.

24.

Gao

X

,

Wang

DH

,

Zhang

J

, et al.

iRBP-Motif-PSSM: identification of rna-binding proteins based on collaborative learning

.

IEEE Access

2019

;

7

:

168956

–

62

.

10.1109/TCBB.2021.3069263

25.

Wang

N

,

Zhang

J

,

Liu

B

.

IDRBP-PPCT: identifying nucleic acid-binding proteins based on position-specific score matrix and position-specific frequency matrix cross transformation

.

IEEE/ACM Trans Comput Biol Bioinform

,

2021

. doi:

.

26.

Liu

B

,

Wang

X

,

Lin

L

, et al.

A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

.

BMC Bioinformatics

2008

;

9

:

510

.

27.

Liu

B

,

Xu

JH

,

Zou

Q

, et al.

Using distances between Top-n-gram and residue pairs for protein remote homology detection

.

BMC Bioinformatics

2014

;

15

:S3.

28.

Liu

B

,

Xu

J

,

Lan

X

, et al.

iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition

.

PLoS One

2014

;

9

:e106691.

29.

Chou

K-C

.

Prediction of protein cellular attributes using pseudo-amino acid composition

.

Proteins

2001

;

43

:

246

–

55

.

30.

Xu

R

,

Zhou

J

,

Wang

H

, et al.

Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation

.

BMC Syst Biol

2015

;

9

(

Suppl 1

):

S10

.

PubMed

31.

Zhang

J

,

Liu

B

.

PSFM-DBT: identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation

.

Int J Mol Sci

2017

;

18

:1856.

32.

Liu

B

,

Gao

X

,

Zhang

H

.

BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches

.

Nucleic Acids Res

2019

;

47

:e127.

33.

Rangwala

H

,

Karypis

G

.

Profile-based direct kernels for remote homology detection and fold recognition

.

Bioinformatics

2005

;

21

:

4239

–

47

.

34.

Liu

B

,

Long

R

,

Chou

KC

.

iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework

.

Bioinformatics

2016

;

32

:

2411

–

8

.

35.

Altschul

SF

,

Madden

TL

,

Schaffer

AA

, et al.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

.

Nucleic Acids Res

1997

;

25

:

3389

–

402

.

36.

Holm

L

,

Sander

C

.

Removing near-neighbour redundancy from large protein sequence collections

.

Bioinformatics

1998

;

14

:

423

–

9

.

37.

Hasan

MM

,

Schaduangrat

N

,

Basith

S

, et al.

HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation

.

Bioinformatics

2020

;

36

:

3350

–

6

.

38.

Suykens

JAK

,

Vandewalle

J

.

Least squares support vector machine classifiers

.

Neural Processing Letters

1999

;

9

:

293

–

300

.

39.

Breiman

L

.

Random forests, machine learning 45

.

J Clin Microbiol

2001

;

2

:

199

–

228

.

40.

Pedregosa

F

,

Varoquaux

G

,

Gramfort

A

, et al.

Scikit-learn: machine learning in Python

2011

;

12

:

2825

–

30

.

41.

Breiman

L

.

Random Forests

.

Machine Learning

2001

;

45

:

5

–

32

.

42.

Kosakovsky Pond

SL

,

Posada

D

,

Gravenor

MB

, et al.

GARD: a genetic algorithm for recombination detection

.

Bioinformatics

2006

;

22

:

3096

–

8

.

43.

Maulik

U

,

Bandyopadhyay

S

.

Genetic algorithm-based clustering technique

.

Pattern Recognition

2000

;

33

:

1455

–

65

.

44.

Powers DJjomlt

.

Evaluation: From Precision, Recall and F-Factor to ROC

.

Informedness, Markedness & Correlation

,

2010

;

16061

:

37

–

63

.

Google Preview

45.

Agrawal

P

,

Bhagat

D

,

Mahalwal

M

, et al.

AntiCP 2.0: an updated model for predicting anticancer peptides

.

Brief Bioinform

2021

;

22

:bbaa153.

46.

Tyagi

A

,

Kapoor

P

,

Kumar

R

, et al.

In silico models for designing and discovering novel anticancer peptides

.

Sci Rep

2013

;

3

:

2984

.

47.

Schaduangrat

N

,

Nantasenamat

C

,

Prachayasittikul

V

, et al.

ACPred: a computational tool for the prediction and analysis of anticancer peptides

.

Molecules

2019

;

24

:1973.