Abstract

Therapeutic peptides are important for understanding the correlation between peptides and their therapeutic diagnostic potential. The therapeutic peptides can be further divided into different types based on therapeutic function sharing different characteristics. Although some computational approaches have been proposed to predict different types of therapeutic peptides, they failed to accurately predict all types of therapeutic peptides. In this study, a predictor called PreTP-EL has been proposed via employing the ensemble learning approach to fuse the different features and machine learning techniques in order to capture the different characteristics of various therapeutic peptides. Experimental results showed that PreTP-EL outperformed other competing methods. Availability and implementation: A user-friendly web-server of PreTP-EL predictor is available at http://bliulab.net/PreTP-EL.

Introduction

Peptides can control human body growth, immune regulation, and metabolism [1, 2]. Peptides play an important role in biological activities. For example, anti-cancer peptides have the potential in curing cancer [3]. Anti-angiogenic peptides can inhibit angiogenesis and benefit the human body [4]. Cell-penetrating peptides can be transporters for drug delivery into cells [5]. Therefore, it is important to predict the therapeutic peptides by using computational methods.

In recent years, several machine learning methods have been developed to identify therapeutic peptides. The traditional machine learning methods can be divided into two steps: including peptide sequence feature extraction and machine learning classifiers.

In the first step, considerable research efforts have been made to characterize the discriminative features from the peptide sequences. A good feature extraction method can easily get a more discriminative feature vector, and effectively predict peptide sequences [6, 7]. Amino acid composition (AAC) and dipeptide composition (DC) are two features commonly used to represent the composition of the peptide sequence [8]. Composition of k-spaced amino acid pair (CKSAAP) [9] and adaptive skip dipeptide composition (ASDC) [10] consider short-range and long-range effects of residues, respectively. Pseudo amino acid composition (Pse-AAC) is based on the both composition and sequence information [11]. Several feature extraction methods utilize the physicochemical information to encode the peptide sequences, such as PPTPP [12]. Moreover, several peptide predictors extract the specific features from comprehensive features. For example, the PEPred-Suite [13] selects the discriminative features from 99 features using SFS and mRMD.

In the second step, different machine learning methods were used for constructing the predictors. For example, ACPredStackL [14] predicted anti-cancer peptides (ACP) trained with four different basic predictors, including KNN, Support Vector Machine (SVM), LigthGBM, and Naive Bayesian (NB). PPTPP [12] and PEPred-Suite [13] used random forest (RF) as their machine learning method. Most recent methods construct the models for specific peptides. For example, ACPred-FL [15] and ACPred-Fuse [16] models were established to predict the ACP. However, there is no predictor that can accurately predict all therapeutic peptides because of the different characteristics of different types of therapeutic peptides. In order to solve this problem, we developed an ensemble learning method to predict therapeutic peptides by integrating two machine learning classifiers with nine feature extraction methods.

Methods and materials

Dataset construction

In this study, we used eight therapeutic peptide datasets, including anti-angiogenic peptides (AAP) [17], anti-bacterial peptides (ABP) [18], anti-cancer peptides (ACP) [15], anti-inflammatory peptides (AIP) [19], anti-viral peptides (AVP) [20], cell-penetrating peptides (CPP) [21], polystyrene surface binding peptides (PBP) [22] and quorun sensing peptides (QSP) [23]. Their detailed information is listed in Table 1.

Table 1

Information of the eight peptide training datasets and the independent test set

DatasetsTraining setIndependent test set
PositiveNegativePositiveNegative
AAP1071072828
ABP800800199199
ACP2502508282
AIP12581887420629
AVP5444076045
CPP3703709292
QSP2002002020
PBP80802424
DatasetsTraining setIndependent test set
PositiveNegativePositiveNegative
AAP1071072828
ABP800800199199
ACP2502508282
AIP12581887420629
AVP5444076045
CPP3703709292
QSP2002002020
PBP80802424
Table 1

Information of the eight peptide training datasets and the independent test set

DatasetsTraining setIndependent test set
PositiveNegativePositiveNegative
AAP1071072828
ABP800800199199
ACP2502508282
AIP12581887420629
AVP5444076045
CPP3703709292
QSP2002002020
PBP80802424
DatasetsTraining setIndependent test set
PositiveNegativePositiveNegative
AAP1071072828
ABP800800199199
ACP2502508282
AIP12581887420629
AVP5444076045
CPP3703709292
QSP2002002020
PBP80802424

Table 1 shows that each dataset contains a training dataset and an independent dataset. The positive peptide sequences are usually experimentally proven to be therapeutic peptides, such as AAPs. While the negative peptide sequences are usually common and featureless peptide sequences or only random disordered sequences [13].

Overview of PreTP-EL

In this paper, we proposed a novel therapeutic peptides prediction method PreTP-EL based on ensemble learning. The PreTP-EL combines individual predictors via a genetic algorithm to improve prediction performance. First, nine different features are used to reflect the characters of a peptide sequence from its different properties. Second, the feature vectors from the previous step are sent into the two prediction models, including SVM and RF classifiers. Then we concentrate the 18 individual predictors using the optimal weights based on the genetic algorithm. Finally, the prediction model has predicted whether the query sequence is a therapeutic peptide or not. The higher the score is, the more likely the peptide is to be the corresponding specific peptide. The framework of PreTP-EL is illustrated in Figure 1.

The framework of PreTP-EL. There are three main steps: (i) feature extraction. Each training or test peptide sequence is represented as nine feature vectors by nine feature extraction methods. (ii) Training phase. The nine feature vectors are fed into SVM and RF basic models for training, and then 18 individual basic predictors are constructed. The training sequences are embedded into the probability matrix based on 18 predictors, where the rows represent the probability scores obtained by the different predictors. Then, the probability matrix is fed into the genetic algorithm for training. The genetic algorithm is applied to learn the weights corresponding to the different predictors. (iii) Test phase. The therapeutic peptide sequence is predicted by Eq. (17).
Figure 1

The framework of PreTP-EL. There are three main steps: (i) feature extraction. Each training or test peptide sequence is represented as nine feature vectors by nine feature extraction methods. (ii) Training phase. The nine feature vectors are fed into SVM and RF basic models for training, and then 18 individual basic predictors are constructed. The training sequences are embedded into the probability matrix based on 18 predictors, where the rows represent the probability scores obtained by the different predictors. Then, the probability matrix is fed into the genetic algorithm for training. The genetic algorithm is applied to learn the weights corresponding to the different predictors. (iii) Test phase. The therapeutic peptide sequence is predicted by Eq. (17).

Feature extraction methods

Feature extraction is an important step in therapeutic recognition. In this study, we utilize nine features to extract different properties of the peptides. A peptide sequence S can be expressed as:
(1)
where |${\mathrm{R}}_i$| represents the i-th residue of peptide |$\mathrm{S}$|⁠, L is the length of peptide S. In this study, we employed nine different feature extraction methods to encode the peptide sequence, including Kmer [24], PSSM and PSFM Cross Transformation (PPCT) [25], Top-n-gram (Tng) [26], distance-based Top-n-gram approach (DT) [27], distance amino acid pair or just distance-pair (DP) [28], distance-based residue (DR) [27], PC-PseAAC-General (PCG) [29], PSSM distance transformation (PSDT) [30], PSFMs with distance-Bigram transformation (PSBT) [31]. Among them, Kmer, Tng, DT, DP, DR and PCG are extracted by BioSeq-Analysis2.0 [32]. The parameters corresponding to different methods on eight datasets are listed in the Table S1. These feature extraction methods are introduced in details in the followings.

Kmer

Kmer is a feature extraction method commonly used in protein fold recognition [33], remote homology detection [33] and other biological tasks [34]. In this section, we apply this feature extraction method to represent the sequence information, indicating the occurrence information of k adjacent amino acids (k ∈ [1, ⋯, 5]). For example, if k = 2, the feature vector can be calculated as [24]:
(2)
where |$N(r)$| is the number of kmer type |$r$|⁠, and the exact composition of r is changed with the value of k. The feature dimension is |${20}^k$|–D.

P‌PCT

The sequence is searched by the PSI-BLAST [35] with the parameters (iterations = 3 and e-value = 0.001) through the non-redundant database NRDB90 [36], where the profile is represented as the PSSM and PSFM. PPCT [25] integrates evolutionary information of PSSMs and PSFMs with the sequence information of distance amino acid pairs. The correlation between the two positions in S can be expressed by the followings [25]:
(3)
where|${P}_{i,x}$| and |${P}_{i+d,y}$| are the elements in |${P}_{PSSM}$| and |${F}_{i,y}$| and |${F}_{i+d,y}$| are the elements in |${F}_{PSFM}$|⁠. |$f\big({\mathrm{R}}_{\mathrm{x}},{\mathrm{R}}_{\mathrm{y}}|d\big)$| is the consistent expression probability of amino acids |${\mathrm{R}}_{\mathrm{x}}$| and |${\mathrm{R}}_{\mathrm{y}}$| divided by the distance |$d\big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)$| during the evolutionary process. A peptide S can be represented as follows [25]:
(4)
The elements can be calculated by [25]:
(5)

The feature dimension is |$\big(20\times \big(x-1\big)+y+400\times{d}_{MAX}\big)$| − D.

Tng

Tng [26] utilizes the evolutionary information based on the PSSM profiles and is represented as the profile-based building block of proteins. Tng represents the top n most frequent amino acids in the corresponding columns of the PSSM profile. The feature vectors represent the occurrences frequencies of |$n$|-gram subsequence in the PSSM profile (n ∈ [1, ⋯, 5]). The feature dimension is |${20}^n$|–D.

DT

DT [27] combines evolutionary information from Top-n-gram and distance information between pairs of amino acids. DT utilizes the occurrence times of all possible Top-n-gram pairs to calculate the feature vector at a given distance.

Let |$\big({t}_i,{t}_j\big)$| represent the Top-n-gram pairs |${t}_i$| and |${t}_j$| and |${\mathrm{S}}^{\prime }$| is a Top-n-gram sequence. Assuming that Top-n-gram amino acid |${t}_i$| appears before |${t}_j$| and the distance between them is |$d\big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$|⁠. Within the maximum distance |${d}_{MAX}$| between |$\big({t}_i,{t}_j\big)$|⁠, we can obtain the feature vector of peptide sequence |$\mathrm{S}$| according to the occurrences between Top-n-gram pairs as follows [27]:
(6)
where |${T^0}_i\big({\mathrm{S}}^{\prime}\big)$| is the occurrences of |${t}_i$|⁠, |${T^d}_{ij}\big({\mathrm{S}}^{\prime}\big)$| is the occurrences of |$\big({t}_i,{t}_j\big)$|⁠. The feature vector is achieved by combining all the Top-n-gram pairs as follows [27]:
(7)

The dimension of DT is |$\big(20+20\times 20\times{d}_{MAX}\big)$| − D.

DR

DR [27] includes distance information between pairs of amino acids. DR calculates the occurrences of all possible residue pairs between distance |$d\ \big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$|⁠. The dimension of DR is |$\big(20+400\times{d}_{MAX}\big)$| − D.

PCG

Parallel correlation pseudo amino acid composition general, also named as PCG [29] extends the formulation of the amino acid composition, which is defined as,
(8)
where [29]:
(9)
where |${f}_i$| is occurrence frequency, |$\lambda$|(⁠|$\lambda$| ∈ [1, ⋯, |$L$|]) is the parameter, which is associated with the length of protein expansion, |${\theta}_j$| is the |$j$|-tier sequence correlating factor and |$\omega$| indicates the weight of sequence order. The dimension of PCG is |$\big(20+\lambda \big)$| − D.

PSDT

PSDT [30] constructs the feature vectors based on PSSM information. PSDT [30] is divided into two parts, including the combination of PSDT of pairs of the same amino acids (PSSM-SDT) and PSDT of pairs of different amino acids (PSSM-DDT) [30], which can be generated as [30]:
(10)
(11)
where |${P}_{i,j}$| is the PSSM of j at i, |$d\big(d\in \big[1,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$| is the distance between two amino acids. The feature dimension of PSDT is |$\big(400\times{d}_{MAX}\big)$| − D.

PSBT

The PSBT feature vectors are captured from peptide sequences by combing PSFMs with distance-Bigram transformation [31]. PSBT accounts for the occurrence frequency of amino acid pairs detached by a certain distance and the evolutionary information from PSFMs. The feature vector can be obtained as followings [31]:
(12)
where|${P}_{i,x}$| and |${P}_{i+d,y}$| are the elements in PSFMs. |$f\big({\mathrm{R}}_x,{\mathrm{R}}_y|d\big)$| is the expression of amino acids |${\mathrm{R}}_x$| and |${\mathrm{R}}_y$| divided by the distance 𝑑|$\big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$| during the evolutionary process. The feature dimension is |$\big(400+400\times{d}_{MAX}\big)$| − D.

DP

DP is the pseudo amino acid composition of distance-pairs and the reduced alphabet scheme containing sequence secondary structure and physicochemical property information. DP is able to effectively capture the sequence information of peptides with lower dimensions. The pseudo amino acid composition of distance-pairs can be formulated by [28]:
(13)
where |${\mathrm{R}}_i$| represents the i-th residue of the peptide |$\mathrm{S}$|⁠, |$d\big(d\in \big[0,\cdots, {d}_{MAX}\big],\big({d}_{MAX}>0\big)\big)$| is the distance between two amino acids. Suppose |${\mathrm{R}}_i$| is A, |${\mathrm{R}}_j$| is K, and |$d=3$|⁠, then |$f\big({\mathrm{R}}_i,{\mathrm{R}}_j|d\big)$| means the occurrence frequency of the ‘A–K’ pair with its two counterparts separated by two residues along the protein chain.
Larger values of the parameter |${d}_{MAX}$| will cause high feature dimension. DP uses a reduced alphabet approach to decrease the dimension of feature vectors. In this study, we use cp(14) [28], which is a kind of amino acid cluster profile to replace the origin 20 amino acids. The detailed information of cp(14) is shown as below [28]:
(14)
and the number of clusters of cp(14) is 14. Therefore, the final dimension of DP is |$\big(14+196\times{d}_{MAX}\big)$| − D.

Support vector machine and random forest

Widely used in the area of therapeutic peptide recognition and bioinformatics [37], the SVM [38] and RF [39] are two powerful algorithms. The SVM constructs the hyperplane for the decision boundary to produce the optimal separation class [38]. The scikit-learn package [40] with the radial basis function (RBF) kernel function is used to implement the SVM. The parameters C (with the regularization) and |$\gamma$| (with the kernel function) are optimized on the cross-validation.

The RF algorithm is one of the widely used algorithms in computational biology [12, 13]. It utilizes the automated voting system to determine the final classification results from multiple binary trees [39]. The detailed formulation of RF has been described in Breiman [41], and hence is not repeated here. The corresponding regularization parameters about the two predictors were given in Tables S2-S3.

Genetic algorithm

The genetic algorithm is an effective optimization method in the field of bioinformatics [42]. The genetic algorithm employs the crossover and mutation operation to optimize the individuals [43]. In this study, we calculate the fitness through the area under the curve (AUC) score to select excellent individuals. The characteristics of the better individuals will be passed on to the next generation through crossover and mutation, while eliminating the less desirable ones [43]. After hundreds of circulations, the optimal weight can be obtained.

In our study, we set the population size for the genetic algorithm as 1000, the generation for the genetic algorithm as 500, the number of folds for the cross-validation as 10 and the number of seeds for the partition of the dataset as 1.

Ensemble learning

As a series of previous peptides studies, such as PEPred-Suite [13], PPTPP [12], ACPred-Fuse [16], etc., the ensemble predictors derived from fusing varieties classifiers with different weights can provide better performance. In this study, nine feature extraction methods were fed into the two basic classifiers, SVM and RF, to produce 18 diverse predictors (such as RF-Kmer, SVM-Kmer, RF-DT, etc.). In order to improve the therapeutic peptide prediction performance, the proposed method employs 18 basic predictors with adaptive weights using a genetic algorithm for different peptide datasets.

By combination 18 individual predictors, we obtained the ensemble classifier as follows:
(15)
where |${\mathbb{FL}}^E$| represents the ensemble fusing predictor, the symbol |$\forall$| denotes ensemble learning by using a genetic algorithm. |$Pre(i)$| indicates the predictor of each feature extraction method trained by SVM or RF, which can be formulated by
(16)
where |$\triangleright$| means the |$i$|-th predictor trained from the above steps to identify the peptide sequence S, |${P}_i$| is the probability that the sequence S belongs to a certain peptide. The query peptide sequence is predicted as follows:
(17)
where |${w}_i$| is the weight of each predictor obtained from a genetic algorithm, and |$\sum_{i=1}^{18}{w}_i$|=1. Y represents the final probability.

Performance evaluation

Because the training datasets of some peptides are unbalanced in this respect, we choose AUC to be the main index to evaluate the prediction power as it without fixed threshold [13]. AUC indicates the area under the receiver operating characteristic curve (ROC). The AUC can be got from the follows:
(18)
(19)
where |${n}_p$| is the number of positive sequences and |${n}_f$| is the negative sequences, |${r}_i$| indicates the ranking of the i-th positive sequence in the prediction result priority table.
Apart from that, another six metrics including accuracy (ACC), Matthews correlation coefficient (MCC), sensitive (SE), specificity (SP), precision (P) and F1-score (F1) were still calculated although not as the leading indicators.
(20)
(21)
(22)
(23)
(24)
(25)
where TP is true positive, TN is true negative, FP is false positive, FN is false negative and R is the recall [44]. Therefore, AUC, ACC, MCC, SE, SP, P and F1 can evaluate the performance of the predictive model.

Results and discussion

Cross-validation

In our experiment, we adopted the 10-fold cross-validation to evaluate the performance on 8 benchmark datasets. Genetic algorithms are used to generate the weights corresponding to 18 individual predictors per iteration. Then, the averages of 10 cross-validation optimization weights are set as the final weight results corresponding to certain dataset.

Performance of PreTP-EL

Importance of basic predictors

In this section, we evaluated and examined the importance of different predictors in distinguishing the therapeutic peptides. We compared the results of the proposed method with 18 ensemble learning methods, which integrate any 17 individual predictors with adaptive weights based on genetic algorithm. The results on the eight benchmark datasets are shown in Figure 2. The results show that the proposed method outperforms the other competition methods in terms of the AUC, indicating that the 18 predictors are necessary and removing one will significantly reduce the accuracy.

The performance of the PreTP-EL method and 18 individual predictors in terms of AUC on eight benchmark datasets. GA18 represents the proposed method PreTP-EL based on the 18 individual predictors. GA17-RF-Kmer represents the ensemble learning based on 17 predictors that the proposed method PreTP-EL eliminates the RF-Kmer predictor. The remaining 17 predictors employ the similar construction strategy.
Figure 2

The performance of the PreTP-EL method and 18 individual predictors in terms of AUC on eight benchmark datasets. GA18 represents the proposed method PreTP-EL based on the 18 individual predictors. GA17-RF-Kmer represents the ensemble learning based on 17 predictors that the proposed method PreTP-EL eliminates the RF-Kmer predictor. The remaining 17 predictors employ the similar construction strategy.

Genetic algorithm improves the performance of the prediction model

To verify the effectiveness of adaptive weights based on genetic algorithm, we examined the ability of the proposed method with different weights strategies, including the average weights and the adaptive weights based on the genetic algorithm.

To show the performance of the average weights, we compared the prediction ability obtained by the 18 individual predictors (denoted as Ave18) and any 17 predictors (denoted as Ave17). The results are illustrated in Figure 3, from which we can see the following: (i) the red block means the Ave17 compared with Ave18 has a positive effect on prediction. In other words, the missing classifier has a negative effect, and vice versa. The darker the block, the greater the influence of the predictor on predicting the reference function peptide. (ii) The impact distribution of individual predictors is uneven on the eight benchmark datasets. Therefore, the same classifier has both positive and negative effects in different datasets.

Differences between Ave17 and Ave18. Each block represents the value of corresponding Ave17 method minus the value of Ave18 method. The x-axis represents 18 kinds of Ave17. Ave17-RF-Kmer means the prediction result of ensemble learning method using 17 classifiers except RF-Kmer. The remaining 17 methods employ the similar construction strategy. The y-axis represents eight peptides datasets.
Figure 3

Differences between Ave17 and Ave18. Each block represents the value of corresponding Ave17 method minus the value of Ave18 method. The x-axis represents 18 kinds of Ave17. Ave17-RF-Kmer means the prediction result of ensemble learning method using 17 classifiers except RF-Kmer. The remaining 17 methods employ the similar construction strategy. The y-axis represents eight peptides datasets.

Furthermore, we employ the adaptive weights corresponding to the 18 individual predictors strategy to improve the prediction performance effectively. The performance of the ensemble methods with average weights and adaptive weights on training datasets and test datasets are shown in Figures 4 and 5, respectively. The results are shown that the proposed method with adaptive weights (PreTP-EL) outperforms the method with the average weights (PreTP-Ave) in terms of the AUC, indicating that the proposed method using the genetic algorithm is an effective method for therapeutic peptide prediction.

Comparison between the PreTP-Ave and PreTP-EL on the eight training datasets in terms of AUC.
Figure 4

Comparison between the PreTP-Ave and PreTP-EL on the eight training datasets in terms of AUC.

Comparison between the PreTP-Ave and PreTP-EL in terms of AUC on the eight independent test datasets.
Figure 5

Comparison between the PreTP-Ave and PreTP-EL in terms of AUC on the eight independent test datasets.

To investigate the importance of different basic predictors to the overall prediction accuracy of PreTP-EL, the distribution of the weights is illustrated in Figure 6. Figure 6 shows that the distribution of the weights between eight peptide benchmark datasets is various. The darker the color, the more important and discriminative the basic predictor is. Due to the functions of the eight peptides datasets are different, the prediction performance of 18 basic predictors is uneven. Therefore, it is necessary to utilize the adaptive weights based on the genetic algorithm to improve the effectiveness and robustness performance of the therapeutic peptide prediction.

Weights of individual predictors in the ensemble learning. The x-axis represents 18 basic predictors, and the y-axis represents eight peptides datasets.
Figure 6

Weights of individual predictors in the ensemble learning. The x-axis represents 18 basic predictors, and the y-axis represents eight peptides datasets.

Analysis of basic predictors

In this section, we evaluated the performance of the proposed method PreTP-EL and 18 basic individual predictors. The results are illustrated in Figure 7, from which we can observe the following: (i) the basic predictors using sequences feature achieve the comparable performance in prediction of therapeutic peptides. (ii) PreTP-EL further outperforms all the other basic predictors in the most benchmark dataset, by integrating the multi-features and various basic predictors through the genetic algorithm.

ROC curves of 18 basic predictors and PreTP-EL on eight benchmark datasets.
Figure 7

ROC curves of 18 basic predictors and PreTP-EL on eight benchmark datasets.

PCA on independent test dataset of ACP using Bit20(NTCT = 2) and DT.
Figure 8

PCA on independent test dataset of ACP using Bit20(NTCT = 2) and DT.

Comparison of PreTP-EL with existing predictors

To examine the predictive power of PreTP-EL, we further evaluated and compared it with other existing predictors, including PPTPP [12] and PEPred-Suite [13]. The performance of different predictors is listed in Table 2 and Tables S4–S8, including PPTPP [12], PEPred-Suite [13], AntiAngioPred [17], AntiBP [18], ACPred-FL [15], AIPpred [19], AVPpred [20], CPPred-RF [21], PSBinder [22] and QSPpred [23]. These results showed that PreTP-EL achieved better or comparable performance than all the other competing methods in terms of AUC, ACC, MCC and P indicate that combining multiple supervised frameworks contributes to the better performance of PreTP-EL for predicting the therapeutic peptides.

Table 2

The performance of the existing predictors and PreTP-EL on independent test datasets in terms of the AUC

DatasetsPPTPP [12]PEPred-Suite [13]MethodsaPreTP-EL
AAP0.7700.8040.742 (AntiAngioPred)0.819
ABP0.9880.9760.976 (AntiBP)0.992
ACP0.8830.9490.939 (ACPred-FL)0.950b
AIP0.7200.7510.795 (AIPpred)0.810
AVP0.9460.9490.931 (AVPpred)0.951
CPP0.9650.9520.941 (CPPred-RF)0.978
PBP0.7400.8040.742 (PSBinder)0.809
QSP0.9440.9600.903 (QSPpred)0.965
Ave.c0.8690.8930.8710.909
DatasetsPPTPP [12]PEPred-Suite [13]MethodsaPreTP-EL
AAP0.7700.8040.742 (AntiAngioPred)0.819
ABP0.9880.9760.976 (AntiBP)0.992
ACP0.8830.9490.939 (ACPred-FL)0.950b
AIP0.7200.7510.795 (AIPpred)0.810
AVP0.9460.9490.931 (AVPpred)0.951
CPP0.9650.9520.941 (CPPred-RF)0.978
PBP0.7400.8040.742 (PSBinder)0.809
QSP0.9440.9600.903 (QSPpred)0.965
Ave.c0.8690.8930.8710.909

aThe results are reported in [13] and the method is following the corresponding results.

bPreTP-EL utilizes one more feature extraction method Bit20(NTCT = 2) [13] on ACP dataset.

cAve. represents the average value of each predictor over eight datasets.

Table 2

The performance of the existing predictors and PreTP-EL on independent test datasets in terms of the AUC

DatasetsPPTPP [12]PEPred-Suite [13]MethodsaPreTP-EL
AAP0.7700.8040.742 (AntiAngioPred)0.819
ABP0.9880.9760.976 (AntiBP)0.992
ACP0.8830.9490.939 (ACPred-FL)0.950b
AIP0.7200.7510.795 (AIPpred)0.810
AVP0.9460.9490.931 (AVPpred)0.951
CPP0.9650.9520.941 (CPPred-RF)0.978
PBP0.7400.8040.742 (PSBinder)0.809
QSP0.9440.9600.903 (QSPpred)0.965
Ave.c0.8690.8930.8710.909
DatasetsPPTPP [12]PEPred-Suite [13]MethodsaPreTP-EL
AAP0.7700.8040.742 (AntiAngioPred)0.819
ABP0.9880.9760.976 (AntiBP)0.992
ACP0.8830.9490.939 (ACPred-FL)0.950b
AIP0.7200.7510.795 (AIPpred)0.810
AVP0.9460.9490.931 (AVPpred)0.951
CPP0.9650.9520.941 (CPPred-RF)0.978
PBP0.7400.8040.742 (PSBinder)0.809
QSP0.9440.9600.903 (QSPpred)0.965
Ave.c0.8690.8930.8710.909

aThe results are reported in [13] and the method is following the corresponding results.

bPreTP-EL utilizes one more feature extraction method Bit20(NTCT = 2) [13] on ACP dataset.

cAve. represents the average value of each predictor over eight datasets.

As shown in Table 2, PreTP-EL achieved better or comparable performance compared with the other existing methods, except the ACP. To test whether PreTP-EL is inherently unsuited for prediction ACP, or is influenced by the ACP dataset, we conducted a controlled experiment with different feature extraction methods. Due to the feature Bit20(NTCT = 2) obtains the highest feature representation score in the PEPred-Suite [13] on the ACP dataset, we integrate Bit20(NTCT = 2) with the PreTP-EL. The results in Table 2 showed that PEPred-Suite achieved a slightly higher than the PEPred-Suite, indicating that the proposed assembling framework can improve the performance in therapeutic peptide prediction.

Comparison of PreTP-EL with other existing predictors on the AP dataset

In this section, we further evaluated the performance of PreTP-EL on the AP dataset [45] by using the PreTP-EL model trained with the benchmark dataset, and compared with other related methods, including AntiCP2.0 [45], AntiCP [46], ACPred [47], ACPred-FL [15], ACPred-Fuse [16], PEPred-Suite [13] and iACP [48]. The results are shown in Table 3, from which we can see that: PreTP-EL outperforms all the other predictors in terms of ACC and MCC, further demonstrating the better performance of the proposed method.

Table 3

The performance of PreTP-EL and other existing predictors on the AP datasetsa

MethodsACCMCCSESP
PreTP-EL0.8660.7330.8430.890
AntiCP2.0 [45]0.7540.5100.7750.734
AntiCP [46]0.5060.0701.0000.012
ACPred [47]0.5350.0900.8560.214
ACPred-FL [15]0.448−0.1200.6710.225
ACPpred-Fuse [16]0.6890.380.6920.686
PEPred-Suite [13]0.5350.080.3310.738
iACP [48]0.5510.1100.7790.322
MethodsACCMCCSESP
PreTP-EL0.8660.7330.8430.890
AntiCP2.0 [45]0.7540.5100.7750.734
AntiCP [46]0.5060.0701.0000.012
ACPred [47]0.5350.0900.8560.214
ACPred-FL [15]0.448−0.1200.6710.225
ACPpred-Fuse [16]0.6890.380.6920.686
PEPred-Suite [13]0.5350.080.3310.738
iACP [48]0.5510.1100.7790.322

aAll the results reported in this table except for the ones of PreTP-EL were obtained from [45].

Table 3

The performance of PreTP-EL and other existing predictors on the AP datasetsa

MethodsACCMCCSESP
PreTP-EL0.8660.7330.8430.890
AntiCP2.0 [45]0.7540.5100.7750.734
AntiCP [46]0.5060.0701.0000.012
ACPred [47]0.5350.0900.8560.214
ACPred-FL [15]0.448−0.1200.6710.225
ACPpred-Fuse [16]0.6890.380.6920.686
PEPred-Suite [13]0.5350.080.3310.738
iACP [48]0.5510.1100.7790.322
MethodsACCMCCSESP
PreTP-EL0.8660.7330.8430.890
AntiCP2.0 [45]0.7540.5100.7750.734
AntiCP [46]0.5060.0701.0000.012
ACPred [47]0.5350.0900.8560.214
ACPred-FL [15]0.448−0.1200.6710.225
ACPpred-Fuse [16]0.6890.380.6920.686
PEPred-Suite [13]0.5350.080.3310.738
iACP [48]0.5510.1100.7790.322

aAll the results reported in this table except for the ones of PreTP-EL were obtained from [45].

Analysis of dataset ACP

In this section, we further examined the ability of the Bit20(NTCT = 2) and different sequence features in identifying ACP from non-ACP. We utilize the PCA to evaluate the performance of the Bit20(NTCT = 2) and DT feature, which is selected by the highest weight score based on the genetic algorithm. The results are illustrated in Figure 8, from which we can indicating that the Bit20(NTCT = 2) has better discrimination and can greatly separate ACP and non-ACP. The results in Table 2 and Figure 8 shown that the proposed method achieved a competitive performance in ACP prediction by utilizing the multiple discriminative features and integrating the basic predictors through the genetic algorithm.

Conclusion

In this study, a novel ensemble prediction model called PreTP-EL was proposed to identify multiple functional therapeutic peptides. The model combines nine feature extraction methods with two machine learning classifiers and integrated basic predictors based on a genetic algorithm. Furthermore, a user-friendly webserver of PreTP-EL is developed to predict the query peptide. The experimental results show that the proposed method outperforms the other state-of-the-art methods and achieves stable performance. Therefore, PreTP-EL is a useful algorithm for therapeutic peptide prediction. Because the prediction of novel peptides with post-translational modification is very important, in the future studies we will develop the computational predictors for post-translational modification prediction for novel peptides. Furthermore, because the post-translationally modified peptides are very interesting and important, we will analyze these peptides as well.

Key Points
  • In this study, we proposed a therapeutic peptides prediction model PreTP-EL by integrating two machine learning classifiers with nine feature extraction methods. For different kinds of peptides, we constructed adaptively weights corresponding to 18 classifiers based on a genetic algorithm.

  • Feature analysis results show that the nine features used in this study are complementary. Predictors based on these features are able to improve the performance of therapeutic peptide prediction.

  • Furthermore, we established a webserver for PreTP-EL to predict unknown peptide sequences using a genetic algorithm and is available at http://bliulab.net/PreTP-EL. We expect that this webserver can help users identify therapeutic peptides.

Acknowledgements

We are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful for strengthening the presentation of this paper.

Data Availability

The dataset used and analyzed during the current study are available from http://bliulab.net/PreTP-EL.

Funding

The National Natural Science Foundation of China (No. 62102030), the National Key R&D Program of China (No. 2018AAA0100100), and the Beijing Natural Science Foundation (No. JQ19019).

Yichen Guo is a master student at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. Her expertise is in bioinformatics, nature language processing and machine learning.

Ke Yan is a postdoctoral researcher at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, machine learning.

Hongwu Lv is a master student at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, nature language processing and machine learning.

Bin Liu, PhD, is a professor at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, nature language processing and machine learning.

References

1.

Fosgerau
K
,
Hoffmann
T
.
Peptide therapeutics: current status and future directions
.
Drug Discov Today
2015
;
20
:
122
8
.

2.

Vázquez-Prieto
S
,
Paniagua
E
,
Ubeira
FM
, et al.
QSPR-perturbation models for the prediction of B-epitopes from immune epitope database: a potentially valuable route for predicting "in silico" new optimal peptide sequences and/or boundary conditions for vaccine development
.
Int J Pep Res & Therapeutics
2016
;
22
:
445
50
.

3.

Borghouts
C
,
Kunz
C
,
Groner
B
.
Current strategies for the development of peptide-based anti-cancer therapeutics
.
J Pept Sci
2010
;
11
:
713
26
.

4.

Gupta
S
,
Sharma
A
,
Shastri
V
, et al.
Prediction of anti-inflammatory proteins/peptides: An insilico approach
.
J Transl Med
2017
;
15
:7.

5.

Wei
L
,
Xing
P
,
Shi
G
, et al.
Fast prediction of protein methylation sites using a sequence-based feature selection technique
.
IEEE/ACM Trans Comput Biol Bioinform
2019
;
16
:
1264
73
.

6.

Vázquez-Prieto
S
,
Paniagua
E
,
Solana
H
, et al.
A study of the immune epitope database for some fungi species using network topological indices
.
Mol Divers
2017
;
21
:
713
18
.

7.

Vazquez-Prieto
S
,
Paniagua
E
,
Solana
H
, et al.
Complex network study of the immune epitope database for parasitic organisms
.
Curr Top Med Chem
2018
;
17
:3249–55.

8.

Xiaoli
Q
,
Chen
, et al.
CPPred-FL: a sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning
.
Brief Bioinform
2018
. doi: .

9.

Chen
X
,
Qiu
, et al.
Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites
.
Bioinformatics
2013
;
29
:1614–22.

10.

Wei
L
,
Xing
P
,
Shi
G
, et al.
Fast prediction of protein methylation sites using a sequence-based feature selection technique
2019
;
16
:
1264
73
.

11.

Shen
HB
,
Chou
KC
.
PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition
.
Anal Biochem
2008
;
373
:
386
8
.

12.

Zhang
YP
,
Zou
Q
.
PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning
.
Bioinformatics
2020
;
36
:
3982
7
.

13.

Wei
L
,
Zhou
C
,
Su
R
, et al.
PEPred-Suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning
.
Bioinformatics
2019
;
35
:
4272
80
.

14.

Liang
X
,
Li
F
,
Chen
J
, et al.
Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification
.
Brief Bioinform
2021
;
22
:bbaa312.

15.

Wei
L
,
Zhou
C
,
Chen
H
, et al.
ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides
.
Bioinformatics
2018
;
34
:
4007
16
.

16.

Rao
B
,
Zhou
C
,
Zhang
G
, et al.
ACPred-Fuse: fusing multi-view information improves the prediction of anticancer peptides
.
Brief Bioinform
2020
;
21
:
1846
55
.

17.

Ettayapuram Ramaprasad
AS
,
Singh
S
,
Gajendra
PSR
, et al.
AntiAngioPred: a server for prediction of anti-angiogenic peptides
.
PLoS One
2015
;
10
:e0136990.

18.

Lata
S
,
Sharma
BK
,
Raghava
GP
.
Analysis and prediction of antibacterial peptides
.
BMC Bioinformatics
2007
;
8
:
263
.

19.

Manavalan
B
,
Shin
TH
,
Kim
MO
, et al.
AIPpred: sequence-based prediction of anti-inflammatory peptides using random forest
.
Front Pharmacol
2018
;
9
:
276
.

20.

Thakur
N
,
Qureshi
A
,
Kumar
M
.
AVPpred: collection and prediction of highly effective antiviral peptides
.
Nucleic Acids Res
2012
;
40
:
W199
204
.

21.

Wei
L
,
Xing
P
,
Su
R
, et al.
CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency
.
J Proteome Res
2017
;
16
:
2044
53
.

22.

Li
N
,
Kang
J
,
Jiang
L
, et al.
PSBinder: a web service for predicting polystyrene surface-binding peptides
.
Biomed Res Int
2017
;
2017
:
5761517
.

23.

Rajput
A
,
Gupta
AK
,
Kumar
M
.
Prediction and analysis of quorum sensing peptides based on sequence features
.
PLoS One
2015
;
10
:e0120066.

24.

Gao
X
,
Wang
DH
,
Zhang
J
, et al.
iRBP-Motif-PSSM: identification of rna-binding proteins based on collaborative learning
.
IEEE Access
2019
;
7
:
168956
62
.

25.

Wang
N
,
Zhang
J
,
Liu
B
.
IDRBP-PPCT: identifying nucleic acid-binding proteins based on position-specific score matrix and position-specific frequency matrix cross transformation
.
IEEE/ACM Trans Comput Biol Bioinform
,
2021
. doi: .

26.

Liu
B
,
Wang
X
,
Lin
L
, et al.
A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
.
BMC Bioinformatics
2008
;
9
:
510
.

27.

Liu
B
,
Xu
JH
,
Zou
Q
, et al.
Using distances between Top-n-gram and residue pairs for protein remote homology detection
.
BMC Bioinformatics
2014
;
15
:S3.

28.

Liu
B
,
Xu
J
,
Lan
X
, et al.
iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition
.
PLoS One
2014
;
9
:e106691.

29.

Chou
K-C
.
Prediction of protein cellular attributes using pseudo-amino acid composition
.
Proteins
2001
;
43
:
246
55
.

30.

Xu
R
,
Zhou
J
,
Wang
H
, et al.
Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation
.
BMC Syst Biol
2015
;
9
(
Suppl 1
):
S10
.

31.

Zhang
J
,
Liu
B
.
PSFM-DBT: identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation
.
Int J Mol Sci
2017
;
18
:1856.

32.

Liu
B
,
Gao
X
,
Zhang
H
.
BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches
.
Nucleic Acids Res
2019
;
47
:e127.

33.

Rangwala
H
,
Karypis
G
.
Profile-based direct kernels for remote homology detection and fold recognition
.
Bioinformatics
2005
;
21
:
4239
47
.

34.

Liu
B
,
Long
R
,
Chou
KC
.
iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework
.
Bioinformatics
2016
;
32
:
2411
8
.

35.

Altschul
SF
,
Madden
TL
,
Schaffer
AA
, et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
1997
;
25
:
3389
402
.

36.

Holm
L
,
Sander
C
.
Removing near-neighbour redundancy from large protein sequence collections
.
Bioinformatics
1998
;
14
:
423
9
.

37.

Hasan
MM
,
Schaduangrat
N
,
Basith
S
, et al.
HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation
.
Bioinformatics
2020
;
36
:
3350
6
.

38.

Suykens
JAK
,
Vandewalle
J
.
Least squares support vector machine classifiers
.
Neural Processing Letters
1999
;
9
:
293
300
.

39.

Breiman
L
.
Random forests, machine learning 45
.
J Clin Microbiol
2001
;
2
:
199
228
.

40.

Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
, et al.
Scikit-learn: machine learning in Python
2011
;
12
:
2825
30
.

41.

Breiman
L
.
Random Forests
.
Machine Learning
2001
;
45
:
5
32
.

42.

Kosakovsky Pond
SL
,
Posada
D
,
Gravenor
MB
, et al.
GARD: a genetic algorithm for recombination detection
.
Bioinformatics
2006
;
22
:
3096
8
.

43.

Maulik
U
,
Bandyopadhyay
S
.
Genetic algorithm-based clustering technique
.
Pattern Recognition
2000
;
33
:
1455
65
.

44.

Powers DJjomlt
.
Evaluation: From Precision, Recall and F-Factor to ROC
.
Informedness, Markedness & Correlation
,
2010
;
16061
:
37
63
.

45.

Agrawal
P
,
Bhagat
D
,
Mahalwal
M
, et al.
AntiCP 2.0: an updated model for predicting anticancer peptides
.
Brief Bioinform
2021
;
22
:bbaa153.

46.

Tyagi
A
,
Kapoor
P
,
Kumar
R
, et al.
In silico models for designing and discovering novel anticancer peptides
.
Sci Rep
2013
;
3
:
2984
.

47.

Schaduangrat
N
,
Nantasenamat
C
,
Prachayasittikul
V
, et al.
ACPred: a computational tool for the prediction and analysis of anticancer peptides
.
Molecules
2019
;
24
:1973.

48.

Chen
W
,
Ding
H
,
Feng
P
, et al.
iACP: a sequence-based tool for identifying anticancer peptides
.
Oncotarget
2016
;
7
:16895–909.

Author notes

Yichen Guo and Ke Yan authors contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data