Abstract

Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.

Introduction

Antimicrobial peptides (AMPs) are a unique and diverse group of gene-coded peptide antibiotics, which have been found in various species ranging from prokaryotes to eukaryotes, including mammals, amphibians, insects and plants [1]. AMPs are essential components of the immune system in a wide range of organisms, representing the first line of defense against a variety of pathogens [2]. AMPs display a broad spectrum of antimicrobial activities mainly against bacteria and also against fungi, viruses and cancer cells. They typically have a short length (6–100 amino acid residues) and fast and efficient action against microbes. In recent years, with resistance to antibiotics becoming an increasing global concern, AMPs have attracted significant interest as a potential substitute for conventional antibiotics [3]. The low toxicity to mammals and minimal development of resistance in target microorganisms make them potential candidates as peptide drugs [1]. AMPs’ targeting and mechanisms of action mean that they may have evolved to function in specific physiological and anatomical environments to minimize their potential damage to the host [4, 5]. Some AMPs can bind to protein receptors, whereas others appear to act directly against cell membranes [6], suggesting the existence of multiple modes of action. Besides, some AMPs are multifunctional effector molecules. They play a direct role in killing microorganisms and amplifying leukocytes’ antibacterial mechanism, which may bridge the innate and adaptive immunity [4, 5]. Compared with traditional antibacterials, AMPs have a more flexible mechanism, fewer side effects and are less prone to drug resistance.

Several models of antibacterial activity have been proposed; some features, such as helicity, flexibility and a cationic nature, have proven to be necessary for antibiotic activity [7–12]. Further, due to the functional significance of AMPs, several curated public resources have been established, providing comprehensive, experimentally verified annotations of AMPs [6, 13, 14]. In addition, the mode of action and activity-specific databases, such as AntiTbPdb [15], have also been made publicly available.

Based on the identification of AMPs, many efforts have been dedicated to the investigation of potential cellular mechanisms [16, 17]. Advances in AMP research have driven continued efforts for developing computational methods for accurate prediction of AMPs, aimed at significantly reducing the time and effort involved in experimental identification. Indeed, compared to labor-intensive and time-consuming experimental characterization of AMPs, computational prediction of AMPs provides a useful and complementary approach by shortlisting likely AMP candidates for subsequent experimental validation.

Thus far, a number of computational approaches have been developed and published for this purpose. At present, these tools can be classified into two major categories in terms of the adopted methodologies. The first category comprises conventional machine learning-based predictors, such as Collection of Anti-Microbial Peptides (CAMP) [3, 18], iAMP-2L [19] and iAMPpred [20]. These tools apply machine learning algorithms to features extracted from the peptide sequences to identify AMPs. Among the machine learning-based predictors, support vector machine (SVM) [21] and random forest (RF) [22] are the two most commonly used algorithms. The second category contains all deep learning-based methods. Deep learning is a hot direction in recent years, which is widely used in bioinformatics, especially in biological sequences [23, 24]. Such tools of the second category frequently use encodings of the original sequences, such as one-hot encoding, as the input, and in some cases, add the extracted sequence information from sequences by some third-party tools and then further extract features and output the classification labels through a neural network structure. Encodings of the original sequences as the inputs are seldomly used for training machine learning methods.

A number of attempts have been made to provide benchmark tests of prediction tools [25]; however, each study had certain limitations and drawbacks: either the study authors did not include a performance evaluation of all reviewed tools; several state-of-the-art prediction tools were not considered and benchmarked, or the studies did not provide a detailed algorithm description for each of the reviewed tools. In this article, we first provide a comprehensive survey on current machine learning-based computational methods for AMP prediction to the best of our knowledge to overcome these issues. We discuss a wide range of aspects including but not limited to the data set used, core algorithms selected for individual methods, feature selection techniques employed and performance evaluation strategies. Furthermore, we construct an independent data set and multiple validation data sets of different databases and compare different AMP prediction methods.

Materials and Methods

Construction of the independent test data set

In order to objectively evaluate the predictive performance of the existing available tools, an independent test data set must be constructed. For the positive data set, we integrated all AMPs from different comprehensive AMP public databases, including ADAM [26], ADAPTABLE [27], APD3 [17], CAMP [3, 18], dbAMP [28], DRAMP [29, 30], LAMP [14, 31], MilkAMP [32] and YADAMP [6]. Peptide sequences with length greater than 100 or less than 10 were not considered, and sequences with non-standard residues such as ‘B’, ‘J’, ‘O’, ‘U’, ‘X’ or ‘Z’ were eliminated since these peptides are rare and cannot be predicted by some tools [33]. Furthermore, to reduce homology bias and redundancy, the CD-HIT tool [34, 35] was used to filter out the sequences sharing the 40% sequence identity with any other in the same subset. And then, the sequences used as training data sets in the reviewed approaches were removed. Finally, 1536 AMP positive samples were obtained after all the steps mentioned above and were used as our training data set. The major steps for constructing the independent positive data set are illustrated in Supplementary Figure S1 available online at http://bib.oxfordjournals.org/.

For generating a negative data set, the following steps were taken: (1) peptide sequences were downloaded from UniProt (http://www.uniprot.org), and all entries containing the keyword ‘antimicrobial’ and related keywords (e.g. ‘antibacterial’, ‘antifungal’, ‘anticancer’, ‘antiviral’, ‘antiparasitic’, ‘antibiotic’, ‘antibiofilm’, ‘effector’ or ‘excreted’) were removed; (2) sequences with the length greater than 100 or less than 10 amino acid residues and with non-standard residues ‘B’, ‘J’, ‘O’, ‘U’, ‘X’ or ‘Z’ were eliminated, just as was done for the construction of the positive data set and (3) in order to reduce homology bias and redundancy, the CD-HIT program [34, 35] was employed to remove sequences that had 40% pairwise sequence identity to samples from the positive data set or were sharing 40% sequence identity to any other sequences in the same subset. Finally, 3062 negative samples were obtained. To balance the number between the positive and negative samples, we randomly selected 1536 samples from these 3062 samples, constituting our final negative data set.

Construction of the validation data sets

In order to compare the performance of different tools on different AMP databases, we constructed six data sets for additional performance evaluation based on six commonly used AMP public databases (APD3 [17], CAMP [3, 18], dbAMP [28], DRAMP [29, 30], LAMP [14, 31], YADAMP [6]).

We again removed sequences with the length less than 10 or greater than 100 and with non-standard residues to construct the positive data sets, and we used CD-HIT [34, 35] to filter out sequences sharing 40% sequence identity. To construct the negative data sets, we used the same approach as described in the previous section. Finally, 494 positive samples and 494 negative samples were obtained for the APD3 test, 203 positive samples and 203 negative samples were obtained for the CAMP test, 522 positive samples and 522 negative samples were obtained for the dbAMP test, 1408 positive samples and 1408 negative samples were obtained for the DRAMP test, 1054 positive samples and 1054 negative samples were obtained for the LAMP test and 324 positive samples and 324 negative samples were obtained for the YADAMP test, respectively. The major steps of constructing the six validation positive data sets are shown in Supplementary Figure S1 available online at http://bib.oxfordjournals.org/.

State-of-the-art computational approaches for AMP prediction

More than 30 computational methods have been developed for AMP identification to date. These methods differ in a variety of key aspects, including algorithms employed, adopted feature selection techniques and more. Table 1 summarizes 34 computational approaches currently available for AMP prediction according to the algorithm selected, features calculated, feature selection methods employed, data set size, performance evaluation strategy, web server availability, max data uploaded, emails of results and software availability.

Table 1

A comprehensive summary of the reviewed approaches for AMP prediction

TypeToolYearAlgorithmFeature selectionEvaluation strategyWeb server availabilityMax data uploadFile upload availabilityEmail of resultSoftware availability
Machine learning-based methodsAMPer2007HMMsNone10-fold CVNoN.A.N.A.N.A.Yes
CAMP2010, 2016SVM, RF, ANN, DARFE (RF Gini)10-fold CV, independent testYesN.A.YesNoNo
Porto et al.2010SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
Song et al.2011k-NN, BLASTPmRMR, IFSJack-knife validation, independent testNoN.A.N.A.N.A.No
Torrent et al.2011ANNNoneIndependent testNoN.A.N.A.N.A.No
Fernandes et al.2012ANFISANFISIndependent testNoN.A.N.A.N.A.No
ClassAMP2012RF, SVMRFE (RF Gini)10-fold CV, independent testYesN.A.YesNoNo
CS-AMPPred2012SVMNone5-fold CV, independent testNoN.A.N.A.N.A.Yes
C-PAmP2013SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
iAMP-2L2013FKNNNoneJack-knife validation, independent testYes500NoNoNo
Paola et al.2013SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
Randou et al.2013LRNoneIndependent testNoN.A.N.A.N.A.No
dbaasp2014z-ScoreNoneIndependent testNoN.A.N.A.N.A.No
ADAM2015SVM, profile HMMsNoneN.A.YesN.A.NoNONo
Camacho et al.2015NB, RF, SVMNone5-fold CV, independent testNoN.A.N.A.N.A.No
Ng et al.2015SVM, BLASTPNoneJack-knife validation, independent testNoN.A.N.A.N.A.No
MLAMP2016RFNoneJack-knife validation, independent testYes5YesYesNo
iAMPpred2017SVMNone10-fold CV, independent testYesN.A.NoNoNo
AmPEP2018RFNone10-fold CV, independent testYesN.A.YesYesYes
CLN-MLEM22018MLEM2, IRIMNone10-fold CV, independent testNoN.A.N.A.N.A.No
MOEA-FW2018RF, k-NN, SVM, ANNNone10-fold CVNoN.A.N.A.N.A.No
AMAP2019SVM, XGBoost, one-versus-rest classifier fusionNoneLOCO, 5-fold CV, independent testYesN.A.NoNoNo
MAMPs-Pred2019RF, LC-RF, PS-RFNone10-fold CV, independent testNoN.A.N.A.N.A.No
dbAMP2019RFNone5-fold CV, independent testNoN.A.N.A.N.A.No
AMPfun2019DT, RF, SVMFS10-fold CV, independent testYesN.A.NoNoNo
Ampir2020SVMRFEindependent testNoN.A.N.A.N.A.Yes
Chung et al.2020RFOneR, FS5-fold CV, independent testNoN.A.N.A.N.A.No
Fu et al.2020ADANone5-fold CV, independent testNoN.A.N.A.N.A.No
AmpGram2020RFNone5-fold CV, benchmark testYes50YesNoYes
IAMPE2020RF, SVM, XGBoost, k-NN, NBNone10-fold CV, independent testYesN.A.YesNoNo
Deep learning-based methodsAMP Scanner V22018LSTMNone10-fold CV, independent testYes50 000YesNoNo
APIN2019CNNNone10-fold CV, independent testNoN.A.N.A.N.A.Yes
Deep-AmPEP302020CNNNone10-fold CV, independent testYesN.A.YesYesNo
AMPlify2020Bi-LSTM, attention mechanismNone5-fold CV, independent testNoN.A.N.A.N.A.Yes
TypeToolYearAlgorithmFeature selectionEvaluation strategyWeb server availabilityMax data uploadFile upload availabilityEmail of resultSoftware availability
Machine learning-based methodsAMPer2007HMMsNone10-fold CVNoN.A.N.A.N.A.Yes
CAMP2010, 2016SVM, RF, ANN, DARFE (RF Gini)10-fold CV, independent testYesN.A.YesNoNo
Porto et al.2010SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
Song et al.2011k-NN, BLASTPmRMR, IFSJack-knife validation, independent testNoN.A.N.A.N.A.No
Torrent et al.2011ANNNoneIndependent testNoN.A.N.A.N.A.No
Fernandes et al.2012ANFISANFISIndependent testNoN.A.N.A.N.A.No
ClassAMP2012RF, SVMRFE (RF Gini)10-fold CV, independent testYesN.A.YesNoNo
CS-AMPPred2012SVMNone5-fold CV, independent testNoN.A.N.A.N.A.Yes
C-PAmP2013SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
iAMP-2L2013FKNNNoneJack-knife validation, independent testYes500NoNoNo
Paola et al.2013SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
Randou et al.2013LRNoneIndependent testNoN.A.N.A.N.A.No
dbaasp2014z-ScoreNoneIndependent testNoN.A.N.A.N.A.No
ADAM2015SVM, profile HMMsNoneN.A.YesN.A.NoNONo
Camacho et al.2015NB, RF, SVMNone5-fold CV, independent testNoN.A.N.A.N.A.No
Ng et al.2015SVM, BLASTPNoneJack-knife validation, independent testNoN.A.N.A.N.A.No
MLAMP2016RFNoneJack-knife validation, independent testYes5YesYesNo
iAMPpred2017SVMNone10-fold CV, independent testYesN.A.NoNoNo
AmPEP2018RFNone10-fold CV, independent testYesN.A.YesYesYes
CLN-MLEM22018MLEM2, IRIMNone10-fold CV, independent testNoN.A.N.A.N.A.No
MOEA-FW2018RF, k-NN, SVM, ANNNone10-fold CVNoN.A.N.A.N.A.No
AMAP2019SVM, XGBoost, one-versus-rest classifier fusionNoneLOCO, 5-fold CV, independent testYesN.A.NoNoNo
MAMPs-Pred2019RF, LC-RF, PS-RFNone10-fold CV, independent testNoN.A.N.A.N.A.No
dbAMP2019RFNone5-fold CV, independent testNoN.A.N.A.N.A.No
AMPfun2019DT, RF, SVMFS10-fold CV, independent testYesN.A.NoNoNo
Ampir2020SVMRFEindependent testNoN.A.N.A.N.A.Yes
Chung et al.2020RFOneR, FS5-fold CV, independent testNoN.A.N.A.N.A.No
Fu et al.2020ADANone5-fold CV, independent testNoN.A.N.A.N.A.No
AmpGram2020RFNone5-fold CV, benchmark testYes50YesNoYes
IAMPE2020RF, SVM, XGBoost, k-NN, NBNone10-fold CV, independent testYesN.A.YesNoNo
Deep learning-based methodsAMP Scanner V22018LSTMNone10-fold CV, independent testYes50 000YesNoNo
APIN2019CNNNone10-fold CV, independent testNoN.A.N.A.N.A.Yes
Deep-AmPEP302020CNNNone10-fold CV, independent testYesN.A.YesYesNo
AMPlify2020Bi-LSTM, attention mechanismNone5-fold CV, independent testNoN.A.N.A.N.A.Yes

Full names of the algorithms: N.A., not available; HMMs, hidden Markov models; SVM, support vector machine; RF, random forest; ANN, artificial neural networks; DA, discriminant analysis; DT, decision tree; LR, logistic regression; k-NN, k-nearest neighbor; BLASTP, basic local alignment search tool (protein); ANFIS, adaptive neuro-fuzzy inference system; NB, naive Bayes; FKNN, fuzzy k-nearest neighbor; profile HMMs, profile hidden Markov models; MLEM2, modified learning from examples module; IRIM, interesting rule induction module; XGBoost, extreme gradient boosting; PS-RF, pruned sets-random forests; LC-RF, label combination-random forests; ADA, adaboost; LSTM, long short-term memory; CNN, convolutional neural networks; Bi-LSTM, bidirectional LSTM.

Table 1

A comprehensive summary of the reviewed approaches for AMP prediction

TypeToolYearAlgorithmFeature selectionEvaluation strategyWeb server availabilityMax data uploadFile upload availabilityEmail of resultSoftware availability
Machine learning-based methodsAMPer2007HMMsNone10-fold CVNoN.A.N.A.N.A.Yes
CAMP2010, 2016SVM, RF, ANN, DARFE (RF Gini)10-fold CV, independent testYesN.A.YesNoNo
Porto et al.2010SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
Song et al.2011k-NN, BLASTPmRMR, IFSJack-knife validation, independent testNoN.A.N.A.N.A.No
Torrent et al.2011ANNNoneIndependent testNoN.A.N.A.N.A.No
Fernandes et al.2012ANFISANFISIndependent testNoN.A.N.A.N.A.No
ClassAMP2012RF, SVMRFE (RF Gini)10-fold CV, independent testYesN.A.YesNoNo
CS-AMPPred2012SVMNone5-fold CV, independent testNoN.A.N.A.N.A.Yes
C-PAmP2013SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
iAMP-2L2013FKNNNoneJack-knife validation, independent testYes500NoNoNo
Paola et al.2013SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
Randou et al.2013LRNoneIndependent testNoN.A.N.A.N.A.No
dbaasp2014z-ScoreNoneIndependent testNoN.A.N.A.N.A.No
ADAM2015SVM, profile HMMsNoneN.A.YesN.A.NoNONo
Camacho et al.2015NB, RF, SVMNone5-fold CV, independent testNoN.A.N.A.N.A.No
Ng et al.2015SVM, BLASTPNoneJack-knife validation, independent testNoN.A.N.A.N.A.No
MLAMP2016RFNoneJack-knife validation, independent testYes5YesYesNo
iAMPpred2017SVMNone10-fold CV, independent testYesN.A.NoNoNo
AmPEP2018RFNone10-fold CV, independent testYesN.A.YesYesYes
CLN-MLEM22018MLEM2, IRIMNone10-fold CV, independent testNoN.A.N.A.N.A.No
MOEA-FW2018RF, k-NN, SVM, ANNNone10-fold CVNoN.A.N.A.N.A.No
AMAP2019SVM, XGBoost, one-versus-rest classifier fusionNoneLOCO, 5-fold CV, independent testYesN.A.NoNoNo
MAMPs-Pred2019RF, LC-RF, PS-RFNone10-fold CV, independent testNoN.A.N.A.N.A.No
dbAMP2019RFNone5-fold CV, independent testNoN.A.N.A.N.A.No
AMPfun2019DT, RF, SVMFS10-fold CV, independent testYesN.A.NoNoNo
Ampir2020SVMRFEindependent testNoN.A.N.A.N.A.Yes
Chung et al.2020RFOneR, FS5-fold CV, independent testNoN.A.N.A.N.A.No
Fu et al.2020ADANone5-fold CV, independent testNoN.A.N.A.N.A.No
AmpGram2020RFNone5-fold CV, benchmark testYes50YesNoYes
IAMPE2020RF, SVM, XGBoost, k-NN, NBNone10-fold CV, independent testYesN.A.YesNoNo
Deep learning-based methodsAMP Scanner V22018LSTMNone10-fold CV, independent testYes50 000YesNoNo
APIN2019CNNNone10-fold CV, independent testNoN.A.N.A.N.A.Yes
Deep-AmPEP302020CNNNone10-fold CV, independent testYesN.A.YesYesNo
AMPlify2020Bi-LSTM, attention mechanismNone5-fold CV, independent testNoN.A.N.A.N.A.Yes
TypeToolYearAlgorithmFeature selectionEvaluation strategyWeb server availabilityMax data uploadFile upload availabilityEmail of resultSoftware availability
Machine learning-based methodsAMPer2007HMMsNone10-fold CVNoN.A.N.A.N.A.Yes
CAMP2010, 2016SVM, RF, ANN, DARFE (RF Gini)10-fold CV, independent testYesN.A.YesNoNo
Porto et al.2010SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
Song et al.2011k-NN, BLASTPmRMR, IFSJack-knife validation, independent testNoN.A.N.A.N.A.No
Torrent et al.2011ANNNoneIndependent testNoN.A.N.A.N.A.No
Fernandes et al.2012ANFISANFISIndependent testNoN.A.N.A.N.A.No
ClassAMP2012RF, SVMRFE (RF Gini)10-fold CV, independent testYesN.A.YesNoNo
CS-AMPPred2012SVMNone5-fold CV, independent testNoN.A.N.A.N.A.Yes
C-PAmP2013SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
iAMP-2L2013FKNNNoneJack-knife validation, independent testYes500NoNoNo
Paola et al.2013SVMNone10-fold CV, independent testNoN.A.N.A.N.A.No
Randou et al.2013LRNoneIndependent testNoN.A.N.A.N.A.No
dbaasp2014z-ScoreNoneIndependent testNoN.A.N.A.N.A.No
ADAM2015SVM, profile HMMsNoneN.A.YesN.A.NoNONo
Camacho et al.2015NB, RF, SVMNone5-fold CV, independent testNoN.A.N.A.N.A.No
Ng et al.2015SVM, BLASTPNoneJack-knife validation, independent testNoN.A.N.A.N.A.No
MLAMP2016RFNoneJack-knife validation, independent testYes5YesYesNo
iAMPpred2017SVMNone10-fold CV, independent testYesN.A.NoNoNo
AmPEP2018RFNone10-fold CV, independent testYesN.A.YesYesYes
CLN-MLEM22018MLEM2, IRIMNone10-fold CV, independent testNoN.A.N.A.N.A.No
MOEA-FW2018RF, k-NN, SVM, ANNNone10-fold CVNoN.A.N.A.N.A.No
AMAP2019SVM, XGBoost, one-versus-rest classifier fusionNoneLOCO, 5-fold CV, independent testYesN.A.NoNoNo
MAMPs-Pred2019RF, LC-RF, PS-RFNone10-fold CV, independent testNoN.A.N.A.N.A.No
dbAMP2019RFNone5-fold CV, independent testNoN.A.N.A.N.A.No
AMPfun2019DT, RF, SVMFS10-fold CV, independent testYesN.A.NoNoNo
Ampir2020SVMRFEindependent testNoN.A.N.A.N.A.Yes
Chung et al.2020RFOneR, FS5-fold CV, independent testNoN.A.N.A.N.A.No
Fu et al.2020ADANone5-fold CV, independent testNoN.A.N.A.N.A.No
AmpGram2020RFNone5-fold CV, benchmark testYes50YesNoYes
IAMPE2020RF, SVM, XGBoost, k-NN, NBNone10-fold CV, independent testYesN.A.YesNoNo
Deep learning-based methodsAMP Scanner V22018LSTMNone10-fold CV, independent testYes50 000YesNoNo
APIN2019CNNNone10-fold CV, independent testNoN.A.N.A.N.A.Yes
Deep-AmPEP302020CNNNone10-fold CV, independent testYesN.A.YesYesNo
AMPlify2020Bi-LSTM, attention mechanismNone5-fold CV, independent testNoN.A.N.A.N.A.Yes

Full names of the algorithms: N.A., not available; HMMs, hidden Markov models; SVM, support vector machine; RF, random forest; ANN, artificial neural networks; DA, discriminant analysis; DT, decision tree; LR, logistic regression; k-NN, k-nearest neighbor; BLASTP, basic local alignment search tool (protein); ANFIS, adaptive neuro-fuzzy inference system; NB, naive Bayes; FKNN, fuzzy k-nearest neighbor; profile HMMs, profile hidden Markov models; MLEM2, modified learning from examples module; IRIM, interesting rule induction module; XGBoost, extreme gradient boosting; PS-RF, pruned sets-random forests; LC-RF, label combination-random forests; ADA, adaboost; LSTM, long short-term memory; CNN, convolutional neural networks; Bi-LSTM, bidirectional LSTM.

The AMP identification task can be generally classified into two categories. The first category formulates the AMP identification problem as a binary classification task. The most common classification task is to determine whether a peptide is an AMP or not, and the majority of methods listed in Table 1 belong to this category. The second category treats the AMP identification problem as a multi-classification task, with the goal to determine the specific functional types of AMPs, such as AMPs targeting Gram-positive bacteria, or AMPs targeting Gram-negative bacteria or other AMPs. In this work, we mainly focus on the review, survey and benchmarking of the first category and simply discuss the second category. For AMP identification, we classify approaches into two groups based on the adopted computational methodology. The first group is based on traditional machine learning algorithms that use sequence-derived features to train the models. According to our survey, most tools listed in Table 1 employ the first category of approaches to build their prediction models. In recent years, deep learning has become wildly popular due to its strong ability to learn and extract informative features. Accordingly, the second group of approaches is developed based on deep learning. A timeline and a general flow chart of computational methods for AMP prediction are shown in Figures 1 and 2, respectively.

Timeline of current computational approaches for AMP prediction.
Figure 1

Timeline of current computational approaches for AMP prediction.

Overview of current computational approaches for AMP prediction. (A) pre-processing to build data set, (B) structure of the CNNs used in this study.
Figure 2

Overview of current computational approaches for AMP prediction. (A) pre-processing to build data set, (B) structure of the CNNs used in this study.

Prediction distributions of tests of reviewed computational approaches using the independent test data set. Red: non-AMPs; Blue: AMPs.
Figure 3

Prediction distributions of tests of reviewed computational approaches using the independent test data set. Red: non-AMPs; Blue: AMPs.

Features calculated and extracted for model construction

Extracting feature information from the sequence is crucial for building a computational model, and there is a lot of work on feature construction [11, 12, 24, 36–42]. In order to construct robust and accurate approaches for AMP prediction, various features have been designed and extracted for encoding the peptide sequences. According to these features’ characteristics, we identified five major types of features in current computational approaches for AMP prediction (Table 2). These feature types are: (1) composition features, (2) position features, (3) structure features, (4) physicochemical properties and (5) similarity features. As part of our survey, we collected the most representative features for each type. Amino acid composition (AAC) and pseudo amino acid composition (PseAAC) are the most commonly used features, although some structure features are also often used. Composition features can be calculated based on the peptide sequence and are easy to obtain, while some features require third-party software. For instance, structure features can be calculated using the software Tango [43]. After feature encoding, the initial feature set sometimes has a high dimensionality, which may result in biased model training. Therefore, feature selection needs to be performed to reduce the dimensionality of the initial feature set before construction of the computational models.

Table 2

Different types of features employed by the reviewed approaches for AMP prediction

Feature typeFeatureReference
Composition featuresAAC[20, 28, 64, 82, 83, 98, 103, 104, 112, 113]
Normalized amino acid composition (NAAC)[20]
AAPC[105]
DPC[3, 18, 64, 71, 112, 122]
TPC[3, 18, 64]
Peptide length[49, 89]
N-gram composition found by counting (NCC)[103]
N-gram composition found by t-test (NTC)[103]
Motifs composition (MC)[103]
Position featuresN-gram binary profiling of position found by counting (NCB)[103]
N-gram binary profiling of position found by t-test (NTB)[103]
Motifs binary profiling of position (MB)[103]
PSSM profile[108]
CMV[71]
Structure featuresα-Helix[20, 49–51, 64, 65, 85, 89, 98]
β-Sheet[20, 49–51, 64, 65, 85, 89, 98]
β-Turn[20, 49–51, 64, 85, 89, 98]
Loop formation[65]
Random coil[51, 85]
Physicochemical propertiesIsoelectric point[20, 49, 81, 85]
Charge[3, 18, 20, 51, 65, 81, 98]
Molecular mass[50]
Atom count[50]
Size[81]
Amino acid acidity and basicity[81]
Aromaticity[81]
Sulfur[81]
Oxygen, nitrogen, hydrogen and carbon atom contents[81]
Net charge at the physiological pH[63]
μH[63]
Aliphatic index[3, 18, 81]
Amphipathicity[51, 65]
Ratio between hydrophobic and charged residues[63]
Hydrophilicity[3, 18]
Hydrophobic moment[63, 65]
Hydrophobicity[3, 18, 20, 51, 63, 65, 89, 98]
Instability index[3, 18]
Disordering[51]
Solvent accessibility[98]
Surface tension[98]
Normalized van der Waals volume[64, 98]
Conformational similarity[64]
Polarity[64, 81, 98]
Polarizability[64, 98]
Flexibility[65]
Normalized MoreauBroto autocorrelation (NMBroto)[71]
Moran autocorrelation (Moran)[71]
Geary autocorrelation (Geary)[71]
CNMR[81]
Composition/transition/distribution (CTD)[3, 18, 64, 71, 92, 95, 98, 103, 104]
PseAAC[19, 20, 66, 71, 80, 83, 96, 98, 103, 113]
PseKRAAC[113]
3-mer composition[82]
Sequence order coupling number[71]
Quasi sequence order[71]
Similarity featuresBLOSUM-50[3, 18, 64]
LZ complexity pairwise similarity scores[74]
In vivo propensityIn vivo aggregation propensity[85, 89]
In vivo stability[64]
Feature typeFeatureReference
Composition featuresAAC[20, 28, 64, 82, 83, 98, 103, 104, 112, 113]
Normalized amino acid composition (NAAC)[20]
AAPC[105]
DPC[3, 18, 64, 71, 112, 122]
TPC[3, 18, 64]
Peptide length[49, 89]
N-gram composition found by counting (NCC)[103]
N-gram composition found by t-test (NTC)[103]
Motifs composition (MC)[103]
Position featuresN-gram binary profiling of position found by counting (NCB)[103]
N-gram binary profiling of position found by t-test (NTB)[103]
Motifs binary profiling of position (MB)[103]
PSSM profile[108]
CMV[71]
Structure featuresα-Helix[20, 49–51, 64, 65, 85, 89, 98]
β-Sheet[20, 49–51, 64, 65, 85, 89, 98]
β-Turn[20, 49–51, 64, 85, 89, 98]
Loop formation[65]
Random coil[51, 85]
Physicochemical propertiesIsoelectric point[20, 49, 81, 85]
Charge[3, 18, 20, 51, 65, 81, 98]
Molecular mass[50]
Atom count[50]
Size[81]
Amino acid acidity and basicity[81]
Aromaticity[81]
Sulfur[81]
Oxygen, nitrogen, hydrogen and carbon atom contents[81]
Net charge at the physiological pH[63]
μH[63]
Aliphatic index[3, 18, 81]
Amphipathicity[51, 65]
Ratio between hydrophobic and charged residues[63]
Hydrophilicity[3, 18]
Hydrophobic moment[63, 65]
Hydrophobicity[3, 18, 20, 51, 63, 65, 89, 98]
Instability index[3, 18]
Disordering[51]
Solvent accessibility[98]
Surface tension[98]
Normalized van der Waals volume[64, 98]
Conformational similarity[64]
Polarity[64, 81, 98]
Polarizability[64, 98]
Flexibility[65]
Normalized MoreauBroto autocorrelation (NMBroto)[71]
Moran autocorrelation (Moran)[71]
Geary autocorrelation (Geary)[71]
CNMR[81]
Composition/transition/distribution (CTD)[3, 18, 64, 71, 92, 95, 98, 103, 104]
PseAAC[19, 20, 66, 71, 80, 83, 96, 98, 103, 113]
PseKRAAC[113]
3-mer composition[82]
Sequence order coupling number[71]
Quasi sequence order[71]
Similarity featuresBLOSUM-50[3, 18, 64]
LZ complexity pairwise similarity scores[74]
In vivo propensityIn vivo aggregation propensity[85, 89]
In vivo stability[64]
Table 2

Different types of features employed by the reviewed approaches for AMP prediction

Feature typeFeatureReference
Composition featuresAAC[20, 28, 64, 82, 83, 98, 103, 104, 112, 113]
Normalized amino acid composition (NAAC)[20]
AAPC[105]
DPC[3, 18, 64, 71, 112, 122]
TPC[3, 18, 64]
Peptide length[49, 89]
N-gram composition found by counting (NCC)[103]
N-gram composition found by t-test (NTC)[103]
Motifs composition (MC)[103]
Position featuresN-gram binary profiling of position found by counting (NCB)[103]
N-gram binary profiling of position found by t-test (NTB)[103]
Motifs binary profiling of position (MB)[103]
PSSM profile[108]
CMV[71]
Structure featuresα-Helix[20, 49–51, 64, 65, 85, 89, 98]
β-Sheet[20, 49–51, 64, 65, 85, 89, 98]
β-Turn[20, 49–51, 64, 85, 89, 98]
Loop formation[65]
Random coil[51, 85]
Physicochemical propertiesIsoelectric point[20, 49, 81, 85]
Charge[3, 18, 20, 51, 65, 81, 98]
Molecular mass[50]
Atom count[50]
Size[81]
Amino acid acidity and basicity[81]
Aromaticity[81]
Sulfur[81]
Oxygen, nitrogen, hydrogen and carbon atom contents[81]
Net charge at the physiological pH[63]
μH[63]
Aliphatic index[3, 18, 81]
Amphipathicity[51, 65]
Ratio between hydrophobic and charged residues[63]
Hydrophilicity[3, 18]
Hydrophobic moment[63, 65]
Hydrophobicity[3, 18, 20, 51, 63, 65, 89, 98]
Instability index[3, 18]
Disordering[51]
Solvent accessibility[98]
Surface tension[98]
Normalized van der Waals volume[64, 98]
Conformational similarity[64]
Polarity[64, 81, 98]
Polarizability[64, 98]
Flexibility[65]
Normalized MoreauBroto autocorrelation (NMBroto)[71]
Moran autocorrelation (Moran)[71]
Geary autocorrelation (Geary)[71]
CNMR[81]
Composition/transition/distribution (CTD)[3, 18, 64, 71, 92, 95, 98, 103, 104]
PseAAC[19, 20, 66, 71, 80, 83, 96, 98, 103, 113]
PseKRAAC[113]
3-mer composition[82]
Sequence order coupling number[71]
Quasi sequence order[71]
Similarity featuresBLOSUM-50[3, 18, 64]
LZ complexity pairwise similarity scores[74]
In vivo propensityIn vivo aggregation propensity[85, 89]
In vivo stability[64]
Feature typeFeatureReference
Composition featuresAAC[20, 28, 64, 82, 83, 98, 103, 104, 112, 113]
Normalized amino acid composition (NAAC)[20]
AAPC[105]
DPC[3, 18, 64, 71, 112, 122]
TPC[3, 18, 64]
Peptide length[49, 89]
N-gram composition found by counting (NCC)[103]
N-gram composition found by t-test (NTC)[103]
Motifs composition (MC)[103]
Position featuresN-gram binary profiling of position found by counting (NCB)[103]
N-gram binary profiling of position found by t-test (NTB)[103]
Motifs binary profiling of position (MB)[103]
PSSM profile[108]
CMV[71]
Structure featuresα-Helix[20, 49–51, 64, 65, 85, 89, 98]
β-Sheet[20, 49–51, 64, 65, 85, 89, 98]
β-Turn[20, 49–51, 64, 85, 89, 98]
Loop formation[65]
Random coil[51, 85]
Physicochemical propertiesIsoelectric point[20, 49, 81, 85]
Charge[3, 18, 20, 51, 65, 81, 98]
Molecular mass[50]
Atom count[50]
Size[81]
Amino acid acidity and basicity[81]
Aromaticity[81]
Sulfur[81]
Oxygen, nitrogen, hydrogen and carbon atom contents[81]
Net charge at the physiological pH[63]
μH[63]
Aliphatic index[3, 18, 81]
Amphipathicity[51, 65]
Ratio between hydrophobic and charged residues[63]
Hydrophilicity[3, 18]
Hydrophobic moment[63, 65]
Hydrophobicity[3, 18, 20, 51, 63, 65, 89, 98]
Instability index[3, 18]
Disordering[51]
Solvent accessibility[98]
Surface tension[98]
Normalized van der Waals volume[64, 98]
Conformational similarity[64]
Polarity[64, 81, 98]
Polarizability[64, 98]
Flexibility[65]
Normalized MoreauBroto autocorrelation (NMBroto)[71]
Moran autocorrelation (Moran)[71]
Geary autocorrelation (Geary)[71]
CNMR[81]
Composition/transition/distribution (CTD)[3, 18, 64, 71, 92, 95, 98, 103, 104]
PseAAC[19, 20, 66, 71, 80, 83, 96, 98, 103, 113]
PseKRAAC[113]
3-mer composition[82]
Sequence order coupling number[71]
Quasi sequence order[71]
Similarity featuresBLOSUM-50[3, 18, 64]
LZ complexity pairwise similarity scores[74]
In vivo propensityIn vivo aggregation propensity[85, 89]
In vivo stability[64]

Feature selection strategy

As previously mentioned, before model construction, feature selection is a nontrivial step that measures the importance of all the features and eliminates the less-informative ones. Six of the 34 surveyed predictors in Table 1 adopted the feature selection procedure. The commonly used feature selection algorithms include the incremental feature selection (IFS) [44, 45] method based on maximum relevance minimum redundancy (mRMR) [46], fast correlation-based filter (FCBF) method based on entropy [47], rigorous recursive feature elimination (RFE) method based on RF Gini score [48] and forward selection method based on the one-rule attribute evaluation (OneR) method. There are also studies focusing on new feature selection methods for AMP prediction, such as Fernandes et al. [49] and MOEA-FW [50].

Machine learning-based AMP predictors

As listed in Table 1, in addition to dbaasp [51] which optimizes the features by z-score for predicting linear cationic AMPs, and most computational approaches for AMP identification are developed based on well-established machine learning algorithms. These algorithms include the hidden Markov model (HMM) [52], SVM [21], RF [22], eXtreme Gradient Boosting (XGBoost) [53], discriminant analysis (DA) [54], decision tree (DT) [55], Bayesian network (BN) [56], fuzzy k-nearest neighbor (fuzzy k-NN) [57], artificial neural network (ANN) [58], logistic regression (LR) [58] and AdaBoost (ADA) [59]. Based on our survey, SVM and RF stand out as the two most commonly used machine learning algorithms (Table 1). Next, we will briefly describe these tools.

HMM-based predictors

AMPer [52] is a database and an automated discovery tool for AMPs, which is based on HMM models. First, an initial AMP set was constructed by performing a pairwise comparison between the AMPs in its reference data set and the peptides from the Uniprot database (Swiss-Prot and TrEMBL) using Basic Local Alignment Search Tool (BLAST) [60]. Clusters of similar peptides were constructed based on pairwise alignments using BLAST by setting a threshold value, and multiple alignments were created for each cluster using ClustalW [61]. With these clusters, HMMs were created using the HMM software (hmmer) [62]. Through these HMMs, new AMPs can be identified and added to these clusters according to the matching scores, and the HMMs are updated.

SVM-based predictors

SVM is one of the two most commonly used machine learning algorithms for AMP prediction. Among them, CAMP [3, 18] is a pioneering tool that provides an AMP database with AMP prediction web server based on four different algorithms, SVM, RF, DA and ANN. CAMP adopted rigorous RFE based on the RF Gini score [48] for feature selection. Afterward Porto et al. [63] also proposed an AMP identification model based on the SVM algorithm. This work used four physicochemical properties as features, upon which the authors compared the predictive performances between different SVM kernels, including polynomial, radial, linear and sigmoidal kernels, based on the 10-fold cross-validation test. ClassAMP [64] is another prediction tool for classifying AMPs, which attempts to predict the antibacterial, antifungal or antiviral activity of peptides based on RF and SVM. For the prediction of antibacterial, antifungal and antiviral activity, AAC, dipeptide composition (DPC), tripeptide composition (TPC) and some physicochemical properties were used as the features, and three one-against-all classifiers were built. CS-AMPPred [65] is an updated approach for AMP prediction based on the Porto et al.’s model just mentioned, which employed the SVM algorithm. Compared with Porto et al. ‘s model, CS-AMPPred used nine structural/physicochemical properties as features and applied PCA to redundant information to improve the predictive performance. C-PAmP [66] is a database containing high scoring, which computationally predicted AMPs for a number of plant species. This database used the PseAAC [67–69] and five quantitative descriptors transformed from 237 physicochemical descriptors of amino acids, and it subsequently employed SVM to identify AMPs. SVM with the radial basis function (RBF) was chosen after experimenting and was compared with all built-in kernels. Rondón-Villarreal et al. [70] also used the SVM algorithm to predict AMPs. However, the p-spectrum kernel used for SVM is different from that used in other models. It is defined as follows:
(1)
where |${k}_p(s,t)$| is the p-spectrum for sequences s and t, |$\mid s\mid$| is the length of the sequence s, |$s(i:i+p-1)$| is the subsequence of s that starts in position |$i$| and ends in position |$i+p-1$|and |$\mathrm{isEqual}(a,b)$| is defined as:
(2)

This model only considered the order information of the amino acids in peptide sequences without any physicochemical information.

ADAM [26] is another public AMP database built to systematically establish comprehensive associations between AMP sequences and structures and to provide easy access to view their relationships. It also provides two computational tools based on SVM and HMM to predict the AMPs.

Of course, the issue of unbalanced data sets could also have an impact on AMP prediction. To address this, Camacho et al. extracted 10 groups of features from peptide sequences [71], 9 groups of physicochemical properties using Propy [72] plus the composition moment vector (CMV) [73]. Using these 10 feature groups, they constructed 10 data sets for training, built models based on SVM, RF and NB algorithms with unbalanced data sets and compared their AMP identification performances.

Ng et al. [74] proposed an AMP prediction approach, which, as a first step, classifies a peptide by comparing the maximum of high-scoring segment pairs (HSP) scores with BLASTP. To classify those peptides that cannot be classified by BLASTP [60], Ng et al. instead employed the SVM-LZ complexity pairwise algorithm [75–77]. Due to the fixed number of letters in sequences, LZ complexity is suitable for calculating the distance between AMPs.

iAMPpred [20] is a tool for predicting antibacterial, antiviral and antifungal peptides using three different categories of features, including compositional, structural and physicochemical properties. It is developed based on three SVM models with the RBF kernel. In order to identify the importance of each feature for predicting antibacterial, antiviral or antifungal peptides, the information gain was computed for all the features. The differences in the predictive performance among different feature combinations were discussed as well.

MOEA-FW [50] is a method that models the molecular descriptors selection problem as a multi-objective (MO) optimization problem to find a set of weight vectors that simultaneously optimize the distance between the AMPs and the distances between the AMPs and non-AMPs to obtain a good peptide representation for classification tasks. The MO feature weighting optimization problem can be defined as:
(3)
(4)
(5)
(6)
(7)
where |${x}_p$| is the feature vector of a peptide, |$D$| is a descriptor matrix, |${D}_{\mathrm{intra}}(w,D)$| is the intra-class distance for the class of interest and |${D}_{\mathrm{inter}}(w,D)$| is the inter-class distance. The MO evolutionary algorithm based on decomposition (MOEA/D-DE) [78, 79] was employed to solve the MO feature weighting problem. With this weighted descriptor matrix, RF, SVM with the linear kernel, ANN and k-NN were used as classifiers.

Additionally, Ampir [80] is an R package that employed two SVM with the RBF kernel classifiers for AMP prediction including two built-in models, one trained on peptide data (10–50 amino acids) and another trained on full-length precursor protein sequences data. And IAMPE [81] is a recent AMP predictive web server which utilized the clusters of CNMR spectral of amino acids and several physicochemical properties for identifying AMPs based on SVM, k-NN, NB, RF and XGBoost.

In addition, there are some approaches that can not only identify whether a peptide is an AMP or non-AMP but can also identify which functional type it belongs to. AMAP [82] is such an AMP prediction tool which has two prediction levels. The first level used SVM to predict AMPs, and the second level used one-versus-rest classifier fusion and SVM to predict the type of the biological activity of a peptide and the effect of mutations on its biological activity.

k-NN-based predictors

Song et al.’ s work employed BLASTP [60] and the k-NN algorithm for AMP prediction [83]. This approach consisted of two stages. First, BLASTP was employed to identify AMPs with HSP scores. However, BLASTP was unable to deal with certain sequences that had no overlap with the training sequences. For those sequences, Song et al. used the k-NN algorithm. To this end, 270 features from each sequence were calculated and the mRMR feature selection algorithm [46] was employed. The value of k was set to 1 so that unknown samples were assigned to the class of their nearest neighbor.

It is known that it is very important to predict not only whether a peptide is an AMP but also its function. To our survey, iAMP-2L [19] is the first attempt to predict AMPs and their functional types. iAMP-2L used two prediction levels of the fuzzy k-NN algorithm [84], a variation of the k-NN classifier, to construct the model. Instead of assigning the label based on voting from the nearest neighbors, fuzzy k-NN attempts to calculate the contribution weight of each nearest neighbor sample to determine the final prediction label. For that calculation, the following formula is used:
(8)
where |$K$| is the number of the nearest neighbors counted for the query peptide |$P$|⁠, |${u}_i\Big({P}_j^{\ast}\Big)$| is the fuzzy membership value of the training sample |${P}_j^{\ast }$| to the |$j$|-th class, |$d\Big(P,{P}_j^{\ast}\Big)$| is the Euclidean distance between |$P$| and its |$j$|-th nearest peptide and |$q$| is the fuzzy coefficient for determining how heavily the distance is weighted when calculating each nearest neighbor’s contribution to the membership value.
The fuzzy membership value for the first-level prediction is
(9)
For the second-level prediction, it is
(10)
where |${C}_i$| is the |$i$|-th class.
ROC curves and the corresponding AUC values of tests of reviewed computational approaches on the independent test data set.
Figure 4

ROC curves and the corresponding AUC values of tests of reviewed computational approaches on the independent test data set.

ANN-based predictors

Torrent et al. constructed an ANN structure to classify AMPs [85]. In this work, secondary structure prediction features and physicochemical features were calculated by the TANGO software [43], AGGRESCAN [86], the Expasy reference values [87] and the GRAVY scale [88]. For AMP prediction, an ANN structure containing a two-layer feed-forward network with sigmoid function and 50 nodes in the hidden layer were used.

Fernandes et al. predicted AMPs using an adaptive neuro-fuzzy inference system (ANFIS) model [49], which is a structure of a fuzzy reasoning system that combines fuzzy logic with an ANN. The ANFIS was created using two trapezoidal membership functions. The trapezoidal membership function is defined as follows:
(11)
where a, b, c and d denote four parameters, which were set manually. In addition, Fernandes et al. applied the semi-supervised k-means clustering as a pre-processing step to improve the performance. Compared with traditional ANNs, ANFIS achieved a better AMP prediction performance.

LR-based predictors

Considering that thousands of sequence features are irrelevant and redundant and that some might even mislead the correct classification of the peptide activity, some studies instead focus on feature analysis and feature selection. Randou et al. [89] used TANGO [43], AGGRESCAN [86], ExPASy [87] and the GRAVY scale [88] to obtain eight features for prediction. The statistical significance of each of these features was measured by two-sample randomization tests, upon which seven of them were chosen. Features were further selected if their estimated coefficient was significantly different from 0 at α = 0.5 in the LR model. After that, the Akaike Information Criterion (AIC) [90] measure was employed to choose the best model, and the Brier score [91] was used to measure the overall prediction accuracy.

RF-based predictors

In addition to the approaches described above, there are also some approaches employing RF algorithms. AmPEP [92] is an AMP prediction approach based on the RF algorithm with 100 trees, which uses numerical descriptors to characterize different properties of peptides converted from their amino acid sequences [93]. AmPEP employed the synthetic minority over-sampling technique (SMOTE) [94] to rebalance the proportion of positive and negative samples to improve prediction performance. The descriptors were additionally split into three subsets based on the Pearson correlation coefficient (PCC) analysis of AMP/non-AMP distributions to further compare their individual importance in the prediction. dbAMP [28] is an AMP database that contains a tool for AMP prediction. This tool employed the RF algorithm with AAC and physicochemical properties to identify AMPs. And amPEPpy [95] is the latest publicly available Python toolkit using a RF classifier and the same features as [92] for predicting AMPs.

MLAMP [96] is a two-level AMP predictor using ML-SMOTE and gray PseAAC for predicting AMPs and their functions based on imbalanced data sets. A peptide sequence was translated into a vector by using PseAAC [67–69] with the gray model (GM) [97] to reflect the correlation between the peptide sequence and prediction labels. The GM (1,1), a type of GM, is defined by:
(12)
(13)
(14)
(15)
where |${X}^0$| is a non-negative original series of real numbers with an irregular distribution, |${X}^1$| is viewed as the first-order accumulative generation operation (1-AGO) series for |${X}^0$|⁠, a is the developing coefficient and b is the influence coefficient.

By using the numerical value of the five physicochemical properties for each of the 20 amino acids in the same manner as shown in [19], a 30-dimensional feature vector representing a peptide was obtained. For the first predictive level, the RF algorithm was employed for predicting whether a peptide is an AMP or not. For the second level, a multi-label classifier based on the RF algorithm was used. Considering unbalanced functional types of AMPs, ML-SMOTE, a novel oversampling model, was used.

MAMPs-Pred [98] is also a two-level model that predicts AMPs and their functional types based on the RF algorithm. For AMP prediction, 188 features were used based on eight types of physicochemical and AAC properties and were calculated using SVM-Prot [99, 100]. Random under-sampling method [101] and weighted random sampling [102] were used for data balancing.

AMPfun [103] is a recent useful identification web server for AMPs and their functional activities. It is a two-stage framework, and each stage consists of three steps: feature calculation, feature selection and application of classification algorithms. The sequential forward selection algorithm was employed for feature selection, and RF was used as the prediction method.

Instead of predicting an AMP’s functional activities, Chung et al. proposed a method for predicting AMPs from seven categories of organisms, including amphibians, humans, fish, insects, plants, bacteria and mammals [104]. AAC, amino acid pair composition (AAPC) [105] and seven physicochemical properties were used as features representing each peptide. After feature selection based on the OneR [106], RF, SVM, k-NN and DT models were constructed, out of which, the RF model performed best and was selected.

AmpGram [107] is a recent AMP predictor that employed two RF models using n-grams to represent the information hidden in amino acid sequences. First, each sequence in the training data set was scanned with a sliding window of 10 amino acids and was divided into overlapping 10 amino acid long subsequences (10-mers). All 10-mers from the AMPs were then considered as positive samples, whereas all 10-mers from the non-AMPs were labeled as negative samples. N-grams, which are continuous or discontinuous sequences of n elements, were extracted from each 10-mer in the positive and negative data set as features. If an n-gram was present in the sequence, the feature representing the n-gram was 1, otherwise 0. With these features, an RF was trained to predict the antimicrobial property of 10-mers. In order to predict a peptide, several statistics for each peptide using prediction for its 10-mers were calculated. These statistics were used to train the second RF model with 500 trees to decide whether a peptide is an AMP or not.

ADA-based predictors

Recently, Fu et al. proposed a computational approach for anuran AMPs [108]. They designed an autocorrelation function to improve the traditional PSSM profiles [109] and used ADA as the classifier. The PSSM is a scoring matrix that implies the information of amino acid residues at each position in the sequence. Although the PSSM profile contains composition, position and evolution information, the traditional PSSM profile cannot express the correlation between each residue of the sequence. Therefore, Fu et al. designed an autocorrelation function based on the traditional PSSM profile according to the theory of spatial autocorrelation models:
(16)
(17)
(18)
(19)
(20)
where |$n$| indicates the interval between residues, |${\mathrm{PSSM}}_{i,j}$| indicates the value of the |$i$|-th row and the |$j$|-th column in the matrix after processing and |${\mathrm{Mean}}_i$| is the mean value of the |$i$|-th row.

Deep learning-based AMP predictors

Deep learning architectures are derived from the simpler ANN. Convolutional neural networks (CNNs) [110] and recurrent neural networks (RNNs) [111] are deep learning frameworks that have been previously used to identify AMPs [33, 112–114]. Deep learning-based predictors frequently use original sequences as the input, which are seldom used to train ML-based methods, occasionally add the extracted sequence information by some third-party software, then extract features through several neural network layers and finally output classification labels. The recently developed methods based on deep learning architectures are provided in Table 1.

AMP Scanner v.2 [33], a web server based on a long short-term memory (LSTM) architecture [115], a kind of RNN, was the first to propose a deep neural network (DNN) model for AMP classification. First, peptide sequences were converted into zero-padded numerical vectors of length 200, and each of the 20 essential amino acids was assigned a number from 1 to 20. Then, such vectors were fed into an embedding layer to convert them into a vector representation of fixed size. After that, a CNN layer, a max pooling layer and an LSTM layer were used to further extract features from these vectors. Finally, the label prediction probability was obtained by a fully connected layer with the sigmoid function.

APIN [112] is an AMP identification model mainly based on multi-CNNs. At first, peptide sequences were converted into numerical vectors of fixed length through an embedding layer, in the same way as was done in [33]. Then, these vectors were fed into multiple convolutional layers of different filter lengths to extract several sequence features of different lengths. After two max-pooling layers, the most important feature vector was selected. Finally, the label prediction probability was obtained by a fully connected layer with the sigmoid function. Of note, APIN integrated two kinds of sequence features (i.e. AAC and DPC) to further improve the prediction performance.

Deep-AmPEP30 [113] is another DNN tool for short-length AMP prediction using a deep CNN framework consisting of two conventional layers, two max-pooling layers and a hidden fully connected layer. Different from [33, 113], Deep-AmPEP30 used AAC, composition/transition/distribution (CTD) [116], PseAAC [67–69] and pseudo K-tuple reduced amino acids composition (PseKRAAC) [117] as features fed into the DNN structure. Deep-AmPEP30 has been made publicly available as a web server. This web server also provides another prediction tool based on RF.

AMPlify [114] is a recently proposed DNN approach for AMP identification. Its architecture includes a bidirectional long short-term memory (Bi-LSTM) [118] layer, a multi-head scaled dot-product attention (MHSDPA) [118] layer and a context attention (CA) [119] layer. It uses the one-hot encoding as the input. According to our review, AMPlify appears to be the first DNN tool using the attention mechanism, which can help the deep learning model generate different weights to each part of the input, extract more critical and important information and make the model more accurate.

Performance evaluation strategies and metrics

Performance evaluation strategies including K-fold cross-validation test, jack-knife validation test and independent test are commonly used for performance evaluations of AMP predictors and parameter optimization. For the K-fold cross-validation test, the original data are randomly divided into K parts and, each time, (K − 1) parts are selected as the training set while the remaining part is used as the test set. Cross-validation is repeated K times, and the average of K accuracies is taken as performance metric of the model. Jack-knife validation test (also known as leave-one-out cross-validation test) is commonly used to evaluate models that are trained only on smaller data sets due to the lack of available data in those cases. If there are a total of N samples in the data set, then the predictor is trained on N − 1 training examples and tested on the remaining data point. For a data set with N samples, this procedure is repeated N times, and the final performance is the average of all N tests. In the independent test, the test data set is non-overlapping with the training data of the tools. As a result, an independent test is used as a uniform validation method to test the performance of different methods.

To benchmark and compare the predictive performance of different tools and web servers, six performance metrics, including sensitivity (Sn), specificity (Sp), precision (Pr), accuracy (Acc), Matthew’s correlation coefficient (MCC) and area under the receiver operating characteristic (ROC) curve (AUROC, AUC), are commonly used as the metrics. Sn, Sp, Pr, Acc and MCC are defined as follows:
(21)
where TP and TN represent the numbers of correctly predicted positive and negative samples, respectively; FP and FN represent the numbers of incorrectly predicted positive and negative samples, respectively. The MCC value ranges from −1 to 1, with a higher MCC value indicating a better performance. The Acc value ranges from 0 to 1, and a coefficient of +1 means a perfect performance. The AUC value, calculated based on the ROC curve, ranges from 0 to 1. The higher the AUC value is, the better is the prediction performance.

Software availability and usability

Alongside the publications of AMP predictors, user-friendly web servers and/or a locally executable software with proper documentation promote a broader research on AMPs and their application. As part of this survey, we studied the availability and usability of AMP predictors (Table 1) and found that 21 out of 34 approaches were made available as web servers and/or stand-alone software. However, five of the provided server links could not be opened. The general functionalities of the currently available tools are discussed below.

Web servers

For users, the most concerning issue is typically how to upload sequences to web servers. While all web servers allow users to predict the AMP status of multiple peptide sequences at a time, some of them have different limitations on the maximum number of sequences. For instance, the limit on the maximum number for MLAMP is 5, for AMPGram is 50, for iAMP-2L is 500 and for AMP Scanner Vr.2, it is 50 000.

In addition, only a few servers, including CAMP, ClassAMP, MLAMP, AmPEP, Deep-AmPEP30, RF-AmPEP30, AmpGram and AMP scanner Vr.2 provide a button to upload a sequence file in the FASTA format. Among these servers, AMP scanner Vr.2 allows the upload of a file of size ≤50 MB.

Another crucial issue of web server design is how to deliver the prediction results back to the user. Some web servers, including MLAMP, are able to send prediction results to users via an email address provided by the user. Another important functionality aspect is the possibility to revisit historical prediction results (based on the job ID). However, among the reviewed predictors, no web server provided this functionality. Moreover, all web servers allow checking of the prediction results online, but only the AMP scanner Vr.2 allows users to download the prediction results for follow-up analysis.

Stand-alone tools

There are also some stand-alone AMP prediction tools available for users to install and run locally. These tools include CS-AMPPred, AmPEP, ampir, APIN, AmpGram and AMPlify. Among them, ampir and AmpGram are the two user-friendly R packages that come with additional instructions.

Comparison based on the same training data set

Since different tools were trained with different feature descriptors, different feature selection methods and different training sets from different databases, it is objectively inaccurate to evaluate the prediction performances of different machine learning and deep learning methods based on independent data sets. Therefore, in order to further compare and evaluate the performance of different machine learning methods in AMP prediction, we trained different machine learning and deep learning methods based on the same data set.

Data sets

We integrated all AMPs from a set of comprehensive AMP databases and removed those peptide sequences with length greater than 100 or less than 10 and with non-standard residues ‘B’, ‘J’, ‘O’, ‘U’, ‘X’ or ‘Z’. After that, we used the program CD-HIT to filter out those sequences that share 70% or more sequence identity to existing samples, obtaining 10 019 positive samples in total. For the negative data set, we did the same as previously explained in the Construction of the independent test data set section, except that the CD-HIT parameter was set to 70%. We then randomly selected 10 019 negative samples.

Sequence feature generation and machine learning methods

Since peptide sequences cannot be used as inputs in machine learning classifiers, they need to be mapped onto numeric feature vectors. Recently, iFeature [120] and iLearn [121] were created, which are comprehensive Python-based toolkits that can calculate and analyze various features, construct machine learning models and evaluate their performances for classification problems involving deoxyribonucleic acid (DNA), ribonucleic acid (RNA) or protein sequences. Here, we used the iLearn platform to generate several peptide sequence features. These features are AAC, CKSAAP, DPC, TPC, PAAC, CTD, GDPC, GAAC, GTPC, CKSAAGP, Moran, Geary, NMBroto, CTriad, KSCTriad, SOCNumber, QSOrder, APAAC and PseKRAAC.

Considering the huge number of sequence features and the resulting computational complexity, we used the feature selection methods provided in iLearn, including CHI-Square feature selection, information gain feature selection, F-score, mutual information feature selection, recursive feature elimination based on RF (RFE-RF) and Max-Relevance, Min-Redundancy (mRMR), to reduce the number of features to 100.

Subsequently, 11 machine learning methods were used to build AMP predictors, and their performances were evaluated and compared. Specifically, SVM, RF, LR, k-NN, DT, ANN, NB, ADA and XGBoost—all algorithms that had been previously used in the tools reviewed here—were used to identify AMPs before. In addition, a gradient boosting tree (GDBT) model and an extremely randomized trees (ET) model were built and compared. The performances of all models were assessed using 5-fold cross-validation.

Results and Discussion

Performance evaluation based on an independent data set

Considering that the training data sets of some reviewed tools are not available and several other tools have been updated with expanded training data sets since their first release, there might be some overlap between the data sets used for developing some tools and our independent data set. Therefore, we downloaded the training data sets of these tools and removed the peptides that are included in our validation data set to avoid any possible overlaps. Then we submitted the independent data set to these tools and collected the prediction results, which are shown in Figure 3. For this evaluation, the tools’ parameters were set to the recommended configurations in the corresponding publications, or to the default values if no recommendations were given. Six commonly used metrics, namely AUC, Sn, Sp, Pr, Acc and F-score, were previously utilized to assess the performance of the different tools. Consequently, we also used these metrics to evaluate the tools on our independent data set. In order to illustrate the prediction performance of each tool, ROC curves and other metrics are shown in Figure 4 and Supplementary Table S1 available online at http://bib.oxfordjournals.org/. The results show that amPEPpy achieved the best AUC value (74.2%) among all tools, while ADAM–HMM achieved the highest accuracy and the highest F-score. On one hand, amPEPpy used a range of AMP databases, including APD3 [17], CAMPR3 [18] and LAMP [14] for constructing the positive training data set, which was larger than those used by some other approaches. We conclude that use of abundant data is one of the reasons why amPEPpy predicted AMPs more accurately than other approaches. In addition, amPEPpy also randomly subsampled sequences in five different sequence length distributions to match the proportions of the AMP data to obtain a more balanced training set. All these steps for the training data set pre-processing contributed to the performance improvement. On the other hand, amPEPpy used the out-Of-bag (OOB) error to optimize the RF model parameters to achieve a better predictive performance. ADAM–HMM performs a kind of sequence alignment analysis and thus has a great advantage in identifying non-AMPs. This may explain the much higher Sn score compared to amPEPpy. In contrast, ADAM–HMM could not identify AMPs accurately compared with some other AMP predictors, such as amPEPpy, because ADAM–HMM achieved a lower Pr score than amPEPpy.

We collected AMPs from multiple AMP databases to construct the independent test data set while trying our best to remove those samples that might appear in the training sets of all approaches. However, most of the current approaches were trained based on only one or a few public AMP databases. Therefore, these approaches showed worse performance of AMP prediction on the independent data set constructed by us. Some approaches had high false-positive rates, such as IAMPE and AmpGram, while some approaches had high false-negative rates, such as ADAM–HMM.

Performance evaluation and comparison for AMPs with different functional activities

It is well known that AMPs have specific functional activities, and these AMPs with specific functions have attracted more interest and attention of biologists. On one hand, due to the limitation of experimental techniques, a large number of AMPs cannot be annotated with specific functions. On the other hand, from a computational perspective, the number of AMPs with different functions is highly unbalanced. Accordingly, it is a difficult and challenging task to predict the specific functional activities of AMPs.

Currently, most computational methods are developed to predict AMPs. Considering that AMPs with different functions differ from each other in sequences, secondary structure and physicochemical properties, these differences will result in the differences in predictive difficulty. We thus evaluated the predictive performances and compared the predictive differences of these approaches.

First, we counted the number of AMPs with different functions in the independent data set, and we have summarized the 12 common functional activities in Supplementary Table S2 available online at http://bib.oxfordjournals.org/. Based on the 12 functional groups, we calculated the Acc values of different computational approaches and generated bar plots, as shown in Figure 5). For AMPs with anti-Gram-positive, antibacterial, anticancer, antifungal, antitumor or antiviral functional activities, IAMPE–SVM was observed to be the most accurate approach that can predict whether such peptides are AMPs. For AMPs with anti-Gram-negative or antibiofilm activities, IAMPE–XGBoost achieved the best predictive performance. Meanwhile, IAMPE based on four different machine learning methods performed quite well for identifying AMPs with antiparasitic or antiproliferative functions. In addition, CAMP–DA, CAMP–SVM and CAMP–RF also performed as well as IAMPE when they were used to predict AMPs with antiproliferative activities. IAMPE–SVM and IAMPE–XGBoost also achieved the best performance for predicting AMPs with insecticidal activities. Additionally, AMPfun achieved the highest Acc score for AMPs with the antiangiogenic type.

Bar plots comparing accuracy of reviewed computational approaches for AMP identification on 12 common functional types of the independent test data set. (A) anti-Gram-positive, (B) anti-Gram-negative, (C) antiangiogenic, (D) antibacterial, (E) antibiofilm, (F) anticancer, (G) antifungal, (H) antiparasitic, (I) antiproliferative, (J) antitumor, (K) antiviral and (L) insecticidal.
Figure 5

Bar plots comparing accuracy of reviewed computational approaches for AMP identification on 12 common functional types of the independent test data set. (A) anti-Gram-positive, (B) anti-Gram-negative, (C) antiangiogenic, (D) antibacterial, (E) antibiofilm, (F) anticancer, (G) antifungal, (H) antiparasitic, (I) antiproliferative, (J) antitumor, (K) antiviral and (L) insecticidal.

Furthermore, we evaluated those approaches that can be used to predict an AMP’s specific functional activity. Similarly, we calculated the accuracy values of these approaches with respect to different functional groups and have provided the results in Supplementary Table S3 available online at http://bib.oxfordjournals.org/. It can be seen that ClassAMP and iAMPpred performed better than the other compared approaches in this task. One reason is that ClassAMP and iAMPpred can often predict an AMP’s function based on the assumption that a peptide is an AMP. However, although the two approaches can predict functional activities accurately (Supplementary Table S3 available online at http://bib.oxfordjournals.org/), they cannot predict AMPs very well due to their high false-positive rates because these tools have low AUC and Acc values when testing on the independent data set constructed by us.

Performance evaluation based on six validation data sets of different databases

Since our independent test set was integrated from many AMP databases, prediction differences of individual methods across different databases could not be identified. In order to investigate the prediction differences, we tested these approaches on six data sets constructed from six commonly used AMP databases. We used ROC curves and AUC, Sn, Sp, Pr, Acc, MCC and F-score to measure the predictive performance of the different methods. The results are shown in Figure 6 and Supplementary Tables S4S9 available online at http://bib.oxfordjournals.org/.

ROC curves and the corresponding AUC values of tests of reviewed computational approaches on six validation data sets.
Figure 6

ROC curves and the corresponding AUC values of tests of reviewed computational approaches on six validation data sets.

Boxplots comparing accuracy of 11 traditional machine learning methods based on 6 feature selection methods using 5-fold cross validation test.
Figure 7

Boxplots comparing accuracy of 11 traditional machine learning methods based on 6 feature selection methods using 5-fold cross validation test.

When tested on the validation data set constructed from APD3, AMPEP achieved the largest AUC, while amPEPpy achieved the largest Acc. AMPfun and AMPlify also achieved high AUC values. When tested on the validation data set constructed from CAMP, both RF–AMPEP30 and Deep-AMPEP30 achieved the best performance. When tested on the validation data set constructed from dbAMP, amPEPpy outperformed all other methods, achieving the highest AUC value and Acc value. When tested on the validation data set built from DRAMP, AMPfun performed best with highest AUC value, and it is the only tool with an AUC value of more than 80%. When tested on the validation data set built from LAMP, ADAM–HMM was considered the best AMP predictor based on achieving the highest AUC value. Lastly, when tested on the validation data set built from YADAMP, AMPfun performed better than all other methods.

Performance evaluation for different machine learning methods based on 5-fold cross-validation

Because the methods are all trained on different training sets, the difference in predictions is partly caused by the training set. We first performed 5-fold cross-validation to compare the traditional machine learning methods on the same data set. The positive data set was collected from multiple AMP data sets, and the redundant information was identified and removed using CD-HIT. We then tested 11 machine learning methods using six different feature selection methods. Boxplots for accuracy values of these methods are shown in Figure 7, and boxplots for other metrics are shown in Supplementary Figures S2S6 available online at http://bib.oxfordjournals.org/. Using feature selection based on CHI-Square and mutual information, SVM achieved the highest accuracy score. Upon using feature selection based on information gain, Max-Relevance, Min-Redundancy (mRMR) and using RFE-RF, RF got the highest accuracy score. When the F-score feature selection was used, XGBoost performed best in terms of accuracy. Despite the feature selection method used, RF and XGBoost usually achieved a competitive if not the best performance for AMP prediction.

Heatmaps for different metrics of 11 traditional machine learning methods based on CHI-2 square feature selection methods. NB, naive Bayes; XGB, XGBoost; GDB: gradient boosting DT.
Figure 8

Heatmaps for different metrics of 11 traditional machine learning methods based on CHI-2 square feature selection methods. NB, naive Bayes; XGB, XGBoost; GDB: gradient boosting DT.

In addition, we clustered these machine learning methods based on accuracy, precision, sensitivity, specificity, MCC and F-score of the 5-fold cross-validation test to compare their similarity. The clustering results are shown in Figure 8 and Supplementary Figures S7S11 available online at http://bib.oxfordjournals.org/. According to these heatmaps, prediction performances of RF and XGBoost are considered to be similar.

Performance evaluation for feature importance

According to the 5-fold cross-validation results of the previous section, we further compared the importance of different feature contributions. First of all, we compared the types of 100 features selected by different feature selection methods and the number of each type of feature. Then, because RF and XGBoost yielded the best performances, we compared the contributions of the selected features to these two methods. Results are shown in Figure 9. For feature selection based on CHI-Square, F-score, information gain and mutual information, PseKRAAC turned out to be the most informative feature as it contributes significantly to RF’s and XGBoost’s prediction. In terms of the number of selected features, for feature selection based on mRMR, CKSAAP features were most selected, while in terms of the feature importance for RF and XGBoost, the most informative and contributing feature type is PAAC. For RFE based on RF, CTDD is the most contributing feature for RF and XGBoost.

A heatmap for the contributions of features based on different feature selection methods and two contribution distributions of selected features based on RF and XGBoost methods.
Figure 9

A heatmap for the contributions of features based on different feature selection methods and two contribution distributions of selected features based on RF and XGBoost methods.

Furthermore, we selected 10 AMPs from the Uniprot database which had been manually annotated in 2019 and 2020 and tested them with different tools (Supplementary Table S10 available online at http://bib.oxfordjournals.org/). As a result, IAMPE–RF, IAMPE–SVM, IAMPE–XGBoost and AmpGram predicted AMPs quite correctly with only one mistake. Notably, all predictions of ADAM–HMM were wrong, implying that ADAM–HMM has certain problems, especially for predicting previously unidentified AMPs.

How to develop effective feature extraction and enhance the representation of the learning samples remains a challenging but important question. In this regard, combining feature engineering and representation learning techniques might provide a potentially useful strategy for the AMP peptide representation and identification [11, 12, 24, 36, 39–42]. Based on some common features calculated from sequences, this paper preliminarily compared the prediction results of different traditional machine learning methods and feature selection methods and preliminarily evaluated the importance of different features. In the future, more in-depth feature extraction, construction and selection and model construction are needed for the prediction problem.

Future prospects

Due to their importance in the immune system, there has been a significant progress in the characterization of AMPs in recent years. Simultaneously, because of the increasing availability of AMPs annotated, a variety of computational methods for AMPs have emerged. These methods employ different traditional machine learning-based algorithms as well as the feature calculation and selection strategies. However, most current approaches still suffer from high false-positive rates; there is an urgency to develop new and improved methods to address this problem.

Several potentially useful strategies may be helpful for improving the predictive perfomance of machine learning-based methods. First, deep learning has recently emerged as a powerful machine learning technique and has gained popularity in the field of bioinformatics and computational biomedicine. Although several approaches used deep learning frameworks for AMP identification, their deep learning structures are simple. A more comprehensive and more accurate deep learning structure needs to be developed in future work. Second, the idea of ensemble machine learning has been seldom used in AMP identification. Integrating different machine learning methods can combine the advantages of different machine learning methods and may help to develop more powerful AMP predictors. In addition, predicting the functional activities of AMPs is also an important task. However, only a few approaches are currently available for predicting AMPs and their functional activities. These approaches can predict a limited number of activities with low accuracy. Therefore, new methods for improved prediction of AMPs’ functional activities are expected to be developed in the future.

Conclusions

AMPs, a vital part of the immune system of organisms, have generated considerable interest. Accordingly, tremendous research efforts have been put into developing new approaches to predict AMPs accurately. In this study, we provided a comprehensive review to these approaches in terms of a wide range of aspects, including algorithms selected, feature selection techniques employed, performance evaluation strategy and web server/tool availability. To obtain a more objective performance evaluation, we constructed an independent test data set to benchmark all tools. Based on the assessment results, amPEPpy achieved the highest predictive performance. We also constructed six validation test data sets based on commonly used AMP databases to further evaluate these approaches. Based on AUC metrics, AMPfun showed good performances for most validation data sets. In addition, we tested 11 traditional machine learning methods and 6 commonly used feature selection algorithms using 5-fold cross-validation. Based on several metrics, we concluded SVM, RF and XGBoost were remarkable machine learning methods for AMP prediction.

This study should provide useful guidance to researchers who are interested in developing innovative and even more powerful AMP prediction models. We expect that, in the near future, more accurate AMP prediction methods will be developed and made available to facilitate community-wide efforts for characterizing and discovering new AMPs.

Key Points
  • We provide a comprehensive survey and assessment of tools for the prediction of AMPs.

  • We systematically review and benchmark 34 existing tools with respect to the algorithms employed, chosen feature selection methods, calculated features, performance evaluation strategies and web server/software functionality.

  • All predictors underwent a comprehensive performance assessment based on an independent data set and six validation data sets from six commonly used databases.

  • Eleven traditional machine learning methods based on six feature selection strategies that were assessed using 5-fold cross-validation test based on the same data set.

  • This paper provides a useful guidance to researchers who are interested in developing innovative and even more powerful AMP prediction models.

Data Availability

The data are available from the authors upon reasonable request.

Conflict of Interest

The authors declare that they have no competing interests.

Funding

National Health and Medical Research Council of Australia (NHMRC) (1144652, 1127948); the Young Scientists Fund of the National Natural Science Foundation of China (31701142); the Australian Research Council (ARC) (LP110200333, DP120104460); a Major Inter-Disciplinary Research (IDR) project awarded by the Monash University, the Collaborative Research Program of Institute for Chemical Research, Kyoto University (2019-32 and 2018-28); the National Natural Science Foundation of China (62072243, 61772273).

Jing Xu received her BSc and MSc degrees from the Nankai University, China. She is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her research interests are bioinformatics, computational oncology, machine learning and pattern recognition.

Fuyi Li is currently a research fellow in the Department of Microbiology and Immunology, the Peter Doherty Institute for Infection and Immunity, the University of Melbourne, Australia. His research interests are bioinformatics, computational biology, machine learning and data mining.

André Leier is currently an assistant professor in the Department of Genetics, UAB School of Medicine, USA. He is also an associate scientist in the UAB’s O’Neal Comprehensive Cancer Center and the Gregory Fleming James Cystic Fibrosis Research Center. His research interests are in computational biomedicine, bioengineering, bioinformatics and machine learning.

Dongxu Xiang received his BEng from the Northwest A&F University, China. He is currently a research assistant in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. His research interests are bioinformatics, computational biology, machine learning and data mining.

Hsin-Hui Shen received her PhD in physical chemistry from the Oxford University, UK. She is an NHMRC Career Development Fellow and group leader in the Department of Biochemistry & Molecular Biology and Department of Materials Science & Engineering, Monash University. Her research interests are biophysical chemistry, chemical biology, antibiotics and biotechnology.

Tatiana T. Marquez Lago is currently an associate professor in the Departments of Genetics and Microbiology, UAB School of Medicine, USA. She is also a scientist in the UAB Gregory Fleming James Cystic Fibrosis Research Center. Her research interests are in host–microbiota interactions, evolution of antibiotic resistance, computational systems biology, bioengineering and artificial intelligence.

Jian Li is a professor and group leader in the Monash Biomedicine Discovery Institute and Department of Microbiology, Monash University, Australia. He is a Web of Science 2015–2017 Highly Cited Researcher in Pharmacology & Toxicology. He is currently an NHMRC Principal Research Fellow. His research interests include the pharmacology of polymyxins and the discovery of novel, safer polymyxins.

Dong-Jun Yu received his PhD degree from the Nanjing University of Science and Technology on the subject of pattern recognition and intelligence systems in 2003. He is currently a full professor in the School of Computer Science and Engineering, Nanjing University of Science and Technology. His research interests include pattern recognition, machine learning and bioinformatics.

Jiangning Song is an associate professor in the Monash Biomedicine Discovery Institute, Monash University. He is also affiliated with the Monash Data Futures Institute, Monash University. His research interests include bioinformatics, computational biology, machine learning, data mining and pattern recognition.

References

1.

Brahmachary
M
,
Krishnan
SP
,
Koh
JL
, et al.
ANTIMIC: a database of antimicrobial sequences
.
Nucleic Acids Res
2004
;
32
:
D586
9
.

2.

Seebah
S
,
Suresh
A
,
Zhuo
S
, et al.
Defensins knowledgebase: a manually curated database and information source focused on the defensins family of antimicrobial peptides
.
Nucleic Acids Res
2007
;
35
:
D265
8
.

3.

Thomas
S
,
Karnik
S
,
Barai
RS
, et al.
CAMP: a useful resource for research on antimicrobial peptides
.
Nucleic Acids Res
2010
;
38
:
D774
80
.

4.

Nannette
YY
,
Michael
RY
.
Immunocontinuum: perspectives in antimicrobial peptide mechanisms of action and resistance
.
Protein Pept Lett
2005
;
12
:
49
67
.

5.

Andersson
DI
,
Hughes
D
,
Kubicek-Sutherland
JZ
.
Mechanisms and consequences of bacterial resistance to antimicrobial peptides
.
Drug Resist Updat
2016
;
26
:
43
57
.

6.

Piotto
SP
,
Sessa
L
,
Concilio
S
, et al.
YADAMP: yet another database of antimicrobial peptides
.
Int J Antimicrob Agents
2012
;
39
:
346
51
.

7.

Brogden
KA
.
Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria?
Nat Rev Microbiol
2005
;
3
:
238
50
.

8.

Zasloff
M
.
Antimicrobial peptides of multicellular organisms
.
Nature
2002
;
415
:
389
95
.

9.

Epand
RM
,
Vogel
HJ
.
Diversity of antimicrobial peptides and their mechanisms of action
.
Biochim Biophys Acta
1999
;
1462
:
11
28
.

10.

Shai
Y
,
Oren
Z
.
From ‘carpet’ mechanism to de-novo designed diastereomeric cell-selective antimicrobial peptides
.
Peptides
2001
;
22
:
1629
41
.

11.

Wei
L
,
Hu
J
,
Li
F
, et al.
Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms
.
Brief Bioinform
2018
;
21
:
106
19
.

12.

Wei
L
,
Chen
Z
,
Chen
H
, et al.
ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides
.
Bioinformatics
2018
;
34
:
4007
16
.

13.

Aguilera-Mendoza
L
,
Marrero-Ponce
Y
,
Tellez-Ibarra
R
, et al.
Overlap and diversity in antimicrobial peptide databases: compiling a non-redundant set of sequences
.
Bioinformatics
2015
;
31
:
2553
9
.

14.

Zhao
X
,
Wu
H
,
Lu
H
, et al.
LAMP: a database linking antimicrobial peptides
.
PLoS One
2013
;
8
:
e66557
.

15.

Khusro
A
,
Aarti
C
,
Agastian
P
.
Anti-tubercular peptides: a quest of future therapeutic weapon to combat tuberculosis
.
Asian Pac J Trop Med
2016
;
9
:
1023
34
.

16.

Lande
R
,
Gregorio
J
,
Facchinetti
V
, et al.
Plasmacytoid dendritic cells sense self-DNA coupled with antimicrobial peptide
.
Nature
2007
;
449
:
564
9
.

17.

Guangshun
W
,
Xia
L
,
Zhe
W
.
APD3: the antimicrobial peptide database as a tool for research and education
.
Nucleic Acids Res
2016
;
44
:
D1087
93
.

18.

Waghu
FH
,
Barai
RS
,
Gurung
P
, et al.
CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides
.
Nucleic Acids Res
2016
;
44
:
D1094
7
.

19.

Xiao
X
,
Wang
P
,
Lin
WZ
, et al.
iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types
.
Anal Biochem
2013
;
436
:
168
77
.

20.

Meher
PK
,
Sahu
TK
,
Saini
V
, et al.
Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC
.
Sci Rep
2017
;
7
:
42362
.

21.

Muller
K
,
Mika
S
,
Ratsch
G
, et al.
An introduction to kernel-based learning algorithms
.
IEEE Trans Neural Netw
2001
;
12
:
181
201
.

22.

Breiman
L
.
Random forests
.
Mach Learn
2001
;
45
:
5
32
.

23.

Lv
H
,
Dao
F-Y
,
Guan
Z-X
, et al.
Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method
.
Brief Bioinform
2020
. doi: .

24.

Shao
L
,
Gao
H
,
Liu
Z
, et al.
Identification of antioxidant proteins with deep learning from sequence information
.
Front Pharmacol
2018
;
9
:
1036
.

25.

Nur
GM
,
Stafford
NW
.
Empirical comparison of web-based antimicrobial peptide prediction tools
.
Bioinformatics
2017
;
33
:
1921
9
.

26.

Lee
HT
,
Lee
CC
,
Yang
JR
, et al.
A large-scale structural classification of antimicrobial peptides
.
Biomed Res Int
2015
;
2015
:
475062
.

27.

Ramos-Martin
F
,
Annaval
T
,
Buchoux
S
, et al.
ADAPTABLE: a comprehensive web platform of antimicrobial peptides tailored to the user’s research
.
Life Sci Alliance
2019
;
2
:
e201900512
.

28.

Jhong
JH
,
Chi
YH
,
Li
WC
, et al.
dbAMP: an integrated resource for exploring antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data
.
Nucleic Acids Res
2019
;
47
:
D285
97
.

29.

Kang
X
,
Dong
F
,
Shi
C
, et al.
DRAMP 2.0, an updated data repository of antimicrobial peptides
.
Sci Data
2019
;
6
:
148
.

30.

Fan
L
,
Sun
J
,
Zhou
M
, et al.
DRAMP: a comprehensive data repository of antimicrobial peptides
.
LA Rep
2016
;
6
:
24482
.

31.

Ye
G
,
Wu
H
,
Huang
J
, et al.
LAMP2: a major update of the database linking antimicrobial peptides
.
Database
2020
;
2020
:
baaa061
.

32.

Théolier
J
,
Fliss
I
,
Jean
J
, et al.
MilkAMP: a comprehensive database of antimicrobial peptides of dairy origin
.
Dairy Sci Technol
2014
;
94
:
181
93
.

33.

Veltri
D
,
Kamath
U
,
Shehu
A
.
Deep learning improves antimicrobial peptide recognition
.
Bioinformatics
2018
;
34
:
2740
7
.

34.

Li
W
,
Godzik
A
.
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
.
Bioinformatics
2006
;
22
:
1658
.

35.

Fu
L
,
Niu
B
,
Zhu
Z
, et al.
CD-HIT: accelerated for clustering the next-generation sequencing data
.
Bioinformatics
2012
;
28
:
3150
2
.

36.

Huang
Q
,
Zhang
J
,
Wei
L
, et al.
6mA-RicePred: a method for identifying DNA N 6-methyladenine sites in the rice genome based on feature fusion
.
Front Plant Sci
2020
;
11
:
4
.

37.

Yu
CY
,
Li
XX
,
Yang
H
, et al.
Assessing the performances of protein function prediction algorithms from the perspectives of identification accuracy and false discovery rate
.
Int J Mol Sci
2018
;
19
:
183
.

38.

Shen
C
,
Wang
Z
,
Yao
X
, et al.
Comprehensive assessment of nine docking programs on type II kinase inhibitors: prediction accuracy of sampling power, scoring power and screening power
.
Brief Bioinform
2020
;
21
:
282
97
.

39.

Zhang
D
,
Xu
Z-C
,
Su
W
, et al.
iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features
.
Bioinformatics
2020
. doi: .

40.

Wei
L
,
Su
R
,
Luan
S
, et al.
Iterative feature representations improve N4-methylcytosine site prediction
.
Bioinformatics
2019
;
35
:
4930
7
.

41.

Wei
L
,
Chen
Z
,
Su
R
, et al.
PEPred-suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning
.
Bioinformatics
2019
;
35
:
4272
80
.

42.

Liu
M-L
,
Su
W
,
Wang
J-S
, et al.
Predicting preference of transcription factors for methylated DNA using sequence information
.
Mol Ther Nucleic Acids
2020
;
22
:
1043
50
.

43.

Fernandez-Escamilla
AM
,
Rousseau
F
,
Schymkowitz
J
, et al.
Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins
.
Nat Biotechnol
2004
;
22
:
1302
6
.

44.

Huang
T
,
Shi
XH
,
Wang
P
, et al.
Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks
.
PLoS One
2010
;
5
:
e10972
.

45.

Huang
T
,
Cui
W
,
Hu
L
, et al.
Prediction of pharmacological and xenobiotic responses to drugs based on time course gene expression profiles
.
PLoS One
2009
;
4
:
e8126
.

46.

Peng
H
,
Long
F
,
Ding
C
.
Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy
.
IEEE Trans Pattern Anal Mach Intell
2005
;
27
:
1226
38
.

47.

Yu
L
,
Liu
H
. Feature selection for high-dimensional data: a fast correlation-based filter solution. In:
Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003)
.
Washington, DC, USA
:
AAAI Press
,
2003
,
856
63
.

48.

Liaw
A
,
Wiener
M
.
Classification and regression by randomForest
.
R News
2002
;
23
:
18
22
.

49.

Fernandes
FC
,
Rigden
DJ
,
Franco
OL
.
Prediction of antimicrobial peptides based on the adaptive neuro-fuzzy inference system application
.
Biopolymers
2012
;
98
:
280
7
.

50.

Beltran
JA
,
Aguilera-Mendoza
L
,
Brizuela
CA
. Feature weighting for antimicrobial peptides classification: a multi-objective evolutionary approach. In:
2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Kansas City, MO, USA
:
IEEE
,
2017
,
276
83
.

51.

Vishnepolsky
B
,
Pirtskhalava
M
.
Prediction of linear cationic antimicrobial peptides based on characteristics responsible for their interaction with the membranes
.
J Chem Inf Model
2014
;
54
:
1512
23
.

52.

Fjell
CD
,
Hancock
RE
,
Cherkasov
A
.
AMPer: a database and an automated discovery tool for antimicrobial peptides
.
Bioinformatics
2007
;
23
:
1148
55
.

53.

Chen
T
,
Guestrin
C
. XGBoost: A scalable tree boosting system. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
New York, USA
:
ACM
,
2016
,
785
94
.

54.

Reimann
C
,
Filzmoserf
P
,
Garrett
RG
, et al.
Discriminant analysis (DA) and other knowledge-based classification methods
.
Stat Data Anal Explain
2008
;
17
:
269
80
.

55.

Quinlan
JR
.
Induction on decision tree
.
Mach Learn
1986
;
1
:
81
106
.

56.

Friedman
N
,
Dan
G
,
Goldszmidt
M
.
Bayesian network classifiers
.
Mach Learn
1997
;
29
:
131
63
.

57.

Cabello
D
,
Barro
S
,
Salceda
JM
, et al.
Fuzzy K-nearest neighbor classifiers for ventricular arrhythmia detection
.
Int J Biomed Comput
1991
;
27
:
77
93
.

58.

Dreiseitla
S
,
Ohno-Machadob
L
.
Logistic regression and artificial neural network classification models: a methodology review
.
J Biomed Inform
2002
;
35
:
352
9
.

59.

Cao
Y
,
Miao
QG
,
Liu
JC
, et al.
Advance and prospects of AdaBoost algorithm
.
Acta Automatica Sinica
2013
;
39
:
745
58
.

60.

Altschul
SF
,
Gish
W
,
Miller
W
, et al.
Basic local alignment search tool
.
J Mol Biol
1990
;
215
:
403
10
.

61.

Thompson
JD
,
Gibson
TJ
,
Higgins
DG
.
Multiple sequence alignment using ClustalW and ClustalX
.
Curr Protoc Bioinformatics
2002
;
00
:
2.3.1
22
.

62.

Eddy
S
.
HMMER: profile HMMs for protein sequence analysis
.
Bioinformatics
1998
;
14
:
755
63
.

63.

Porto
WF
,
Fernandes
FC
,
OvL
F
. An SVM model based on physicochemical properties to predict antimicrobial activity from protein sequences with cysteine knot motifs. In:
Advances in Bioinformatics and Computational Biology
.
Berlin, Heidelberg
:
Springer
,
2010
,
59
62
.

64.

Joseph
S
,
Karnik
S
,
Nilawe
P
, et al.
ClassAMP: a prediction tool for classification of antimicrobial peptides
.
IEEE/ACM Trans Comput Biol Bioinform
2012
;
9
:
1535
8
.

65.

Porto
WF
,
Pires
AS
,
Franco
OL
.
CS-AMPPred: an updated SVM model for antimicrobial activity prediction in cysteine-stabilized peptides
.
PLoS One
2012
;
7
:
e51444
.

66.

Niarchou
A
,
Alexandridou
A
,
Athanasiadis
E
, et al.
C-PAmP: large scale analysis and database construction containing high scoring computationally predicted antimicrobial peptides for all the available plant species
.
PLoS One
2013
;
8
:
e79728
.

67.

Chou
KC
.
Prediction of protein cellular attributes using pseudo-amino acid composition
.
Proteins
2001
;
43
:
246
55
.

68.

Liu
B
,
Liu
F
,
Wang
X
, et al.
Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences
.
Nucleic Acids Res
2015
;
43
:
W65
71
.

69.

Chou
KC
.
Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes
.
Bioinformatics
2005
;
21
:
10
9
.

70.

Rondón-Villarreal
P
,
Sierra
DA
,
Torres
R
.
Classification of Antimicrobial Peptides by Using the p-Spectrum Kernel and Support Vector Machines
. In
Advances in Intelligent Systems and Computing
.
Cham, Switzerland
:
Springer International Publishing
,
2014
.

71.

Camacho
FL
,
Torres
R
,
Pollán
RR
. Classification of antimicrobial peptides with imbalanced datasets. In:
International Symposium on Medical Information Processing & Analysis
.
Cuenca, Ecuador
:
SPIE
,
2015
,
96810T
.

72.

Dong-Sheng
C
,
Qing-Song
X
,
Yi-Zeng
L
.
Propy: a tool to generate various modes of Chou’s PseAAC
.
Bioinformatics
2013
;
29
:
960
2
.

73.

Ruan
J
,
Wang
K
,
Yang
J
, et al.
Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences
.
Artif Intell Med
2005
;
35
:
19
35
.

74.

Ng
XY
,
Rosdi
BA
,
Shahrudin
S
.
Prediction of antimicrobial peptides based on sequence alignment and support vector machine-pairwise algorithm utilizing LZ-complexity
.
Biomed Res Int
2015
;
2015
:
212715
.

75.

Muh
HC
,
Tong
JC
,
Tammi
MT
.
AllerHunter: a SVM-pairwise system for assessment of allergenicity and allergic cross-reactivity in proteins
.
PLoS One
2009
;
4
:
e5861
.

76.

Liao
L
,
Noble
WS
.
Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships
.
J Comput Biol
2003
;
10
:
857
68
.

77.

Lempel
A
,
Ziv
J
.
On the complexity of finite sequences
.
IEEE Trans Inf Theory
1976
;
22
:
75
81
.

78.

Zhang
Q
,
Li
H
.
MOEA/D: a multiobjective evolutionary algorithm based on decomposition
.
IEEE Trans Evol Comput
2007
;
11
:
712
31
.

79.

Li
H
,
Zhang
Q
.
Multiobjective optimization problems with complicated Pareto sets, MOEA/D and NSGA-II
.
IEEE Trans Evol Comput
2009
;
13
:
284
302
.

80.

Fingerhut
LCHW
,
Miller
DJ
,
Strugnell
JM
, et al.
Ampir: an R package for fast genome-wide prediction of antimicrobial peptides
.
Bioinformatics
2020
;
36
:
5262
3
.

81.

Kavousi
K
,
Bagheri
M
,
Behrouzi
S
, et al.
IAMPE: NMR-assisted computational prediction of antimicrobial peptides
.
J Chem Inf Model
2020
;
60
:
4691
701
.

82.

Gull
S
,
Shamim
N
,
Minhas
F
.
AMAP: hierarchical multi-label prediction of biologically active and antimicrobial peptides
.
Comput Biol Med
2019
;
107
:
172
81
.

83.

Wang
P
,
Hu
L
,
Liu
G
, et al.
Prediction of antimicrobial peptides based on sequence alignment and feature selection methods
.
PLoS One
2011
;
6
:
e18476
.

84.

Keller
JM
,
Gray
MR
,
Givens
JA
.
A fuzzy K-nearest neighbor algorithm
.
IEEE Trans Syst Man Cybern
2012
;
SMC-15
:
580
5
.

85.

Torrent
M
,
Andreu
D
,
Nogués
VM
, et al.
Connecting peptide physicochemical and antimicrobial properties by a rational prediction model
.
PLoS One
2011
;
6
:
e16968
8
.

86.

Conchillo-Solé
O
,
De Groot
N
,
Avilés
F
, et al.
AGGRESCAN: a server for the prediction of ‘hot spots’ of aggregation in polypeptides
.
BMC Bioinform
2007
;
8
:
65
.

87.

Artimo
P
,
Jonnalagedda
M
,
Arnold
K
, et al.
ExPASy: SIB bioinformatics resource portal
.
Nucleic Acids Res
2012
;
40
:
W597
603
.

88.

Kyte
J
,
Doolittle
RF
.
A simple method for displaying the hydropathic character of a protein
.
J Mol Biol
1982
;
157
:
105
32
.

89.

Randou
EG
,
Veltri
D
,
Shehu
A
. Systematic analysis of global features and model building for recognition of antimicrobial peptides. In:
2013 IEEE 3rd International Conference on Computational Advances in Bio and medical Sciences (ICCABS)
.
New Orleans, LA, USA
:
IEEE
,
2013
,
1
6
.

90.

Akaike
H
.
A new look at the statistical model identification
.
IEEE Trans Autom Control
1974
;
19
:
716
23
.

91.

Steyerberg
EW
, Jr,
Harrell
FE
,
Borsboom
GJJM
, et al.
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis
.
J Clin Epidemiol
2001
;
54
:
774
81
.

92.

Bhadra
P
,
Yan
J
,
Li
J
, et al.
AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest
.
Sci Rep
2018
;
8
:
1697
.

93.

Dubchak
I
,
Muchnik
I
,
Holbrook
SR
, et al.
Prediction of protein folding class using global description of amino acid sequence
.
Proc Natl Acad Sci
1995
;
92
:
8700
4
.

94.

Chawla
NV
,
Bowyer
KW
,
Hall
LO
, et al.
SMOTE: synthetic minority over-sampling technique
.
J Artif Intell Res
2011
;
16
:
321
57
.

95.

Lawrence
TJ
,
Carper
DL
,
Spangler
MK
, et al.
amPEPpy 1.0: a portable and accurate antimicrobial peptide prediction tool
.
Bioinformatics
2020
. doi: .

96.

Lin
W
,
Xu
D
.
Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types
.
Bioinformatics
2016
;
32
:
3745
52
.

97.

Deng
JL
.
Introduction to grey system theory
.
J Grey Syst
1989
;
1
:
1
24
.

98.

Lin
Y
,
Cai
Y
,
Liu
J
, et al.
An advanced approach to identify antimicrobial peptides and their function types for penaeus through machine learning strategies
.
BMC Bioinform
2019
;
20
:
291
.

99.

Li
YH
,
Xu
JY
,
Tao
L
, et al.
SVM-Prot 2016: a web-server for machine learning prediction of protein functional families from sequence irrespective of similarity
.
PLoS One
2016
;
11
:
e0155290
.

100.

Cai
CZ
,
Han
LY
,
Ji
ZL
, et al.
SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence
.
Nucleic Acids Res
2003
;
31
:
3692
7
.

101.

Lijuan
G
,
Ziwei
NI
,
Yi
J
, et al.
Research on imbalanced data classification based on ensemble and under-sampling
.
J Front Comput Sci Technol
2013
;
7
:
630
8
.

102.

Tsoumakas
G
,
Katakis
I
.
Multi-label classification: an overview
.
Int J Data Warehous Min
2007
;
3
:
1
13
.

103.

Chung
CR
,
Kuo
TR
,
Wu
LC
, et al.
Characterization and identification of antimicrobial peptides with different functional activities
.
Brief Bioinform
2019
;
21
:
1098
114
.

104.

Chung
CR
,
Jhong
JH
,
Wang
Z
, et al.
Characterization and identification of natural antimicrobial peptides on different organisms
.
Int J Mol Sci
2020
;
21
:
986
.

105.

Chou
KC
.
Using pair-coupled amino acid composition to predict protein secondary structure content
.
J Protein Chem
1999
;
18
:
473
80
.

106.

Pfahringer
B
,
Reutemann
P
,
Witten
IH
, et al.
The WEKA data mining software: an update
.
ACM SIGKDD Explorations Newsletter
2009
;
11
:
10
8
.

107.

Burdukiewicz
M
,
Sidorczuk
K
,
Rafacz
D
, et al.
Proteomic screening for prediction and design of antimicrobial peptides with AmpGram
.
Int J Mol Sci
2020
;
21
:
4310
.

108.

Fu
H
,
Cao
Z
,
Li
M
, et al. Prediction of anuran antimicrobial peptides using AdaBoost and improved PSSM profiles. In:
Proceedings of the Fourth International Conference on Biological Information and Biomedical Engineering
.
Chengdu, China
:
Association for Computing Machinery
,
2020
,
1
6
.

109.

Zou
L
,
Nan
C
,
Hu
F
.
Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles
.
Bioinformatics
2013
;
29
:
3135
42
.

110.

Lawrence
S
,
Giles
CL
.
Face recognition: a convolutional neural-network approach
.
IEEE Trans Neural Netw
1997
;
8
:
98
113
.

111.

Graves
A
,
Mohamed
AR
,
Hinton
G
. Speech recognition with deep recurrent neural networks. In:
013 IEEE International Conference on Acoustics, Speech and Signal Processing
.
Vancouver, BC, Canada
:
IEEE
,
2013
,
6645
9
.

112.

Su
X
,
Xu
J
,
Yin
Y
, et al.
Antimicrobial peptide identification using multi-scale convolutional network
.
BMC Bioinform
2019
;
20
:
730
.

113.

Yan
J
,
Bhadra
P
,
Li
A
, et al.
Deep-AmPEP30: improve short antimicrobial peptides prediction with deep learning
.
Mol Ther Nucleic Acids
2020
;
20
:
882
94
.

114.

Li
C
,
Sutherland
D
,
Hammond
SA
, et al.
AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens
.
bioRxiv 2020
. doi: .

115.

Hochreiter
S
,
Schmidhuber
J
.
Long short-term memory
.
Neural Comput
1997
;
9
:
1735
80
.

116.

Govindan
N
. Composition, transition and distribution (CTD) — a dynamic feature for predictions based on hierarchical structure of cellular sorting. In:
2011 Annual IEEE India Conference
.
Hyderabad, India
:
IEEE
,
2011
,
1
6
.

117.

Yongchun
,
Zuo
,
Yuan
, et al.
PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition
.
Bioinformatics
2016
;
33
:
122
4
.

118.

Schuster
M
,
Paliwal
KK
.
Bidirectional recurrent neural networks
.
IEEE Trans Signal Process
2002
;
45
:
2673
81
.

119.

Yang
Z
,
Yang
D
,
Dyer
C
, et al. Hierarchical attention networks for document classification. In:
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
.
San Diego, CA, USA
:
Association for Computational Linguistics
,
2017
,
1480
9
.

120.

Chen
Z
,
Zhao
P
,
Li
F
, et al.
iFeature: a python package and web server for features extraction and selection from protein and peptide sequences
.
Bioinformatics
2018
;
34
:
2499
502
.

121.

Chen
Z
,
Pei
Z
,
Fuyi
L
, et al.
iLearn: an integrated platform and meta-learner for feature engineering, machine learning analysis and modeling of DNA, RNA and protein sequence data
. Brief Bioinform
2020
;
21
:
1047
57
.

122.

Tang
B
,
Iyer
A
,
Rao
V
, et al.
Group-representative functional network estimation from multi-subject fMRI data via MRF-based image segmentation
.
Comput Methods Programs Biomed
2019
;
179
:
104976
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)