-
PDF
- Split View
-
Views
-
Cite
Cite
Chia-Ru Chung, Ting-Rung Kuo, Li-Ching Wu, Tzong-Yi Lee, Jorng-Tzong Horng, Characterization and identification of antimicrobial peptides with different functional activities, Briefings in Bioinformatics, Volume 21, Issue 3, May 2020, Pages 1098–1114, https://doi.org/10.1093/bib/bbz043
- Share Icon Share
Abstract
In recent years, antimicrobial peptides (AMPs) have become an emerging area of focus when developing therapeutics hot spot residues of proteins are dominant against infections. Importantly, AMPs are produced by virtually all known living organisms and are able to target a wide range of pathogenic microorganisms, including viruses, parasites, bacteria and fungi. Although several studies have proposed different machine learning methods to predict peptides as being AMPs, most do not consider the diversity of AMP activities. On this basis, we specifically investigated the sequence features of AMPs with a range of functional activities, including anti-parasitic, anti-viral, anti-cancer and anti-fungal activities and those that target mammals, Gram-positive and Gram-negative bacteria. A new scheme is proposed to systematically characterize and identify AMPs and their functional activities. The 1st stage of the proposed approach is to identify the AMPs, while the 2nd involves further characterization of their functional activities. Sequential forward selection was employed to extract potentially informative features that are possibly associated with the functional activities of the AMPs. These features include hydrophobicity, the normalized van der Waals volume, polarity, charge and solvent accessibility—all of which are essential attributes in classifying between AMPs and non-AMPs. The results revealed the 1st stage AMP classifier was able to achieve an area under the receiver operating characteristic curve (AUC) value of 0.9894. During the 2nd stage, we found pseudo amino acid composition to be an informative attribute when differentiating between AMPs in terms of their functional activities. The independent testing results demonstrated that the AUCs of the multi-class models were 0.7773, 0.9404, 0.8231, 0.8578, 0.8648, 0.8745 and 0.8672 for anti-parasitic, anti-viral, anti-cancer, anti-fungal AMPs and those that target mammals, Gram-positive and Gram-negative bacteria, respectively. The proposed scheme helps facilitate biological experiments related to the functional analysis of AMPs. Additionally, it was implemented as a user-friendly web server (AMPfun, http://fdblab.csie.ncu.edu.tw/AMPfun/index.html) that allows individuals to explore the antimicrobial functions of peptides of interest.
Introduction
The overuse of antibiotics has led to microbial pathogens developing resistance to chemical antibiotics. As a result, there is an urgent need to develop new therapeutics to treat infections. This seemingly intractable problem means that the remarkable benefits of antibiotics are now seriously threatened by the rapid development of antibiotic-resistant bacteria—an issue that has developed into a global crisis [1]. Antimicrobial peptides (AMPs) are a particularly functional group of protein molecules. Most AMPs contain positively charged residues on their hydrophilic side for attaching to the membrane surfaces of microbes [2]. Compared to conventional antibiotics, the short time frame interaction between AMPs and microbial membranes promotes rapid microbe death and decreases the probability of resistance development [3].
Reference . | Prediction target . | Data source (#Pos : #Neg) . | CD-HIT parameter . | Sequence length . | Training features . | Classifier . | Web server . |
---|---|---|---|---|---|---|---|
Bhadra et al. (2018) [6] | AMP | APD3 + CAMP+LAMP/UniProt (3268:166791) | – | – | AAC, PseAAC, CTDD | RF | – |
Veltri et al. (2018) [37] | AMP | APD3/UniProt (1779:1778) | 90% on Pos; 40% on Neg | Larger than 10 | Peptide sequences | DNN | 1 |
Agrawal et al. (2018) [19] | Anti-fungal peptides | DRAMP/SwissProt (1459:1459) | – | Keep same in both Pos and Neg | AAC, DPC, binary profile, mass, charge and pI value of peptide | SVM, RF, DT, Naïve Bayes | 2 |
Wang et al. (2017) [38] | Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancer | APD (2222) | – | Between 10 and 60 | AAC, DPC | WKnn, MLR | – |
Meher et al. (2017) [11] | Anti-bacterial, anti-viral, anti-fungal | CAMP+APD3 + AntiBP2 + LAMP+ AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384) | – | Larger than 10 | AAC, PseAAC, structural, physicochemical | SVM | 3 |
Gabere et al. (2017) [39] | AMP | DRMPD+APD3/UniProt (2260:11300) | 90% | Match 1:5 with Pos | – | – | – |
Manavalan (2017) [21] | ACP | (735:1025) | 90% | Less than 50 | AAC, DPC, PCP, ATC(C,H,O,N,S) | SVM, RF | 4 |
Lin and Xu (2016) [29] | AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viral | APD/UniProt (879:2405) | – | Between 5 and 100 | PseAAC | RF | 5 |
Chen et al. (2016) [24] | Anti-cancer | APD2/UnProt (138:206) | 90% | – | PseAAC, g-gap DPC | SVM | 6 |
Zare et al. (2015) [23] | Anti-viral | AVPpred (342:312) | 90% | Between 10 and 100 | PseAAC | AdaBoost, RBF, NB, DT, decision stump, REPTree | – |
Tyagi et al. (2013) [22] | Anti-cancer | APD + CAMP+DADP/SwissProt (225:2250–) | – | – | AAC, DPC, binary profiles | SVM | 7 |
Xiao et al. (2013) [28] | AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIV | APD/UniProt (1486:2405) | 40% | Between 5 and 100 | PseAAC | FKNN, multi-label FKNN classifier | 8 |
Khosravian. et al. (2013) [26] | Anti-bacterial | APD/AMP (1086:8860) | 90% | – | PseAAC | SVM | – |
Thakur et al. (2012) [34] | Anti-viral | Pubmed and Patent Lens (604:452) | – | – | AAC, motif, physicochemical properties, sequence alignment | SVM | 9 |
Wang. et al. (2011) [40] | AMP | CAMP/UniProt (870:9731) | 70% | Same as Pos | AAC, PseAAC, sequence alignment | NNA, mRMR | 10 |
Lata et al. (2010) [27] | Anti-bacterial | APD/SwisProt (999:999) | – | Same as Pos | AAC, binary patterns | SVM |
Reference . | Prediction target . | Data source (#Pos : #Neg) . | CD-HIT parameter . | Sequence length . | Training features . | Classifier . | Web server . |
---|---|---|---|---|---|---|---|
Bhadra et al. (2018) [6] | AMP | APD3 + CAMP+LAMP/UniProt (3268:166791) | – | – | AAC, PseAAC, CTDD | RF | – |
Veltri et al. (2018) [37] | AMP | APD3/UniProt (1779:1778) | 90% on Pos; 40% on Neg | Larger than 10 | Peptide sequences | DNN | 1 |
Agrawal et al. (2018) [19] | Anti-fungal peptides | DRAMP/SwissProt (1459:1459) | – | Keep same in both Pos and Neg | AAC, DPC, binary profile, mass, charge and pI value of peptide | SVM, RF, DT, Naïve Bayes | 2 |
Wang et al. (2017) [38] | Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancer | APD (2222) | – | Between 10 and 60 | AAC, DPC | WKnn, MLR | – |
Meher et al. (2017) [11] | Anti-bacterial, anti-viral, anti-fungal | CAMP+APD3 + AntiBP2 + LAMP+ AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384) | – | Larger than 10 | AAC, PseAAC, structural, physicochemical | SVM | 3 |
Gabere et al. (2017) [39] | AMP | DRMPD+APD3/UniProt (2260:11300) | 90% | Match 1:5 with Pos | – | – | – |
Manavalan (2017) [21] | ACP | (735:1025) | 90% | Less than 50 | AAC, DPC, PCP, ATC(C,H,O,N,S) | SVM, RF | 4 |
Lin and Xu (2016) [29] | AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viral | APD/UniProt (879:2405) | – | Between 5 and 100 | PseAAC | RF | 5 |
Chen et al. (2016) [24] | Anti-cancer | APD2/UnProt (138:206) | 90% | – | PseAAC, g-gap DPC | SVM | 6 |
Zare et al. (2015) [23] | Anti-viral | AVPpred (342:312) | 90% | Between 10 and 100 | PseAAC | AdaBoost, RBF, NB, DT, decision stump, REPTree | – |
Tyagi et al. (2013) [22] | Anti-cancer | APD + CAMP+DADP/SwissProt (225:2250–) | – | – | AAC, DPC, binary profiles | SVM | 7 |
Xiao et al. (2013) [28] | AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIV | APD/UniProt (1486:2405) | 40% | Between 5 and 100 | PseAAC | FKNN, multi-label FKNN classifier | 8 |
Khosravian. et al. (2013) [26] | Anti-bacterial | APD/AMP (1086:8860) | 90% | – | PseAAC | SVM | – |
Thakur et al. (2012) [34] | Anti-viral | Pubmed and Patent Lens (604:452) | – | – | AAC, motif, physicochemical properties, sequence alignment | SVM | 9 |
Wang. et al. (2011) [40] | AMP | CAMP/UniProt (870:9731) | 70% | Same as Pos | AAC, PseAAC, sequence alignment | NNA, mRMR | 10 |
Lata et al. (2010) [27] | Anti-bacterial | APD/SwisProt (999:999) | – | Same as Pos | AAC, binary patterns | SVM |
Note: Pos, positive dataset; Neg, negative dataset; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, composition–transition–distribution D descriptor; DPC, dipeptide composition; NB, Naïve Bayes; FKNN, Fuzzy K-nearest neighbor; WKnn, weighted K-nearest neighbor algorithm; MLR, multiple linear regression; DNN, deep neural network; REPTree, Reduced-Error Pruning Tree; NNA, nearest neighbor algorithm; mRMR, maximum relevance, minimum redundancy; − means that there is no provided.
1, https://www.dveltri.com/ascan/v2/ascan.html; 2, https://webs.iiitd.edu.in/raghava/antifp/; 3, http://cabgrid.res.in:8080/amppred/; 4, http://www.thegleelab.org/MLACP.html; 5, http://www.jci-bioinfo.cn/MLAMP; 6, http://lin.uestc.edu.cn/server/iACP; 7, http://crdd.osdd.net/raghava/anticp/; 8, http://www.jci-bioinfo.cn/iAMP-2L; 9, http://crdd.osdd.net/servers/avppred/; 10, http://amp.biosino.org/
Reference . | Prediction target . | Data source (#Pos : #Neg) . | CD-HIT parameter . | Sequence length . | Training features . | Classifier . | Web server . |
---|---|---|---|---|---|---|---|
Bhadra et al. (2018) [6] | AMP | APD3 + CAMP+LAMP/UniProt (3268:166791) | – | – | AAC, PseAAC, CTDD | RF | – |
Veltri et al. (2018) [37] | AMP | APD3/UniProt (1779:1778) | 90% on Pos; 40% on Neg | Larger than 10 | Peptide sequences | DNN | 1 |
Agrawal et al. (2018) [19] | Anti-fungal peptides | DRAMP/SwissProt (1459:1459) | – | Keep same in both Pos and Neg | AAC, DPC, binary profile, mass, charge and pI value of peptide | SVM, RF, DT, Naïve Bayes | 2 |
Wang et al. (2017) [38] | Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancer | APD (2222) | – | Between 10 and 60 | AAC, DPC | WKnn, MLR | – |
Meher et al. (2017) [11] | Anti-bacterial, anti-viral, anti-fungal | CAMP+APD3 + AntiBP2 + LAMP+ AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384) | – | Larger than 10 | AAC, PseAAC, structural, physicochemical | SVM | 3 |
Gabere et al. (2017) [39] | AMP | DRMPD+APD3/UniProt (2260:11300) | 90% | Match 1:5 with Pos | – | – | – |
Manavalan (2017) [21] | ACP | (735:1025) | 90% | Less than 50 | AAC, DPC, PCP, ATC(C,H,O,N,S) | SVM, RF | 4 |
Lin and Xu (2016) [29] | AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viral | APD/UniProt (879:2405) | – | Between 5 and 100 | PseAAC | RF | 5 |
Chen et al. (2016) [24] | Anti-cancer | APD2/UnProt (138:206) | 90% | – | PseAAC, g-gap DPC | SVM | 6 |
Zare et al. (2015) [23] | Anti-viral | AVPpred (342:312) | 90% | Between 10 and 100 | PseAAC | AdaBoost, RBF, NB, DT, decision stump, REPTree | – |
Tyagi et al. (2013) [22] | Anti-cancer | APD + CAMP+DADP/SwissProt (225:2250–) | – | – | AAC, DPC, binary profiles | SVM | 7 |
Xiao et al. (2013) [28] | AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIV | APD/UniProt (1486:2405) | 40% | Between 5 and 100 | PseAAC | FKNN, multi-label FKNN classifier | 8 |
Khosravian. et al. (2013) [26] | Anti-bacterial | APD/AMP (1086:8860) | 90% | – | PseAAC | SVM | – |
Thakur et al. (2012) [34] | Anti-viral | Pubmed and Patent Lens (604:452) | – | – | AAC, motif, physicochemical properties, sequence alignment | SVM | 9 |
Wang. et al. (2011) [40] | AMP | CAMP/UniProt (870:9731) | 70% | Same as Pos | AAC, PseAAC, sequence alignment | NNA, mRMR | 10 |
Lata et al. (2010) [27] | Anti-bacterial | APD/SwisProt (999:999) | – | Same as Pos | AAC, binary patterns | SVM |
Reference . | Prediction target . | Data source (#Pos : #Neg) . | CD-HIT parameter . | Sequence length . | Training features . | Classifier . | Web server . |
---|---|---|---|---|---|---|---|
Bhadra et al. (2018) [6] | AMP | APD3 + CAMP+LAMP/UniProt (3268:166791) | – | – | AAC, PseAAC, CTDD | RF | – |
Veltri et al. (2018) [37] | AMP | APD3/UniProt (1779:1778) | 90% on Pos; 40% on Neg | Larger than 10 | Peptide sequences | DNN | 1 |
Agrawal et al. (2018) [19] | Anti-fungal peptides | DRAMP/SwissProt (1459:1459) | – | Keep same in both Pos and Neg | AAC, DPC, binary profile, mass, charge and pI value of peptide | SVM, RF, DT, Naïve Bayes | 2 |
Wang et al. (2017) [38] | Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancer | APD (2222) | – | Between 10 and 60 | AAC, DPC | WKnn, MLR | – |
Meher et al. (2017) [11] | Anti-bacterial, anti-viral, anti-fungal | CAMP+APD3 + AntiBP2 + LAMP+ AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384) | – | Larger than 10 | AAC, PseAAC, structural, physicochemical | SVM | 3 |
Gabere et al. (2017) [39] | AMP | DRMPD+APD3/UniProt (2260:11300) | 90% | Match 1:5 with Pos | – | – | – |
Manavalan (2017) [21] | ACP | (735:1025) | 90% | Less than 50 | AAC, DPC, PCP, ATC(C,H,O,N,S) | SVM, RF | 4 |
Lin and Xu (2016) [29] | AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viral | APD/UniProt (879:2405) | – | Between 5 and 100 | PseAAC | RF | 5 |
Chen et al. (2016) [24] | Anti-cancer | APD2/UnProt (138:206) | 90% | – | PseAAC, g-gap DPC | SVM | 6 |
Zare et al. (2015) [23] | Anti-viral | AVPpred (342:312) | 90% | Between 10 and 100 | PseAAC | AdaBoost, RBF, NB, DT, decision stump, REPTree | – |
Tyagi et al. (2013) [22] | Anti-cancer | APD + CAMP+DADP/SwissProt (225:2250–) | – | – | AAC, DPC, binary profiles | SVM | 7 |
Xiao et al. (2013) [28] | AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIV | APD/UniProt (1486:2405) | 40% | Between 5 and 100 | PseAAC | FKNN, multi-label FKNN classifier | 8 |
Khosravian. et al. (2013) [26] | Anti-bacterial | APD/AMP (1086:8860) | 90% | – | PseAAC | SVM | – |
Thakur et al. (2012) [34] | Anti-viral | Pubmed and Patent Lens (604:452) | – | – | AAC, motif, physicochemical properties, sequence alignment | SVM | 9 |
Wang. et al. (2011) [40] | AMP | CAMP/UniProt (870:9731) | 70% | Same as Pos | AAC, PseAAC, sequence alignment | NNA, mRMR | 10 |
Lata et al. (2010) [27] | Anti-bacterial | APD/SwisProt (999:999) | – | Same as Pos | AAC, binary patterns | SVM |
Note: Pos, positive dataset; Neg, negative dataset; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, composition–transition–distribution D descriptor; DPC, dipeptide composition; NB, Naïve Bayes; FKNN, Fuzzy K-nearest neighbor; WKnn, weighted K-nearest neighbor algorithm; MLR, multiple linear regression; DNN, deep neural network; REPTree, Reduced-Error Pruning Tree; NNA, nearest neighbor algorithm; mRMR, maximum relevance, minimum redundancy; − means that there is no provided.
1, https://www.dveltri.com/ascan/v2/ascan.html; 2, https://webs.iiitd.edu.in/raghava/antifp/; 3, http://cabgrid.res.in:8080/amppred/; 4, http://www.thegleelab.org/MLACP.html; 5, http://www.jci-bioinfo.cn/MLAMP; 6, http://lin.uestc.edu.cn/server/iACP; 7, http://crdd.osdd.net/raghava/anticp/; 8, http://www.jci-bioinfo.cn/iAMP-2L; 9, http://crdd.osdd.net/servers/avppred/; 10, http://amp.biosino.org/
AMPs are produced by virtually all organisms on earth as essential components of the innate immune system [4]. As such, they are the 1st line of defense against microbes in many organisms [5–11]. Specifically, AMPs are able to disrupt either the cell membrane of microbes or intracellular functioning to kill the target organism [11]. Consequently, they are able to defend an organism against a wide range of pathogenic microorganisms, including viruses, parasites, bacteria and fungi [5, 12, 13]. Generally, AMPs consist of 10–50 amino acids and show little sequence homology with one another [5, 12, 14]. As a result, identifying AMPs and their activity is quite challenging. With the rapid growth of resistance to chemical antibiotics among microbial pathogens, there has developed an urgent need to identify new and novel therapeutics that are able to treat infections. As such, AMPs are now being taken more seriously and gaining popularity with respect to their clinical application. Investigations targeting AMPs are emerging to help us gain a full understanding of their mechanisms for new drug development. This means that new agents against infectious diseases and inflammatory, as well as anti-tumorigenic AMP-based drugs, should be reachable [15].
Due to the urgent need to be able to identify AMPs and their functional activities, numerous studies have constructed various databases that contain experimentally validated AMPs and annotations of their functional activities. The 3rd version of the antimicrobial peptide database (APD3) not only includes AMPs and their different functional activities— including anti-bacterial, anti-fungal, anti-viral and anti-cancer—but also has a glossary, specific nomenclature, peptide classification section, search functionality and AMP prediction system [8]. Furthermore, the A Database of Antimicrobial Peptides (ADAM) database is also available and provides a comprehensive analysis of experimentally validated AMPs linked to their known anti-microbial activities—including activity against Gram-positive bacteria, activity against Gram-negative bacteria and anti-fungal and anti-viral activity [10]. Additionally, the Data Repository of Antimicrobial Peptides (DRAMP) is an open access database containing a diverse set of AMP annotations and their different functional activities, such as activities against Gram-positive bacteria and Gram-negative bacterial [16]. In contrast to the aforementioned databases, several others have focused on AMPs of a specific functional activity. ParaPEP is one such database consisting of 519 unique anti-parasitic peptides and provides very useful information related to experimentally validated anti-parasitic peptide sequences and their structures [17]. There is also AVPdb, which is the 1st comprehensive database of anti-viral peptides (AVPs). It provides a dedicated resource of experimentally verified AVPs that are able to target over 60 medically important viruses. AVPdb includes 2683 experimentally verified AVPs [18]. Additionally, there is AntiFP, which has not only released several anti-fungal datasets, but also provided a number of prediction tools [19]. Several databases—such as CancerPPD [20], MLACP [21] and AntiCP [22]—have concentrated on specifically collecting anti-cancer peptides (ACPs). Currently, an integrated resource, dbAMP [2], has been developed for exploring AMPs with functional activities and physicochemical properties on high-throughput data.
To deal with the lack of sequence similarity among AMPs, there is a necessity to develop an effective method for identifying AMPs and their functional activity. As listed in Table 1, many studies have developed various automatic tools using a variety of modeling techniques. As part of these studies, the decision tree (DT) [19, 23], random forest (RF) [6, 9, 19, 21, 23], support vector machine (SVM) [9, 11, 19, 21–27], artificial neural network [9, 26] and discriminant analysis [9] approaches have been used to construct classifiers that are capable of differentiating AMPs from non-AMPs. Besides the ability to classify peptides as AMPs and non-AMPs, the ability to predict each AMP’s functional activities has gained increased attention in recent years. For example, iAMP-2L—a two-level multi-label classifier—was developed using the fuzzy K-nearest neighbor (FKNN) approach; its aim was to identify both AMPs and their functional activities, including anti-bacterial, anti-cancer, anti-viral, anti-fungal and anti-human immunodeficiency virus (HIV) [28]. In comparison, Meher et al. [11] adopted an SVM-based method to discriminate between anti-bacterial, anti-fungal and AVPs and subsequently performed further important feature analysis to pinpoint the critical factors associated with these three AMP functional activities. Moreover, to deal with the unbalanced multi-label problem, Lin and Xu [29] applied the synthetic minority over-sampling technique with the aim of distinguishing some functional activities of AMPs—including anti-bacterial, anti-cancer, anti-viral, anti-fungal and anti-HIV. In addition to the aforementioned machine learning methods, the composition of the feature set is also a critical factor in making precise predictions [6]. It is important to have the proper feature set as this will help identify differences between the various groups. Among the research published in this field, compositional and physicochemical data, structural properties, sequence order and terminal residue patterns of the AMPs have been effective features for making AMP predictions. For example, amino acid composition (AAC) and dipeptide composition (DPC) are compositional features and several previous studies have used them as features when constructing their classifiers [11, 22, 27–30]. As global protein sequence descriptors, the composition, transition and distribution (CTD) patterns of amino acid properties are able to encode the distribution patterns of physical–chemical properties [6, 31]. Furthermore, numerous studies have considered physicochemical properties, including CTD, pseudo amino acid composition (PseAAC) and the AAINDEX [6, 11, 19, 21, 24, 27–29], in the AMP prediction.

Table 1 shows limited number of methods devoted to constructing multiple classifiers for AMPs with a range of functional activities. Most studies have focused on the identification of AMPs with a specific function, e.g. ACPs. Although several studies have investigated various types of AMPs, they have not yet developed freely accessed tools that allow users to submit their peptides of interest. Therefore, the primary purpose of this study is to design a two-stage framework for identifying AMPs along with their functional activities. Additionally, feature selection analysis is performed to obtain a better understanding of the sequential characteristics of AMPs with respect to seven target activities.
Materials and methods
To help create a useful identification system for AMPs and their functional activities, we developed a flow chart for this study (Figure 1). Detailed explanations of each process described in the chart are provided in the following sections.
Dataset preparation
The positive dataset used to construct the AMP classifier was collected from a range of AMP databases, namely APD3 [8], ADAM [10], ParaPep [17], AVPdb [18], CancerPPD [20], MLACP [12], AntiCP [22], AntiFP [19] and DRAMP [16]. A total of 6766 experimentally validated AMP sequences were obtained from these resources. Additionally, non-AMP data that formed the negative dataset were obtained from AmPEP [6] and consist of a collection of sequences from UniProt [30] with lengths ranging from 5 to 255 amino acid residues. These sequences were then filtered using the unnatural amino acid B, J, O, U, X and Z. To reduce any homology bias and redundancy, the CD-HIT program [32] was used to screen the positive dataset with a 50% threshold. Homologous sequences were also removed from the negative dataset according to the same criterion. Subsequently, CD-HIT-2D [32] was used on the two datasets at a threshold of 50%. Finally, 70% of both the positive and negative datasets (18,114 sequences, total) was used to create two training sets, while the remaining 30% of each (7764 sequences, total) was used to create two testing sets. The positive training set contains 1686 AMP sequences, while the testing set contained 723.
Data source and the number of sequences for training set, testing set and independent testing set based on seven AMP functional activities
Activities . | Databases . | Training set . | Testing set . | Independent testing set . | |||
---|---|---|---|---|---|---|---|
. | . | Positive . | Negative . | Positive . | Negative . | Positive . | Negative . |
Anti-parasitic | APD3 [8], ADAM [10], ParaPep [17] | 140 | 700 | 60 | 1914 | – | – |
Anti-viral | APD3 [8], ADAM [10], AVPdb [18] | 1400 | 2451 | 601 | 1374 | 601 | 1374 |
Anti-cancer | APD3 [8], CancerPPD [20], MLACP [21], AntiCP [22] | 219 | 1095 | 94 | 1881 | 94 | 1881 |
Targeting mammals | APD3 [8] | 215 | 1075 | 93 | 1882 | – | – |
Anti-fungal | APD3 [8], ADAM [10], AntiFP [19] | 1912 | 1261 | 820 | 1155 | 820+ | 1155+ |
734# | 825# | ||||||
Targeting Gram-positive bacteria | APD3 [8], ADAM [10], DRAMP [16] | 1930 | 1624 | 828 | 1147 | 828 | 1147 |
Targeting Gram-negative bacteria | APD3 [8], ADAM [10], DRAMP [16] | 1931 | 1635 | 828 | 1147 | 828 | 1147 |
Activities . | Databases . | Training set . | Testing set . | Independent testing set . | |||
---|---|---|---|---|---|---|---|
. | . | Positive . | Negative . | Positive . | Negative . | Positive . | Negative . |
Anti-parasitic | APD3 [8], ADAM [10], ParaPep [17] | 140 | 700 | 60 | 1914 | – | – |
Anti-viral | APD3 [8], ADAM [10], AVPdb [18] | 1400 | 2451 | 601 | 1374 | 601 | 1374 |
Anti-cancer | APD3 [8], CancerPPD [20], MLACP [21], AntiCP [22] | 219 | 1095 | 94 | 1881 | 94 | 1881 |
Targeting mammals | APD3 [8] | 215 | 1075 | 93 | 1882 | – | – |
Anti-fungal | APD3 [8], ADAM [10], AntiFP [19] | 1912 | 1261 | 820 | 1155 | 820+ | 1155+ |
734# | 825# | ||||||
Targeting Gram-positive bacteria | APD3 [8], ADAM [10], DRAMP [16] | 1930 | 1624 | 828 | 1147 | 828 | 1147 |
Targeting Gram-negative bacteria | APD3 [8], ADAM [10], DRAMP [16] | 1931 | 1635 | 828 | 1147 | 828 | 1147 |
Note: ‘+’ means the testing set for iAMPpred tool; ‘#’ means the testing for AntiFP tool; and ‘–’ means that there is no existing tool and hence we do not have to separate an independent testing set.
Data source and the number of sequences for training set, testing set and independent testing set based on seven AMP functional activities
Activities . | Databases . | Training set . | Testing set . | Independent testing set . | |||
---|---|---|---|---|---|---|---|
. | . | Positive . | Negative . | Positive . | Negative . | Positive . | Negative . |
Anti-parasitic | APD3 [8], ADAM [10], ParaPep [17] | 140 | 700 | 60 | 1914 | – | – |
Anti-viral | APD3 [8], ADAM [10], AVPdb [18] | 1400 | 2451 | 601 | 1374 | 601 | 1374 |
Anti-cancer | APD3 [8], CancerPPD [20], MLACP [21], AntiCP [22] | 219 | 1095 | 94 | 1881 | 94 | 1881 |
Targeting mammals | APD3 [8] | 215 | 1075 | 93 | 1882 | – | – |
Anti-fungal | APD3 [8], ADAM [10], AntiFP [19] | 1912 | 1261 | 820 | 1155 | 820+ | 1155+ |
734# | 825# | ||||||
Targeting Gram-positive bacteria | APD3 [8], ADAM [10], DRAMP [16] | 1930 | 1624 | 828 | 1147 | 828 | 1147 |
Targeting Gram-negative bacteria | APD3 [8], ADAM [10], DRAMP [16] | 1931 | 1635 | 828 | 1147 | 828 | 1147 |
Activities . | Databases . | Training set . | Testing set . | Independent testing set . | |||
---|---|---|---|---|---|---|---|
. | . | Positive . | Negative . | Positive . | Negative . | Positive . | Negative . |
Anti-parasitic | APD3 [8], ADAM [10], ParaPep [17] | 140 | 700 | 60 | 1914 | – | – |
Anti-viral | APD3 [8], ADAM [10], AVPdb [18] | 1400 | 2451 | 601 | 1374 | 601 | 1374 |
Anti-cancer | APD3 [8], CancerPPD [20], MLACP [21], AntiCP [22] | 219 | 1095 | 94 | 1881 | 94 | 1881 |
Targeting mammals | APD3 [8] | 215 | 1075 | 93 | 1882 | – | – |
Anti-fungal | APD3 [8], ADAM [10], AntiFP [19] | 1912 | 1261 | 820 | 1155 | 820+ | 1155+ |
734# | 825# | ||||||
Targeting Gram-positive bacteria | APD3 [8], ADAM [10], DRAMP [16] | 1930 | 1624 | 828 | 1147 | 828 | 1147 |
Targeting Gram-negative bacteria | APD3 [8], ADAM [10], DRAMP [16] | 1931 | 1635 | 828 | 1147 | 828 | 1147 |
Note: ‘+’ means the testing set for iAMPpred tool; ‘#’ means the testing for AntiFP tool; and ‘–’ means that there is no existing tool and hence we do not have to separate an independent testing set.
Category . | Feature name . | Abbreviation . | # descriptor . |
---|---|---|---|
Binary profiling of position | N-gram binary profiling of position found by counting | NCB | 3a |
N-gram binary profiling of position found by t-test | NTB | 3a | |
Motifs binary profiling of position | MB | 3a | |
Composition | N-gram composition found by counting | NCC | 1a |
N-gram composition found by t-test | NTC | 1a | |
Motifs composition | MC | 1a | |
Amino acid composition | AAC | 20 | |
Physical–chemical properties | Pseudo amino acid composition | PseAAC | 22 |
Distribution descriptor of composition transition distribution | CTDD | 105 |
Category . | Feature name . | Abbreviation . | # descriptor . |
---|---|---|---|
Binary profiling of position | N-gram binary profiling of position found by counting | NCB | 3a |
N-gram binary profiling of position found by t-test | NTB | 3a | |
Motifs binary profiling of position | MB | 3a | |
Composition | N-gram composition found by counting | NCC | 1a |
N-gram composition found by t-test | NTC | 1a | |
Motifs composition | MC | 1a | |
Amino acid composition | AAC | 20 | |
Physical–chemical properties | Pseudo amino acid composition | PseAAC | 22 |
Distribution descriptor of composition transition distribution | CTDD | 105 |
Note: ameans that the number depends on the choice of the number of binary profiling of position or composition found by either counting or t-test.
Category . | Feature name . | Abbreviation . | # descriptor . |
---|---|---|---|
Binary profiling of position | N-gram binary profiling of position found by counting | NCB | 3a |
N-gram binary profiling of position found by t-test | NTB | 3a | |
Motifs binary profiling of position | MB | 3a | |
Composition | N-gram composition found by counting | NCC | 1a |
N-gram composition found by t-test | NTC | 1a | |
Motifs composition | MC | 1a | |
Amino acid composition | AAC | 20 | |
Physical–chemical properties | Pseudo amino acid composition | PseAAC | 22 |
Distribution descriptor of composition transition distribution | CTDD | 105 |
Category . | Feature name . | Abbreviation . | # descriptor . |
---|---|---|---|
Binary profiling of position | N-gram binary profiling of position found by counting | NCB | 3a |
N-gram binary profiling of position found by t-test | NTB | 3a | |
Motifs binary profiling of position | MB | 3a | |
Composition | N-gram composition found by counting | NCC | 1a |
N-gram composition found by t-test | NTC | 1a | |
Motifs composition | MC | 1a | |
Amino acid composition | AAC | 20 | |
Physical–chemical properties | Pseudo amino acid composition | PseAAC | 22 |
Distribution descriptor of composition transition distribution | CTDD | 105 |
Note: ameans that the number depends on the choice of the number of binary profiling of position or composition found by either counting or t-test.
The construction of class-specific datasets involved a subdivision of the AMP databases. For each functional activity, Table 2 shows the number of sequences forming the positive and negative testing and training datasets, as well as the data source. It should be noted that once a sequence has been assigned an activity, that sequence is assigned to the positive dataset for that specific activity; otherwise, it is regarded as negative data. For instance, an AMP peptide having activity against parasites and cancer is assigned to the positive anti-parasite dataset (for building the anti-parasitic classifier); however, it is also assigned to the negative datasets of the remaining activities. On this basis, seven positive and negative datasets are arranged for generation of the class-specific classifiers. Similar to the process described above, these seven positive and negative datasets were randomly divided into training and testing sets made up of 70% and 30% of the sequences, respectively. Furthermore, the CD-HIT-2D [32] was applied on the training and testing datasets with a 50% threshold during the development of the class-specific classifiers.
Generation of training features
It was extremely important to transform these sequences, which consisted of strings of amino acids, into reasonable and representative numerical values before building the prediction model. As presented in Table 3, we divided the features into three types: binary profiling of amino acid position, AAC and physical–chemical properties. The binary profiling of amino acid position included n-gram binary profiling of position as determined by counting (NCB), the same as determined by t-test (NTB) and motif-based binary profiling of position (MB). The n-gram composition by counting (NCC), n-gram composition by t-test (NTC), motifs composition (MC) and AAC were features of the composition category, while PseAAC and Distribution descriptor of CTD (CTDD) were physical–chemical properties.
In this study, we not only considered uni-gram and bi-gram, but also tri-gram, fourth-gram and five-gram. Due to the large number of possible combinations when extending the analysis from uni-gram to five-gram, we needed to adopt two approaches—counting and the Welch’s t-test—to remove rare patterns. Counting uses the number of occurrences and mainly involves identifying n-grams that do not appear in the negative training dataset at a frequency greater than half of the maximum frequency in the positive training dataset, such were then used as our features. Welch’s t test was implemented to identify n-gram occurrences that were significant different between the positive and negative training datasets. It should be noted that such n-grams were retained when they reached a P-value of less than 0.001. The patterns that presented often in the positive training dataset, but were not in the negative training dataset, were retained for this study. The MERCI program [31] was implemented to extract motifs that were present in positive training dataset at least k times and appeared in negative training dataset at most m times. For this study, k = 50 and m = 0. The number of motifs found by MERCI is shown in Table 4.
Activity . | 2-gram . | 3-gram . | 4-gram . | 5-gram . | >6-gram . | Total . |
---|---|---|---|---|---|---|
Anti-parasitic | 0 | 1 | 15 | 11 | 8 | 35 |
Anti-viral | 0 | 0 | 10 | 11 | 41 | 62 |
Anti-cancer | 0 | 2 | 16 | 23 | 40 | 81 |
Targeting mammals | 0 | 1 | 20 | 13 | 30 | 64 |
Anti-fungal | 0 | 0 | 8 | 11 | 31 | 50 |
Targeting Gram-positive bacteria | 0 | 3 | 36 | 15 | 7 | 61 |
Targeting Gram-negative bacteria | 0 | 2 | 31 | 18 | 5 | 56 |
Activity . | 2-gram . | 3-gram . | 4-gram . | 5-gram . | >6-gram . | Total . |
---|---|---|---|---|---|---|
Anti-parasitic | 0 | 1 | 15 | 11 | 8 | 35 |
Anti-viral | 0 | 0 | 10 | 11 | 41 | 62 |
Anti-cancer | 0 | 2 | 16 | 23 | 40 | 81 |
Targeting mammals | 0 | 1 | 20 | 13 | 30 | 64 |
Anti-fungal | 0 | 0 | 8 | 11 | 31 | 50 |
Targeting Gram-positive bacteria | 0 | 3 | 36 | 15 | 7 | 61 |
Targeting Gram-negative bacteria | 0 | 2 | 31 | 18 | 5 | 56 |
Activity . | 2-gram . | 3-gram . | 4-gram . | 5-gram . | >6-gram . | Total . |
---|---|---|---|---|---|---|
Anti-parasitic | 0 | 1 | 15 | 11 | 8 | 35 |
Anti-viral | 0 | 0 | 10 | 11 | 41 | 62 |
Anti-cancer | 0 | 2 | 16 | 23 | 40 | 81 |
Targeting mammals | 0 | 1 | 20 | 13 | 30 | 64 |
Anti-fungal | 0 | 0 | 8 | 11 | 31 | 50 |
Targeting Gram-positive bacteria | 0 | 3 | 36 | 15 | 7 | 61 |
Targeting Gram-negative bacteria | 0 | 2 | 31 | 18 | 5 | 56 |
Activity . | 2-gram . | 3-gram . | 4-gram . | 5-gram . | >6-gram . | Total . |
---|---|---|---|---|---|---|
Anti-parasitic | 0 | 1 | 15 | 11 | 8 | 35 |
Anti-viral | 0 | 0 | 10 | 11 | 41 | 62 |
Anti-cancer | 0 | 2 | 16 | 23 | 40 | 81 |
Targeting mammals | 0 | 1 | 20 | 13 | 30 | 64 |
Anti-fungal | 0 | 0 | 8 | 11 | 31 | 50 |
Targeting Gram-positive bacteria | 0 | 3 | 36 | 15 | 7 | 61 |
Targeting Gram-negative bacteria | 0 | 2 | 31 | 18 | 5 | 56 |
Activity . | 2-gram . | 3-gram . | 4-gram . | 5-gram . | Total . |
---|---|---|---|---|---|
(a) Found by counting | |||||
AMP | 0 | 0 | 0 | 0 | 0 |
Anti-parasitic | 0 | 0 | 22 | 56 | 78 |
Anti-viral | 0 | 0 | 0 | 2 | 2 |
Anti-cancer | 0 | 1 | 6 | 19 | 26 |
Targeting mammals | 0 | 0 | 9 | 5 | 14 |
Anti-fungal | 0 | 0 | 19 | 26 | 45 |
Targeting Gram-positive bacteria | 0 | 0 | 0 | 11 | 11 |
Targeting Gram-negative bacteria | 0 | 0 | 21 | 42 | 63 |
(b) Found by t-test | |||||
AMP | 392 | 816 | 991 | 106 | 2305 |
Anti-parasitic | 22 | 7 | 6 | 3 | 38 |
Anti-viral | 2 | 319 | 444 | 127 | 918 |
Anti-cancer | 9 | 12 | 9 | 1 | 31 |
Targeting mammals | 13 | 13 | 4 | 0 | 30 |
Anti-fungal | 82 | 142 | 101 | 39 | 364 |
Targeting Gram-positive bacteria | 143 | 285 | 175 | 60 | 663 |
Targeting Gram-negative bacteria | 135 | 279 | 168 | 54 | 636 |
Activity . | 2-gram . | 3-gram . | 4-gram . | 5-gram . | Total . |
---|---|---|---|---|---|
(a) Found by counting | |||||
AMP | 0 | 0 | 0 | 0 | 0 |
Anti-parasitic | 0 | 0 | 22 | 56 | 78 |
Anti-viral | 0 | 0 | 0 | 2 | 2 |
Anti-cancer | 0 | 1 | 6 | 19 | 26 |
Targeting mammals | 0 | 0 | 9 | 5 | 14 |
Anti-fungal | 0 | 0 | 19 | 26 | 45 |
Targeting Gram-positive bacteria | 0 | 0 | 0 | 11 | 11 |
Targeting Gram-negative bacteria | 0 | 0 | 21 | 42 | 63 |
(b) Found by t-test | |||||
AMP | 392 | 816 | 991 | 106 | 2305 |
Anti-parasitic | 22 | 7 | 6 | 3 | 38 |
Anti-viral | 2 | 319 | 444 | 127 | 918 |
Anti-cancer | 9 | 12 | 9 | 1 | 31 |
Targeting mammals | 13 | 13 | 4 | 0 | 30 |
Anti-fungal | 82 | 142 | 101 | 39 | 364 |
Targeting Gram-positive bacteria | 143 | 285 | 175 | 60 | 663 |
Targeting Gram-negative bacteria | 135 | 279 | 168 | 54 | 636 |
Note: AMP, anti-microbial peptides.
Activity . | 2-gram . | 3-gram . | 4-gram . | 5-gram . | Total . |
---|---|---|---|---|---|
(a) Found by counting | |||||
AMP | 0 | 0 | 0 | 0 | 0 |
Anti-parasitic | 0 | 0 | 22 | 56 | 78 |
Anti-viral | 0 | 0 | 0 | 2 | 2 |
Anti-cancer | 0 | 1 | 6 | 19 | 26 |
Targeting mammals | 0 | 0 | 9 | 5 | 14 |
Anti-fungal | 0 | 0 | 19 | 26 | 45 |
Targeting Gram-positive bacteria | 0 | 0 | 0 | 11 | 11 |
Targeting Gram-negative bacteria | 0 | 0 | 21 | 42 | 63 |
(b) Found by t-test | |||||
AMP | 392 | 816 | 991 | 106 | 2305 |
Anti-parasitic | 22 | 7 | 6 | 3 | 38 |
Anti-viral | 2 | 319 | 444 | 127 | 918 |
Anti-cancer | 9 | 12 | 9 | 1 | 31 |
Targeting mammals | 13 | 13 | 4 | 0 | 30 |
Anti-fungal | 82 | 142 | 101 | 39 | 364 |
Targeting Gram-positive bacteria | 143 | 285 | 175 | 60 | 663 |
Targeting Gram-negative bacteria | 135 | 279 | 168 | 54 | 636 |
Activity . | 2-gram . | 3-gram . | 4-gram . | 5-gram . | Total . |
---|---|---|---|---|---|
(a) Found by counting | |||||
AMP | 0 | 0 | 0 | 0 | 0 |
Anti-parasitic | 0 | 0 | 22 | 56 | 78 |
Anti-viral | 0 | 0 | 0 | 2 | 2 |
Anti-cancer | 0 | 1 | 6 | 19 | 26 |
Targeting mammals | 0 | 0 | 9 | 5 | 14 |
Anti-fungal | 0 | 0 | 19 | 26 | 45 |
Targeting Gram-positive bacteria | 0 | 0 | 0 | 11 | 11 |
Targeting Gram-negative bacteria | 0 | 0 | 21 | 42 | 63 |
(b) Found by t-test | |||||
AMP | 392 | 816 | 991 | 106 | 2305 |
Anti-parasitic | 22 | 7 | 6 | 3 | 38 |
Anti-viral | 2 | 319 | 444 | 127 | 918 |
Anti-cancer | 9 | 12 | 9 | 1 | 31 |
Targeting mammals | 13 | 13 | 4 | 0 | 30 |
Anti-fungal | 82 | 142 | 101 | 39 | 364 |
Targeting Gram-positive bacteria | 143 | 285 | 175 | 60 | 663 |
Targeting Gram-negative bacteria | 135 | 279 | 168 | 54 | 636 |
Note: AMP, anti-microbial peptides.
Model construction and feature selection
The DT is a commonly used classification method. It has a tree structure similar to a flow chart, such that each internal node represents a test of the features, each branch represents a result of a test and each leaf node represents a class label. In our study, we used Classification and Regression Trees (CART) to build binary DTs for classifying peptides as AMPs or non-AMPs, and to identify their functional activities. The CART applies a greedy approach wherein the tree is constructed by a top-down recursion. During the construction process, the program chooses the ‘best’ feature set to classify the training tuples that make a split in the tree. The CART uses the Gini index as its feature set selection approach. Using the scikit-learn package [33], the ‘DecisionTreeClassifier’ function was adopted to build our DT.
The RF method, an ensemble learning method that allows classification, was implemented to build the prediction model. As this approach combines the results of multiple models, its resulting performance is usually better than that of an individual model. Moreover, the RF is a non-parametric approach that does not make any assumptions about the underlying distribution of the training dataset. Additionally, it is constructed using multiple DTs so that the collection of classifiers is a ‘forest’. In the present study, the CART was adopted to build the forest, and bootstrap aggregation (bagging) was applied when sampling the data. The bootstrap method samples the training tuples evenly with replacement, which means that every selected tuple is likely to be re-added into the training set. The ‘RandomForestClassifier’ function, which is available in the scikit-learn package [33] from python-software, was adopted to construct the RF model. Additionally, the scikit-learn package [33] provided a function to calculate the importance of input features. More specifically, Gini importance is the average decreased impurity of each feature across all trees; this impurity was the least randomness of the given data.
In addition to the aforementioned non-parametric methods, SVM was also used. The SVM is a supervised learning method for classification and regression of linear and nonlinear data. It utilizes nonlinear mapping to transform the training data into a higher dimensional space in which it searches for a linear optimal separating hyper-plane. The svm.SVC function in the scikit-learn package [33] was also used to construct the classifiers.
A sequential forward selection algorithm was employed to extract the useful features for improving prediction performance. First, one type of feature was used. Subsequently, at every round, a new feature was added until the performance of the model stopped showing improvement over the previous round; otherwise, the process stopped when all features had been used. The result was a well-trained prediction model.
Metrics of model performance

Distribution of the peptide sequence lengths of the AMPs and non-AMPs.
10-fold cross validation and independent testing sets
To confirm the robustness of the prediction model, the 10-fold cross validation technique was implemented using the training set. Furthermore, to compare the proposed model with existing tools, various independent testing sets were created for use with iAMPpred [28], AVPpred [34] and AntiFP [19]. These three systems include the latest research on predicting specific functional activities of AMPs, and each provides a web server for researchers. iAMPpred is a predictor that identifies the anti-bacterial, anti-viral and anti-fungal functional activities of AMPs. As iAMPpred only considers AMPs to have a general anti-bacterial functional activity—rather than targeting Gram-positive and Gram-negative functional activities—it was necessary to filter the Gram-positive sequences from the Gram-negative testing set and vice versa. As the AntiFP web server has a sequence length limit of greater than 15 amino acids, any peptide sequences 15 amino acids or less were eliminated from the independent testing set used to assess AntiFP. It should be noted that no further additional procedures were needed when creating the independent testing sets for iAMPpred, MLACP and AVPpred when comparing the anti-fungal, anti-cancer and anti-viral classifiers. The number of sequences in each class-specific dataset is shown in Table 2.
Development of the web-based prediction tool
A web-based prediction tool was constructed using a hypertext preprocessor and was implemented at the backend upon submission of peptide sequences. When users submit their peptides of interest, they need to decide if they want to predict the functional activities of the submitted peptides. If they are interested in predicting the functional activities, they must choose which functional activities they are interested in; otherwise, the prediction tool will only provide the AMP prediction results. The web-based prediction tool is able to accept both single peptide and multiple peptides. Depending on the user’s selection, this tool is able to list the predicted probabilities for a single or multiple types of functional activities—including AMPs that target mammals, Gram-positive and Gram-negative bacteria, anti-parasitic, anti-fungal, anti-cancer and anti-viral AMPs.
Results
Characterization of the sequence-based features of AMPs
As part of this study, a total of 1686 AMP sequences were used to construct the AMP classifier. Based on these sequences, Figure 2 shows the length distribution of the AMPs and non-AMPs. From this analysis, it is clear that the length of AMPs tended to be shorter than that of non-AMPs. Furthermore, Table 6 shows that there was a significant difference between these two groups in terms of length. Bar charts of the AAC of the training set are shown in Figure 3. It is clear that the amino acids ‘C’, ‘D’, ‘E’, ‘G’, ‘K’ and ‘L’ were present in relatively different levels between AMPs and non-AMPs. Specifically, the average ‘C’, ‘G’ and ‘K’ AACs among the AMPs were relative higher than those in the non-AMPs. However, occurrences of ‘D’, ‘E’ and ‘L’ were relatively fewer in the AMPs than the non-AMPs. Furthermore, Table S1 indicates that the amino acid ‘C’ was correlated with the peptide being an AMP, which means the presence of ‘C’ may be a good indicator for identifying AMPs. The ‘CK’ n-gram was identified by its abundance in AMPs and attained the highest correlation among the NTC category. For the PseAAC values, ‘D’ was found to show a relatively strong correlation with AMP identification. A neutral charge (Class 2) on the 1st residue (Charge_C2_001) showed the highest correlation with AMP identification among all CTDD features. Additionally, the various Pearson correlation coefficients (PCCs)—shown in Table S1—indicate that features in CTDD category tended to show a high correlation value. Thus, the peptides’ physicochemical properties might be important for AMP identification.
Comparisons of means and standard deviations (shown in the parentheses) of peptide lengths in the positive and negative datasets based on different AMP functional activities
Classifier . | Positive . | Negative . | P-value . |
---|---|---|---|
Stage 1 | |||
AMP | 44.34 (35.67) | 173.07 (35.28) | <0.0001* |
Stage 2 | |||
Anti-parasitic | 29.53 (21.64) | 29.36 (17.70) | 0.5861 |
Anti-viral | 20.80 (14.57) | 47.33 (31.39) | <0.0001* |
Anti-cancer | 24.43 (16.05) | 30.21 (17.99) | <0.0001* |
Targeting mammals | 26.21 (12.95) | 29.98 (17.39) | 0.2388 |
Anti-fungal | 45.82 (31.17) | 37.76 (31.31) | <0.0001* |
Targeting Gram-positive bacteria | 33.95 (28.70) | 47.64 (30.62) | <0.0001* |
Targeting Gram-negative bacteria | 34.15 (28.19) | 47.15 (30.19) | <0.0001* |
Classifier . | Positive . | Negative . | P-value . |
---|---|---|---|
Stage 1 | |||
AMP | 44.34 (35.67) | 173.07 (35.28) | <0.0001* |
Stage 2 | |||
Anti-parasitic | 29.53 (21.64) | 29.36 (17.70) | 0.5861 |
Anti-viral | 20.80 (14.57) | 47.33 (31.39) | <0.0001* |
Anti-cancer | 24.43 (16.05) | 30.21 (17.99) | <0.0001* |
Targeting mammals | 26.21 (12.95) | 29.98 (17.39) | 0.2388 |
Anti-fungal | 45.82 (31.17) | 37.76 (31.31) | <0.0001* |
Targeting Gram-positive bacteria | 33.95 (28.70) | 47.64 (30.62) | <0.0001* |
Targeting Gram-negative bacteria | 34.15 (28.19) | 47.15 (30.19) | <0.0001* |
Note: * indicates that the test result is significant.
Comparisons of means and standard deviations (shown in the parentheses) of peptide lengths in the positive and negative datasets based on different AMP functional activities
Classifier . | Positive . | Negative . | P-value . |
---|---|---|---|
Stage 1 | |||
AMP | 44.34 (35.67) | 173.07 (35.28) | <0.0001* |
Stage 2 | |||
Anti-parasitic | 29.53 (21.64) | 29.36 (17.70) | 0.5861 |
Anti-viral | 20.80 (14.57) | 47.33 (31.39) | <0.0001* |
Anti-cancer | 24.43 (16.05) | 30.21 (17.99) | <0.0001* |
Targeting mammals | 26.21 (12.95) | 29.98 (17.39) | 0.2388 |
Anti-fungal | 45.82 (31.17) | 37.76 (31.31) | <0.0001* |
Targeting Gram-positive bacteria | 33.95 (28.70) | 47.64 (30.62) | <0.0001* |
Targeting Gram-negative bacteria | 34.15 (28.19) | 47.15 (30.19) | <0.0001* |
Classifier . | Positive . | Negative . | P-value . |
---|---|---|---|
Stage 1 | |||
AMP | 44.34 (35.67) | 173.07 (35.28) | <0.0001* |
Stage 2 | |||
Anti-parasitic | 29.53 (21.64) | 29.36 (17.70) | 0.5861 |
Anti-viral | 20.80 (14.57) | 47.33 (31.39) | <0.0001* |
Anti-cancer | 24.43 (16.05) | 30.21 (17.99) | <0.0001* |
Targeting mammals | 26.21 (12.95) | 29.98 (17.39) | 0.2388 |
Anti-fungal | 45.82 (31.17) | 37.76 (31.31) | <0.0001* |
Targeting Gram-positive bacteria | 33.95 (28.70) | 47.64 (30.62) | <0.0001* |
Targeting Gram-negative bacteria | 34.15 (28.19) | 47.15 (30.19) | <0.0001* |
Note: * indicates that the test result is significant.

Comparisons of the AAC between the AMP and non-AMP peptides based on seven types of functional activities.
Performances of three different AMP classifiers based on 10-fold cross-validation
Method . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|
RF | 0.9488 | 0.9511 | 0.9509 | 0.7706 | 0.9915 |
SVM | 0.9433 | 0.9429 | 0.9430 | 0.7447 | 0.9847 |
DT | 0.8340 | 0.9826 | 0.9687 | 0.8147 | 0.9083 |
Method . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|
RF | 0.9488 | 0.9511 | 0.9509 | 0.7706 | 0.9915 |
SVM | 0.9433 | 0.9429 | 0.9430 | 0.7447 | 0.9847 |
DT | 0.8340 | 0.9826 | 0.9687 | 0.8147 | 0.9083 |
Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.
Performances of three different AMP classifiers based on 10-fold cross-validation
Method . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|
RF | 0.9488 | 0.9511 | 0.9509 | 0.7706 | 0.9915 |
SVM | 0.9433 | 0.9429 | 0.9430 | 0.7447 | 0.9847 |
DT | 0.8340 | 0.9826 | 0.9687 | 0.8147 | 0.9083 |
Method . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|
RF | 0.9488 | 0.9511 | 0.9509 | 0.7706 | 0.9915 |
SVM | 0.9433 | 0.9429 | 0.9430 | 0.7447 | 0.9847 |
DT | 0.8340 | 0.9826 | 0.9687 | 0.8147 | 0.9083 |
Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.
AUC values of RF classifiers trained with the features according to the process of forward feature selection. The number of extracted features in each category is shown in the parentheses
Round . | NCB (0) . | NTB (6915) . | NCC (0) . | NTC (2305) . | AAC (20) . | PseAAC (22) . | CTDD (105) . |
---|---|---|---|---|---|---|---|
1 | N/A | 0.5000 | N/A | 0.9869 | 0.9792 | 0.9787 | 0.9882 |
2 | N/A | 0.9860 | N/A | 0.9918 | 0.9910 | 0.9914 | – |
3 | N/A | 0.9911 | N/A | – | 0.99215 | 0.99212 | – |
4 | N/A | 0.9914 | N/A | – | – | 0.99225 | – |
5 | N/A | 0.9915 | N/A | – | – | – | – |
Round . | NCB (0) . | NTB (6915) . | NCC (0) . | NTC (2305) . | AAC (20) . | PseAAC (22) . | CTDD (105) . |
---|---|---|---|---|---|---|---|
1 | N/A | 0.5000 | N/A | 0.9869 | 0.9792 | 0.9787 | 0.9882 |
2 | N/A | 0.9860 | N/A | 0.9918 | 0.9910 | 0.9914 | – |
3 | N/A | 0.9911 | N/A | – | 0.99215 | 0.99212 | – |
4 | N/A | 0.9914 | N/A | – | – | 0.99225 | – |
5 | N/A | 0.9915 | N/A | – | – | – | – |
Note: The N/A represented that there are no features used to build the classifier and ‘–’ means that the features were selected in the last round. NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.
AUC values of RF classifiers trained with the features according to the process of forward feature selection. The number of extracted features in each category is shown in the parentheses
Round . | NCB (0) . | NTB (6915) . | NCC (0) . | NTC (2305) . | AAC (20) . | PseAAC (22) . | CTDD (105) . |
---|---|---|---|---|---|---|---|
1 | N/A | 0.5000 | N/A | 0.9869 | 0.9792 | 0.9787 | 0.9882 |
2 | N/A | 0.9860 | N/A | 0.9918 | 0.9910 | 0.9914 | – |
3 | N/A | 0.9911 | N/A | – | 0.99215 | 0.99212 | – |
4 | N/A | 0.9914 | N/A | – | – | 0.99225 | – |
5 | N/A | 0.9915 | N/A | – | – | – | – |
Round . | NCB (0) . | NTB (6915) . | NCC (0) . | NTC (2305) . | AAC (20) . | PseAAC (22) . | CTDD (105) . |
---|---|---|---|---|---|---|---|
1 | N/A | 0.5000 | N/A | 0.9869 | 0.9792 | 0.9787 | 0.9882 |
2 | N/A | 0.9860 | N/A | 0.9918 | 0.9910 | 0.9914 | – |
3 | N/A | 0.9911 | N/A | – | 0.99215 | 0.99212 | – |
4 | N/A | 0.9914 | N/A | – | – | 0.99225 | – |
5 | N/A | 0.9915 | N/A | – | – | – | – |
Note: The N/A represented that there are no features used to build the classifier and ‘–’ means that the features were selected in the last round. NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Box plots of feature importance (upper) and PCC values (lower) of the selected features.

Distribution of peptide sequence lengths for various AMP functional activities.

Comparisons of the amino acids present in the different AMP functional activities forming the training set.
Performance when classifying AMPs and non-AMPs
Without feature selection, Table 7 shows that the AUC of the RF model achieved the best performance among the three methods in terms of 10-fold cross validation. Optimizations of the parameters used in generating the RF and SVM models are presented in Figures S1 and Figure S2, respectively. For the number of trees in the RF model, we tried different values ranging from 100 to 1000 to determine the parameter with the best performance. In Figures S1, the optimal number of trees for each functional activity is marked with red point. Similarly, Figure S2 shows the best cost values for generating the SVM models for the different AMP activities. In addition to the cost-value optimization of the SVM models, the selection of kernel functions for the SVM models is presented in Table S2—based on 10-fold cross-validation. The radial basis function (RBF) kernel, combined with a cost value of 5, generated the best SVM classifier for AMP identification. It should be noted that the RF model that considered only the n-gram features NTB and NTC yielded an AUC value of 0.9859. Table 8 demonstrates that when CTDD, NTC, PseAAC and AAC features were incorporated into the model sequentially by the forward selection strategy, the AUC value increased from 0.9915 to 0.9922. Additionally, the MCC increased from 0.7706 to 0.7924; meanwhile the number of features in the training set decreased from 9367 to 2452. In comparison, the AUC for the independent testing set using an RF model trained with 2452 features was 0.9899.
Furthermore, based on the highest performing model with the 2452 selected features, we adopted RF feature importance via the scikit-learn package [33]. The feature importance was determined by the Gini index, and the resulting box plots are presented in Figure 4. Among all the AACs, ‘M’, ‘D’, ‘E’ and ‘C’ occurred at relatively high frequencies. The importance of amino acids ‘D’, ‘E’ and ‘C’ were found to be very different between the AMPs and non-AMPs based on the training set. Additionally, the importance of ‘E’ and ‘M’ were relatively high among all PseAACs. Furthermore, several features—including charge, hydrophobicity, normalized van der Waals volume, secondary structure, solvent accessibility, polarizability and polarity—in the CTDD category tended to have higher importance than others. The PCCs that formed the training set were also analyzed to examine the feature importance. As shown in Figure 4, the PCCs of the CTDD category were higher than those of other feature categories. In a previous study, AmPEP [6] calculated the PCCs but used only 23 features, which resulted in the correlations being lower than 0.5 when training the model. Most of the top 10 CTDD features in our study were also among the 23 features used in [6]. Additionally, the PCCs of the top 10 CTDD features were relatively higher than those of other features. The PCC of amino acid ‘C’ was the highest among the AAC category. The PCC values of ‘CK’ and ‘KC’ were higher than when NTC was considered. Among the PseAACs, the PCCs of ‘C’ and ‘E’ were the highest. Notably, these amino acids were more significantly different between AMPs and non-AMPs, based on our training set observations. They were also among the top 10 most important features in their feature category. Table S1 presents the top 10 features in each category and additional information concerning these important features.
Investigation of the sequence-based features of AMPs that have different functional activities
Figure 5 shows the length distributions of the seven AMP functional activities. Based on the results, the sequence lengths of the anti-viral, anti-cancer, anti-fungal, the AMPs that target Gram-positive and Gram-negative bacteria were different between the positive and negative datasets. More specifically, the anti-viral and anti-cancer AMPs, and those that target Gram-positive and Gram-negative bacteria in the positive dataset, were shorter than those in the negative dataset. In contrast, the anti-fungal AMPs in the positive dataset were longer than those in negative dataset. Table 6 shows that significant differences exist between the different antimicrobial activities. Figure 6 shows that the average AACs for ‘K’ and ‘L’ were different between the positive and negative training datasets. This difference existed in the anti-parasite, anti-viral, anti-cancer AMPs, the AMPs that target mammals, the AMPs that target Gram-positive and Gram-negative bacteria. It is possible to list the amino acids that were relatively different between the positive and negative datasets for each of the seven different targets. Specifically, ‘A’ and ‘S’ were different for the anti-parasite AMPs; ‘G’ and ‘V’ were different for the anti-viral AMPs; ‘I’ and ‘R’ were different for the anti-cancer AMPs; ‘D’, ‘E’, ‘R’ and ‘Y’ were different for the AMPs that target mammals; ‘C’, ‘D’, ‘E’, ‘W’ and ‘G’ were different for the anti-fungal AMPs; and ‘G’, ‘D’ and ‘E’ were different for AMPs that target Gram-positive and Gram-negative bacteria. Table S3 shows that several features seemed to be relevant to identifying the functional activities of an AMP. Unsurprisingly, the physicochemical property-related features, particularly CTDD and PseAAC, were significant in identifying the AMPs with functional activities; most of the important features involved the physicochemical properties of the peptides. However, ‘GL’ and ‘GLF’ were both important n-grams for identifying ACPs, while ‘LP’ was a key feature for identifying AMPs that target mammals.
Activities (# features) . | Method . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-parasitic (751) | RF | 0.8030 | 0.8037 | 0.8036 | 0.4935 | 0.8812 |
SVM | 0.6306 | 0.7516 | 0.7298 | 0.3066 | 0.7596 | |
DT | 0.5735 | 0.8831 | 0.8321 | 0.4322 | 0.7283 | |
Anti-viral (4075) | RF | 0.9028 | 0.9004 | 0.9013 | 0.7915 | 0.9617 |
SVM | 0.8609 | 0.8587 | 0.8595 | 0.7053 | 0.9307 | |
DT | 0.7897 | 0.8749 | 0.8442 | 0.6637 | 0.8323 | |
Anti-cancer (699) | RF | 0.7786 | 0.7679 | 0.7695 | 0.4420 | 0.8527 |
SVM | 0.7572 | 0.7583 | 0.7580 | 0.4116 | 0.8186 | |
DT | 0.5319 | 0.8814 | 0.8229 | 0.3962 | 0.7067 | |
Targeting mammals (579) | RF | 0.8729 | 0.8736 | 0.8736 | 0.6408 | 0.9265 |
SVM | 0.7959 | 0.8029 | 0.8016 | 0.4900 | 0.8888 | |
DT | 0.6603 | 0.9205 | 0.8767 | 0.5721 | 0.7904 | |
Anti-fungal (1983) | RF | 0.8499 | 0.8467 | 0.8486 | 0.6888 | 0.9358 |
SVM | 0.7900 | 0.7882 | 0.7893 | 0.5698 | 0.8719 | |
DT | 0.8125 | 0.7253 | 0.7785 | 0.5375 | 0.7689 | |
Targeting Gram-positive bacteria (3087) | RF | 0.8853 | 0.8860 | 0.8856 | 0.7698 | 0.9567 |
SVM | 0.8404 | 0.8389 | 0.8397 | 0.6776 | 0.9122 | |
DT | 0.8176 | 0.7892 | 0.8045 | 0.6063 | 0.8034 | |
Targeting Gram-negative bacteria (3167) | RF | 0.8799 | 0.8809 | 0.8803 | 0.7592 | 0.9546 |
SVM | 0.8388 | 0.8405 | 0.8396 | 0.6775 | 0.9201 | |
DT | 0.8177 | 0.7802 | 0.8003 | 0.5984 | 0.7990 |
Activities (# features) . | Method . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-parasitic (751) | RF | 0.8030 | 0.8037 | 0.8036 | 0.4935 | 0.8812 |
SVM | 0.6306 | 0.7516 | 0.7298 | 0.3066 | 0.7596 | |
DT | 0.5735 | 0.8831 | 0.8321 | 0.4322 | 0.7283 | |
Anti-viral (4075) | RF | 0.9028 | 0.9004 | 0.9013 | 0.7915 | 0.9617 |
SVM | 0.8609 | 0.8587 | 0.8595 | 0.7053 | 0.9307 | |
DT | 0.7897 | 0.8749 | 0.8442 | 0.6637 | 0.8323 | |
Anti-cancer (699) | RF | 0.7786 | 0.7679 | 0.7695 | 0.4420 | 0.8527 |
SVM | 0.7572 | 0.7583 | 0.7580 | 0.4116 | 0.8186 | |
DT | 0.5319 | 0.8814 | 0.8229 | 0.3962 | 0.7067 | |
Targeting mammals (579) | RF | 0.8729 | 0.8736 | 0.8736 | 0.6408 | 0.9265 |
SVM | 0.7959 | 0.8029 | 0.8016 | 0.4900 | 0.8888 | |
DT | 0.6603 | 0.9205 | 0.8767 | 0.5721 | 0.7904 | |
Anti-fungal (1983) | RF | 0.8499 | 0.8467 | 0.8486 | 0.6888 | 0.9358 |
SVM | 0.7900 | 0.7882 | 0.7893 | 0.5698 | 0.8719 | |
DT | 0.8125 | 0.7253 | 0.7785 | 0.5375 | 0.7689 | |
Targeting Gram-positive bacteria (3087) | RF | 0.8853 | 0.8860 | 0.8856 | 0.7698 | 0.9567 |
SVM | 0.8404 | 0.8389 | 0.8397 | 0.6776 | 0.9122 | |
DT | 0.8176 | 0.7892 | 0.8045 | 0.6063 | 0.8034 | |
Targeting Gram-negative bacteria (3167) | RF | 0.8799 | 0.8809 | 0.8803 | 0.7592 | 0.9546 |
SVM | 0.8388 | 0.8405 | 0.8396 | 0.6775 | 0.9201 | |
DT | 0.8177 | 0.7802 | 0.8003 | 0.5984 | 0.7990 |
Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.
Activities (# features) . | Method . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-parasitic (751) | RF | 0.8030 | 0.8037 | 0.8036 | 0.4935 | 0.8812 |
SVM | 0.6306 | 0.7516 | 0.7298 | 0.3066 | 0.7596 | |
DT | 0.5735 | 0.8831 | 0.8321 | 0.4322 | 0.7283 | |
Anti-viral (4075) | RF | 0.9028 | 0.9004 | 0.9013 | 0.7915 | 0.9617 |
SVM | 0.8609 | 0.8587 | 0.8595 | 0.7053 | 0.9307 | |
DT | 0.7897 | 0.8749 | 0.8442 | 0.6637 | 0.8323 | |
Anti-cancer (699) | RF | 0.7786 | 0.7679 | 0.7695 | 0.4420 | 0.8527 |
SVM | 0.7572 | 0.7583 | 0.7580 | 0.4116 | 0.8186 | |
DT | 0.5319 | 0.8814 | 0.8229 | 0.3962 | 0.7067 | |
Targeting mammals (579) | RF | 0.8729 | 0.8736 | 0.8736 | 0.6408 | 0.9265 |
SVM | 0.7959 | 0.8029 | 0.8016 | 0.4900 | 0.8888 | |
DT | 0.6603 | 0.9205 | 0.8767 | 0.5721 | 0.7904 | |
Anti-fungal (1983) | RF | 0.8499 | 0.8467 | 0.8486 | 0.6888 | 0.9358 |
SVM | 0.7900 | 0.7882 | 0.7893 | 0.5698 | 0.8719 | |
DT | 0.8125 | 0.7253 | 0.7785 | 0.5375 | 0.7689 | |
Targeting Gram-positive bacteria (3087) | RF | 0.8853 | 0.8860 | 0.8856 | 0.7698 | 0.9567 |
SVM | 0.8404 | 0.8389 | 0.8397 | 0.6776 | 0.9122 | |
DT | 0.8176 | 0.7892 | 0.8045 | 0.6063 | 0.8034 | |
Targeting Gram-negative bacteria (3167) | RF | 0.8799 | 0.8809 | 0.8803 | 0.7592 | 0.9546 |
SVM | 0.8388 | 0.8405 | 0.8396 | 0.6775 | 0.9201 | |
DT | 0.8177 | 0.7802 | 0.8003 | 0.5984 | 0.7990 |
Activities (# features) . | Method . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-parasitic (751) | RF | 0.8030 | 0.8037 | 0.8036 | 0.4935 | 0.8812 |
SVM | 0.6306 | 0.7516 | 0.7298 | 0.3066 | 0.7596 | |
DT | 0.5735 | 0.8831 | 0.8321 | 0.4322 | 0.7283 | |
Anti-viral (4075) | RF | 0.9028 | 0.9004 | 0.9013 | 0.7915 | 0.9617 |
SVM | 0.8609 | 0.8587 | 0.8595 | 0.7053 | 0.9307 | |
DT | 0.7897 | 0.8749 | 0.8442 | 0.6637 | 0.8323 | |
Anti-cancer (699) | RF | 0.7786 | 0.7679 | 0.7695 | 0.4420 | 0.8527 |
SVM | 0.7572 | 0.7583 | 0.7580 | 0.4116 | 0.8186 | |
DT | 0.5319 | 0.8814 | 0.8229 | 0.3962 | 0.7067 | |
Targeting mammals (579) | RF | 0.8729 | 0.8736 | 0.8736 | 0.6408 | 0.9265 |
SVM | 0.7959 | 0.8029 | 0.8016 | 0.4900 | 0.8888 | |
DT | 0.6603 | 0.9205 | 0.8767 | 0.5721 | 0.7904 | |
Anti-fungal (1983) | RF | 0.8499 | 0.8467 | 0.8486 | 0.6888 | 0.9358 |
SVM | 0.7900 | 0.7882 | 0.7893 | 0.5698 | 0.8719 | |
DT | 0.8125 | 0.7253 | 0.7785 | 0.5375 | 0.7689 | |
Targeting Gram-positive bacteria (3087) | RF | 0.8853 | 0.8860 | 0.8856 | 0.7698 | 0.9567 |
SVM | 0.8404 | 0.8389 | 0.8397 | 0.6776 | 0.9122 | |
DT | 0.8176 | 0.7892 | 0.8045 | 0.6063 | 0.8034 | |
Targeting Gram-negative bacteria (3167) | RF | 0.8799 | 0.8809 | 0.8803 | 0.7592 | 0.9546 |
SVM | 0.8388 | 0.8405 | 0.8396 | 0.6775 | 0.9201 | |
DT | 0.8177 | 0.7802 | 0.8003 | 0.5984 | 0.7990 |
Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.
Performance when identifying AMPs together with their functional activities
Table 9 demonstrates that the RF achieved the best results during the 2nd stage. The performances regarding the different parameters are shown in Table S4. We adopted the same stopping rule as that of Stage 1. The smaller dataset tended to perform well with a lower number of trees. Except for anti-fungal, the large dataset needed a greater number of trees. Similarly, the SVM kernel selection was related to the scale of the dataset with the RBF kernel outperforming the polynomial kernel on the larger dataset. In contrast, the smaller dataset required the polynomial kernel. As the datasets for the anti-parasitic and anti-cancer AMPs and that of the AMPs targeting mammals were smaller and more imbalanced, their MCCs and AUCs were lower than those of the other groups. Notwithstanding the above, the AUCs of the anti-viral, the anti-fungal AMPs and of those targeting Gram-positive and Gram-negative bacteria were 0.9617, 0.9358, 0.9567 and 0.9546, respectively.
To evaluate the relative importance of features, the RF importance calculation (determined using the scikit-learn package [33]) was implemented. The results are shown in Figure 7; all of the binary profiling positional feature importance values were zero. This is because the compositional features information was already included in the binary profiling positional features. Furthermore, most of the compositional feature importance values tended to be low—this was particularly true for the AAC, which had the lowest values for all seven AMP classifiers. Nevertheless, several compositional features had higher values, which implies that they are likely to be of greater importance in identifying the various functional activities. In comparison, the physical–chemical properties features had the highest importance values among the other feature categories—if outliers are not considered. Their feature importance values show a grouping with relatively high values. We also computed the PCCs for each feature; their box plots are shown in Figure 7. The correlation between the binary profiling positional feature category and label as AMPs is very weak—this result is consistent with their low importance values compared with those of the other features. When the anti-parasite and anti-cancer datasets were considered, the compositional feature correlations were the highest, especially for the NCC and MC features. The physical–chemical property features showed the highest correlations among the remaining datasets.

Box plots of feature importance (upper) and PCC values (lower) for the seven classifiers based on all investigated features.
The features selected using sequential forward selection strategy and the AUC values of seven classes based on RF modeling scheme. The number of features used in each category is shown in the parentheses
Activities . | Selected features . | AUC . |
---|---|---|
Anti-parasitic | AAC (20) + NCC (78) + PseAAC (22) | 0.8783 |
Anti-viral | NTC (918) + AAC (20) + MB (186) + NCB (6) | 0.9692 |
Anti-cancer | CTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243) | 0.8518 |
Targeting mammals | AAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14) | 0.9392 |
Anti-fungal | AAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50) | 0.9382 |
Targeting Gram-positive bacteria | AAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33) | 0.9583 |
Targeting Gram-negative bacteria | AAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56) | 0.9560 |
Activities . | Selected features . | AUC . |
---|---|---|
Anti-parasitic | AAC (20) + NCC (78) + PseAAC (22) | 0.8783 |
Anti-viral | NTC (918) + AAC (20) + MB (186) + NCB (6) | 0.9692 |
Anti-cancer | CTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243) | 0.8518 |
Targeting mammals | AAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14) | 0.9392 |
Anti-fungal | AAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50) | 0.9382 |
Targeting Gram-positive bacteria | AAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33) | 0.9583 |
Targeting Gram-negative bacteria | AAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56) | 0.9560 |
Note: AUC, area under the receiver operating characteristic curve; NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.
The features selected using sequential forward selection strategy and the AUC values of seven classes based on RF modeling scheme. The number of features used in each category is shown in the parentheses
Activities . | Selected features . | AUC . |
---|---|---|
Anti-parasitic | AAC (20) + NCC (78) + PseAAC (22) | 0.8783 |
Anti-viral | NTC (918) + AAC (20) + MB (186) + NCB (6) | 0.9692 |
Anti-cancer | CTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243) | 0.8518 |
Targeting mammals | AAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14) | 0.9392 |
Anti-fungal | AAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50) | 0.9382 |
Targeting Gram-positive bacteria | AAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33) | 0.9583 |
Targeting Gram-negative bacteria | AAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56) | 0.9560 |
Activities . | Selected features . | AUC . |
---|---|---|
Anti-parasitic | AAC (20) + NCC (78) + PseAAC (22) | 0.8783 |
Anti-viral | NTC (918) + AAC (20) + MB (186) + NCB (6) | 0.9692 |
Anti-cancer | CTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243) | 0.8518 |
Targeting mammals | AAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14) | 0.9392 |
Anti-fungal | AAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50) | 0.9382 |
Targeting Gram-positive bacteria | AAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33) | 0.9583 |
Targeting Gram-negative bacteria | AAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56) | 0.9560 |
Note: AUC, area under the receiver operating characteristic curve; NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.
Our analysis showed that the RF approach performed the best among the three methods used. Based on the RF approach, a forward feature selection algorithm was adopted for the RF; the selected features and their AUCs are shown in Table 10. The final performance of this approach is shown in Table 11. When the anti-viral classifier was examined, there was an increase from 0.9617 to 0.9692, which implies that the n-gram-related features might have been different between the positive and negative datasets. Furthermore, when the classifier of AMPs targeting mammals was examined, if CTDD was excluded, but n-gram- and motifs-related features were included, there was an increase performance from 0.9265 to 0.9392. This indicates that n-gram- and motifs-related features are key factors in prediction. For the anti-fungal AMP classifier, almost all features were selected, but there was no large improvement detected under any specific circumstances. Finally, for the classifiers of AMPs targeting Gram-positive and Gram-negative bacteria, selection of the n-gram- and motifs-related features resulted in better performance levels.
Performances of seven class-specific classifiers using RF with the selected features
Activity (# features) . | Dataset . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-parasitic (120) | 10-fold CV | 0.7526 | 0.8366 | 0.8202 | 0.4955 | 0.8783 |
Testing | 0.6167 | 0.7732 | 0.7685 | 0.1570 | 0.7773 | |
Anti-viral (1130) | 10-fold CV | 0.9109 | 0.9324 | 0.9247 | 0.8382 | 0.9692 |
Testing | 0.9085 | 0.8406 | 0.8613 | 0.7075 | 0.9404 | |
Anti- cancer (263) | 10-fold CV | 0.7673 | 0.7888 | 0.7855 | 0.4507 | 0.8518 |
Testing | 0.7766 | 0.7060 | 0.7094 | 0.2208 | 0.8231 | |
Targeting mammals (362) | 10-fold CV | 0.8677 | 0.8893 | 0.8853 | 0.6620 | 0.9392 |
Testing | 0.7849 | 0.8045 | 0.8035 | 0.2998 | 0.8648 | |
Anti- fungal (606) | 10-fold CV | 0.8573 | 0.8553 | 0.8565 | 0.7050 | 0.9382 |
Testing | 0.8561 | 0.6675 | 0.7458 | 0.5186 | 0.8578 | |
Targeting Gram-positive bacteria (843) | 10-fold CV | 0.8852 | 0.8848 | 0.8851 | 0.7687 | 0.9583 |
Testing | 0.8877 | 0.6373 | 0.7423 | 0.5254 | 0.8745 | |
Targeting Gram-negative bacteria (839) | 10-fold CV | 0.8805 | 0.8815 | 0.8809 | 0.7606 | 0.9560 |
Testing | 0.8575 | 0.6574 | 0.7413 | 0.5116 | 0.8672 |
Activity (# features) . | Dataset . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-parasitic (120) | 10-fold CV | 0.7526 | 0.8366 | 0.8202 | 0.4955 | 0.8783 |
Testing | 0.6167 | 0.7732 | 0.7685 | 0.1570 | 0.7773 | |
Anti-viral (1130) | 10-fold CV | 0.9109 | 0.9324 | 0.9247 | 0.8382 | 0.9692 |
Testing | 0.9085 | 0.8406 | 0.8613 | 0.7075 | 0.9404 | |
Anti- cancer (263) | 10-fold CV | 0.7673 | 0.7888 | 0.7855 | 0.4507 | 0.8518 |
Testing | 0.7766 | 0.7060 | 0.7094 | 0.2208 | 0.8231 | |
Targeting mammals (362) | 10-fold CV | 0.8677 | 0.8893 | 0.8853 | 0.6620 | 0.9392 |
Testing | 0.7849 | 0.8045 | 0.8035 | 0.2998 | 0.8648 | |
Anti- fungal (606) | 10-fold CV | 0.8573 | 0.8553 | 0.8565 | 0.7050 | 0.9382 |
Testing | 0.8561 | 0.6675 | 0.7458 | 0.5186 | 0.8578 | |
Targeting Gram-positive bacteria (843) | 10-fold CV | 0.8852 | 0.8848 | 0.8851 | 0.7687 | 0.9583 |
Testing | 0.8877 | 0.6373 | 0.7423 | 0.5254 | 0.8745 | |
Targeting Gram-negative bacteria (839) | 10-fold CV | 0.8805 | 0.8815 | 0.8809 | 0.7606 | 0.9560 |
Testing | 0.8575 | 0.6574 | 0.7413 | 0.5116 | 0.8672 |
Note: SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.
Performances of seven class-specific classifiers using RF with the selected features
Activity (# features) . | Dataset . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-parasitic (120) | 10-fold CV | 0.7526 | 0.8366 | 0.8202 | 0.4955 | 0.8783 |
Testing | 0.6167 | 0.7732 | 0.7685 | 0.1570 | 0.7773 | |
Anti-viral (1130) | 10-fold CV | 0.9109 | 0.9324 | 0.9247 | 0.8382 | 0.9692 |
Testing | 0.9085 | 0.8406 | 0.8613 | 0.7075 | 0.9404 | |
Anti- cancer (263) | 10-fold CV | 0.7673 | 0.7888 | 0.7855 | 0.4507 | 0.8518 |
Testing | 0.7766 | 0.7060 | 0.7094 | 0.2208 | 0.8231 | |
Targeting mammals (362) | 10-fold CV | 0.8677 | 0.8893 | 0.8853 | 0.6620 | 0.9392 |
Testing | 0.7849 | 0.8045 | 0.8035 | 0.2998 | 0.8648 | |
Anti- fungal (606) | 10-fold CV | 0.8573 | 0.8553 | 0.8565 | 0.7050 | 0.9382 |
Testing | 0.8561 | 0.6675 | 0.7458 | 0.5186 | 0.8578 | |
Targeting Gram-positive bacteria (843) | 10-fold CV | 0.8852 | 0.8848 | 0.8851 | 0.7687 | 0.9583 |
Testing | 0.8877 | 0.6373 | 0.7423 | 0.5254 | 0.8745 | |
Targeting Gram-negative bacteria (839) | 10-fold CV | 0.8805 | 0.8815 | 0.8809 | 0.7606 | 0.9560 |
Testing | 0.8575 | 0.6574 | 0.7413 | 0.5116 | 0.8672 |
Activity (# features) . | Dataset . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-parasitic (120) | 10-fold CV | 0.7526 | 0.8366 | 0.8202 | 0.4955 | 0.8783 |
Testing | 0.6167 | 0.7732 | 0.7685 | 0.1570 | 0.7773 | |
Anti-viral (1130) | 10-fold CV | 0.9109 | 0.9324 | 0.9247 | 0.8382 | 0.9692 |
Testing | 0.9085 | 0.8406 | 0.8613 | 0.7075 | 0.9404 | |
Anti- cancer (263) | 10-fold CV | 0.7673 | 0.7888 | 0.7855 | 0.4507 | 0.8518 |
Testing | 0.7766 | 0.7060 | 0.7094 | 0.2208 | 0.8231 | |
Targeting mammals (362) | 10-fold CV | 0.8677 | 0.8893 | 0.8853 | 0.6620 | 0.9392 |
Testing | 0.7849 | 0.8045 | 0.8035 | 0.2998 | 0.8648 | |
Anti- fungal (606) | 10-fold CV | 0.8573 | 0.8553 | 0.8565 | 0.7050 | 0.9382 |
Testing | 0.8561 | 0.6675 | 0.7458 | 0.5186 | 0.8578 | |
Targeting Gram-positive bacteria (843) | 10-fold CV | 0.8852 | 0.8848 | 0.8851 | 0.7687 | 0.9583 |
Testing | 0.8877 | 0.6373 | 0.7423 | 0.5254 | 0.8745 | |
Targeting Gram-negative bacteria (839) | 10-fold CV | 0.8805 | 0.8815 | 0.8809 | 0.7606 | 0.9560 |
Testing | 0.8575 | 0.6574 | 0.7413 | 0.5116 | 0.8672 |
Note: SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.
Performances of AMPfun and existing AMP prediction tools based on the independent testing set
Activities . | Tool . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-viral | AMPfun | 0.9085 | 0.8406 | 0.8613 | 0.7075 | 0.9404 |
iAMPpred [28] | 0.3128 | 0.3959 | 0.3706 | −0.2682 | 0.3158 | |
AVPpred [34] | 0.2409 | 0.8857 | 0.6901 | 0.1643 | N/A | |
Anti-cancer | AMPfun | 0.7766 | 0.7060 | 0.7094 | 0.2208 | 0.8231 |
MLACP [21] | 0.7234 | 0.7512 | 0.7499 | 0.2272 | 0.8320 | |
Anti-fungal | AMPfun | 0.8561 | 0.6675 | 0.7458 | 0.5186 | 0.8578 |
iAMPpred [28] | 0.6610 | 0.7212 | 0.6962 | 0.3796 | 0.7412 | |
Anti-fungal | AMPfun | 0.7929 | 0.7455 | 0.7678 | 0.5375 | 0.8448 |
AntiFP [19] | 0.6699 | 0.7039 | 0.6879 | 0.3737 | N/A | |
Targeting Gram-positive bacteria | AMPfun | 0.8829 | 0.6282 | 0.7385 | 0.5155 | 0.8653 |
iAMPpred [28] | 0.6872 | 0.6541 | 0.6684 | 0.3382 | 0.6950 | |
Targeting Gram-negative bacteria | AMPfun | 0.8563 | 0.6522 | 0.7406 | 0.5086 | 0.8590 |
iAMPpred [28] | 0.6932 | 0.6568 | 0.6726 | 0.3469 | 0.6958 |
Activities . | Tool . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-viral | AMPfun | 0.9085 | 0.8406 | 0.8613 | 0.7075 | 0.9404 |
iAMPpred [28] | 0.3128 | 0.3959 | 0.3706 | −0.2682 | 0.3158 | |
AVPpred [34] | 0.2409 | 0.8857 | 0.6901 | 0.1643 | N/A | |
Anti-cancer | AMPfun | 0.7766 | 0.7060 | 0.7094 | 0.2208 | 0.8231 |
MLACP [21] | 0.7234 | 0.7512 | 0.7499 | 0.2272 | 0.8320 | |
Anti-fungal | AMPfun | 0.8561 | 0.6675 | 0.7458 | 0.5186 | 0.8578 |
iAMPpred [28] | 0.6610 | 0.7212 | 0.6962 | 0.3796 | 0.7412 | |
Anti-fungal | AMPfun | 0.7929 | 0.7455 | 0.7678 | 0.5375 | 0.8448 |
AntiFP [19] | 0.6699 | 0.7039 | 0.6879 | 0.3737 | N/A | |
Targeting Gram-positive bacteria | AMPfun | 0.8829 | 0.6282 | 0.7385 | 0.5155 | 0.8653 |
iAMPpred [28] | 0.6872 | 0.6541 | 0.6684 | 0.3382 | 0.6950 | |
Targeting Gram-negative bacteria | AMPfun | 0.8563 | 0.6522 | 0.7406 | 0.5086 | 0.8590 |
iAMPpred [28] | 0.6932 | 0.6568 | 0.6726 | 0.3469 | 0.6958 |
Note: N/A means that we could not derive the values from the provided results. SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.
Performances of AMPfun and existing AMP prediction tools based on the independent testing set
Activities . | Tool . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-viral | AMPfun | 0.9085 | 0.8406 | 0.8613 | 0.7075 | 0.9404 |
iAMPpred [28] | 0.3128 | 0.3959 | 0.3706 | −0.2682 | 0.3158 | |
AVPpred [34] | 0.2409 | 0.8857 | 0.6901 | 0.1643 | N/A | |
Anti-cancer | AMPfun | 0.7766 | 0.7060 | 0.7094 | 0.2208 | 0.8231 |
MLACP [21] | 0.7234 | 0.7512 | 0.7499 | 0.2272 | 0.8320 | |
Anti-fungal | AMPfun | 0.8561 | 0.6675 | 0.7458 | 0.5186 | 0.8578 |
iAMPpred [28] | 0.6610 | 0.7212 | 0.6962 | 0.3796 | 0.7412 | |
Anti-fungal | AMPfun | 0.7929 | 0.7455 | 0.7678 | 0.5375 | 0.8448 |
AntiFP [19] | 0.6699 | 0.7039 | 0.6879 | 0.3737 | N/A | |
Targeting Gram-positive bacteria | AMPfun | 0.8829 | 0.6282 | 0.7385 | 0.5155 | 0.8653 |
iAMPpred [28] | 0.6872 | 0.6541 | 0.6684 | 0.3382 | 0.6950 | |
Targeting Gram-negative bacteria | AMPfun | 0.8563 | 0.6522 | 0.7406 | 0.5086 | 0.8590 |
iAMPpred [28] | 0.6932 | 0.6568 | 0.6726 | 0.3469 | 0.6958 |
Activities . | Tool . | SN . | SP . | ACC . | MCC . | AUC . |
---|---|---|---|---|---|---|
Anti-viral | AMPfun | 0.9085 | 0.8406 | 0.8613 | 0.7075 | 0.9404 |
iAMPpred [28] | 0.3128 | 0.3959 | 0.3706 | −0.2682 | 0.3158 | |
AVPpred [34] | 0.2409 | 0.8857 | 0.6901 | 0.1643 | N/A | |
Anti-cancer | AMPfun | 0.7766 | 0.7060 | 0.7094 | 0.2208 | 0.8231 |
MLACP [21] | 0.7234 | 0.7512 | 0.7499 | 0.2272 | 0.8320 | |
Anti-fungal | AMPfun | 0.8561 | 0.6675 | 0.7458 | 0.5186 | 0.8578 |
iAMPpred [28] | 0.6610 | 0.7212 | 0.6962 | 0.3796 | 0.7412 | |
Anti-fungal | AMPfun | 0.7929 | 0.7455 | 0.7678 | 0.5375 | 0.8448 |
AntiFP [19] | 0.6699 | 0.7039 | 0.6879 | 0.3737 | N/A | |
Targeting Gram-positive bacteria | AMPfun | 0.8829 | 0.6282 | 0.7385 | 0.5155 | 0.8653 |
iAMPpred [28] | 0.6872 | 0.6541 | 0.6684 | 0.3382 | 0.6950 | |
Targeting Gram-negative bacteria | AMPfun | 0.8563 | 0.6522 | 0.7406 | 0.5086 | 0.8590 |
iAMPpred [28] | 0.6932 | 0.6568 | 0.6726 | 0.3469 | 0.6958 |
Note: N/A means that we could not derive the values from the provided results. SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.
Comparison with existing AMP prediction tools in terms of predictive performance
AMPfun was further compared with related prediction tools such as iAMPpred, AVPpred, MLACP and AntiFP. The performance with respect to the various functional activities using the independent testing set is displayed in Table 12, and the corresponding ROC curves are presented in Figure 8. The accuracy of AMPfun on the anti-viral class was higher than that of iAMPpred and AVPpred. Overall, AMPfun achieved much higher MCC values for most functional activities—the exception being for the anti-cancer, where the value was only 0.0064 lower than that of MLACP. Base on evaluation of the AUC values, AMPfun was again higher for the anti-viral functional activity, exceeding the iAMPpred by 0.6246. Similarly, for the anti-fungal AMPs, and those that target Gram-positive and Gram-negative bacteria, our results were 0.1166, 0.1703 and 0.1632 higher, respectively, than those of iAMPpred. However, when the anti-cancer model was examined, the AUC of AMPfun was slightly lower than that of MLACP by 0.0089.
AMPfun web interface
Based on our method, an online prediction server—AMPfun (http://fdblab.csie.ncu.edu.tw/AMPfun/index.html)—has been developed to predict the possibility that a peptide sequence might be an AMP with a particular functional activity. The best prediction models developed for anti-parasite, anti-viral, anti-cancer, anti-fungal AMPs, as well as those targeting mammals, and Gram-positive and Gram-negative bacteria were applied here. Screenshots of the website are shown in Figure S3.

The ROC curves of AMPfun, MLACP and iAMPpred when predicting anti-viral AMPs, anti-cancer AMPs, anti-fungal AMPs, AMPs targeting Gram-positive bacteria and AMPs targeting Gram-negative peptides.
Discussions and conclusion
The unfortunate reality of multidrug antibiotic resistance is rapidly growing throughout the world. Consequently, many common antibiotics are now unable to be used to fight pathogenic bacteria. As such, the identification of AMPs and their functional activities has begun to attract increasing attention. In this study, we constructed a two-stage framework based on machine learning techniques that is able to identify AMPs and determine their functional activities. We adopted a forward feature selection algorithm to illustrate the importance of crucial features when developing the classifiers for different AMP activities. To obtain better crucial information from the peptides sequences in this study, we adopted the n-grams concept to encode the binary profiling of the positional and composition features. AAC, PseAAC and CTDD were also used as features. Among these, both AAC and PseAAC are commonly used features when separating AMPs from non-AMPs and identifying their functional activities [7, 11, 19, 21, 22, 24, 27–30].
Most previous studies only consider the composition or binary profile of single, double or triple amino acids [11, 22, 27–30]. In our study, we innovatively considered amino acid n-grams as a means of encoding the composition and binary profile features of AMPs. We incorporated two approaches, namely counting and t-tests, to investigate the influence of the n-gram on AMPs and their functional activities. However, no n-gram was identified by the counting approach during the 1st stage due to significant differences in the lengths of the AMPs. Put differently, the non-AMPs in the negative dataset were much longer, which seems to have resulted in a more diverse n-gram arrangement. Whatever the frequency of an n-gram, once it appeared in a non-AMP sequence, it was filtered out of the positive dataset. During the feature selection process of constructing the AMP classifier, CTDD was selected firstly. Specifically, the AMP classifier was able to perform well, even when only CTDD was solely considered. Table S1 shows that all of the properties CTDD considers regarding the 1st residue are important patterns for separating AMPs from non-AMPs. Previous studies have indicated that AMPs are generally positively charged and amphiphilic, but have a large proportion of hydrophobic residues [5, 14]. This suggests that charge is a key factor in separating AMPs from non-AMPs with terminal residue properties being significantly different between AMPs and non-AMPs [6].
In the 2nd stage, we used seven class-specific classifiers. In addition to the 1st stage features, we added motifs-based binary profiling positional features and compositional features. As shown in Figure 7, we found that the motifs-related features were relatively less important than other features, but still showed a strong correlation with the anti-parasitic and anti-cancer AMPs, and those targeting mammals. One possible reason for this is that these features need to be combined with others to improve classifier performance. One previous study indicated that charge also plays an important role in the identification of anti-bacterial and anti-fungal peptides [11].
Figures 6 and 7 and Table S3 all show that the features selected may depend on the functional activity of the AMP. Among the anti-parasite classifiers, pseudo amino acids ‘A’ and ‘K’—having hydrophobicity of the neutral groups in 25% of the residues—showed high feature importance. Among the anti-viral classifiers, the charge of neutral groups, secondary structure of the helix groups, solvent accessibility of the buried groups and polarizability and normalized van der Waals volume of the 3rd group were labeled as high feature importance. These important features were all present in the 1st residue, which showed a major difference in position. Separately, the results for the anti-cancer classifier showed that the compositions ‘GL’ and ‘GLF’ and pseudo amino acids ‘C’ and ‘I’ were important. An interestingly relation is that the top 10 features in feature importance levels also reached higher PCCs when the classifier of AMPs that targeting mammals was considered. The results for the anti-fungal classifier indicated that the pseudo amino acids ‘W’, ‘K’ and ‘λ2’ seemed to be essential factors. When the classifier for AMPs targeting Gram-positive bacteria was examined, a negative charge was present on the 75% and 100% of the residues. Additionally, a neutral charge on the 1st residue, helix secondary structure associated with the 1st residue and the presence of pseudo amino acids ‘E’ and ‘G’ were the key features. Finally, when the classifier for AMPs targeting Gram-negative bacteria was considered, a negative charge in 50% and 100% of residues, polar hydrophobicity of the 1st residue and the presence of pseudo amino acids ‘E’ and ‘G’ were important. The distribution of these important pseudo amino acids between the positive and negative training sets was also quite different, which implies that they are important factors in identifying AMP functional activities. Previous research has indicated that amino acids ‘K’, ‘E’, ‘G’, ‘P’, ‘C’ and ‘I’ are important in classifying anti-bacterial and anti-fungal peptides, and that amino acids ‘R’, ‘K’, ‘W’, ‘S’, ‘T’, ‘P’, ‘H’, ‘C’ and ‘I’ are essential for identifying AVPs [11]. As we collected data from a variety of different databases, and the definition of the negative dataset is rigorous, the above findings are highly convincing and solid.
AMPfun was further compared with several AMP prediction models using our class-specific testing set. AMPfun achieved a higher AUC than the other tools. However, for ACPs, the AUC obtained by AMPfun was slightly lower than that of MLACP [21]. Three possible reasons for the better performance of MLACP are as follows: first, we collected data from nine resources containing a range of different functional activities to increase the size and range of our dataset. Second, some tools did not use the PseAAC and CTDD features, which seem to be important when classifying AMPs into certain functional activities. Finally, we used a strict AMP definition when compiling the negative dataset.
This work presents a new scheme for the identification of AMPs, and beyond this, their classification into a wide variety of functional activities. With this scheme, a web server—AMPfun—was developed to provide a tool to screen unknown peptides for AMPs and identify their potential functional activities. Moreover, we have pinpointed some helpful features that assist the classification of short peptides into AMPs and non-AMPs, such as the physical–chemical properties of the peptide’s 1st residue. In conclusion, this study not only constructed classifiers to distinguish the activities of AMPs but also identified key features with respect to different AMP functions. In the future, the post-translational modifications [35, 36] should be considered into a further investigation of AMP functions.
Although several studies have proposed different machine learning methods to perform antimicrobial peptide (AMP) prediction, most did not consider the diversity of antimicrobial activities. Thus, we specifically investigated the sequential features of AMPs with various functions including anti-parasitic, anti-viral, anti-cancer and anti-fungal AMPs, as well as those that target mammals, Gram-positive and Gram-negative bacteria.
We proposed a two-staged scheme to first identify AMPs, and second, characterize their functional activities. Features—including hydrophobicity, normalized van der Waals volume, polarity, charge and solvent accessibility—were essential to classify peptides as AMPs and non-AMPs; their use enabled the 1st stage AMP classifier to achieve an area under the receiver operating characteristic curve (AUC) value of 0.9894. In the 2nd stage, we found the pseudo amino acid composition was an informative attribute when differentiating between AMPs in terms of their functional activities.
The independent testing results demonstrated that the AUCs of the multi-class models were 0.7773, 0.9404, 0.8231, 0.8578, 0.8648, 0.8745 and 0.8672 with respect to anti-parasitic, anti-viral, anti-cancer, anti-fungal AMPs and those that target mammals, Gram-positive and Gram-negative bacteria, respectively. The proposed scheme was implemented as a web-based tool—AMPfun—which is now free to access at http://fdblab.csie.ncu.edu.tw/AMPfun/index.html.
Author contributions
C.R.C. and T.R.K. carried out the data collection and curation, participated in the bioinformatics analyses and drafted the manuscript. T.R.K. carried out the web tool implementation. C.R.C., L.C.W. and T.Y.L. participated in the design of the study and performed the draft revision. J.T.H. and T.Y.L. conceived of the study, participated in its design and coordination and helped revise the manuscript. All authors read and approved the final manuscript.
Chia-Ru Chung is a PhD candidate in the Department of Computer Science and Information Engineering, National Central University. Her research interests include bioinformatics, statistical inference, big data analytics, data mining and machine learning.
Ting-Rung Kuo obtained her master degree in Department of Computer Science and Information Engineering, National Central University. Her research interests include bioinformatics, data mining and machine learning.
Prof. Li-Ching Wu has a PhD in computer science and is a bioinformatics specialist. He joined Institute of Systems Biology and Bioinformatics, National Central University as an Assistant Professor in the Fall of 2006 and promoted to associate Professor in 2011. He is the creator of the biology database PGTdb and also participated construction of in several databases and tools such as ProSplicer, VirusProbeDB, and RgsMiner. His research interests include bioinformatics, systems biology, big data, disease network analysis and machine learning.
Dr. Tzong-Yi Lee is now an associate professor in the School of Life and Health Sciences, Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen. His research interests include bioinformatics, computational biology, systems biology, big data analytics, data mining and machine learning.
Dr. Jorng-Tzong Horng is now a Distinguished Professor in the Department of Computer Science and Information Engineering, National Central University, Taiwan. His research interests include database system, bioinformatics, computational biology, big data analytics, data mining and machine learning.