Abstract

In recent years, antimicrobial peptides (AMPs) have become an emerging area of focus when developing therapeutics hot spot residues of proteins are dominant against infections. Importantly, AMPs are produced by virtually all known living organisms and are able to target a wide range of pathogenic microorganisms, including viruses, parasites, bacteria and fungi. Although several studies have proposed different machine learning methods to predict peptides as being AMPs, most do not consider the diversity of AMP activities. On this basis, we specifically investigated the sequence features of AMPs with a range of functional activities, including anti-parasitic, anti-viral, anti-cancer and anti-fungal activities and those that target mammals, Gram-positive and Gram-negative bacteria. A new scheme is proposed to systematically characterize and identify AMPs and their functional activities. The 1st stage of the proposed approach is to identify the AMPs, while the 2nd involves further characterization of their functional activities. Sequential forward selection was employed to extract potentially informative features that are possibly associated with the functional activities of the AMPs. These features include hydrophobicity, the normalized van der Waals volume, polarity, charge and solvent accessibility—all of which are essential attributes in classifying between AMPs and non-AMPs. The results revealed the 1st stage AMP classifier was able to achieve an area under the receiver operating characteristic curve (AUC) value of 0.9894. During the 2nd stage, we found pseudo amino acid composition to be an informative attribute when differentiating between AMPs in terms of their functional activities. The independent testing results demonstrated that the AUCs of the multi-class models were 0.7773, 0.9404, 0.8231, 0.8578, 0.8648, 0.8745 and 0.8672 for anti-parasitic, anti-viral, anti-cancer, anti-fungal AMPs and those that target mammals, Gram-positive and Gram-negative bacteria, respectively. The proposed scheme helps facilitate biological experiments related to the functional analysis of AMPs. Additionally, it was implemented as a user-friendly web server (AMPfun, http://fdblab.csie.ncu.edu.tw/AMPfun/index.html) that allows individuals to explore the antimicrobial functions of peptides of interest.

Introduction

The overuse of antibiotics has led to microbial pathogens developing resistance to chemical antibiotics. As a result, there is an urgent need to develop new therapeutics to treat infections. This seemingly intractable problem means that the remarkable benefits of antibiotics are now seriously threatened by the rapid development of antibiotic-resistant bacteria—an issue that has developed into a global crisis [1]. Antimicrobial peptides (AMPs) are a particularly functional group of protein molecules. Most AMPs contain positively charged residues on their hydrophilic side for attaching to the membrane surfaces of microbes [2]. Compared to conventional antibiotics, the short time frame interaction between AMPs and microbial membranes promotes rapid microbe death and decreases the probability of resistance development [3].

Table 1

Related works of antimicrobial peptide predictions

ReferencePrediction targetData source (#Pos : #Neg)CD-HIT parameterSequence lengthTraining featuresClassifierWeb server
Bhadra et al. (2018) [6]AMPAPD3 + CAMP+LAMP/UniProt (3268:166791)AAC, PseAAC, CTDDRF
Veltri et al. (2018) [37]AMPAPD3/UniProt (1779:1778)90% on Pos;
40% on Neg
Larger than 10Peptide sequencesDNN1
Agrawal et al. (2018) [19]Anti-fungal peptidesDRAMP/SwissProt (1459:1459)Keep same in both Pos and NegAAC, DPC, binary profile, mass, charge and pI value of peptideSVM, RF, DT,
Naïve Bayes
2
Wang et al. (2017) [38]Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancerAPD (2222)Between 10 and 60AAC, DPCWKnn, MLR
Meher et al. (2017) [11]Anti-bacterial, anti-viral, anti-fungalCAMP+APD3 + AntiBP2 + LAMP+
AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384)
Larger than 10AAC, PseAAC, structural, physicochemicalSVM3
Gabere et al. (2017) [39]AMPDRMPD+APD3/UniProt (2260:11300)90%Match 1:5 with Pos
Manavalan (2017) [21]ACP(735:1025)90%Less than 50AAC, DPC, PCP, ATC(C,H,O,N,S)SVM, RF4
Lin and Xu (2016) [29]AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viralAPD/UniProt (879:2405)Between 5 and 100PseAACRF5
Chen et al. (2016) [24]Anti-cancerAPD2/UnProt (138:206)90%PseAAC, g-gap DPCSVM6
Zare et al. (2015) [23]Anti-viralAVPpred (342:312)90%Between 10 and 100PseAACAdaBoost, RBF, NB, DT, decision stump, REPTree
Tyagi et al. (2013) [22]Anti-cancerAPD + CAMP+DADP/SwissProt (225:2250–)AAC, DPC, binary profilesSVM7
Xiao et al. (2013) [28]AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIVAPD/UniProt (1486:2405)40%Between 5 and 100PseAACFKNN, multi-label FKNN classifier8
Khosravian. et al. (2013) [26]Anti-bacterialAPD/AMP (1086:8860)90%PseAACSVM
Thakur et al. (2012) [34]Anti-viralPubmed and Patent Lens (604:452)AAC, motif, physicochemical properties, sequence alignmentSVM9
Wang. et al. (2011) [40]AMPCAMP/UniProt (870:9731)70%Same as PosAAC, PseAAC, sequence alignmentNNA, mRMR10
Lata et al. (2010) [27]Anti-bacterialAPD/SwisProt (999:999)Same as PosAAC, binary patternsSVM
ReferencePrediction targetData source (#Pos : #Neg)CD-HIT parameterSequence lengthTraining featuresClassifierWeb server
Bhadra et al. (2018) [6]AMPAPD3 + CAMP+LAMP/UniProt (3268:166791)AAC, PseAAC, CTDDRF
Veltri et al. (2018) [37]AMPAPD3/UniProt (1779:1778)90% on Pos;
40% on Neg
Larger than 10Peptide sequencesDNN1
Agrawal et al. (2018) [19]Anti-fungal peptidesDRAMP/SwissProt (1459:1459)Keep same in both Pos and NegAAC, DPC, binary profile, mass, charge and pI value of peptideSVM, RF, DT,
Naïve Bayes
2
Wang et al. (2017) [38]Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancerAPD (2222)Between 10 and 60AAC, DPCWKnn, MLR
Meher et al. (2017) [11]Anti-bacterial, anti-viral, anti-fungalCAMP+APD3 + AntiBP2 + LAMP+
AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384)
Larger than 10AAC, PseAAC, structural, physicochemicalSVM3
Gabere et al. (2017) [39]AMPDRMPD+APD3/UniProt (2260:11300)90%Match 1:5 with Pos
Manavalan (2017) [21]ACP(735:1025)90%Less than 50AAC, DPC, PCP, ATC(C,H,O,N,S)SVM, RF4
Lin and Xu (2016) [29]AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viralAPD/UniProt (879:2405)Between 5 and 100PseAACRF5
Chen et al. (2016) [24]Anti-cancerAPD2/UnProt (138:206)90%PseAAC, g-gap DPCSVM6
Zare et al. (2015) [23]Anti-viralAVPpred (342:312)90%Between 10 and 100PseAACAdaBoost, RBF, NB, DT, decision stump, REPTree
Tyagi et al. (2013) [22]Anti-cancerAPD + CAMP+DADP/SwissProt (225:2250–)AAC, DPC, binary profilesSVM7
Xiao et al. (2013) [28]AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIVAPD/UniProt (1486:2405)40%Between 5 and 100PseAACFKNN, multi-label FKNN classifier8
Khosravian. et al. (2013) [26]Anti-bacterialAPD/AMP (1086:8860)90%PseAACSVM
Thakur et al. (2012) [34]Anti-viralPubmed and Patent Lens (604:452)AAC, motif, physicochemical properties, sequence alignmentSVM9
Wang. et al. (2011) [40]AMPCAMP/UniProt (870:9731)70%Same as PosAAC, PseAAC, sequence alignmentNNA, mRMR10
Lata et al. (2010) [27]Anti-bacterialAPD/SwisProt (999:999)Same as PosAAC, binary patternsSVM

Note: Pos, positive dataset; Neg, negative dataset; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, composition–transition–distribution D descriptor; DPC, dipeptide composition; NB, Naïve Bayes; FKNN, Fuzzy K-nearest neighbor; WKnn, weighted K-nearest neighbor algorithm; MLR, multiple linear regression; DNN, deep neural network; REPTree, Reduced-Error Pruning Tree; NNA, nearest neighbor algorithm; mRMR, maximum relevance, minimum redundancy; − means that there is no provided.

Table 1

Related works of antimicrobial peptide predictions

ReferencePrediction targetData source (#Pos : #Neg)CD-HIT parameterSequence lengthTraining featuresClassifierWeb server
Bhadra et al. (2018) [6]AMPAPD3 + CAMP+LAMP/UniProt (3268:166791)AAC, PseAAC, CTDDRF
Veltri et al. (2018) [37]AMPAPD3/UniProt (1779:1778)90% on Pos;
40% on Neg
Larger than 10Peptide sequencesDNN1
Agrawal et al. (2018) [19]Anti-fungal peptidesDRAMP/SwissProt (1459:1459)Keep same in both Pos and NegAAC, DPC, binary profile, mass, charge and pI value of peptideSVM, RF, DT,
Naïve Bayes
2
Wang et al. (2017) [38]Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancerAPD (2222)Between 10 and 60AAC, DPCWKnn, MLR
Meher et al. (2017) [11]Anti-bacterial, anti-viral, anti-fungalCAMP+APD3 + AntiBP2 + LAMP+
AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384)
Larger than 10AAC, PseAAC, structural, physicochemicalSVM3
Gabere et al. (2017) [39]AMPDRMPD+APD3/UniProt (2260:11300)90%Match 1:5 with Pos
Manavalan (2017) [21]ACP(735:1025)90%Less than 50AAC, DPC, PCP, ATC(C,H,O,N,S)SVM, RF4
Lin and Xu (2016) [29]AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viralAPD/UniProt (879:2405)Between 5 and 100PseAACRF5
Chen et al. (2016) [24]Anti-cancerAPD2/UnProt (138:206)90%PseAAC, g-gap DPCSVM6
Zare et al. (2015) [23]Anti-viralAVPpred (342:312)90%Between 10 and 100PseAACAdaBoost, RBF, NB, DT, decision stump, REPTree
Tyagi et al. (2013) [22]Anti-cancerAPD + CAMP+DADP/SwissProt (225:2250–)AAC, DPC, binary profilesSVM7
Xiao et al. (2013) [28]AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIVAPD/UniProt (1486:2405)40%Between 5 and 100PseAACFKNN, multi-label FKNN classifier8
Khosravian. et al. (2013) [26]Anti-bacterialAPD/AMP (1086:8860)90%PseAACSVM
Thakur et al. (2012) [34]Anti-viralPubmed and Patent Lens (604:452)AAC, motif, physicochemical properties, sequence alignmentSVM9
Wang. et al. (2011) [40]AMPCAMP/UniProt (870:9731)70%Same as PosAAC, PseAAC, sequence alignmentNNA, mRMR10
Lata et al. (2010) [27]Anti-bacterialAPD/SwisProt (999:999)Same as PosAAC, binary patternsSVM
ReferencePrediction targetData source (#Pos : #Neg)CD-HIT parameterSequence lengthTraining featuresClassifierWeb server
Bhadra et al. (2018) [6]AMPAPD3 + CAMP+LAMP/UniProt (3268:166791)AAC, PseAAC, CTDDRF
Veltri et al. (2018) [37]AMPAPD3/UniProt (1779:1778)90% on Pos;
40% on Neg
Larger than 10Peptide sequencesDNN1
Agrawal et al. (2018) [19]Anti-fungal peptidesDRAMP/SwissProt (1459:1459)Keep same in both Pos and NegAAC, DPC, binary profile, mass, charge and pI value of peptideSVM, RF, DT,
Naïve Bayes
2
Wang et al. (2017) [38]Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancerAPD (2222)Between 10 and 60AAC, DPCWKnn, MLR
Meher et al. (2017) [11]Anti-bacterial, anti-viral, anti-fungalCAMP+APD3 + AntiBP2 + LAMP+
AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384)
Larger than 10AAC, PseAAC, structural, physicochemicalSVM3
Gabere et al. (2017) [39]AMPDRMPD+APD3/UniProt (2260:11300)90%Match 1:5 with Pos
Manavalan (2017) [21]ACP(735:1025)90%Less than 50AAC, DPC, PCP, ATC(C,H,O,N,S)SVM, RF4
Lin and Xu (2016) [29]AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viralAPD/UniProt (879:2405)Between 5 and 100PseAACRF5
Chen et al. (2016) [24]Anti-cancerAPD2/UnProt (138:206)90%PseAAC, g-gap DPCSVM6
Zare et al. (2015) [23]Anti-viralAVPpred (342:312)90%Between 10 and 100PseAACAdaBoost, RBF, NB, DT, decision stump, REPTree
Tyagi et al. (2013) [22]Anti-cancerAPD + CAMP+DADP/SwissProt (225:2250–)AAC, DPC, binary profilesSVM7
Xiao et al. (2013) [28]AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIVAPD/UniProt (1486:2405)40%Between 5 and 100PseAACFKNN, multi-label FKNN classifier8
Khosravian. et al. (2013) [26]Anti-bacterialAPD/AMP (1086:8860)90%PseAACSVM
Thakur et al. (2012) [34]Anti-viralPubmed and Patent Lens (604:452)AAC, motif, physicochemical properties, sequence alignmentSVM9
Wang. et al. (2011) [40]AMPCAMP/UniProt (870:9731)70%Same as PosAAC, PseAAC, sequence alignmentNNA, mRMR10
Lata et al. (2010) [27]Anti-bacterialAPD/SwisProt (999:999)Same as PosAAC, binary patternsSVM

Note: Pos, positive dataset; Neg, negative dataset; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, composition–transition–distribution D descriptor; DPC, dipeptide composition; NB, Naïve Bayes; FKNN, Fuzzy K-nearest neighbor; WKnn, weighted K-nearest neighbor algorithm; MLR, multiple linear regression; DNN, deep neural network; REPTree, Reduced-Error Pruning Tree; NNA, nearest neighbor algorithm; mRMR, maximum relevance, minimum redundancy; − means that there is no provided.

AMPs are produced by virtually all organisms on earth as essential components of the innate immune system [4]. As such, they are the 1st line of defense against microbes in many organisms [5–11]. Specifically, AMPs are able to disrupt either the cell membrane of microbes or intracellular functioning to kill the target organism [11]. Consequently, they are able to defend an organism against a wide range of pathogenic microorganisms, including viruses, parasites, bacteria and fungi [5, 12, 13]. Generally, AMPs consist of 10–50 amino acids and show little sequence homology with one another [5, 12, 14]. As a result, identifying AMPs and their activity is quite challenging. With the rapid growth of resistance to chemical antibiotics among microbial pathogens, there has developed an urgent need to identify new and novel therapeutics that are able to treat infections. As such, AMPs are now being taken more seriously and gaining popularity with respect to their clinical application. Investigations targeting AMPs are emerging to help us gain a full understanding of their mechanisms for new drug development. This means that new agents against infectious diseases and inflammatory, as well as anti-tumorigenic AMP-based drugs, should be reachable [15].

Due to the urgent need to be able to identify AMPs and their functional activities, numerous studies have constructed various databases that contain experimentally validated AMPs and annotations of their functional activities. The 3rd version of the antimicrobial peptide database (APD3) not only includes AMPs and their different functional activities— including anti-bacterial, anti-fungal, anti-viral and anti-cancer—but also has a glossary, specific nomenclature, peptide classification section, search functionality and AMP prediction system [8]. Furthermore, the A Database of Antimicrobial Peptides (ADAM) database is also available and provides a comprehensive analysis of experimentally validated AMPs linked to their known anti-microbial activities—including activity against Gram-positive bacteria, activity against Gram-negative bacteria and anti-fungal and anti-viral activity [10]. Additionally, the Data Repository of Antimicrobial Peptides (DRAMP) is an open access database containing a diverse set of AMP annotations and their different functional activities, such as activities against Gram-positive bacteria and Gram-negative bacterial [16]. In contrast to the aforementioned databases, several others have focused on AMPs of a specific functional activity. ParaPEP is one such database consisting of 519 unique anti-parasitic peptides and provides very useful information related to experimentally validated anti-parasitic peptide sequences and their structures [17]. There is also AVPdb, which is the 1st comprehensive database of anti-viral peptides (AVPs). It provides a dedicated resource of experimentally verified AVPs that are able to target over 60 medically important viruses. AVPdb includes 2683 experimentally verified AVPs [18]. Additionally, there is AntiFP, which has not only released several anti-fungal datasets, but also provided a number of prediction tools [19]. Several databases—such as CancerPPD [20], MLACP [21] and AntiCP [22]—have concentrated on specifically collecting anti-cancer peptides (ACPs). Currently, an integrated resource, dbAMP [2], has been developed for exploring AMPs with functional activities and physicochemical properties on high-throughput data.

To deal with the lack of sequence similarity among AMPs, there is a necessity to develop an effective method for identifying AMPs and their functional activity. As listed in Table 1, many studies have developed various automatic tools using a variety of modeling techniques. As part of these studies, the decision tree (DT) [19, 23], random forest (RF) [6, 9, 19, 21, 23], support vector machine (SVM) [9, 11, 19, 21–27], artificial neural network [9, 26] and discriminant analysis [9] approaches have been used to construct classifiers that are capable of differentiating AMPs from non-AMPs. Besides the ability to classify peptides as AMPs and non-AMPs, the ability to predict each AMP’s functional activities has gained increased attention in recent years. For example, iAMP-2L—a two-level multi-label classifier—was developed using the fuzzy K-nearest neighbor (FKNN) approach; its aim was to identify both AMPs and their functional activities, including anti-bacterial, anti-cancer, anti-viral, anti-fungal and anti-human immunodeficiency virus (HIV) [28]. In comparison, Meher et al. [11] adopted an SVM-based method to discriminate between anti-bacterial, anti-fungal and AVPs and subsequently performed further important feature analysis to pinpoint the critical factors associated with these three AMP functional activities. Moreover, to deal with the unbalanced multi-label problem, Lin and Xu [29] applied the synthetic minority over-sampling technique with the aim of distinguishing some functional activities of AMPs—including anti-bacterial, anti-cancer, anti-viral, anti-fungal and anti-HIV. In addition to the aforementioned machine learning methods, the composition of the feature set is also a critical factor in making precise predictions [6]. It is important to have the proper feature set as this will help identify differences between the various groups. Among the research published in this field, compositional and physicochemical data, structural properties, sequence order and terminal residue patterns of the AMPs have been effective features for making AMP predictions. For example, amino acid composition (AAC) and dipeptide composition (DPC) are compositional features and several previous studies have used them as features when constructing their classifiers [11, 22, 27–30]. As global protein sequence descriptors, the composition, transition and distribution (CTD) patterns of amino acid properties are able to encode the distribution patterns of physical–chemical properties [6, 31]. Furthermore, numerous studies have considered physicochemical properties, including CTD, pseudo amino acid composition (PseAAC) and the AAINDEX [6, 11, 19, 21, 24, 27–29], in the AMP prediction.

Flow chart of this study.
Figure 1

Flow chart of this study.

Table 1 shows limited number of methods devoted to constructing multiple classifiers for AMPs with a range of functional activities. Most studies have focused on the identification of AMPs with a specific function, e.g. ACPs. Although several studies have investigated various types of AMPs, they have not yet developed freely accessed tools that allow users to submit their peptides of interest. Therefore, the primary purpose of this study is to design a two-stage framework for identifying AMPs along with their functional activities. Additionally, feature selection analysis is performed to obtain a better understanding of the sequential characteristics of AMPs with respect to seven target activities.

Materials and methods

To help create a useful identification system for AMPs and their functional activities, we developed a flow chart for this study (Figure 1). Detailed explanations of each process described in the chart are provided in the following sections.

Dataset preparation

The positive dataset used to construct the AMP classifier was collected from a range of AMP databases, namely APD3 [8], ADAM [10], ParaPep [17], AVPdb [18], CancerPPD [20], MLACP [12], AntiCP [22], AntiFP [19] and DRAMP [16]. A total of 6766 experimentally validated AMP sequences were obtained from these resources. Additionally, non-AMP data that formed the negative dataset were obtained from AmPEP [6] and consist of a collection of sequences from UniProt [30] with lengths ranging from 5 to 255 amino acid residues. These sequences were then filtered using the unnatural amino acid B, J, O, U, X and Z. To reduce any homology bias and redundancy, the CD-HIT program [32] was used to screen the positive dataset with a 50% threshold. Homologous sequences were also removed from the negative dataset according to the same criterion. Subsequently, CD-HIT-2D [32] was used on the two datasets at a threshold of 50%. Finally, 70% of both the positive and negative datasets (18,114 sequences, total) was used to create two training sets, while the remaining 30% of each (7764 sequences, total) was used to create two testing sets. The positive training set contains 1686 AMP sequences, while the testing set contained 723.

Table 2

Data source and the number of sequences for training set, testing set and independent testing set based on seven AMP functional activities

ActivitiesDatabasesTraining setTesting setIndependent testing set
PositiveNegativePositiveNegativePositiveNegative
Anti-parasiticAPD3 [8], ADAM [10], ParaPep [17]140700601914
Anti-viralAPD3 [8], ADAM [10],
AVPdb [18]
1400245160113746011374
Anti-cancerAPD3 [8], CancerPPD [20], MLACP [21], AntiCP [22]2191095941881941881
Targeting mammalsAPD3 [8]2151075931882
Anti-fungalAPD3 [8], ADAM [10], AntiFP [19]191212618201155820+1155+
734#825#
Targeting Gram-positive bacteriaAPD3 [8], ADAM [10], DRAMP [16]1930162482811478281147
Targeting Gram-negative bacteriaAPD3 [8], ADAM [10], DRAMP [16]1931163582811478281147
ActivitiesDatabasesTraining setTesting setIndependent testing set
PositiveNegativePositiveNegativePositiveNegative
Anti-parasiticAPD3 [8], ADAM [10], ParaPep [17]140700601914
Anti-viralAPD3 [8], ADAM [10],
AVPdb [18]
1400245160113746011374
Anti-cancerAPD3 [8], CancerPPD [20], MLACP [21], AntiCP [22]2191095941881941881
Targeting mammalsAPD3 [8]2151075931882
Anti-fungalAPD3 [8], ADAM [10], AntiFP [19]191212618201155820+1155+
734#825#
Targeting Gram-positive bacteriaAPD3 [8], ADAM [10], DRAMP [16]1930162482811478281147
Targeting Gram-negative bacteriaAPD3 [8], ADAM [10], DRAMP [16]1931163582811478281147

Note: ‘+’ means the testing set for iAMPpred tool; ‘#’ means the testing for AntiFP tool; and ‘–’ means that there is no existing tool and hence we do not have to separate an independent testing set.

Table 2

Data source and the number of sequences for training set, testing set and independent testing set based on seven AMP functional activities

ActivitiesDatabasesTraining setTesting setIndependent testing set
PositiveNegativePositiveNegativePositiveNegative
Anti-parasiticAPD3 [8], ADAM [10], ParaPep [17]140700601914
Anti-viralAPD3 [8], ADAM [10],
AVPdb [18]
1400245160113746011374
Anti-cancerAPD3 [8], CancerPPD [20], MLACP [21], AntiCP [22]2191095941881941881
Targeting mammalsAPD3 [8]2151075931882
Anti-fungalAPD3 [8], ADAM [10], AntiFP [19]191212618201155820+1155+
734#825#
Targeting Gram-positive bacteriaAPD3 [8], ADAM [10], DRAMP [16]1930162482811478281147
Targeting Gram-negative bacteriaAPD3 [8], ADAM [10], DRAMP [16]1931163582811478281147
ActivitiesDatabasesTraining setTesting setIndependent testing set
PositiveNegativePositiveNegativePositiveNegative
Anti-parasiticAPD3 [8], ADAM [10], ParaPep [17]140700601914
Anti-viralAPD3 [8], ADAM [10],
AVPdb [18]
1400245160113746011374
Anti-cancerAPD3 [8], CancerPPD [20], MLACP [21], AntiCP [22]2191095941881941881
Targeting mammalsAPD3 [8]2151075931882
Anti-fungalAPD3 [8], ADAM [10], AntiFP [19]191212618201155820+1155+
734#825#
Targeting Gram-positive bacteriaAPD3 [8], ADAM [10], DRAMP [16]1930162482811478281147
Targeting Gram-negative bacteriaAPD3 [8], ADAM [10], DRAMP [16]1931163582811478281147

Note: ‘+’ means the testing set for iAMPpred tool; ‘#’ means the testing for AntiFP tool; and ‘–’ means that there is no existing tool and hence we do not have to separate an independent testing set.

Table 3

Overview of the features analyzed in this investigation

CategoryFeature nameAbbreviation# descriptor
Binary profiling of positionN-gram binary profiling of position found by countingNCB3a
N-gram binary profiling of position found by t-testNTB3a
Motifs binary profiling of positionMB3a
CompositionN-gram composition found by countingNCC1a
N-gram composition found by t-testNTC1a
Motifs compositionMC1a
Amino acid compositionAAC20
Physical–chemical propertiesPseudo amino acid compositionPseAAC22
Distribution descriptor of composition transition distributionCTDD105
CategoryFeature nameAbbreviation# descriptor
Binary profiling of positionN-gram binary profiling of position found by countingNCB3a
N-gram binary profiling of position found by t-testNTB3a
Motifs binary profiling of positionMB3a
CompositionN-gram composition found by countingNCC1a
N-gram composition found by t-testNTC1a
Motifs compositionMC1a
Amino acid compositionAAC20
Physical–chemical propertiesPseudo amino acid compositionPseAAC22
Distribution descriptor of composition transition distributionCTDD105

Note: ameans that the number depends on the choice of the number of binary profiling of position or composition found by either counting or t-test.

Table 3

Overview of the features analyzed in this investigation

CategoryFeature nameAbbreviation# descriptor
Binary profiling of positionN-gram binary profiling of position found by countingNCB3a
N-gram binary profiling of position found by t-testNTB3a
Motifs binary profiling of positionMB3a
CompositionN-gram composition found by countingNCC1a
N-gram composition found by t-testNTC1a
Motifs compositionMC1a
Amino acid compositionAAC20
Physical–chemical propertiesPseudo amino acid compositionPseAAC22
Distribution descriptor of composition transition distributionCTDD105
CategoryFeature nameAbbreviation# descriptor
Binary profiling of positionN-gram binary profiling of position found by countingNCB3a
N-gram binary profiling of position found by t-testNTB3a
Motifs binary profiling of positionMB3a
CompositionN-gram composition found by countingNCC1a
N-gram composition found by t-testNTC1a
Motifs compositionMC1a
Amino acid compositionAAC20
Physical–chemical propertiesPseudo amino acid compositionPseAAC22
Distribution descriptor of composition transition distributionCTDD105

Note: ameans that the number depends on the choice of the number of binary profiling of position or composition found by either counting or t-test.

The construction of class-specific datasets involved a subdivision of the AMP databases. For each functional activity, Table 2 shows the number of sequences forming the positive and negative testing and training datasets, as well as the data source. It should be noted that once a sequence has been assigned an activity, that sequence is assigned to the positive dataset for that specific activity; otherwise, it is regarded as negative data. For instance, an AMP peptide having activity against parasites and cancer is assigned to the positive anti-parasite dataset (for building the anti-parasitic classifier); however, it is also assigned to the negative datasets of the remaining activities. On this basis, seven positive and negative datasets are arranged for generation of the class-specific classifiers. Similar to the process described above, these seven positive and negative datasets were randomly divided into training and testing sets made up of 70% and 30% of the sequences, respectively. Furthermore, the CD-HIT-2D [32] was applied on the training and testing datasets with a 50% threshold during the development of the class-specific classifiers.

Generation of training features

It was extremely important to transform these sequences, which consisted of strings of amino acids, into reasonable and representative numerical values before building the prediction model. As presented in Table 3, we divided the features into three types: binary profiling of amino acid position, AAC and physical–chemical properties. The binary profiling of amino acid position included n-gram binary profiling of position as determined by counting (NCB), the same as determined by t-test (NTB) and motif-based binary profiling of position (MB). The n-gram composition by counting (NCC), n-gram composition by t-test (NTC), motifs composition (MC) and AAC were features of the composition category, while PseAAC and Distribution descriptor of CTD (CTDD) were physical–chemical properties.

In this study, we not only considered uni-gram and bi-gram, but also tri-gram, fourth-gram and five-gram. Due to the large number of possible combinations when extending the analysis from uni-gram to five-gram, we needed to adopt two approaches—counting and the Welch’s t-test—to remove rare patterns. Counting uses the number of occurrences and mainly involves identifying n-grams that do not appear in the negative training dataset at a frequency greater than half of the maximum frequency in the positive training dataset, such were then used as our features. Welch’s t test was implemented to identify n-gram occurrences that were significant different between the positive and negative training datasets. It should be noted that such n-grams were retained when they reached a P-value of less than 0.001. The patterns that presented often in the positive training dataset, but were not in the negative training dataset, were retained for this study. The MERCI program [31] was implemented to extract motifs that were present in positive training dataset at least k times and appeared in negative training dataset at most m times. For this study, k = 50 and m = 0. The number of motifs found by MERCI is shown in Table 4.

Table 4

Number of motifs detected by MERCI in seven AMP functional activities

Activity2-gram3-gram4-gram5-gram>6-gramTotal
Anti-parasitic011511835
Anti-viral0010114162
Anti-cancer0216234081
Targeting mammals0120133064
Anti-fungal008113150
Targeting Gram-positive bacteria033615761
Targeting Gram-negative bacteria023118556
Activity2-gram3-gram4-gram5-gram>6-gramTotal
Anti-parasitic011511835
Anti-viral0010114162
Anti-cancer0216234081
Targeting mammals0120133064
Anti-fungal008113150
Targeting Gram-positive bacteria033615761
Targeting Gram-negative bacteria023118556
Table 4

Number of motifs detected by MERCI in seven AMP functional activities

Activity2-gram3-gram4-gram5-gram>6-gramTotal
Anti-parasitic011511835
Anti-viral0010114162
Anti-cancer0216234081
Targeting mammals0120133064
Anti-fungal008113150
Targeting Gram-positive bacteria033615761
Targeting Gram-negative bacteria023118556
Activity2-gram3-gram4-gram5-gram>6-gramTotal
Anti-parasitic011511835
Anti-viral0010114162
Anti-cancer0216234081
Targeting mammals0120133064
Anti-fungal008113150
Targeting Gram-positive bacteria033615761
Targeting Gram-negative bacteria023118556
Binary profiles of the N-terminal and T-terminal residues have been used in previous research [29, 30]. A three-dimensional vector with binary components was used to represent the special n-grams found by counting and t-test; and motifs were found to appear in the first, second or last third of the sequence—rather than splitting the sequence into N-terminal and T-terminal regions. For instance, a sequence ‘KKWKIVVIKWKK’ would get the feature vector (1, 0, 1) for the n-gram ‘WK’, owing to the fact that ‘WK’ appears in both the first third—‘KKWK’—and last third—‘KWKK’—of the residues. Numerous studies have used AACs and DPCs as features [11, 22, 27–30]. Therefore, we also considered the bi-, tri-, four- and five-gram compositions found by counting, t-test and motifs. More specifically, the composition of n-grami, Com(N-grami), is defined as follows:
where Ci is the number of occurrences of the n-grami in a sequence of length L. The number of n-gram related features that formed part of this study is presented in Table 5. Furthermore, the propy 1.0 package [31] was used to determine the values for PseAAC and CTDD, which were then used to describe the physical–chemical properties of the peptides, such as hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure and solvent accessibility. Note that additional factors λ of 2 and w of 0.5 were used by PseAAC.
Table 5

Number of n-gram related features

Activity2-gram3-gram4-gram5-gramTotal
(a) Found by counting
AMP00000
Anti-parasitic00225678
Anti-viral00022
Anti-cancer0161926
Targeting mammals009514
Anti-fungal00192645
Targeting Gram-positive bacteria0001111
Targeting Gram-negative bacteria00214263
(b) Found by t-test
AMP3928169911062305
Anti-parasitic2276338
Anti-viral2319444127918
Anti-cancer9129131
Targeting mammals13134030
Anti-fungal8214210139364
Targeting Gram-positive bacteria14328517560663
Targeting Gram-negative bacteria13527916854636
Activity2-gram3-gram4-gram5-gramTotal
(a) Found by counting
AMP00000
Anti-parasitic00225678
Anti-viral00022
Anti-cancer0161926
Targeting mammals009514
Anti-fungal00192645
Targeting Gram-positive bacteria0001111
Targeting Gram-negative bacteria00214263
(b) Found by t-test
AMP3928169911062305
Anti-parasitic2276338
Anti-viral2319444127918
Anti-cancer9129131
Targeting mammals13134030
Anti-fungal8214210139364
Targeting Gram-positive bacteria14328517560663
Targeting Gram-negative bacteria13527916854636

Note: AMP, anti-microbial peptides.

Table 5

Number of n-gram related features

Activity2-gram3-gram4-gram5-gramTotal
(a) Found by counting
AMP00000
Anti-parasitic00225678
Anti-viral00022
Anti-cancer0161926
Targeting mammals009514
Anti-fungal00192645
Targeting Gram-positive bacteria0001111
Targeting Gram-negative bacteria00214263
(b) Found by t-test
AMP3928169911062305
Anti-parasitic2276338
Anti-viral2319444127918
Anti-cancer9129131
Targeting mammals13134030
Anti-fungal8214210139364
Targeting Gram-positive bacteria14328517560663
Targeting Gram-negative bacteria13527916854636
Activity2-gram3-gram4-gram5-gramTotal
(a) Found by counting
AMP00000
Anti-parasitic00225678
Anti-viral00022
Anti-cancer0161926
Targeting mammals009514
Anti-fungal00192645
Targeting Gram-positive bacteria0001111
Targeting Gram-negative bacteria00214263
(b) Found by t-test
AMP3928169911062305
Anti-parasitic2276338
Anti-viral2319444127918
Anti-cancer9129131
Targeting mammals13134030
Anti-fungal8214210139364
Targeting Gram-positive bacteria14328517560663
Targeting Gram-negative bacteria13527916854636

Note: AMP, anti-microbial peptides.

Model construction and feature selection

The DT is a commonly used classification method. It has a tree structure similar to a flow chart, such that each internal node represents a test of the features, each branch represents a result of a test and each leaf node represents a class label. In our study, we used Classification and Regression Trees (CART) to build binary DTs for classifying peptides as AMPs or non-AMPs, and to identify their functional activities. The CART applies a greedy approach wherein the tree is constructed by a top-down recursion. During the construction process, the program chooses the ‘best’ feature set to classify the training tuples that make a split in the tree. The CART uses the Gini index as its feature set selection approach. Using the scikit-learn package [33], the ‘DecisionTreeClassifier’ function was adopted to build our DT.

The RF method, an ensemble learning method that allows classification, was implemented to build the prediction model. As this approach combines the results of multiple models, its resulting performance is usually better than that of an individual model. Moreover, the RF is a non-parametric approach that does not make any assumptions about the underlying distribution of the training dataset. Additionally, it is constructed using multiple DTs so that the collection of classifiers is a ‘forest’. In the present study, the CART was adopted to build the forest, and bootstrap aggregation (bagging) was applied when sampling the data. The bootstrap method samples the training tuples evenly with replacement, which means that every selected tuple is likely to be re-added into the training set. The ‘RandomForestClassifier’ function, which is available in the scikit-learn package [33] from python-software, was adopted to construct the RF model. Additionally, the scikit-learn package [33] provided a function to calculate the importance of input features. More specifically, Gini importance is the average decreased impurity of each feature across all trees; this impurity was the least randomness of the given data.

In addition to the aforementioned non-parametric methods, SVM was also used. The SVM is a supervised learning method for classification and regression of linear and nonlinear data. It utilizes nonlinear mapping to transform the training data into a higher dimensional space in which it searches for a linear optimal separating hyper-plane. The svm.SVC function in the scikit-learn package [33] was also used to construct the classifiers.

A sequential forward selection algorithm was employed to extract the useful features for improving prediction performance. First, one type of feature was used. Subsequently, at every round, a new feature was added until the performance of the model stopped showing improvement over the previous round; otherwise, the process stopped when all features had been used. The result was a well-trained prediction model.

Metrics of model performance

We compared our approach to previous studies according to a number of evaluation metrics—namely sensitivity (SN), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC). The detailed definitions of these metrics are as follows:
where TP—true positives—refers to the number of positive labels that were correctly predicted by the classifier, TN—true negatives—refers to the number of negative labels that were correctly predicted by the classifier, FP—false positives—refers to the number of positive labels that were incorrectly predicted by the classifier and FN—false negative—refers to the number of negative labels that were incorrectly predicted by the classifier. In addition to these metrics, the area under the receiver operating characteristic curve (AUC) was also considered. It should be noted that a receiver operating characteristic (ROC) curve is a visual tool for comparing predictive performances; it is able to show the relationship between the true positive rate and false positive rate at various threshold settings. Therefore, the AUC has been widely used for evaluating the overall predictive performance for a variety of prediction tools.
Distribution of the peptide sequence lengths of the AMPs and non-AMPs.
Figure 2

Distribution of the peptide sequence lengths of the AMPs and non-AMPs.

10-fold cross validation and independent testing sets

To confirm the robustness of the prediction model, the 10-fold cross validation technique was implemented using the training set. Furthermore, to compare the proposed model with existing tools, various independent testing sets were created for use with iAMPpred [28], AVPpred [34] and AntiFP [19]. These three systems include the latest research on predicting specific functional activities of AMPs, and each provides a web server for researchers. iAMPpred is a predictor that identifies the anti-bacterial, anti-viral and anti-fungal functional activities of AMPs. As iAMPpred only considers AMPs to have a general anti-bacterial functional activity—rather than targeting Gram-positive and Gram-negative functional activities—it was necessary to filter the Gram-positive sequences from the Gram-negative testing set and vice versa. As the AntiFP web server has a sequence length limit of greater than 15 amino acids, any peptide sequences 15 amino acids or less were eliminated from the independent testing set used to assess AntiFP. It should be noted that no further additional procedures were needed when creating the independent testing sets for iAMPpred, MLACP and AVPpred when comparing the anti-fungal, anti-cancer and anti-viral classifiers. The number of sequences in each class-specific dataset is shown in Table 2.

Development of the web-based prediction tool

A web-based prediction tool was constructed using a hypertext preprocessor and was implemented at the backend upon submission of peptide sequences. When users submit their peptides of interest, they need to decide if they want to predict the functional activities of the submitted peptides. If they are interested in predicting the functional activities, they must choose which functional activities they are interested in; otherwise, the prediction tool will only provide the AMP prediction results. The web-based prediction tool is able to accept both single peptide and multiple peptides. Depending on the user’s selection, this tool is able to list the predicted probabilities for a single or multiple types of functional activities—including AMPs that target mammals, Gram-positive and Gram-negative bacteria, anti-parasitic, anti-fungal, anti-cancer and anti-viral AMPs.

Results

Characterization of the sequence-based features of AMPs

As part of this study, a total of 1686 AMP sequences were used to construct the AMP classifier. Based on these sequences, Figure 2 shows the length distribution of the AMPs and non-AMPs. From this analysis, it is clear that the length of AMPs tended to be shorter than that of non-AMPs. Furthermore, Table 6 shows that there was a significant difference between these two groups in terms of length. Bar charts of the AAC of the training set are shown in Figure 3. It is clear that the amino acids ‘C’, ‘D’, ‘E’, ‘G’, ‘K’ and ‘L’ were present in relatively different levels between AMPs and non-AMPs. Specifically, the average ‘C’, ‘G’ and ‘K’ AACs among the AMPs were relative higher than those in the non-AMPs. However, occurrences of ‘D’, ‘E’ and ‘L’ were relatively fewer in the AMPs than the non-AMPs. Furthermore, Table S1 indicates that the amino acid ‘C’ was correlated with the peptide being an AMP, which means the presence of ‘C’ may be a good indicator for identifying AMPs. The ‘CK’ n-gram was identified by its abundance in AMPs and attained the highest correlation among the NTC category. For the PseAAC values, ‘D’ was found to show a relatively strong correlation with AMP identification. A neutral charge (Class 2) on the 1st residue (Charge_C2_001) showed the highest correlation with AMP identification among all CTDD features. Additionally, the various Pearson correlation coefficients (PCCs)—shown in Table S1—indicate that features in CTDD category tended to show a high correlation value. Thus, the peptides’ physicochemical properties might be important for AMP identification.

Table 6

Comparisons of means and standard deviations (shown in the parentheses) of peptide lengths in the positive and negative datasets based on different AMP functional activities

ClassifierPositiveNegativeP-value
Stage 1
 AMP44.34 (35.67)173.07 (35.28)<0.0001*
Stage 2
 Anti-parasitic29.53 (21.64)29.36 (17.70)0.5861
 Anti-viral20.80 (14.57)47.33 (31.39)<0.0001*
 Anti-cancer24.43 (16.05)30.21 (17.99)<0.0001*
 Targeting mammals26.21 (12.95)29.98 (17.39)0.2388
 Anti-fungal45.82 (31.17)37.76 (31.31)<0.0001*
 Targeting Gram-positive bacteria33.95 (28.70)47.64 (30.62)<0.0001*
 Targeting Gram-negative bacteria34.15 (28.19)47.15 (30.19)<0.0001*
ClassifierPositiveNegativeP-value
Stage 1
 AMP44.34 (35.67)173.07 (35.28)<0.0001*
Stage 2
 Anti-parasitic29.53 (21.64)29.36 (17.70)0.5861
 Anti-viral20.80 (14.57)47.33 (31.39)<0.0001*
 Anti-cancer24.43 (16.05)30.21 (17.99)<0.0001*
 Targeting mammals26.21 (12.95)29.98 (17.39)0.2388
 Anti-fungal45.82 (31.17)37.76 (31.31)<0.0001*
 Targeting Gram-positive bacteria33.95 (28.70)47.64 (30.62)<0.0001*
 Targeting Gram-negative bacteria34.15 (28.19)47.15 (30.19)<0.0001*

Note: * indicates that the test result is significant.

Table 6

Comparisons of means and standard deviations (shown in the parentheses) of peptide lengths in the positive and negative datasets based on different AMP functional activities

ClassifierPositiveNegativeP-value
Stage 1
 AMP44.34 (35.67)173.07 (35.28)<0.0001*
Stage 2
 Anti-parasitic29.53 (21.64)29.36 (17.70)0.5861
 Anti-viral20.80 (14.57)47.33 (31.39)<0.0001*
 Anti-cancer24.43 (16.05)30.21 (17.99)<0.0001*
 Targeting mammals26.21 (12.95)29.98 (17.39)0.2388
 Anti-fungal45.82 (31.17)37.76 (31.31)<0.0001*
 Targeting Gram-positive bacteria33.95 (28.70)47.64 (30.62)<0.0001*
 Targeting Gram-negative bacteria34.15 (28.19)47.15 (30.19)<0.0001*
ClassifierPositiveNegativeP-value
Stage 1
 AMP44.34 (35.67)173.07 (35.28)<0.0001*
Stage 2
 Anti-parasitic29.53 (21.64)29.36 (17.70)0.5861
 Anti-viral20.80 (14.57)47.33 (31.39)<0.0001*
 Anti-cancer24.43 (16.05)30.21 (17.99)<0.0001*
 Targeting mammals26.21 (12.95)29.98 (17.39)0.2388
 Anti-fungal45.82 (31.17)37.76 (31.31)<0.0001*
 Targeting Gram-positive bacteria33.95 (28.70)47.64 (30.62)<0.0001*
 Targeting Gram-negative bacteria34.15 (28.19)47.15 (30.19)<0.0001*

Note: * indicates that the test result is significant.

Comparisons of the AAC between the AMP and non-AMP peptides based on seven types of functional activities.
Figure 3

Comparisons of the AAC between the AMP and non-AMP peptides based on seven types of functional activities.

Table 7

Performances of three different AMP classifiers based on 10-fold cross-validation

MethodSNSPACCMCCAUC
RF0.94880.95110.95090.77060.9915
SVM0.94330.94290.94300.74470.9847
DT0.83400.98260.96870.81470.9083
MethodSNSPACCMCCAUC
RF0.94880.95110.95090.77060.9915
SVM0.94330.94290.94300.74470.9847
DT0.83400.98260.96870.81470.9083

Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 7

Performances of three different AMP classifiers based on 10-fold cross-validation

MethodSNSPACCMCCAUC
RF0.94880.95110.95090.77060.9915
SVM0.94330.94290.94300.74470.9847
DT0.83400.98260.96870.81470.9083
MethodSNSPACCMCCAUC
RF0.94880.95110.95090.77060.9915
SVM0.94330.94290.94300.74470.9847
DT0.83400.98260.96870.81470.9083

Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 8

AUC values of RF classifiers trained with the features according to the process of forward feature selection. The number of extracted features in each category is shown in the parentheses

RoundNCB (0)NTB (6915)NCC (0)NTC (2305)AAC (20)PseAAC (22)CTDD (105)
1N/A0.5000N/A0.98690.97920.97870.9882
2N/A0.9860N/A0.99180.99100.9914
3N/A0.9911N/A0.992150.99212
4N/A0.9914N/A0.99225
5N/A0.9915N/A
RoundNCB (0)NTB (6915)NCC (0)NTC (2305)AAC (20)PseAAC (22)CTDD (105)
1N/A0.5000N/A0.98690.97920.97870.9882
2N/A0.9860N/A0.99180.99100.9914
3N/A0.9911N/A0.992150.99212
4N/A0.9914N/A0.99225
5N/A0.9915N/A

Note: The N/A represented that there are no features used to build the classifier and ‘–’ means that the features were selected in the last round. NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Table 8

AUC values of RF classifiers trained with the features according to the process of forward feature selection. The number of extracted features in each category is shown in the parentheses

RoundNCB (0)NTB (6915)NCC (0)NTC (2305)AAC (20)PseAAC (22)CTDD (105)
1N/A0.5000N/A0.98690.97920.97870.9882
2N/A0.9860N/A0.99180.99100.9914
3N/A0.9911N/A0.992150.99212
4N/A0.9914N/A0.99225
5N/A0.9915N/A
RoundNCB (0)NTB (6915)NCC (0)NTC (2305)AAC (20)PseAAC (22)CTDD (105)
1N/A0.5000N/A0.98690.97920.97870.9882
2N/A0.9860N/A0.99180.99100.9914
3N/A0.9911N/A0.992150.99212
4N/A0.9914N/A0.99225
5N/A0.9915N/A

Note: The N/A represented that there are no features used to build the classifier and ‘–’ means that the features were selected in the last round. NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Box plots of feature importance (upper) and PCC values (lower) of the selected features.
Figure 4

Box plots of feature importance (upper) and PCC values (lower) of the selected features.

Distribution of peptide sequence lengths for various AMP functional activities.
Figure 5

Distribution of peptide sequence lengths for various AMP functional activities.

Comparisons of the amino acids present in the different AMP functional activities forming the training set.
Figure 6

Comparisons of the amino acids present in the different AMP functional activities forming the training set.

Performance when classifying AMPs and non-AMPs

Without feature selection, Table 7 shows that the AUC of the RF model achieved the best performance among the three methods in terms of 10-fold cross validation. Optimizations of the parameters used in generating the RF and SVM models are presented in Figures S1 and Figure S2, respectively. For the number of trees in the RF model, we tried different values ranging from 100 to 1000 to determine the parameter with the best performance. In Figures S1, the optimal number of trees for each functional activity is marked with red point. Similarly, Figure S2 shows the best cost values for generating the SVM models for the different AMP activities. In addition to the cost-value optimization of the SVM models, the selection of kernel functions for the SVM models is presented in Table S2—based on 10-fold cross-validation. The radial basis function (RBF) kernel, combined with a cost value of 5, generated the best SVM classifier for AMP identification. It should be noted that the RF model that considered only the n-gram features NTB and NTC yielded an AUC value of 0.9859. Table 8 demonstrates that when CTDD, NTC, PseAAC and AAC features were incorporated into the model sequentially by the forward selection strategy, the AUC value increased from 0.9915 to 0.9922. Additionally, the MCC increased from 0.7706 to 0.7924; meanwhile the number of features in the training set decreased from 9367 to 2452. In comparison, the AUC for the independent testing set using an RF model trained with 2452 features was 0.9899.

Furthermore, based on the highest performing model with the 2452 selected features, we adopted RF feature importance via the scikit-learn package [33]. The feature importance was determined by the Gini index, and the resulting box plots are presented in Figure 4. Among all the AACs, ‘M’, ‘D’, ‘E’ and ‘C’ occurred at relatively high frequencies. The importance of amino acids ‘D’, ‘E’ and ‘C’ were found to be very different between the AMPs and non-AMPs based on the training set. Additionally, the importance of ‘E’ and ‘M’ were relatively high among all PseAACs. Furthermore, several features—including charge, hydrophobicity, normalized van der Waals volume, secondary structure, solvent accessibility, polarizability and polarity—in the CTDD category tended to have higher importance than others. The PCCs that formed the training set were also analyzed to examine the feature importance. As shown in Figure 4, the PCCs of the CTDD category were higher than those of other feature categories. In a previous study, AmPEP [6] calculated the PCCs but used only 23 features, which resulted in the correlations being lower than 0.5 when training the model. Most of the top 10 CTDD features in our study were also among the 23 features used in [6]. Additionally, the PCCs of the top 10 CTDD features were relatively higher than those of other features. The PCC of amino acid ‘C’ was the highest among the AAC category. The PCC values of ‘CK’ and ‘KC’ were higher than when NTC was considered. Among the PseAACs, the PCCs of ‘C’ and ‘E’ were the highest. Notably, these amino acids were more significantly different between AMPs and non-AMPs, based on our training set observations. They were also among the top 10 most important features in their feature category. Table S1 presents the top 10 features in each category and additional information concerning these important features.

Investigation of the sequence-based features of AMPs that have different functional activities

Figure 5 shows the length distributions of the seven AMP functional activities. Based on the results, the sequence lengths of the anti-viral, anti-cancer, anti-fungal, the AMPs that target Gram-positive and Gram-negative bacteria were different between the positive and negative datasets. More specifically, the anti-viral and anti-cancer AMPs, and those that target Gram-positive and Gram-negative bacteria in the positive dataset, were shorter than those in the negative dataset. In contrast, the anti-fungal AMPs in the positive dataset were longer than those in negative dataset. Table 6 shows that significant differences exist between the different antimicrobial activities. Figure 6 shows that the average AACs for ‘K’ and ‘L’ were different between the positive and negative training datasets. This difference existed in the anti-parasite, anti-viral, anti-cancer AMPs, the AMPs that target mammals, the AMPs that target Gram-positive and Gram-negative bacteria. It is possible to list the amino acids that were relatively different between the positive and negative datasets for each of the seven different targets. Specifically, ‘A’ and ‘S’ were different for the anti-parasite AMPs; ‘G’ and ‘V’ were different for the anti-viral AMPs; ‘I’ and ‘R’ were different for the anti-cancer AMPs; ‘D’, ‘E’, ‘R’ and ‘Y’ were different for the AMPs that target mammals; ‘C’, ‘D’, ‘E’, ‘W’ and ‘G’ were different for the anti-fungal AMPs; and ‘G’, ‘D’ and ‘E’ were different for AMPs that target Gram-positive and Gram-negative bacteria. Table S3 shows that several features seemed to be relevant to identifying the functional activities of an AMP. Unsurprisingly, the physicochemical property-related features, particularly CTDD and PseAAC, were significant in identifying the AMPs with functional activities; most of the important features involved the physicochemical properties of the peptides. However, ‘GL’ and ‘GLF’ were both important n-grams for identifying ACPs, while ‘LP’ was a key feature for identifying AMPs that target mammals.

Table 9

Performances of the seven classifiers based on 10-fold cross-validation

Activities (# features)MethodSNSPACCMCCAUC
Anti-parasitic (751)RF0.80300.80370.80360.49350.8812
SVM0.63060.75160.72980.30660.7596
DT0.57350.88310.83210.43220.7283
Anti-viral (4075)RF0.90280.90040.90130.79150.9617
SVM0.86090.85870.85950.70530.9307
DT0.78970.87490.84420.66370.8323
Anti-cancer (699)RF0.77860.76790.76950.44200.8527
SVM0.75720.75830.75800.41160.8186
DT0.53190.88140.82290.39620.7067
Targeting mammals (579)RF0.87290.87360.87360.64080.9265
SVM0.79590.80290.80160.49000.8888
DT0.66030.92050.87670.57210.7904
Anti-fungal (1983)RF0.84990.84670.84860.68880.9358
SVM0.79000.78820.78930.56980.8719
DT0.81250.72530.77850.53750.7689
Targeting Gram-positive bacteria (3087)RF0.88530.88600.88560.76980.9567
SVM0.84040.83890.83970.67760.9122
DT0.81760.78920.80450.60630.8034
Targeting Gram-negative bacteria (3167)RF0.87990.88090.88030.75920.9546
SVM0.83880.84050.83960.67750.9201
DT0.81770.78020.80030.59840.7990
Activities (# features)MethodSNSPACCMCCAUC
Anti-parasitic (751)RF0.80300.80370.80360.49350.8812
SVM0.63060.75160.72980.30660.7596
DT0.57350.88310.83210.43220.7283
Anti-viral (4075)RF0.90280.90040.90130.79150.9617
SVM0.86090.85870.85950.70530.9307
DT0.78970.87490.84420.66370.8323
Anti-cancer (699)RF0.77860.76790.76950.44200.8527
SVM0.75720.75830.75800.41160.8186
DT0.53190.88140.82290.39620.7067
Targeting mammals (579)RF0.87290.87360.87360.64080.9265
SVM0.79590.80290.80160.49000.8888
DT0.66030.92050.87670.57210.7904
Anti-fungal (1983)RF0.84990.84670.84860.68880.9358
SVM0.79000.78820.78930.56980.8719
DT0.81250.72530.77850.53750.7689
Targeting Gram-positive bacteria (3087)RF0.88530.88600.88560.76980.9567
SVM0.84040.83890.83970.67760.9122
DT0.81760.78920.80450.60630.8034
Targeting Gram-negative bacteria (3167)RF0.87990.88090.88030.75920.9546
SVM0.83880.84050.83960.67750.9201
DT0.81770.78020.80030.59840.7990

Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 9

Performances of the seven classifiers based on 10-fold cross-validation

Activities (# features)MethodSNSPACCMCCAUC
Anti-parasitic (751)RF0.80300.80370.80360.49350.8812
SVM0.63060.75160.72980.30660.7596
DT0.57350.88310.83210.43220.7283
Anti-viral (4075)RF0.90280.90040.90130.79150.9617
SVM0.86090.85870.85950.70530.9307
DT0.78970.87490.84420.66370.8323
Anti-cancer (699)RF0.77860.76790.76950.44200.8527
SVM0.75720.75830.75800.41160.8186
DT0.53190.88140.82290.39620.7067
Targeting mammals (579)RF0.87290.87360.87360.64080.9265
SVM0.79590.80290.80160.49000.8888
DT0.66030.92050.87670.57210.7904
Anti-fungal (1983)RF0.84990.84670.84860.68880.9358
SVM0.79000.78820.78930.56980.8719
DT0.81250.72530.77850.53750.7689
Targeting Gram-positive bacteria (3087)RF0.88530.88600.88560.76980.9567
SVM0.84040.83890.83970.67760.9122
DT0.81760.78920.80450.60630.8034
Targeting Gram-negative bacteria (3167)RF0.87990.88090.88030.75920.9546
SVM0.83880.84050.83960.67750.9201
DT0.81770.78020.80030.59840.7990
Activities (# features)MethodSNSPACCMCCAUC
Anti-parasitic (751)RF0.80300.80370.80360.49350.8812
SVM0.63060.75160.72980.30660.7596
DT0.57350.88310.83210.43220.7283
Anti-viral (4075)RF0.90280.90040.90130.79150.9617
SVM0.86090.85870.85950.70530.9307
DT0.78970.87490.84420.66370.8323
Anti-cancer (699)RF0.77860.76790.76950.44200.8527
SVM0.75720.75830.75800.41160.8186
DT0.53190.88140.82290.39620.7067
Targeting mammals (579)RF0.87290.87360.87360.64080.9265
SVM0.79590.80290.80160.49000.8888
DT0.66030.92050.87670.57210.7904
Anti-fungal (1983)RF0.84990.84670.84860.68880.9358
SVM0.79000.78820.78930.56980.8719
DT0.81250.72530.77850.53750.7689
Targeting Gram-positive bacteria (3087)RF0.88530.88600.88560.76980.9567
SVM0.84040.83890.83970.67760.9122
DT0.81760.78920.80450.60630.8034
Targeting Gram-negative bacteria (3167)RF0.87990.88090.88030.75920.9546
SVM0.83880.84050.83960.67750.9201
DT0.81770.78020.80030.59840.7990

Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Performance when identifying AMPs together with their functional activities

Table 9 demonstrates that the RF achieved the best results during the 2nd stage. The performances regarding the different parameters are shown in Table S4. We adopted the same stopping rule as that of Stage 1. The smaller dataset tended to perform well with a lower number of trees. Except for anti-fungal, the large dataset needed a greater number of trees. Similarly, the SVM kernel selection was related to the scale of the dataset with the RBF kernel outperforming the polynomial kernel on the larger dataset. In contrast, the smaller dataset required the polynomial kernel. As the datasets for the anti-parasitic and anti-cancer AMPs and that of the AMPs targeting mammals were smaller and more imbalanced, their MCCs and AUCs were lower than those of the other groups. Notwithstanding the above, the AUCs of the anti-viral, the anti-fungal AMPs and of those targeting Gram-positive and Gram-negative bacteria were 0.9617, 0.9358, 0.9567 and 0.9546, respectively.

To evaluate the relative importance of features, the RF importance calculation (determined using the scikit-learn package [33]) was implemented. The results are shown in Figure 7; all of the binary profiling positional feature importance values were zero. This is because the compositional features information was already included in the binary profiling positional features. Furthermore, most of the compositional feature importance values tended to be low—this was particularly true for the AAC, which had the lowest values for all seven AMP classifiers. Nevertheless, several compositional features had higher values, which implies that they are likely to be of greater importance in identifying the various functional activities. In comparison, the physical–chemical properties features had the highest importance values among the other feature categories—if outliers are not considered. Their feature importance values show a grouping with relatively high values. We also computed the PCCs for each feature; their box plots are shown in Figure 7. The correlation between the binary profiling positional feature category and label as AMPs is very weak—this result is consistent with their low importance values compared with those of the other features. When the anti-parasite and anti-cancer datasets were considered, the compositional feature correlations were the highest, especially for the NCC and MC features. The physical–chemical property features showed the highest correlations among the remaining datasets.

Box plots of feature importance (upper) and PCC values (lower) for the seven classifiers based on all investigated features.
Figure 7

Box plots of feature importance (upper) and PCC values (lower) for the seven classifiers based on all investigated features.

Table 10

The features selected using sequential forward selection strategy and the AUC values of seven classes based on RF modeling scheme. The number of features used in each category is shown in the parentheses

ActivitiesSelected featuresAUC
Anti-parasiticAAC (20) + NCC (78) + PseAAC (22)0.8783
Anti-viralNTC (918) + AAC (20) + MB (186) + NCB (6)0.9692
Anti-cancerCTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243)0.8518
Targeting mammalsAAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14)0.9392
Anti-fungalAAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50)0.9382
Targeting Gram-positive bacteriaAAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33)0.9583
Targeting Gram-negative bacteriaAAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56)0.9560
ActivitiesSelected featuresAUC
Anti-parasiticAAC (20) + NCC (78) + PseAAC (22)0.8783
Anti-viralNTC (918) + AAC (20) + MB (186) + NCB (6)0.9692
Anti-cancerCTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243)0.8518
Targeting mammalsAAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14)0.9392
Anti-fungalAAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50)0.9382
Targeting Gram-positive bacteriaAAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33)0.9583
Targeting Gram-negative bacteriaAAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56)0.9560

Note: AUC, area under the receiver operating characteristic curve; NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Table 10

The features selected using sequential forward selection strategy and the AUC values of seven classes based on RF modeling scheme. The number of features used in each category is shown in the parentheses

ActivitiesSelected featuresAUC
Anti-parasiticAAC (20) + NCC (78) + PseAAC (22)0.8783
Anti-viralNTC (918) + AAC (20) + MB (186) + NCB (6)0.9692
Anti-cancerCTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243)0.8518
Targeting mammalsAAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14)0.9392
Anti-fungalAAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50)0.9382
Targeting Gram-positive bacteriaAAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33)0.9583
Targeting Gram-negative bacteriaAAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56)0.9560
ActivitiesSelected featuresAUC
Anti-parasiticAAC (20) + NCC (78) + PseAAC (22)0.8783
Anti-viralNTC (918) + AAC (20) + MB (186) + NCB (6)0.9692
Anti-cancerCTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243)0.8518
Targeting mammalsAAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14)0.9392
Anti-fungalAAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50)0.9382
Targeting Gram-positive bacteriaAAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33)0.9583
Targeting Gram-negative bacteriaAAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56)0.9560

Note: AUC, area under the receiver operating characteristic curve; NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Our analysis showed that the RF approach performed the best among the three methods used. Based on the RF approach, a forward feature selection algorithm was adopted for the RF; the selected features and their AUCs are shown in Table 10. The final performance of this approach is shown in Table 11. When the anti-viral classifier was examined, there was an increase from 0.9617 to 0.9692, which implies that the n-gram-related features might have been different between the positive and negative datasets. Furthermore, when the classifier of AMPs targeting mammals was examined, if CTDD was excluded, but n-gram- and motifs-related features were included, there was an increase performance from 0.9265 to 0.9392. This indicates that n-gram- and motifs-related features are key factors in prediction. For the anti-fungal AMP classifier, almost all features were selected, but there was no large improvement detected under any specific circumstances. Finally, for the classifiers of AMPs targeting Gram-positive and Gram-negative bacteria, selection of the n-gram- and motifs-related features resulted in better performance levels.

Table 11

Performances of seven class-specific classifiers using RF with the selected features

Activity (# features)DatasetSNSPACCMCCAUC
Anti-parasitic (120)10-fold CV0.75260.83660.82020.49550.8783
Testing0.61670.77320.76850.15700.7773
Anti-viral (1130)10-fold CV0.91090.93240.92470.83820.9692
Testing0.90850.84060.86130.70750.9404
Anti- cancer (263)10-fold CV0.76730.78880.78550.45070.8518
Testing0.77660.70600.70940.22080.8231
Targeting mammals (362)10-fold CV0.86770.88930.88530.66200.9392
Testing0.78490.80450.80350.29980.8648
Anti- fungal (606)10-fold CV0.85730.85530.85650.70500.9382
Testing0.85610.66750.74580.51860.8578
Targeting Gram-positive bacteria (843)10-fold CV0.88520.88480.88510.76870.9583
Testing0.88770.63730.74230.52540.8745
Targeting Gram-negative bacteria (839)10-fold CV0.88050.88150.88090.76060.9560
Testing0.85750.65740.74130.51160.8672
Activity (# features)DatasetSNSPACCMCCAUC
Anti-parasitic (120)10-fold CV0.75260.83660.82020.49550.8783
Testing0.61670.77320.76850.15700.7773
Anti-viral (1130)10-fold CV0.91090.93240.92470.83820.9692
Testing0.90850.84060.86130.70750.9404
Anti- cancer (263)10-fold CV0.76730.78880.78550.45070.8518
Testing0.77660.70600.70940.22080.8231
Targeting mammals (362)10-fold CV0.86770.88930.88530.66200.9392
Testing0.78490.80450.80350.29980.8648
Anti- fungal (606)10-fold CV0.85730.85530.85650.70500.9382
Testing0.85610.66750.74580.51860.8578
Targeting Gram-positive bacteria (843)10-fold CV0.88520.88480.88510.76870.9583
Testing0.88770.63730.74230.52540.8745
Targeting Gram-negative bacteria (839)10-fold CV0.88050.88150.88090.76060.9560
Testing0.85750.65740.74130.51160.8672

Note: SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 11

Performances of seven class-specific classifiers using RF with the selected features

Activity (# features)DatasetSNSPACCMCCAUC
Anti-parasitic (120)10-fold CV0.75260.83660.82020.49550.8783
Testing0.61670.77320.76850.15700.7773
Anti-viral (1130)10-fold CV0.91090.93240.92470.83820.9692
Testing0.90850.84060.86130.70750.9404
Anti- cancer (263)10-fold CV0.76730.78880.78550.45070.8518
Testing0.77660.70600.70940.22080.8231
Targeting mammals (362)10-fold CV0.86770.88930.88530.66200.9392
Testing0.78490.80450.80350.29980.8648
Anti- fungal (606)10-fold CV0.85730.85530.85650.70500.9382
Testing0.85610.66750.74580.51860.8578
Targeting Gram-positive bacteria (843)10-fold CV0.88520.88480.88510.76870.9583
Testing0.88770.63730.74230.52540.8745
Targeting Gram-negative bacteria (839)10-fold CV0.88050.88150.88090.76060.9560
Testing0.85750.65740.74130.51160.8672
Activity (# features)DatasetSNSPACCMCCAUC
Anti-parasitic (120)10-fold CV0.75260.83660.82020.49550.8783
Testing0.61670.77320.76850.15700.7773
Anti-viral (1130)10-fold CV0.91090.93240.92470.83820.9692
Testing0.90850.84060.86130.70750.9404
Anti- cancer (263)10-fold CV0.76730.78880.78550.45070.8518
Testing0.77660.70600.70940.22080.8231
Targeting mammals (362)10-fold CV0.86770.88930.88530.66200.9392
Testing0.78490.80450.80350.29980.8648
Anti- fungal (606)10-fold CV0.85730.85530.85650.70500.9382
Testing0.85610.66750.74580.51860.8578
Targeting Gram-positive bacteria (843)10-fold CV0.88520.88480.88510.76870.9583
Testing0.88770.63730.74230.52540.8745
Targeting Gram-negative bacteria (839)10-fold CV0.88050.88150.88090.76060.9560
Testing0.85750.65740.74130.51160.8672

Note: SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 12

Performances of AMPfun and existing AMP prediction tools based on the independent testing set

ActivitiesToolSNSPACCMCCAUC
Anti-viralAMPfun0.90850.84060.86130.70750.9404
iAMPpred [28]0.31280.39590.3706−0.26820.3158
AVPpred [34]0.24090.88570.69010.1643N/A
Anti-cancerAMPfun0.77660.70600.70940.22080.8231
MLACP [21]0.72340.75120.74990.22720.8320
Anti-fungalAMPfun0.85610.66750.74580.51860.8578
iAMPpred [28]0.66100.72120.69620.37960.7412
Anti-fungalAMPfun0.79290.74550.76780.53750.8448
AntiFP [19]0.66990.70390.68790.3737N/A
Targeting Gram-positive bacteriaAMPfun0.88290.62820.73850.51550.8653
iAMPpred [28]0.68720.65410.66840.33820.6950
Targeting Gram-negative bacteriaAMPfun0.85630.65220.74060.50860.8590
iAMPpred [28]0.69320.65680.67260.34690.6958
ActivitiesToolSNSPACCMCCAUC
Anti-viralAMPfun0.90850.84060.86130.70750.9404
iAMPpred [28]0.31280.39590.3706−0.26820.3158
AVPpred [34]0.24090.88570.69010.1643N/A
Anti-cancerAMPfun0.77660.70600.70940.22080.8231
MLACP [21]0.72340.75120.74990.22720.8320
Anti-fungalAMPfun0.85610.66750.74580.51860.8578
iAMPpred [28]0.66100.72120.69620.37960.7412
Anti-fungalAMPfun0.79290.74550.76780.53750.8448
AntiFP [19]0.66990.70390.68790.3737N/A
Targeting Gram-positive bacteriaAMPfun0.88290.62820.73850.51550.8653
iAMPpred [28]0.68720.65410.66840.33820.6950
Targeting Gram-negative bacteriaAMPfun0.85630.65220.74060.50860.8590
iAMPpred [28]0.69320.65680.67260.34690.6958

Note: N/A means that we could not derive the values from the provided results. SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 12

Performances of AMPfun and existing AMP prediction tools based on the independent testing set

ActivitiesToolSNSPACCMCCAUC
Anti-viralAMPfun0.90850.84060.86130.70750.9404
iAMPpred [28]0.31280.39590.3706−0.26820.3158
AVPpred [34]0.24090.88570.69010.1643N/A
Anti-cancerAMPfun0.77660.70600.70940.22080.8231
MLACP [21]0.72340.75120.74990.22720.8320
Anti-fungalAMPfun0.85610.66750.74580.51860.8578
iAMPpred [28]0.66100.72120.69620.37960.7412
Anti-fungalAMPfun0.79290.74550.76780.53750.8448
AntiFP [19]0.66990.70390.68790.3737N/A
Targeting Gram-positive bacteriaAMPfun0.88290.62820.73850.51550.8653
iAMPpred [28]0.68720.65410.66840.33820.6950
Targeting Gram-negative bacteriaAMPfun0.85630.65220.74060.50860.8590
iAMPpred [28]0.69320.65680.67260.34690.6958
ActivitiesToolSNSPACCMCCAUC
Anti-viralAMPfun0.90850.84060.86130.70750.9404
iAMPpred [28]0.31280.39590.3706−0.26820.3158
AVPpred [34]0.24090.88570.69010.1643N/A
Anti-cancerAMPfun0.77660.70600.70940.22080.8231
MLACP [21]0.72340.75120.74990.22720.8320
Anti-fungalAMPfun0.85610.66750.74580.51860.8578
iAMPpred [28]0.66100.72120.69620.37960.7412
Anti-fungalAMPfun0.79290.74550.76780.53750.8448
AntiFP [19]0.66990.70390.68790.3737N/A
Targeting Gram-positive bacteriaAMPfun0.88290.62820.73850.51550.8653
iAMPpred [28]0.68720.65410.66840.33820.6950
Targeting Gram-negative bacteriaAMPfun0.85630.65220.74060.50860.8590
iAMPpred [28]0.69320.65680.67260.34690.6958

Note: N/A means that we could not derive the values from the provided results. SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Comparison with existing AMP prediction tools in terms of predictive performance

AMPfun was further compared with related prediction tools such as iAMPpred, AVPpred, MLACP and AntiFP. The performance with respect to the various functional activities using the independent testing set is displayed in Table 12, and the corresponding ROC curves are presented in Figure 8. The accuracy of AMPfun on the anti-viral class was higher than that of iAMPpred and AVPpred. Overall, AMPfun achieved much higher MCC values for most functional activities—the exception being for the anti-cancer, where the value was only 0.0064 lower than that of MLACP. Base on evaluation of the AUC values, AMPfun was again higher for the anti-viral functional activity, exceeding the iAMPpred by 0.6246. Similarly, for the anti-fungal AMPs, and those that target Gram-positive and Gram-negative bacteria, our results were 0.1166, 0.1703 and 0.1632 higher, respectively, than those of iAMPpred. However, when the anti-cancer model was examined, the AUC of AMPfun was slightly lower than that of MLACP by 0.0089.

AMPfun web interface

Based on our method, an online prediction server—AMPfun (http://fdblab.csie.ncu.edu.tw/AMPfun/index.html)—has been developed to predict the possibility that a peptide sequence might be an AMP with a particular functional activity. The best prediction models developed for anti-parasite, anti-viral, anti-cancer, anti-fungal AMPs, as well as those targeting mammals, and Gram-positive and Gram-negative bacteria were applied here. Screenshots of the website are shown in Figure S3.

The ROC curves of AMPfun, MLACP and iAMPpred when predicting anti-viral AMPs, anti-cancer AMPs, anti-fungal AMPs, AMPs targeting Gram-positive bacteria and AMPs targeting Gram-negative peptides.
Figure 8

The ROC curves of AMPfun, MLACP and iAMPpred when predicting anti-viral AMPs, anti-cancer AMPs, anti-fungal AMPs, AMPs targeting Gram-positive bacteria and AMPs targeting Gram-negative peptides.

Discussions and conclusion

The unfortunate reality of multidrug antibiotic resistance is rapidly growing throughout the world. Consequently, many common antibiotics are now unable to be used to fight pathogenic bacteria. As such, the identification of AMPs and their functional activities has begun to attract increasing attention. In this study, we constructed a two-stage framework based on machine learning techniques that is able to identify AMPs and determine their functional activities. We adopted a forward feature selection algorithm to illustrate the importance of crucial features when developing the classifiers for different AMP activities. To obtain better crucial information from the peptides sequences in this study, we adopted the n-grams concept to encode the binary profiling of the positional and composition features. AAC, PseAAC and CTDD were also used as features. Among these, both AAC and PseAAC are commonly used features when separating AMPs from non-AMPs and identifying their functional activities [7, 11, 19, 21, 22, 24, 27–30].

Most previous studies only consider the composition or binary profile of single, double or triple amino acids [11, 22, 27–30]. In our study, we innovatively considered amino acid n-grams as a means of encoding the composition and binary profile features of AMPs. We incorporated two approaches, namely counting and t-tests, to investigate the influence of the n-gram on AMPs and their functional activities. However, no n-gram was identified by the counting approach during the 1st stage due to significant differences in the lengths of the AMPs. Put differently, the non-AMPs in the negative dataset were much longer, which seems to have resulted in a more diverse n-gram arrangement. Whatever the frequency of an n-gram, once it appeared in a non-AMP sequence, it was filtered out of the positive dataset. During the feature selection process of constructing the AMP classifier, CTDD was selected firstly. Specifically, the AMP classifier was able to perform well, even when only CTDD was solely considered. Table S1 shows that all of the properties CTDD considers regarding the 1st residue are important patterns for separating AMPs from non-AMPs. Previous studies have indicated that AMPs are generally positively charged and amphiphilic, but have a large proportion of hydrophobic residues [5, 14]. This suggests that charge is a key factor in separating AMPs from non-AMPs with terminal residue properties being significantly different between AMPs and non-AMPs [6].

In the 2nd stage, we used seven class-specific classifiers. In addition to the 1st stage features, we added motifs-based binary profiling positional features and compositional features. As shown in Figure 7, we found that the motifs-related features were relatively less important than other features, but still showed a strong correlation with the anti-parasitic and anti-cancer AMPs, and those targeting mammals. One possible reason for this is that these features need to be combined with others to improve classifier performance. One previous study indicated that charge also plays an important role in the identification of anti-bacterial and anti-fungal peptides [11].

Figures 6 and 7 and Table S3 all show that the features selected may depend on the functional activity of the AMP. Among the anti-parasite classifiers, pseudo amino acids ‘A’ and ‘K’—having hydrophobicity of the neutral groups in 25% of the residues—showed high feature importance. Among the anti-viral classifiers, the charge of neutral groups, secondary structure of the helix groups, solvent accessibility of the buried groups and polarizability and normalized van der Waals volume of the 3rd group were labeled as high feature importance. These important features were all present in the 1st residue, which showed a major difference in position. Separately, the results for the anti-cancer classifier showed that the compositions ‘GL’ and ‘GLF’ and pseudo amino acids ‘C’ and ‘I’ were important. An interestingly relation is that the top 10 features in feature importance levels also reached higher PCCs when the classifier of AMPs that targeting mammals was considered. The results for the anti-fungal classifier indicated that the pseudo amino acids ‘W’, ‘K’ and ‘λ2’ seemed to be essential factors. When the classifier for AMPs targeting Gram-positive bacteria was examined, a negative charge was present on the 75% and 100% of the residues. Additionally, a neutral charge on the 1st residue, helix secondary structure associated with the 1st residue and the presence of pseudo amino acids ‘E’ and ‘G’ were the key features. Finally, when the classifier for AMPs targeting Gram-negative bacteria was considered, a negative charge in 50% and 100% of residues, polar hydrophobicity of the 1st residue and the presence of pseudo amino acids ‘E’ and ‘G’ were important. The distribution of these important pseudo amino acids between the positive and negative training sets was also quite different, which implies that they are important factors in identifying AMP functional activities. Previous research has indicated that amino acids ‘K’, ‘E’, ‘G’, ‘P’, ‘C’ and ‘I’ are important in classifying anti-bacterial and anti-fungal peptides, and that amino acids ‘R’, ‘K’, ‘W’, ‘S’, ‘T’, ‘P’, ‘H’, ‘C’ and ‘I’ are essential for identifying AVPs [11]. As we collected data from a variety of different databases, and the definition of the negative dataset is rigorous, the above findings are highly convincing and solid.

AMPfun was further compared with several AMP prediction models using our class-specific testing set. AMPfun achieved a higher AUC than the other tools. However, for ACPs, the AUC obtained by AMPfun was slightly lower than that of MLACP [21]. Three possible reasons for the better performance of MLACP are as follows: first, we collected data from nine resources containing a range of different functional activities to increase the size and range of our dataset. Second, some tools did not use the PseAAC and CTDD features, which seem to be important when classifying AMPs into certain functional activities. Finally, we used a strict AMP definition when compiling the negative dataset.

This work presents a new scheme for the identification of AMPs, and beyond this, their classification into a wide variety of functional activities. With this scheme, a web server—AMPfun—was developed to provide a tool to screen unknown peptides for AMPs and identify their potential functional activities. Moreover, we have pinpointed some helpful features that assist the classification of short peptides into AMPs and non-AMPs, such as the physical–chemical properties of the peptide’s 1st residue. In conclusion, this study not only constructed classifiers to distinguish the activities of AMPs but also identified key features with respect to different AMP functions. In the future, the post-translational modifications [35, 36] should be considered into a further investigation of AMP functions.

Key Points

  • Although several studies have proposed different machine learning methods to perform antimicrobial peptide (AMP) prediction, most did not consider the diversity of antimicrobial activities. Thus, we specifically investigated the sequential features of AMPs with various functions including anti-parasitic, anti-viral, anti-cancer and anti-fungal AMPs, as well as those that target mammals, Gram-positive and Gram-negative bacteria.

  • We proposed a two-staged scheme to first identify AMPs, and second, characterize their functional activities. Features—including hydrophobicity, normalized van der Waals volume, polarity, charge and solvent accessibility—were essential to classify peptides as AMPs and non-AMPs; their use enabled the 1st stage AMP classifier to achieve an area under the receiver operating characteristic curve (AUC) value of 0.9894. In the 2nd stage, we found the pseudo amino acid composition was an informative attribute when differentiating between AMPs in terms of their functional activities.

  • The independent testing results demonstrated that the AUCs of the multi-class models were 0.7773, 0.9404, 0.8231, 0.8578, 0.8648, 0.8745 and 0.8672 with respect to anti-parasitic, anti-viral, anti-cancer, anti-fungal AMPs and those that target mammals, Gram-positive and Gram-negative bacteria, respectively. The proposed scheme was implemented as a web-based tool—AMPfun—which is now free to access at http://fdblab.csie.ncu.edu.tw/AMPfun/index.html.

Author contributions

C.R.C. and T.R.K. carried out the data collection and curation, participated in the bioinformatics analyses and drafted the manuscript. T.R.K. carried out the web tool implementation. C.R.C., L.C.W. and T.Y.L. participated in the design of the study and performed the draft revision. J.T.H. and T.Y.L. conceived of the study, participated in its design and coordination and helped revise the manuscript. All authors read and approved the final manuscript.

Chia-Ru Chung is a PhD candidate in the Department of Computer Science and Information Engineering, National Central University. Her research interests include bioinformatics, statistical inference, big data analytics, data mining and machine learning.

Ting-Rung Kuo obtained her master degree in Department of Computer Science and Information Engineering, National Central University. Her research interests include bioinformatics, data mining and machine learning.

Prof. Li-Ching Wu has a PhD in computer science and is a bioinformatics specialist. He joined Institute of Systems Biology and Bioinformatics, National Central University as an Assistant Professor in the Fall of 2006 and promoted to associate Professor in 2011. He is the creator of the biology database PGTdb and also participated construction of in several databases and tools such as ProSplicer, VirusProbeDB, and RgsMiner. His research interests include bioinformatics, systems biology, big data, disease network analysis and machine learning.

Dr. Tzong-Yi Lee is now an associate professor in the School of Life and Health Sciences, Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen. His research interests include bioinformatics, computational biology, systems biology, big data analytics, data mining and machine learning.

Dr. Jorng-Tzong Horng is now a Distinguished Professor in the Department of Computer Science and Information Engineering, National Central University, Taiwan. His research interests include database system, bioinformatics, computational biology, big data analytics, data mining and machine learning.

References

1.

Ventola
CL
.
The antibiotic resistance crisis: part 1: causes and threats
.
P T
2015
;
40
:
277
.

2.

Jhong
JH
,
Chi
YH
,
Li
WC
, et al.
dbAMP: an integrated resource for exploring antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data
.
Nucleic Acids Res
2019
;
47
:
D285
97
.

3.

Gaspar
D
,
Veiga
AS
,
Castanho
MA
.
From antimicrobial to anticancer peptides. A review
.
Front Microbiol
2013
;
4
:
294
.

4.

Huang
KY
,
Chang
TH
,
Jhong
JH
, et al.
Identification of natural antimicrobial peptides from bacteria through metagenomic and metatranscriptomic analysis of high-throughput transcriptome data of Taiwanese oolong teas
.
BMC Syst Biol
2017
;
11
:
131
.

5.

Bahar
AA
,
Ren
D
.
Antimicrobial peptides
.
Pharmaceuticals
2013
;
6
:
1543
75
.

6.

Bhadra
P
,
Yan
J
,
Li
J
, et al.
AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest
.
Sci Rep
2018
;
8
:
1697
.

7.

Chang
KY
,
T-p
L
,
Shih
L-Y
, et al.
Analysis and prediction of the critical regions of antimicrobial peptides based on conditional random fields
.
PLoS One
2015
;
10
:e0119490.

8.

Wang
G
,
Li
X
,
Wang
Z
.
APD3: the antimicrobial peptide database as a tool for research and education
.
Nucleic Acids Res
2015
;
44
:
D1087
93
.

9.

Waghu
FH
,
Barai
RS
,
Gurung
P
, et al.
CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides
.
Nucleic Acids Res
2015
;
44
:
D1094
7
.

10.

Lee
H-T
,
Lee
C-C
,
Yang
J-R
, et al.
A large-scale structural classification of antimicrobial peptides
.
Biomed Res Int
2015
;
2015
:
4
5
.

11.

Meher
PK
,
Sahu
TK
,
Saini
V
, et al.
Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC
.
Sci Rep
2017
;
7
:
42362
.

12.

Jenssen
H
,
Hamill
P
,
Hancock
RE
.
Peptide antimicrobial agents
.
Clin Microbiol Rev
2006
;
19
:
491
511
.

13.

Silva
O
,
De La Fuente-núñez
C
,
Haney
E
, et al.
An anti-infective synthetic peptide with dual antimicrobial and immunomodulatory activities
.
Sci Rep
2016
;
6
:
35465
.

14.

Zhang
L-j
,
Gallo
RL
.
Antimicrobial peptides
.
Curr Biol
2016
;
26
:
R14
9
.

15.

de la Fuente-núñez
C
,
Silva
ON
,
Lu
TK
, et al.
Antimicrobial peptides: role in human disease and potential as immunotherapies
.
Pharmacol Ther
2017
;
178
:
132
40
.

16.

Fan
L
,
Sun
J
,
Zhou
M
, et al.
DRAMP: a comprehensive data repository of antimicrobial peptides
.
Sci Rep
2016
;
6
:
24482
.

17.

Mehta
D
,
Anand
P
,
Kumar
V
, et al.
ParaPep: a web resource for experimentally validated antiparasitic peptide sequences and their structures
.
Database
2014
;
2014
:
4
5
.

18.

Qureshi
A
,
Thakur
N
,
Tandon
H
, et al.
AVPdb: a database of experimentally validated antiviral peptides targeting medically important viruses
.
Nucleic Acids Res
2013
;
42
:
D1147
53
.

19.

Agrawal
P
,
Bhalla
S
,
Chaudhary
K
, et al.
In silico approach for prediction of antifungal peptides
.
Front Microbiol
2018
;
9
:
323
.

20.

Tyagi
A
,
Tuknait
A
,
Anand
P
, et al.
CancerPPD: a database of anticancer peptides and proteins
.
Nucleic Acids Res
2014
;
43
:
D837
43
.

21.

Manavalan
B
,
Basith
S
,
Shin
TH
, et al.
MLACP: machine-learning-based prediction of anticancer peptides
.
Oncotarget
2017
;
8
:
77121
.

22.

Tyagi
A
,
Kapoor
P
,
Kumar
R
, et al.
In silico models for designing and discovering novel anticancer peptides
.
Sci Rep
2013
;
3
:
2984
.

23.

Zare
M
,
Mohabatkar
H
,
Faramarzi
FK
, et al.
Using Chou’s pseudo amino acid composition and machine learning method to predict the antiviral peptides
.
Open Bioinforma J
2015
;
9
:
3
.

24.

Chen
W
,
Ding
H
,
Feng
P
, et al.
iACP: a sequence-based tool for identifying anticancer peptides
.
Oncotarget
2016
;
7
:
16895
.

25.

Hajisharifi
Z
,
Piryaiee
M
,
Beigi
MM
, et al.
Predicting anticancer peptides with Chou’s pseudo amino acid composition and investigating their mutagenicity via Ames test
.
J Theor Biol
2014
;
341
:
34
40
.

26.

Khosravian
M
,
Kazemi Faramarzi
F
,
Mohammad Beigi
M
, et al.
Predicting antibacterial peptides by the concept of Chou’s pseudo-amino acid composition and machine learning methods
.
Protein Pept Lett
2013
;
20
:
180
6
.

27.

Lata
S
,
Mishra
NK
,
Raghava
GP
.
AntiBP2: improved version of antibacterial peptide prediction
.
BMC Bioinformatics
2010
;
11
:
S19
.

28.

Xiao
X
,
Wang
P
,
Lin
W-Z
, et al.
iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types
.
Anal Biochem
2013
;
436
:
168
77
.

29.

Lin
W
,
Xu
D
.
Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types
.
Bioinformatics
2016
;
32
:
3745
52
.

30.

The UniProt Consortium
.
UniProt: the universal protein knowledgebase
.
Nucleic Acids Res
2017
;
45
:
D158
69
.

31.

Cao
D-S
,
Xu
Q-S
,
Liang
Y-Z
.
propy: a tool to generate various modes of Chou’s PseAAC
.
Bioinformatics
2013
;
29
:
960
2
.

32.

Li
W
,
Godzik
A
.
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
.
Bioinformatics
2006
;
22
:
1658
9
.

33.

Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
, et al.
Scikit-learn: machine learning in Python
.
J Mach Learn Res
2011
;
12
:
2825
30
.

34.

Thakur
N
,
Qureshi
A
,
Kumar
M
.
AVPpred: collection and prediction of highly effective antiviral peptides
.
Nucleic Acids Res
2012
;
40
:
W199
204
.

35.

Huang
KY
,
Lee
TY
,
Kao
HJ
, et al.
dbPTM in 2019: exploring disease association and cross-talk of post-translational modifications
.
Nucleic Acids Res
2019
;
47
:
D298
308
.

36.

Huang
KY
,
Su
MG
,
Kao
HJ
, et al.
dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins
.
Nucleic Acids Res
2016
;
44
:
D435
46
.

37.

Veltri
D
,
Kamath
U
,
Shehu
A
.
Deep learning improves antimicrobial peptide recognition
.
Bioinformatics
2018
;
34
:
2740
2747
.

38.

Wang
P
,
Ge
R
,
Liu
L
, et al.
Multi-label Learning for Predicting the Activities of Antimicrobial Peptides
.
Scientific reports
2017
;
7
:
2202
.

39.

Gabere
MN
,
Noble
WS
.
Empirical comparison of web-based antimicrobial peptide prediction tools
.
Bioinformatics
2017
;
33
:
1921
1929
.

40.

Wang
P
,
Hu
L
,
Liu
G
, et al.
Prediction of antimicrobial peptides based on sequence alignment and feature selection methods
.
PloS one
2011
;
6
:
e18476
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)