Characterization and identification of antimicrobial peptides with different functional activities

Related works of antimicrobial peptide predictions

Reference	Prediction target	Data source (#Pos : #Neg)	CD-HIT parameter	Sequence length	Training features	Classifier	Web server
Bhadra et al. (2018) [6]	AMP	APD3 + CAMP+LAMP/UniProt (3268:166791)	–	–	AAC, PseAAC, CTDD	RF	–
Veltri et al. (2018) [37]	AMP	APD3/UniProt (1779:1778)	90% on Pos; 40% on Neg	Larger than 10	Peptide sequences	DNN	1
Agrawal et al. (2018) [19]	Anti-fungal peptides	DRAMP/SwissProt (1459:1459)	–	Keep same in both Pos and Neg	AAC, DPC, binary profile, mass, charge and pI value of peptide	SVM, RF, DT, Naïve Bayes	2
Wang et al. (2017) [38]	Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancer	APD (2222)	–	Between 10 and 60	AAC, DPC	WKnn, MLR	–
Meher et al. (2017) [11]	Anti-bacterial, anti-viral, anti-fungal	CAMP+APD3 + AntiBP2 + LAMP+ AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384)	–	Larger than 10	AAC, PseAAC, structural, physicochemical	SVM	3
Gabere et al. (2017) [39]	AMP	DRMPD+APD3/UniProt (2260:11300)	90%	Match 1:5 with Pos	–	–	–
Manavalan (2017) [21]	ACP	(735:1025)	90%	Less than 50	AAC, DPC, PCP, ATC(C,H,O,N,S)	SVM, RF	4
Lin and Xu (2016) [29]	AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viral	APD/UniProt (879:2405)	–	Between 5 and 100	PseAAC	RF	5
Chen et al. (2016) [24]	Anti-cancer	APD2/UnProt (138:206)	90%	–	PseAAC, g-gap DPC	SVM	6
Zare et al. (2015) [23]	Anti-viral	AVPpred (342:312)	90%	Between 10 and 100	PseAAC	AdaBoost, RBF, NB, DT, decision stump, REPTree	–
Tyagi et al. (2013) [22]	Anti-cancer	APD + CAMP+DADP/SwissProt (225:2250–)	–	–	AAC, DPC, binary profiles	SVM	7
Xiao et al. (2013) [28]	AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIV	APD/UniProt (1486:2405)	40%	Between 5 and 100	PseAAC	FKNN, multi-label FKNN classifier	8
Khosravian. et al. (2013) [26]	Anti-bacterial	APD/AMP (1086:8860)	90%	–	PseAAC	SVM	–
Thakur et al. (2012) [34]	Anti-viral	Pubmed and Patent Lens (604:452)	–	–	AAC, motif, physicochemical properties, sequence alignment	SVM	9
Wang. et al. (2011) [40]	AMP	CAMP/UniProt (870:9731)	70%	Same as Pos	AAC, PseAAC, sequence alignment	NNA, mRMR	10
Lata et al. (2010) [27]	Anti-bacterial	APD/SwisProt (999:999)	–	Same as Pos	AAC, binary patterns	SVM

Reference	Prediction target	Data source (#Pos : #Neg)	CD-HIT parameter	Sequence length	Training features	Classifier	Web server
Bhadra et al. (2018) [6]	AMP	APD3 + CAMP+LAMP/UniProt (3268:166791)	–	–	AAC, PseAAC, CTDD	RF	–
Veltri et al. (2018) [37]	AMP	APD3/UniProt (1779:1778)	90% on Pos; 40% on Neg	Larger than 10	Peptide sequences	DNN	1
Agrawal et al. (2018) [19]	Anti-fungal peptides	DRAMP/SwissProt (1459:1459)	–	Keep same in both Pos and Neg	AAC, DPC, binary profile, mass, charge and pI value of peptide	SVM, RF, DT, Naïve Bayes	2
Wang et al. (2017) [38]	Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancer	APD (2222)	–	Between 10 and 60	AAC, DPC	WKnn, MLR	–
Meher et al. (2017) [11]	Anti-bacterial, anti-viral, anti-fungal	CAMP+APD3 + AntiBP2 + LAMP+ AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384)	–	Larger than 10	AAC, PseAAC, structural, physicochemical	SVM	3
Gabere et al. (2017) [39]	AMP	DRMPD+APD3/UniProt (2260:11300)	90%	Match 1:5 with Pos	–	–	–
Manavalan (2017) [21]	ACP	(735:1025)	90%	Less than 50	AAC, DPC, PCP, ATC(C,H,O,N,S)	SVM, RF	4
Lin and Xu (2016) [29]	AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viral	APD/UniProt (879:2405)	–	Between 5 and 100	PseAAC	RF	5
Chen et al. (2016) [24]	Anti-cancer	APD2/UnProt (138:206)	90%	–	PseAAC, g-gap DPC	SVM	6
Zare et al. (2015) [23]	Anti-viral	AVPpred (342:312)	90%	Between 10 and 100	PseAAC	AdaBoost, RBF, NB, DT, decision stump, REPTree	–
Tyagi et al. (2013) [22]	Anti-cancer	APD + CAMP+DADP/SwissProt (225:2250–)	–	–	AAC, DPC, binary profiles	SVM	7
Xiao et al. (2013) [28]	AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIV	APD/UniProt (1486:2405)	40%	Between 5 and 100	PseAAC	FKNN, multi-label FKNN classifier	8
Khosravian. et al. (2013) [26]	Anti-bacterial	APD/AMP (1086:8860)	90%	–	PseAAC	SVM	–
Thakur et al. (2012) [34]	Anti-viral	Pubmed and Patent Lens (604:452)	–	–	AAC, motif, physicochemical properties, sequence alignment	SVM	9
Wang. et al. (2011) [40]	AMP	CAMP/UniProt (870:9731)	70%	Same as Pos	AAC, PseAAC, sequence alignment	NNA, mRMR	10
Lata et al. (2010) [27]	Anti-bacterial	APD/SwisProt (999:999)	–	Same as Pos	AAC, binary patterns	SVM

Note: Pos, positive dataset; Neg, negative dataset; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, composition–transition–distribution D descriptor; DPC, dipeptide composition; NB, Naïve Bayes; FKNN, Fuzzy K-nearest neighbor; WKnn, weighted K-nearest neighbor algorithm; MLR, multiple linear regression; DNN, deep neural network; REPTree, Reduced-Error Pruning Tree; NNA, nearest neighbor algorithm; mRMR, maximum relevance, minimum redundancy; − means that there is no provided.

1, https://www.dveltri.com/ascan/v2/ascan.html; 2, https://webs.iiitd.edu.in/raghava/antifp/; 3, http://cabgrid.res.in:8080/amppred/; 4, http://www.thegleelab.org/MLACP.html; 5, http://www.jci-bioinfo.cn/MLAMP; 6, http://lin.uestc.edu.cn/server/iACP; 7, http://crdd.osdd.net/raghava/anticp/; 8, http://www.jci-bioinfo.cn/iAMP-2L; 9, http://crdd.osdd.net/servers/avppred/; 10, http://amp.biosino.org/

Table 1

Related works of antimicrobial peptide predictions

Reference	Prediction target	Data source (#Pos : #Neg)	CD-HIT parameter	Sequence length	Training features	Classifier	Web server
Bhadra et al. (2018) [6]	AMP	APD3 + CAMP+LAMP/UniProt (3268:166791)	–	–	AAC, PseAAC, CTDD	RF	–
Veltri et al. (2018) [37]	AMP	APD3/UniProt (1779:1778)	90% on Pos; 40% on Neg	Larger than 10	Peptide sequences	DNN	1
Agrawal et al. (2018) [19]	Anti-fungal peptides	DRAMP/SwissProt (1459:1459)	–	Keep same in both Pos and Neg	AAC, DPC, binary profile, mass, charge and pI value of peptide	SVM, RF, DT, Naïve Bayes	2
Wang et al. (2017) [38]	Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancer	APD (2222)	–	Between 10 and 60	AAC, DPC	WKnn, MLR	–
Meher et al. (2017) [11]	Anti-bacterial, anti-viral, anti-fungal	CAMP+APD3 + AntiBP2 + LAMP+ AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384)	–	Larger than 10	AAC, PseAAC, structural, physicochemical	SVM	3
Gabere et al. (2017) [39]	AMP	DRMPD+APD3/UniProt (2260:11300)	90%	Match 1:5 with Pos	–	–	–
Manavalan (2017) [21]	ACP	(735:1025)	90%	Less than 50	AAC, DPC, PCP, ATC(C,H,O,N,S)	SVM, RF	4
Lin and Xu (2016) [29]	AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viral	APD/UniProt (879:2405)	–	Between 5 and 100	PseAAC	RF	5
Chen et al. (2016) [24]	Anti-cancer	APD2/UnProt (138:206)	90%	–	PseAAC, g-gap DPC	SVM	6
Zare et al. (2015) [23]	Anti-viral	AVPpred (342:312)	90%	Between 10 and 100	PseAAC	AdaBoost, RBF, NB, DT, decision stump, REPTree	–
Tyagi et al. (2013) [22]	Anti-cancer	APD + CAMP+DADP/SwissProt (225:2250–)	–	–	AAC, DPC, binary profiles	SVM	7
Xiao et al. (2013) [28]	AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIV	APD/UniProt (1486:2405)	40%	Between 5 and 100	PseAAC	FKNN, multi-label FKNN classifier	8
Khosravian. et al. (2013) [26]	Anti-bacterial	APD/AMP (1086:8860)	90%	–	PseAAC	SVM	–
Thakur et al. (2012) [34]	Anti-viral	Pubmed and Patent Lens (604:452)	–	–	AAC, motif, physicochemical properties, sequence alignment	SVM	9
Wang. et al. (2011) [40]	AMP	CAMP/UniProt (870:9731)	70%	Same as Pos	AAC, PseAAC, sequence alignment	NNA, mRMR	10
Lata et al. (2010) [27]	Anti-bacterial	APD/SwisProt (999:999)	–	Same as Pos	AAC, binary patterns	SVM

Reference	Prediction target	Data source (#Pos : #Neg)	CD-HIT parameter	Sequence length	Training features	Classifier	Web server
Bhadra et al. (2018) [6]	AMP	APD3 + CAMP+LAMP/UniProt (3268:166791)	–	–	AAC, PseAAC, CTDD	RF	–
Veltri et al. (2018) [37]	AMP	APD3/UniProt (1779:1778)	90% on Pos; 40% on Neg	Larger than 10	Peptide sequences	DNN	1
Agrawal et al. (2018) [19]	Anti-fungal peptides	DRAMP/SwissProt (1459:1459)	–	Keep same in both Pos and Neg	AAC, DPC, binary profile, mass, charge and pI value of peptide	SVM, RF, DT, Naïve Bayes	2
Wang et al. (2017) [38]	Anti-bacterial, anti-viral, anti-fungal, anti-parasitic, anti-cancer	APD (2222)	–	Between 10 and 60	AAC, DPC	WKnn, MLR	–
Meher et al. (2017) [11]	Anti-bacterial, anti-viral, anti-fungal	CAMP+APD3 + AntiBP2 + LAMP+ AVPpred/AntiBP2 + AVPpred (3417/739/1496: 984/893/1384)	–	Larger than 10	AAC, PseAAC, structural, physicochemical	SVM	3
Gabere et al. (2017) [39]	AMP	DRMPD+APD3/UniProt (2260:11300)	90%	Match 1:5 with Pos	–	–	–
Manavalan (2017) [21]	ACP	(735:1025)	90%	Less than 50	AAC, DPC, PCP, ATC(C,H,O,N,S)	SVM, RF	4
Lin and Xu (2016) [29]	AMP, anti-bacterial, anti-cancer, anti-fungal, anti-HIV, anti-viral	APD/UniProt (879:2405)	–	Between 5 and 100	PseAAC	RF	5
Chen et al. (2016) [24]	Anti-cancer	APD2/UnProt (138:206)	90%	–	PseAAC, g-gap DPC	SVM	6
Zare et al. (2015) [23]	Anti-viral	AVPpred (342:312)	90%	Between 10 and 100	PseAAC	AdaBoost, RBF, NB, DT, decision stump, REPTree	–
Tyagi et al. (2013) [22]	Anti-cancer	APD + CAMP+DADP/SwissProt (225:2250–)	–	–	AAC, DPC, binary profiles	SVM	7
Xiao et al. (2013) [28]	AMP, anti-bacterial, anti-cancer, anti-viral, anti-fungal, anti-HIV	APD/UniProt (1486:2405)	40%	Between 5 and 100	PseAAC	FKNN, multi-label FKNN classifier	8
Khosravian. et al. (2013) [26]	Anti-bacterial	APD/AMP (1086:8860)	90%	–	PseAAC	SVM	–
Thakur et al. (2012) [34]	Anti-viral	Pubmed and Patent Lens (604:452)	–	–	AAC, motif, physicochemical properties, sequence alignment	SVM	9
Wang. et al. (2011) [40]	AMP	CAMP/UniProt (870:9731)	70%	Same as Pos	AAC, PseAAC, sequence alignment	NNA, mRMR	10
Lata et al. (2010) [27]	Anti-bacterial	APD/SwisProt (999:999)	–	Same as Pos	AAC, binary patterns	SVM

Note: Pos, positive dataset; Neg, negative dataset; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, composition–transition–distribution D descriptor; DPC, dipeptide composition; NB, Naïve Bayes; FKNN, Fuzzy K-nearest neighbor; WKnn, weighted K-nearest neighbor algorithm; MLR, multiple linear regression; DNN, deep neural network; REPTree, Reduced-Error Pruning Tree; NNA, nearest neighbor algorithm; mRMR, maximum relevance, minimum redundancy; − means that there is no provided.

1, https://www.dveltri.com/ascan/v2/ascan.html; 2, https://webs.iiitd.edu.in/raghava/antifp/; 3, http://cabgrid.res.in:8080/amppred/; 4, http://www.thegleelab.org/MLACP.html; 5, http://www.jci-bioinfo.cn/MLAMP; 6, http://lin.uestc.edu.cn/server/iACP; 7, http://crdd.osdd.net/raghava/anticp/; 8, http://www.jci-bioinfo.cn/iAMP-2L; 9, http://crdd.osdd.net/servers/avppred/; 10, http://amp.biosino.org/

AMPs are produced by virtually all organisms on earth as essential components of the innate immune system [4]. As such, they are the 1st line of defense against microbes in many organisms [5–11]. Specifically, AMPs are able to disrupt either the cell membrane of microbes or intracellular functioning to kill the target organism [11]. Consequently, they are able to defend an organism against a wide range of pathogenic microorganisms, including viruses, parasites, bacteria and fungi [5, 12, 13]. Generally, AMPs consist of 10–50 amino acids and show little sequence homology with one another [5, 12, 14]. As a result, identifying AMPs and their activity is quite challenging. With the rapid growth of resistance to chemical antibiotics among microbial pathogens, there has developed an urgent need to identify new and novel therapeutics that are able to treat infections. As such, AMPs are now being taken more seriously and gaining popularity with respect to their clinical application. Investigations targeting AMPs are emerging to help us gain a full understanding of their mechanisms for new drug development. This means that new agents against infectious diseases and inflammatory, as well as anti-tumorigenic AMP-based drugs, should be reachable [15].

Due to the urgent need to be able to identify AMPs and their functional activities, numerous studies have constructed various databases that contain experimentally validated AMPs and annotations of their functional activities. The 3rd version of the antimicrobial peptide database (APD3) not only includes AMPs and their different functional activities— including anti-bacterial, anti-fungal, anti-viral and anti-cancer—but also has a glossary, specific nomenclature, peptide classification section, search functionality and AMP prediction system [8]. Furthermore, the A Database of Antimicrobial Peptides (ADAM) database is also available and provides a comprehensive analysis of experimentally validated AMPs linked to their known anti-microbial activities—including activity against Gram-positive bacteria, activity against Gram-negative bacteria and anti-fungal and anti-viral activity [10]. Additionally, the Data Repository of Antimicrobial Peptides (DRAMP) is an open access database containing a diverse set of AMP annotations and their different functional activities, such as activities against Gram-positive bacteria and Gram-negative bacterial [16]. In contrast to the aforementioned databases, several others have focused on AMPs of a specific functional activity. ParaPEP is one such database consisting of 519 unique anti-parasitic peptides and provides very useful information related to experimentally validated anti-parasitic peptide sequences and their structures [17]. There is also AVPdb, which is the 1st comprehensive database of anti-viral peptides (AVPs). It provides a dedicated resource of experimentally verified AVPs that are able to target over 60 medically important viruses. AVPdb includes 2683 experimentally verified AVPs [18]. Additionally, there is AntiFP, which has not only released several anti-fungal datasets, but also provided a number of prediction tools [19]. Several databases—such as CancerPPD [20], MLACP [21] and AntiCP [22]—have concentrated on specifically collecting anti-cancer peptides (ACPs). Currently, an integrated resource, dbAMP [2], has been developed for exploring AMPs with functional activities and physicochemical properties on high-throughput data.

To deal with the lack of sequence similarity among AMPs, there is a necessity to develop an effective method for identifying AMPs and their functional activity. As listed in Table 1, many studies have developed various automatic tools using a variety of modeling techniques. As part of these studies, the decision tree (DT) [19, 23], random forest (RF) [6, 9, 19, 21, 23], support vector machine (SVM) [9, 11, 19, 21–27], artificial neural network [9, 26] and discriminant analysis [9] approaches have been used to construct classifiers that are capable of differentiating AMPs from non-AMPs. Besides the ability to classify peptides as AMPs and non-AMPs, the ability to predict each AMP’s functional activities has gained increased attention in recent years. For example, iAMP-2L—a two-level multi-label classifier—was developed using the fuzzy K-nearest neighbor (FKNN) approach; its aim was to identify both AMPs and their functional activities, including anti-bacterial, anti-cancer, anti-viral, anti-fungal and anti-human immunodeficiency virus (HIV) [28]. In comparison, Meher et al. [11] adopted an SVM-based method to discriminate between anti-bacterial, anti-fungal and AVPs and subsequently performed further important feature analysis to pinpoint the critical factors associated with these three AMP functional activities. Moreover, to deal with the unbalanced multi-label problem, Lin and Xu [29] applied the synthetic minority over-sampling technique with the aim of distinguishing some functional activities of AMPs—including anti-bacterial, anti-cancer, anti-viral, anti-fungal and anti-HIV. In addition to the aforementioned machine learning methods, the composition of the feature set is also a critical factor in making precise predictions [6]. It is important to have the proper feature set as this will help identify differences between the various groups. Among the research published in this field, compositional and physicochemical data, structural properties, sequence order and terminal residue patterns of the AMPs have been effective features for making AMP predictions. For example, amino acid composition (AAC) and dipeptide composition (DPC) are compositional features and several previous studies have used them as features when constructing their classifiers [11, 22, 27–30]. As global protein sequence descriptors, the composition, transition and distribution (CTD) patterns of amino acid properties are able to encode the distribution patterns of physical–chemical properties [6, 31]. Furthermore, numerous studies have considered physicochemical properties, including CTD, pseudo amino acid composition (PseAAC) and the AAINDEX [6, 11, 19, 21, 24, 27–29], in the AMP prediction.

Figure 1

Flow chart of this study.

Table 1 shows limited number of methods devoted to constructing multiple classifiers for AMPs with a range of functional activities. Most studies have focused on the identification of AMPs with a specific function, e.g. ACPs. Although several studies have investigated various types of AMPs, they have not yet developed freely accessed tools that allow users to submit their peptides of interest. Therefore, the primary purpose of this study is to design a two-stage framework for identifying AMPs along with their functional activities. Additionally, feature selection analysis is performed to obtain a better understanding of the sequential characteristics of AMPs with respect to seven target activities.

Materials and methods

To help create a useful identification system for AMPs and their functional activities, we developed a flow chart for this study (Figure 1). Detailed explanations of each process described in the chart are provided in the following sections.

Dataset preparation

The positive dataset used to construct the AMP classifier was collected from a range of AMP databases, namely APD3 [8], ADAM [10], ParaPep [17], AVPdb [18], CancerPPD [20], MLACP [12], AntiCP [22], AntiFP [19] and DRAMP [16]. A total of 6766 experimentally validated AMP sequences were obtained from these resources. Additionally, non-AMP data that formed the negative dataset were obtained from AmPEP [6] and consist of a collection of sequences from UniProt [30] with lengths ranging from 5 to 255 amino acid residues. These sequences were then filtered using the unnatural amino acid B, J, O, U, X and Z. To reduce any homology bias and redundancy, the CD-HIT program [32] was used to screen the positive dataset with a 50% threshold. Homologous sequences were also removed from the negative dataset according to the same criterion. Subsequently, CD-HIT-2D [32] was used on the two datasets at a threshold of 50%. Finally, 70% of both the positive and negative datasets (18,114 sequences, total) was used to create two training sets, while the remaining 30% of each (7764 sequences, total) was used to create two testing sets. The positive training set contains 1686 AMP sequences, while the testing set contained 723.

Table 2

Data source and the number of sequences for training set, testing set and independent testing set based on seven AMP functional activities

Activities	Databases	Training set		Testing set		Independent testing set
		Positive	Negative	Positive	Negative	Positive	Negative
Anti-parasitic	APD3 [8], ADAM [10], ParaPep [17]	140	700	60	1914	–	–
Anti-viral	APD3 [8], ADAM [10], AVPdb [18]	1400	2451	601	1374	601	1374
Anti-cancer	APD3 [8], CancerPPD [20], MLACP [21], AntiCP [22]	219	1095	94	1881	94	1881
Targeting mammals	APD3 [8]	215	1075	93	1882	–	–
Anti-fungal	APD3 [8], ADAM [10], AntiFP [19]	1912	1261	820	1155	820⁺	1155⁺
						734^#	825^#
Targeting Gram-positive bacteria	APD3 [8], ADAM [10], DRAMP [16]	1930	1624	828	1147	828	1147
Targeting Gram-negative bacteria	APD3 [8], ADAM [10], DRAMP [16]	1931	1635	828	1147	828	1147

Activities	Databases	Training set		Testing set		Independent testing set
		Positive	Negative	Positive	Negative	Positive	Negative
Anti-parasitic	APD3 [8], ADAM [10], ParaPep [17]	140	700	60	1914	–	–
Anti-viral	APD3 [8], ADAM [10], AVPdb [18]	1400	2451	601	1374	601	1374
Anti-cancer	APD3 [8], CancerPPD [20], MLACP [21], AntiCP [22]	219	1095	94	1881	94	1881
Targeting mammals	APD3 [8]	215	1075	93	1882	–	–
Anti-fungal	APD3 [8], ADAM [10], AntiFP [19]	1912	1261	820	1155	820⁺	1155⁺
						734^#	825^#
Targeting Gram-positive bacteria	APD3 [8], ADAM [10], DRAMP [16]	1930	1624	828	1147	828	1147
Targeting Gram-negative bacteria	APD3 [8], ADAM [10], DRAMP [16]	1931	1635	828	1147	828	1147

Note: ‘+’ means the testing set for iAMPpred tool; ‘#’ means the testing for AntiFP tool; and ‘–’ means that there is no existing tool and hence we do not have to separate an independent testing set.

Table 2

Data source and the number of sequences for training set, testing set and independent testing set based on seven AMP functional activities

Activities	Databases	Training set		Testing set		Independent testing set
		Positive	Negative	Positive	Negative	Positive	Negative
Anti-parasitic	APD3 [8], ADAM [10], ParaPep [17]	140	700	60	1914	–	–
Anti-viral	APD3 [8], ADAM [10], AVPdb [18]	1400	2451	601	1374	601	1374
Anti-cancer	APD3 [8], CancerPPD [20], MLACP [21], AntiCP [22]	219	1095	94	1881	94	1881
Targeting mammals	APD3 [8]	215	1075	93	1882	–	–
Anti-fungal	APD3 [8], ADAM [10], AntiFP [19]	1912	1261	820	1155	820⁺	1155⁺
						734^#	825^#
Targeting Gram-positive bacteria	APD3 [8], ADAM [10], DRAMP [16]	1930	1624	828	1147	828	1147
Targeting Gram-negative bacteria	APD3 [8], ADAM [10], DRAMP [16]	1931	1635	828	1147	828	1147

Activities	Databases	Training set		Testing set		Independent testing set
		Positive	Negative	Positive	Negative	Positive	Negative
Anti-parasitic	APD3 [8], ADAM [10], ParaPep [17]	140	700	60	1914	–	–
Anti-viral	APD3 [8], ADAM [10], AVPdb [18]	1400	2451	601	1374	601	1374
Anti-cancer	APD3 [8], CancerPPD [20], MLACP [21], AntiCP [22]	219	1095	94	1881	94	1881
Targeting mammals	APD3 [8]	215	1075	93	1882	–	–
Anti-fungal	APD3 [8], ADAM [10], AntiFP [19]	1912	1261	820	1155	820⁺	1155⁺
						734^#	825^#
Targeting Gram-positive bacteria	APD3 [8], ADAM [10], DRAMP [16]	1930	1624	828	1147	828	1147
Targeting Gram-negative bacteria	APD3 [8], ADAM [10], DRAMP [16]	1931	1635	828	1147	828	1147

Note: ‘+’ means the testing set for iAMPpred tool; ‘#’ means the testing for AntiFP tool; and ‘–’ means that there is no existing tool and hence we do not have to separate an independent testing set.

Table 3

Overview of the features analyzed in this investigation

Category	Feature name	Abbreviation	# descriptor
Binary profiling of position	N-gram binary profiling of position found by counting	NCB	3^a
	N-gram binary profiling of position found by t-test	NTB	3^a
	Motifs binary profiling of position	MB	3^a
Composition	N-gram composition found by counting	NCC	1^a
	N-gram composition found by t-test	NTC	1^a
	Motifs composition	MC	1^a
	Amino acid composition	AAC	20
Physical–chemical properties	Pseudo amino acid composition	PseAAC	22
	Distribution descriptor of composition transition distribution	CTDD	105

Category	Feature name	Abbreviation	# descriptor
Binary profiling of position	N-gram binary profiling of position found by counting	NCB	3^a
	N-gram binary profiling of position found by t-test	NTB	3^a
	Motifs binary profiling of position	MB	3^a
Composition	N-gram composition found by counting	NCC	1^a
	N-gram composition found by t-test	NTC	1^a
	Motifs composition	MC	1^a
	Amino acid composition	AAC	20
Physical–chemical properties	Pseudo amino acid composition	PseAAC	22
	Distribution descriptor of composition transition distribution	CTDD	105

Note: ^ameans that the number depends on the choice of the number of binary profiling of position or composition found by either counting or t-test.

Table 3

Overview of the features analyzed in this investigation

Category	Feature name	Abbreviation	# descriptor
Binary profiling of position	N-gram binary profiling of position found by counting	NCB	3^a
	N-gram binary profiling of position found by t-test	NTB	3^a
	Motifs binary profiling of position	MB	3^a
Composition	N-gram composition found by counting	NCC	1^a
	N-gram composition found by t-test	NTC	1^a
	Motifs composition	MC	1^a
	Amino acid composition	AAC	20
Physical–chemical properties	Pseudo amino acid composition	PseAAC	22
	Distribution descriptor of composition transition distribution	CTDD	105

Category	Feature name	Abbreviation	# descriptor
Binary profiling of position	N-gram binary profiling of position found by counting	NCB	3^a
	N-gram binary profiling of position found by t-test	NTB	3^a
	Motifs binary profiling of position	MB	3^a
Composition	N-gram composition found by counting	NCC	1^a
	N-gram composition found by t-test	NTC	1^a
	Motifs composition	MC	1^a
	Amino acid composition	AAC	20
Physical–chemical properties	Pseudo amino acid composition	PseAAC	22
	Distribution descriptor of composition transition distribution	CTDD	105

Note: ^ameans that the number depends on the choice of the number of binary profiling of position or composition found by either counting or t-test.

The construction of class-specific datasets involved a subdivision of the AMP databases. For each functional activity, Table 2 shows the number of sequences forming the positive and negative testing and training datasets, as well as the data source. It should be noted that once a sequence has been assigned an activity, that sequence is assigned to the positive dataset for that specific activity; otherwise, it is regarded as negative data. For instance, an AMP peptide having activity against parasites and cancer is assigned to the positive anti-parasite dataset (for building the anti-parasitic classifier); however, it is also assigned to the negative datasets of the remaining activities. On this basis, seven positive and negative datasets are arranged for generation of the class-specific classifiers. Similar to the process described above, these seven positive and negative datasets were randomly divided into training and testing sets made up of 70% and 30% of the sequences, respectively. Furthermore, the CD-HIT-2D [32] was applied on the training and testing datasets with a 50% threshold during the development of the class-specific classifiers.

Generation of training features

It was extremely important to transform these sequences, which consisted of strings of amino acids, into reasonable and representative numerical values before building the prediction model. As presented in Table 3, we divided the features into three types: binary profiling of amino acid position, AAC and physical–chemical properties. The binary profiling of amino acid position included n-gram binary profiling of position as determined by counting (NCB), the same as determined by t-test (NTB) and motif-based binary profiling of position (MB). The n-gram composition by counting (NCC), n-gram composition by t-test (NTC), motifs composition (MC) and AAC were features of the composition category, while PseAAC and Distribution descriptor of CTD (CTDD) were physical–chemical properties.

In this study, we not only considered uni-gram and bi-gram, but also tri-gram, fourth-gram and five-gram. Due to the large number of possible combinations when extending the analysis from uni-gram to five-gram, we needed to adopt two approaches—counting and the Welch’s t-test—to remove rare patterns. Counting uses the number of occurrences and mainly involves identifying n-grams that do not appear in the negative training dataset at a frequency greater than half of the maximum frequency in the positive training dataset, such were then used as our features. Welch’s t test was implemented to identify n-gram occurrences that were significant different between the positive and negative training datasets. It should be noted that such n-grams were retained when they reached a P-value of less than 0.001. The patterns that presented often in the positive training dataset, but were not in the negative training dataset, were retained for this study. The MERCI program [31] was implemented to extract motifs that were present in positive training dataset at least k times and appeared in negative training dataset at most m times. For this study, k = 50 and m = 0. The number of motifs found by MERCI is shown in Table 4.

Table 4

Number of motifs detected by MERCI in seven AMP functional activities

Activity	3-gram	4-gram	5-gram	>6-gram	Total
Anti-parasitic	1	15	11	8	35
Anti-viral	0	10	11	41	62
Anti-cancer	2	16	23	40	81
Targeting mammals	1	20	13	30	64
Anti-fungal	0	8	11	31	50
Targeting Gram-positive bacteria	3	36	15	7	61
Targeting Gram-negative bacteria	2	31	18	5	56

Activity	3-gram	4-gram	5-gram	>6-gram	Total
Anti-parasitic	1	15	11	8	35
Anti-viral	0	10	11	41	62
Anti-cancer	2	16	23	40	81
Targeting mammals	1	20	13	30	64
Anti-fungal	0	8	11	31	50
Targeting Gram-positive bacteria	3	36	15	7	61
Targeting Gram-negative bacteria	2	31	18	5	56

Table 4

Number of motifs detected by MERCI in seven AMP functional activities

Activity	3-gram	4-gram	5-gram	>6-gram	Total
Anti-parasitic	1	15	11	8	35
Anti-viral	0	10	11	41	62
Anti-cancer	2	16	23	40	81
Targeting mammals	1	20	13	30	64
Anti-fungal	0	8	11	31	50
Targeting Gram-positive bacteria	3	36	15	7	61
Targeting Gram-negative bacteria	2	31	18	5	56

Activity	3-gram	4-gram	5-gram	>6-gram	Total
Anti-parasitic	1	15	11	8	35
Anti-viral	0	10	11	41	62
Anti-cancer	2	16	23	40	81
Targeting mammals	1	20	13	30	64
Anti-fungal	0	8	11	31	50
Targeting Gram-positive bacteria	3	36	15	7	61
Targeting Gram-negative bacteria	2	31	18	5	56

Binary profiles of the N-terminal and T-terminal residues have been used in previous research [29, 30]. A three-dimensional vector with binary components was used to represent the special n-grams found by counting and t-test; and motifs were found to appear in the first, second or last third of the sequence—rather than splitting the sequence into N-terminal and T-terminal regions. For instance, a sequence ‘KKWKIVVIKWKK’ would get the feature vector (1, 0, 1) for the n-gram ‘WK’, owing to the fact that ‘WK’ appears in both the first third—‘KKWK’—and last third—‘KWKK’—of the residues. Numerous studies have used AACs and DPCs as features [11, 22, 27–30]. Therefore, we also considered the bi-, tri-, four- and five-gram compositions found by counting, t-test and motifs. More specifically, the composition of n-gram_i, Com(N-gram_i), is defined as follows:

$$Com\left(N-{\mathrm{gram}}_i\right)=\frac{C_i}{L-N+1},$$

where C_i is the number of occurrences of the n-gram_i in a sequence of length L. The number of n-gram related features that formed part of this study is presented in Table 5. Furthermore, the propy 1.0 package [31] was used to determine the values for PseAAC and CTDD, which were then used to describe the physical–chemical properties of the peptides, such as hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure and solvent accessibility. Note that additional factors λ of 2 and w of 0.5 were used by PseAAC.

Table 5

Number of n-gram related features

Activity	2-gram	3-gram	4-gram	5-gram	Total
	(a) Found by counting
AMP	0	0	0	0	0
Anti-parasitic	0	0	22	56	78
Anti-viral	0	0	0	2	2
Anti-cancer	0	1	6	19	26
Targeting mammals	0	0	9	5	14
Anti-fungal	0	0	19	26	45
Targeting Gram-positive bacteria	0	0	0	11	11
Targeting Gram-negative bacteria	0	0	21	42	63
	(b) Found by t-test
AMP	392	816	991	106	2305
Anti-parasitic	22	7	6	3	38
Anti-viral	2	319	444	127	918
Anti-cancer	9	12	9	1	31
Targeting mammals	13	13	4	0	30
Anti-fungal	82	142	101	39	364
Targeting Gram-positive bacteria	143	285	175	60	663
Targeting Gram-negative bacteria	135	279	168	54	636

Activity	2-gram	3-gram	4-gram	5-gram	Total
	(a) Found by counting
AMP	0	0	0	0	0
Anti-parasitic	0	0	22	56	78
Anti-viral	0	0	0	2	2
Anti-cancer	0	1	6	19	26
Targeting mammals	0	0	9	5	14
Anti-fungal	0	0	19	26	45
Targeting Gram-positive bacteria	0	0	0	11	11
Targeting Gram-negative bacteria	0	0	21	42	63
	(b) Found by t-test
AMP	392	816	991	106	2305
Anti-parasitic	22	7	6	3	38
Anti-viral	2	319	444	127	918
Anti-cancer	9	12	9	1	31
Targeting mammals	13	13	4	0	30
Anti-fungal	82	142	101	39	364
Targeting Gram-positive bacteria	143	285	175	60	663
Targeting Gram-negative bacteria	135	279	168	54	636

Note: AMP, anti-microbial peptides.

Table 5

Number of n-gram related features

Activity	2-gram	3-gram	4-gram	5-gram	Total
	(a) Found by counting
AMP	0	0	0	0	0
Anti-parasitic	0	0	22	56	78
Anti-viral	0	0	0	2	2
Anti-cancer	0	1	6	19	26
Targeting mammals	0	0	9	5	14
Anti-fungal	0	0	19	26	45
Targeting Gram-positive bacteria	0	0	0	11	11
Targeting Gram-negative bacteria	0	0	21	42	63
	(b) Found by t-test
AMP	392	816	991	106	2305
Anti-parasitic	22	7	6	3	38
Anti-viral	2	319	444	127	918
Anti-cancer	9	12	9	1	31
Targeting mammals	13	13	4	0	30
Anti-fungal	82	142	101	39	364
Targeting Gram-positive bacteria	143	285	175	60	663
Targeting Gram-negative bacteria	135	279	168	54	636

Activity	2-gram	3-gram	4-gram	5-gram	Total
	(a) Found by counting
AMP	0	0	0	0	0
Anti-parasitic	0	0	22	56	78
Anti-viral	0	0	0	2	2
Anti-cancer	0	1	6	19	26
Targeting mammals	0	0	9	5	14
Anti-fungal	0	0	19	26	45
Targeting Gram-positive bacteria	0	0	0	11	11
Targeting Gram-negative bacteria	0	0	21	42	63
	(b) Found by t-test
AMP	392	816	991	106	2305
Anti-parasitic	22	7	6	3	38
Anti-viral	2	319	444	127	918
Anti-cancer	9	12	9	1	31
Targeting mammals	13	13	4	0	30
Anti-fungal	82	142	101	39	364
Targeting Gram-positive bacteria	143	285	175	60	663
Targeting Gram-negative bacteria	135	279	168	54	636

Note: AMP, anti-microbial peptides.

Model construction and feature selection

The DT is a commonly used classification method. It has a tree structure similar to a flow chart, such that each internal node represents a test of the features, each branch represents a result of a test and each leaf node represents a class label. In our study, we used Classification and Regression Trees (CART) to build binary DTs for classifying peptides as AMPs or non-AMPs, and to identify their functional activities. The CART applies a greedy approach wherein the tree is constructed by a top-down recursion. During the construction process, the program chooses the ‘best’ feature set to classify the training tuples that make a split in the tree. The CART uses the Gini index as its feature set selection approach. Using the scikit-learn package [33], the ‘DecisionTreeClassifier’ function was adopted to build our DT.

The RF method, an ensemble learning method that allows classification, was implemented to build the prediction model. As this approach combines the results of multiple models, its resulting performance is usually better than that of an individual model. Moreover, the RF is a non-parametric approach that does not make any assumptions about the underlying distribution of the training dataset. Additionally, it is constructed using multiple DTs so that the collection of classifiers is a ‘forest’. In the present study, the CART was adopted to build the forest, and bootstrap aggregation (bagging) was applied when sampling the data. The bootstrap method samples the training tuples evenly with replacement, which means that every selected tuple is likely to be re-added into the training set. The ‘RandomForestClassifier’ function, which is available in the scikit-learn package [33] from python-software, was adopted to construct the RF model. Additionally, the scikit-learn package [33] provided a function to calculate the importance of input features. More specifically, Gini importance is the average decreased impurity of each feature across all trees; this impurity was the least randomness of the given data.

In addition to the aforementioned non-parametric methods, SVM was also used. The SVM is a supervised learning method for classification and regression of linear and nonlinear data. It utilizes nonlinear mapping to transform the training data into a higher dimensional space in which it searches for a linear optimal separating hyper-plane. The svm.SVC function in the scikit-learn package [33] was also used to construct the classifiers.

A sequential forward selection algorithm was employed to extract the useful features for improving prediction performance. First, one type of feature was used. Subsequently, at every round, a new feature was added until the performance of the model stopped showing improvement over the previous round; otherwise, the process stopped when all features had been used. The result was a well-trained prediction model.

Metrics of model performance

We compared our approach to previous studies according to a number of evaluation metrics—namely sensitivity (SN), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC). The detailed definitions of these metrics are as follows:

$$SN=\frac{TP}{TP+ FN}$$

$$SP=\frac{TN}{FP+ TN}$$

$$ACC=\frac{TP+ TN}{TP+ TN+ FP+ FN}$$

$$MCC=\frac{\left( TP\times TN\right)-\left( FP\times FN\right)}{\sqrt{\left( TP+ FP\right)\times \left( TP+ FN\right)\times \left( FP+ TN\right)\times \left( TN+ FN\right)}},$$

where TP—true positives—refers to the number of positive labels that were correctly predicted by the classifier, TN—true negatives—refers to the number of negative labels that were correctly predicted by the classifier, FP—false positives—refers to the number of positive labels that were incorrectly predicted by the classifier and FN—false negative—refers to the number of negative labels that were incorrectly predicted by the classifier. In addition to these metrics, the area under the receiver operating characteristic curve (AUC) was also considered. It should be noted that a receiver operating characteristic (ROC) curve is a visual tool for comparing predictive performances; it is able to show the relationship between the true positive rate and false positive rate at various threshold settings. Therefore, the AUC has been widely used for evaluating the overall predictive performance for a variety of prediction tools.

Figure 2

Distribution of the peptide sequence lengths of the AMPs and non-AMPs.

10-fold cross validation and independent testing sets

To confirm the robustness of the prediction model, the 10-fold cross validation technique was implemented using the training set. Furthermore, to compare the proposed model with existing tools, various independent testing sets were created for use with iAMPpred [28], AVPpred [34] and AntiFP [19]. These three systems include the latest research on predicting specific functional activities of AMPs, and each provides a web server for researchers. iAMPpred is a predictor that identifies the anti-bacterial, anti-viral and anti-fungal functional activities of AMPs. As iAMPpred only considers AMPs to have a general anti-bacterial functional activity—rather than targeting Gram-positive and Gram-negative functional activities—it was necessary to filter the Gram-positive sequences from the Gram-negative testing set and vice versa. As the AntiFP web server has a sequence length limit of greater than 15 amino acids, any peptide sequences 15 amino acids or less were eliminated from the independent testing set used to assess AntiFP. It should be noted that no further additional procedures were needed when creating the independent testing sets for iAMPpred, MLACP and AVPpred when comparing the anti-fungal, anti-cancer and anti-viral classifiers. The number of sequences in each class-specific dataset is shown in Table 2.

Development of the web-based prediction tool

A web-based prediction tool was constructed using a hypertext preprocessor and was implemented at the backend upon submission of peptide sequences. When users submit their peptides of interest, they need to decide if they want to predict the functional activities of the submitted peptides. If they are interested in predicting the functional activities, they must choose which functional activities they are interested in; otherwise, the prediction tool will only provide the AMP prediction results. The web-based prediction tool is able to accept both single peptide and multiple peptides. Depending on the user’s selection, this tool is able to list the predicted probabilities for a single or multiple types of functional activities—including AMPs that target mammals, Gram-positive and Gram-negative bacteria, anti-parasitic, anti-fungal, anti-cancer and anti-viral AMPs.

Results

Characterization of the sequence-based features of AMPs

As part of this study, a total of 1686 AMP sequences were used to construct the AMP classifier. Based on these sequences, Figure 2 shows the length distribution of the AMPs and non-AMPs. From this analysis, it is clear that the length of AMPs tended to be shorter than that of non-AMPs. Furthermore, Table 6 shows that there was a significant difference between these two groups in terms of length. Bar charts of the AAC of the training set are shown in Figure 3. It is clear that the amino acids ‘C’, ‘D’, ‘E’, ‘G’, ‘K’ and ‘L’ were present in relatively different levels between AMPs and non-AMPs. Specifically, the average ‘C’, ‘G’ and ‘K’ AACs among the AMPs were relative higher than those in the non-AMPs. However, occurrences of ‘D’, ‘E’ and ‘L’ were relatively fewer in the AMPs than the non-AMPs. Furthermore, Table S1 indicates that the amino acid ‘C’ was correlated with the peptide being an AMP, which means the presence of ‘C’ may be a good indicator for identifying AMPs. The ‘CK’ n-gram was identified by its abundance in AMPs and attained the highest correlation among the NTC category. For the PseAAC values, ‘D’ was found to show a relatively strong correlation with AMP identification. A neutral charge (Class 2) on the 1st residue (Charge_C2_001) showed the highest correlation with AMP identification among all CTDD features. Additionally, the various Pearson correlation coefficients (PCCs)—shown in Table S1—indicate that features in CTDD category tended to show a high correlation value. Thus, the peptides’ physicochemical properties might be important for AMP identification.

Table 6

Comparisons of means and standard deviations (shown in the parentheses) of peptide lengths in the positive and negative datasets based on different AMP functional activities

Classifier	Positive	Negative	P-value
Stage 1
AMP	44.34 (35.67)	173.07 (35.28)	<0.0001^*
Stage 2
Anti-parasitic	29.53 (21.64)	29.36 (17.70)	0.5861
Anti-viral	20.80 (14.57)	47.33 (31.39)	<0.0001^*
Anti-cancer	24.43 (16.05)	30.21 (17.99)	<0.0001^*
Targeting mammals	26.21 (12.95)	29.98 (17.39)	0.2388
Anti-fungal	45.82 (31.17)	37.76 (31.31)	<0.0001^*
Targeting Gram-positive bacteria	33.95 (28.70)	47.64 (30.62)	<0.0001^*
Targeting Gram-negative bacteria	34.15 (28.19)	47.15 (30.19)	<0.0001^*

Classifier	Positive	Negative	P-value
Stage 1
AMP	44.34 (35.67)	173.07 (35.28)	<0.0001^*
Stage 2
Anti-parasitic	29.53 (21.64)	29.36 (17.70)	0.5861
Anti-viral	20.80 (14.57)	47.33 (31.39)	<0.0001^*
Anti-cancer	24.43 (16.05)	30.21 (17.99)	<0.0001^*
Targeting mammals	26.21 (12.95)	29.98 (17.39)	0.2388
Anti-fungal	45.82 (31.17)	37.76 (31.31)	<0.0001^*
Targeting Gram-positive bacteria	33.95 (28.70)	47.64 (30.62)	<0.0001^*
Targeting Gram-negative bacteria	34.15 (28.19)	47.15 (30.19)	<0.0001^*

Note: ^* indicates that the test result is significant.

Table 6

Comparisons of means and standard deviations (shown in the parentheses) of peptide lengths in the positive and negative datasets based on different AMP functional activities

Classifier	Positive	Negative	P-value
Stage 1
AMP	44.34 (35.67)	173.07 (35.28)	<0.0001^*
Stage 2
Anti-parasitic	29.53 (21.64)	29.36 (17.70)	0.5861
Anti-viral	20.80 (14.57)	47.33 (31.39)	<0.0001^*
Anti-cancer	24.43 (16.05)	30.21 (17.99)	<0.0001^*
Targeting mammals	26.21 (12.95)	29.98 (17.39)	0.2388
Anti-fungal	45.82 (31.17)	37.76 (31.31)	<0.0001^*
Targeting Gram-positive bacteria	33.95 (28.70)	47.64 (30.62)	<0.0001^*
Targeting Gram-negative bacteria	34.15 (28.19)	47.15 (30.19)	<0.0001^*

Classifier	Positive	Negative	P-value
Stage 1
AMP	44.34 (35.67)	173.07 (35.28)	<0.0001^*
Stage 2
Anti-parasitic	29.53 (21.64)	29.36 (17.70)	0.5861
Anti-viral	20.80 (14.57)	47.33 (31.39)	<0.0001^*
Anti-cancer	24.43 (16.05)	30.21 (17.99)	<0.0001^*
Targeting mammals	26.21 (12.95)	29.98 (17.39)	0.2388
Anti-fungal	45.82 (31.17)	37.76 (31.31)	<0.0001^*
Targeting Gram-positive bacteria	33.95 (28.70)	47.64 (30.62)	<0.0001^*
Targeting Gram-negative bacteria	34.15 (28.19)	47.15 (30.19)	<0.0001^*

Note: ^* indicates that the test result is significant.

Figure 3

Comparisons of the AAC between the AMP and non-AMP peptides based on seven types of functional activities.

Table 7

Performances of three different AMP classifiers based on 10-fold cross-validation

Method	SN	SP	ACC	MCC	AUC
RF	0.9488	0.9511	0.9509	0.7706	0.9915
SVM	0.9433	0.9429	0.9430	0.7447	0.9847
DT	0.8340	0.9826	0.9687	0.8147	0.9083

Method	SN	SP	ACC	MCC	AUC
RF	0.9488	0.9511	0.9509	0.7706	0.9915
SVM	0.9433	0.9429	0.9430	0.7447	0.9847
DT	0.8340	0.9826	0.9687	0.8147	0.9083

Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 7

Performances of three different AMP classifiers based on 10-fold cross-validation

Method	SN	SP	ACC	MCC	AUC
RF	0.9488	0.9511	0.9509	0.7706	0.9915
SVM	0.9433	0.9429	0.9430	0.7447	0.9847
DT	0.8340	0.9826	0.9687	0.8147	0.9083

Method	SN	SP	ACC	MCC	AUC
RF	0.9488	0.9511	0.9509	0.7706	0.9915
SVM	0.9433	0.9429	0.9430	0.7447	0.9847
DT	0.8340	0.9826	0.9687	0.8147	0.9083

Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 8

AUC values of RF classifiers trained with the features according to the process of forward feature selection. The number of extracted features in each category is shown in the parentheses

Round	NCB (0)	NTB (6915)	NCC (0)	NTC (2305)	AAC (20)	PseAAC (22)	CTDD (105)
1	N/A	0.5000	N/A	0.9869	0.9792	0.9787	0.9882
2	N/A	0.9860	N/A	0.9918	0.9910	0.9914	–
3	N/A	0.9911	N/A	–	0.99215	0.99212	–
4	N/A	0.9914	N/A	–	–	0.99225	–
5	N/A	0.9915	N/A	–	–	–	–

Round	NCB (0)	NTB (6915)	NCC (0)	NTC (2305)	AAC (20)	PseAAC (22)	CTDD (105)
1	N/A	0.5000	N/A	0.9869	0.9792	0.9787	0.9882
2	N/A	0.9860	N/A	0.9918	0.9910	0.9914	–
3	N/A	0.9911	N/A	–	0.99215	0.99212	–
4	N/A	0.9914	N/A	–	–	0.99225	–
5	N/A	0.9915	N/A	–	–	–	–

Note: The N/A represented that there are no features used to build the classifier and ‘–’ means that the features were selected in the last round. NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Table 8

AUC values of RF classifiers trained with the features according to the process of forward feature selection. The number of extracted features in each category is shown in the parentheses

Round	NCB (0)	NTB (6915)	NCC (0)	NTC (2305)	AAC (20)	PseAAC (22)	CTDD (105)
1	N/A	0.5000	N/A	0.9869	0.9792	0.9787	0.9882
2	N/A	0.9860	N/A	0.9918	0.9910	0.9914	–
3	N/A	0.9911	N/A	–	0.99215	0.99212	–
4	N/A	0.9914	N/A	–	–	0.99225	–
5	N/A	0.9915	N/A	–	–	–	–

Round	NCB (0)	NTB (6915)	NCC (0)	NTC (2305)	AAC (20)	PseAAC (22)	CTDD (105)
1	N/A	0.5000	N/A	0.9869	0.9792	0.9787	0.9882
2	N/A	0.9860	N/A	0.9918	0.9910	0.9914	–
3	N/A	0.9911	N/A	–	0.99215	0.99212	–
4	N/A	0.9914	N/A	–	–	0.99225	–
5	N/A	0.9915	N/A	–	–	–	–

Note: The N/A represented that there are no features used to build the classifier and ‘–’ means that the features were selected in the last round. NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Figure 4

Box plots of feature importance (upper) and PCC values (lower) of the selected features.

Figure 5

Distribution of peptide sequence lengths for various AMP functional activities.

Figure 6

Comparisons of the amino acids present in the different AMP functional activities forming the training set.

Performance when classifying AMPs and non-AMPs

Without feature selection, Table 7 shows that the AUC of the RF model achieved the best performance among the three methods in terms of 10-fold cross validation. Optimizations of the parameters used in generating the RF and SVM models are presented in Figures S1 and Figure S2, respectively. For the number of trees in the RF model, we tried different values ranging from 100 to 1000 to determine the parameter with the best performance. In Figures S1, the optimal number of trees for each functional activity is marked with red point. Similarly, Figure S2 shows the best cost values for generating the SVM models for the different AMP activities. In addition to the cost-value optimization of the SVM models, the selection of kernel functions for the SVM models is presented in Table S2—based on 10-fold cross-validation. The radial basis function (RBF) kernel, combined with a cost value of 5, generated the best SVM classifier for AMP identification. It should be noted that the RF model that considered only the n-gram features NTB and NTC yielded an AUC value of 0.9859. Table 8 demonstrates that when CTDD, NTC, PseAAC and AAC features were incorporated into the model sequentially by the forward selection strategy, the AUC value increased from 0.9915 to 0.9922. Additionally, the MCC increased from 0.7706 to 0.7924; meanwhile the number of features in the training set decreased from 9367 to 2452. In comparison, the AUC for the independent testing set using an RF model trained with 2452 features was 0.9899.

Furthermore, based on the highest performing model with the 2452 selected features, we adopted RF feature importance via the scikit-learn package [33]. The feature importance was determined by the Gini index, and the resulting box plots are presented in Figure 4. Among all the AACs, ‘M’, ‘D’, ‘E’ and ‘C’ occurred at relatively high frequencies. The importance of amino acids ‘D’, ‘E’ and ‘C’ were found to be very different between the AMPs and non-AMPs based on the training set. Additionally, the importance of ‘E’ and ‘M’ were relatively high among all PseAACs. Furthermore, several features—including charge, hydrophobicity, normalized van der Waals volume, secondary structure, solvent accessibility, polarizability and polarity—in the CTDD category tended to have higher importance than others. The PCCs that formed the training set were also analyzed to examine the feature importance. As shown in Figure 4, the PCCs of the CTDD category were higher than those of other feature categories. In a previous study, AmPEP [6] calculated the PCCs but used only 23 features, which resulted in the correlations being lower than 0.5 when training the model. Most of the top 10 CTDD features in our study were also among the 23 features used in [6]. Additionally, the PCCs of the top 10 CTDD features were relatively higher than those of other features. The PCC of amino acid ‘C’ was the highest among the AAC category. The PCC values of ‘CK’ and ‘KC’ were higher than when NTC was considered. Among the PseAACs, the PCCs of ‘C’ and ‘E’ were the highest. Notably, these amino acids were more significantly different between AMPs and non-AMPs, based on our training set observations. They were also among the top 10 most important features in their feature category. Table S1 presents the top 10 features in each category and additional information concerning these important features.

Investigation of the sequence-based features of AMPs that have different functional activities

Figure 5 shows the length distributions of the seven AMP functional activities. Based on the results, the sequence lengths of the anti-viral, anti-cancer, anti-fungal, the AMPs that target Gram-positive and Gram-negative bacteria were different between the positive and negative datasets. More specifically, the anti-viral and anti-cancer AMPs, and those that target Gram-positive and Gram-negative bacteria in the positive dataset, were shorter than those in the negative dataset. In contrast, the anti-fungal AMPs in the positive dataset were longer than those in negative dataset. Table 6 shows that significant differences exist between the different antimicrobial activities. Figure 6 shows that the average AACs for ‘K’ and ‘L’ were different between the positive and negative training datasets. This difference existed in the anti-parasite, anti-viral, anti-cancer AMPs, the AMPs that target mammals, the AMPs that target Gram-positive and Gram-negative bacteria. It is possible to list the amino acids that were relatively different between the positive and negative datasets for each of the seven different targets. Specifically, ‘A’ and ‘S’ were different for the anti-parasite AMPs; ‘G’ and ‘V’ were different for the anti-viral AMPs; ‘I’ and ‘R’ were different for the anti-cancer AMPs; ‘D’, ‘E’, ‘R’ and ‘Y’ were different for the AMPs that target mammals; ‘C’, ‘D’, ‘E’, ‘W’ and ‘G’ were different for the anti-fungal AMPs; and ‘G’, ‘D’ and ‘E’ were different for AMPs that target Gram-positive and Gram-negative bacteria. Table S3 shows that several features seemed to be relevant to identifying the functional activities of an AMP. Unsurprisingly, the physicochemical property-related features, particularly CTDD and PseAAC, were significant in identifying the AMPs with functional activities; most of the important features involved the physicochemical properties of the peptides. However, ‘GL’ and ‘GLF’ were both important n-grams for identifying ACPs, while ‘LP’ was a key feature for identifying AMPs that target mammals.

Table 9

Performances of the seven classifiers based on 10-fold cross-validation

Activities (# features)	Method	SN	SP	ACC	MCC	AUC
Anti-parasitic (751)	RF	0.8030	0.8037	0.8036	0.4935	0.8812
	SVM	0.6306	0.7516	0.7298	0.3066	0.7596
	DT	0.5735	0.8831	0.8321	0.4322	0.7283
Anti-viral (4075)	RF	0.9028	0.9004	0.9013	0.7915	0.9617
	SVM	0.8609	0.8587	0.8595	0.7053	0.9307
	DT	0.7897	0.8749	0.8442	0.6637	0.8323
Anti-cancer (699)	RF	0.7786	0.7679	0.7695	0.4420	0.8527
	SVM	0.7572	0.7583	0.7580	0.4116	0.8186
	DT	0.5319	0.8814	0.8229	0.3962	0.7067
Targeting mammals (579)	RF	0.8729	0.8736	0.8736	0.6408	0.9265
	SVM	0.7959	0.8029	0.8016	0.4900	0.8888
	DT	0.6603	0.9205	0.8767	0.5721	0.7904
Anti-fungal (1983)	RF	0.8499	0.8467	0.8486	0.6888	0.9358
	SVM	0.7900	0.7882	0.7893	0.5698	0.8719
	DT	0.8125	0.7253	0.7785	0.5375	0.7689
Targeting Gram-positive bacteria (3087)	RF	0.8853	0.8860	0.8856	0.7698	0.9567
	SVM	0.8404	0.8389	0.8397	0.6776	0.9122
	DT	0.8176	0.7892	0.8045	0.6063	0.8034
Targeting Gram-negative bacteria (3167)	RF	0.8799	0.8809	0.8803	0.7592	0.9546
	SVM	0.8388	0.8405	0.8396	0.6775	0.9201
	DT	0.8177	0.7802	0.8003	0.5984	0.7990

Activities (# features)	Method	SN	SP	ACC	MCC	AUC
Anti-parasitic (751)	RF	0.8030	0.8037	0.8036	0.4935	0.8812
	SVM	0.6306	0.7516	0.7298	0.3066	0.7596
	DT	0.5735	0.8831	0.8321	0.4322	0.7283
Anti-viral (4075)	RF	0.9028	0.9004	0.9013	0.7915	0.9617
	SVM	0.8609	0.8587	0.8595	0.7053	0.9307
	DT	0.7897	0.8749	0.8442	0.6637	0.8323
Anti-cancer (699)	RF	0.7786	0.7679	0.7695	0.4420	0.8527
	SVM	0.7572	0.7583	0.7580	0.4116	0.8186
	DT	0.5319	0.8814	0.8229	0.3962	0.7067
Targeting mammals (579)	RF	0.8729	0.8736	0.8736	0.6408	0.9265
	SVM	0.7959	0.8029	0.8016	0.4900	0.8888
	DT	0.6603	0.9205	0.8767	0.5721	0.7904
Anti-fungal (1983)	RF	0.8499	0.8467	0.8486	0.6888	0.9358
	SVM	0.7900	0.7882	0.7893	0.5698	0.8719
	DT	0.8125	0.7253	0.7785	0.5375	0.7689
Targeting Gram-positive bacteria (3087)	RF	0.8853	0.8860	0.8856	0.7698	0.9567
	SVM	0.8404	0.8389	0.8397	0.6776	0.9122
	DT	0.8176	0.7892	0.8045	0.6063	0.8034
Targeting Gram-negative bacteria (3167)	RF	0.8799	0.8809	0.8803	0.7592	0.9546
	SVM	0.8388	0.8405	0.8396	0.6775	0.9201
	DT	0.8177	0.7802	0.8003	0.5984	0.7990

Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 9

Performances of the seven classifiers based on 10-fold cross-validation

Activities (# features)	Method	SN	SP	ACC	MCC	AUC
Anti-parasitic (751)	RF	0.8030	0.8037	0.8036	0.4935	0.8812
	SVM	0.6306	0.7516	0.7298	0.3066	0.7596
	DT	0.5735	0.8831	0.8321	0.4322	0.7283
Anti-viral (4075)	RF	0.9028	0.9004	0.9013	0.7915	0.9617
	SVM	0.8609	0.8587	0.8595	0.7053	0.9307
	DT	0.7897	0.8749	0.8442	0.6637	0.8323
Anti-cancer (699)	RF	0.7786	0.7679	0.7695	0.4420	0.8527
	SVM	0.7572	0.7583	0.7580	0.4116	0.8186
	DT	0.5319	0.8814	0.8229	0.3962	0.7067
Targeting mammals (579)	RF	0.8729	0.8736	0.8736	0.6408	0.9265
	SVM	0.7959	0.8029	0.8016	0.4900	0.8888
	DT	0.6603	0.9205	0.8767	0.5721	0.7904
Anti-fungal (1983)	RF	0.8499	0.8467	0.8486	0.6888	0.9358
	SVM	0.7900	0.7882	0.7893	0.5698	0.8719
	DT	0.8125	0.7253	0.7785	0.5375	0.7689
Targeting Gram-positive bacteria (3087)	RF	0.8853	0.8860	0.8856	0.7698	0.9567
	SVM	0.8404	0.8389	0.8397	0.6776	0.9122
	DT	0.8176	0.7892	0.8045	0.6063	0.8034
Targeting Gram-negative bacteria (3167)	RF	0.8799	0.8809	0.8803	0.7592	0.9546
	SVM	0.8388	0.8405	0.8396	0.6775	0.9201
	DT	0.8177	0.7802	0.8003	0.5984	0.7990

Activities (# features)	Method	SN	SP	ACC	MCC	AUC
Anti-parasitic (751)	RF	0.8030	0.8037	0.8036	0.4935	0.8812
	SVM	0.6306	0.7516	0.7298	0.3066	0.7596
	DT	0.5735	0.8831	0.8321	0.4322	0.7283
Anti-viral (4075)	RF	0.9028	0.9004	0.9013	0.7915	0.9617
	SVM	0.8609	0.8587	0.8595	0.7053	0.9307
	DT	0.7897	0.8749	0.8442	0.6637	0.8323
Anti-cancer (699)	RF	0.7786	0.7679	0.7695	0.4420	0.8527
	SVM	0.7572	0.7583	0.7580	0.4116	0.8186
	DT	0.5319	0.8814	0.8229	0.3962	0.7067
Targeting mammals (579)	RF	0.8729	0.8736	0.8736	0.6408	0.9265
	SVM	0.7959	0.8029	0.8016	0.4900	0.8888
	DT	0.6603	0.9205	0.8767	0.5721	0.7904
Anti-fungal (1983)	RF	0.8499	0.8467	0.8486	0.6888	0.9358
	SVM	0.7900	0.7882	0.7893	0.5698	0.8719
	DT	0.8125	0.7253	0.7785	0.5375	0.7689
Targeting Gram-positive bacteria (3087)	RF	0.8853	0.8860	0.8856	0.7698	0.9567
	SVM	0.8404	0.8389	0.8397	0.6776	0.9122
	DT	0.8176	0.7892	0.8045	0.6063	0.8034
Targeting Gram-negative bacteria (3167)	RF	0.8799	0.8809	0.8803	0.7592	0.9546
	SVM	0.8388	0.8405	0.8396	0.6775	0.9201
	DT	0.8177	0.7802	0.8003	0.5984	0.7990

Note: RF, random forest; SVM, support vector machine; DT, decision tree; SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Performance when identifying AMPs together with their functional activities

Table 9 demonstrates that the RF achieved the best results during the 2nd stage. The performances regarding the different parameters are shown in Table S4. We adopted the same stopping rule as that of Stage 1. The smaller dataset tended to perform well with a lower number of trees. Except for anti-fungal, the large dataset needed a greater number of trees. Similarly, the SVM kernel selection was related to the scale of the dataset with the RBF kernel outperforming the polynomial kernel on the larger dataset. In contrast, the smaller dataset required the polynomial kernel. As the datasets for the anti-parasitic and anti-cancer AMPs and that of the AMPs targeting mammals were smaller and more imbalanced, their MCCs and AUCs were lower than those of the other groups. Notwithstanding the above, the AUCs of the anti-viral, the anti-fungal AMPs and of those targeting Gram-positive and Gram-negative bacteria were 0.9617, 0.9358, 0.9567 and 0.9546, respectively.

To evaluate the relative importance of features, the RF importance calculation (determined using the scikit-learn package [33]) was implemented. The results are shown in Figure 7; all of the binary profiling positional feature importance values were zero. This is because the compositional features information was already included in the binary profiling positional features. Furthermore, most of the compositional feature importance values tended to be low—this was particularly true for the AAC, which had the lowest values for all seven AMP classifiers. Nevertheless, several compositional features had higher values, which implies that they are likely to be of greater importance in identifying the various functional activities. In comparison, the physical–chemical properties features had the highest importance values among the other feature categories—if outliers are not considered. Their feature importance values show a grouping with relatively high values. We also computed the PCCs for each feature; their box plots are shown in Figure 7. The correlation between the binary profiling positional feature category and label as AMPs is very weak—this result is consistent with their low importance values compared with those of the other features. When the anti-parasite and anti-cancer datasets were considered, the compositional feature correlations were the highest, especially for the NCC and MC features. The physical–chemical property features showed the highest correlations among the remaining datasets.

Figure 7

Box plots of feature importance (upper) and PCC values (lower) for the seven classifiers based on all investigated features.

Table 10

The features selected using sequential forward selection strategy and the AUC values of seven classes based on RF modeling scheme. The number of features used in each category is shown in the parentheses

Activities	Selected features	AUC
Anti-parasitic	AAC (20) + NCC (78) + PseAAC (22)	0.8783
Anti-viral	NTC (918) + AAC (20) + MB (186) + NCB (6)	0.9692
Anti-cancer	CTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243)	0.8518
Targeting mammals	AAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14)	0.9392
Anti-fungal	AAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50)	0.9382
Targeting Gram-positive bacteria	AAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33)	0.9583
Targeting Gram-negative bacteria	AAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56)	0.9560

Activities	Selected features	AUC
Anti-parasitic	AAC (20) + NCC (78) + PseAAC (22)	0.8783
Anti-viral	NTC (918) + AAC (20) + MB (186) + NCB (6)	0.9692
Anti-cancer	CTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243)	0.8518
Targeting mammals	AAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14)	0.9392
Anti-fungal	AAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50)	0.9382
Targeting Gram-positive bacteria	AAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33)	0.9583
Targeting Gram-negative bacteria	AAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56)	0.9560

Note: AUC, area under the receiver operating characteristic curve; NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Table 10

The features selected using sequential forward selection strategy and the AUC values of seven classes based on RF modeling scheme. The number of features used in each category is shown in the parentheses

Activities	Selected features	AUC
Anti-parasitic	AAC (20) + NCC (78) + PseAAC (22)	0.8783
Anti-viral	NTC (918) + AAC (20) + MB (186) + NCB (6)	0.9692
Anti-cancer	CTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243)	0.8518
Targeting mammals	AAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14)	0.9392
Anti-fungal	AAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50)	0.9382
Targeting Gram-positive bacteria	AAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33)	0.9583
Targeting Gram-negative bacteria	AAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56)	0.9560

Activities	Selected features	AUC
Anti-parasitic	AAC (20) + NCC (78) + PseAAC (22)	0.8783
Anti-viral	NTC (918) + AAC (20) + MB (186) + NCB (6)	0.9692
Anti-cancer	CTDD (105) + NTC (31) + PseAAC (22) + AAC (20) + MC (81) + MB (243)	0.8518
Targeting mammals	AAC (20) + NTC (30) + NCB (42) + MC (64) + MB (192) + NCC (14)	0.9392
Anti-fungal	AAC (20) + CTDD (105) + NTC (364) + PseAAC (22) + NCC (45) + MC (50)	0.9382
Targeting Gram-positive bacteria	AAC (20) + CTDD (105) + NTC(663) + PseAAC (22) + NCB (33)	0.9583
Targeting Gram-negative bacteria	AAC (20) + CTDD (105) + PseAAC (22) + NTC(636) + MC (56)	0.9560

Note: AUC, area under the receiver operating characteristic curve; NCB, n-gram binary profiling of position which found by counting; NTB, n-gram binary profiling of position which found by t-test; NCC, n-gram composition which found by counting; NTC, n-gram composition which found by t-test; AAC, amino acid composition; PseAAC, pseudo amino acid composition; CTDD, distribution descriptor of composition transition distribution.

Our analysis showed that the RF approach performed the best among the three methods used. Based on the RF approach, a forward feature selection algorithm was adopted for the RF; the selected features and their AUCs are shown in Table 10. The final performance of this approach is shown in Table 11. When the anti-viral classifier was examined, there was an increase from 0.9617 to 0.9692, which implies that the n-gram-related features might have been different between the positive and negative datasets. Furthermore, when the classifier of AMPs targeting mammals was examined, if CTDD was excluded, but n-gram- and motifs-related features were included, there was an increase performance from 0.9265 to 0.9392. This indicates that n-gram- and motifs-related features are key factors in prediction. For the anti-fungal AMP classifier, almost all features were selected, but there was no large improvement detected under any specific circumstances. Finally, for the classifiers of AMPs targeting Gram-positive and Gram-negative bacteria, selection of the n-gram- and motifs-related features resulted in better performance levels.

Table 11

Performances of seven class-specific classifiers using RF with the selected features

Activity (# features)	Dataset	SN	SP	ACC	MCC	AUC
Anti-parasitic (120)	10-fold CV	0.7526	0.8366	0.8202	0.4955	0.8783
Anti-parasitic (120)	Testing	0.6167	0.7732	0.7685	0.1570	0.7773
Anti-viral (1130)	10-fold CV	0.9109	0.9324	0.9247	0.8382	0.9692
Anti-viral (1130)	Testing	0.9085	0.8406	0.8613	0.7075	0.9404
Anti- cancer (263)	10-fold CV	0.7673	0.7888	0.7855	0.4507	0.8518
Anti- cancer (263)	Testing	0.7766	0.7060	0.7094	0.2208	0.8231
Targeting mammals (362)	10-fold CV	0.8677	0.8893	0.8853	0.6620	0.9392
Targeting mammals (362)	Testing	0.7849	0.8045	0.8035	0.2998	0.8648
Anti- fungal (606)	10-fold CV	0.8573	0.8553	0.8565	0.7050	0.9382
Anti- fungal (606)	Testing	0.8561	0.6675	0.7458	0.5186	0.8578
Targeting Gram-positive bacteria (843)	10-fold CV	0.8852	0.8848	0.8851	0.7687	0.9583
Targeting Gram-positive bacteria (843)	Testing	0.8877	0.6373	0.7423	0.5254	0.8745
Targeting Gram-negative bacteria (839)	10-fold CV	0.8805	0.8815	0.8809	0.7606	0.9560
Targeting Gram-negative bacteria (839)	Testing	0.8575	0.6574	0.7413	0.5116	0.8672

Activity (# features)	Dataset	SN	SP	ACC	MCC	AUC
Anti-parasitic (120)	10-fold CV	0.7526	0.8366	0.8202	0.4955	0.8783
Anti-parasitic (120)	Testing	0.6167	0.7732	0.7685	0.1570	0.7773
Anti-viral (1130)	10-fold CV	0.9109	0.9324	0.9247	0.8382	0.9692
Anti-viral (1130)	Testing	0.9085	0.8406	0.8613	0.7075	0.9404
Anti- cancer (263)	10-fold CV	0.7673	0.7888	0.7855	0.4507	0.8518
Anti- cancer (263)	Testing	0.7766	0.7060	0.7094	0.2208	0.8231
Targeting mammals (362)	10-fold CV	0.8677	0.8893	0.8853	0.6620	0.9392
Targeting mammals (362)	Testing	0.7849	0.8045	0.8035	0.2998	0.8648
Anti- fungal (606)	10-fold CV	0.8573	0.8553	0.8565	0.7050	0.9382
Anti- fungal (606)	Testing	0.8561	0.6675	0.7458	0.5186	0.8578
Targeting Gram-positive bacteria (843)	10-fold CV	0.8852	0.8848	0.8851	0.7687	0.9583
Targeting Gram-positive bacteria (843)	Testing	0.8877	0.6373	0.7423	0.5254	0.8745
Targeting Gram-negative bacteria (839)	10-fold CV	0.8805	0.8815	0.8809	0.7606	0.9560
Targeting Gram-negative bacteria (839)	Testing	0.8575	0.6574	0.7413	0.5116	0.8672

Note: SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 11

Performances of seven class-specific classifiers using RF with the selected features

Activity (# features)	Dataset	SN	SP	ACC	MCC	AUC
Anti-parasitic (120)	10-fold CV	0.7526	0.8366	0.8202	0.4955	0.8783
Anti-parasitic (120)	Testing	0.6167	0.7732	0.7685	0.1570	0.7773
Anti-viral (1130)	10-fold CV	0.9109	0.9324	0.9247	0.8382	0.9692
Anti-viral (1130)	Testing	0.9085	0.8406	0.8613	0.7075	0.9404
Anti- cancer (263)	10-fold CV	0.7673	0.7888	0.7855	0.4507	0.8518
Anti- cancer (263)	Testing	0.7766	0.7060	0.7094	0.2208	0.8231
Targeting mammals (362)	10-fold CV	0.8677	0.8893	0.8853	0.6620	0.9392
Targeting mammals (362)	Testing	0.7849	0.8045	0.8035	0.2998	0.8648
Anti- fungal (606)	10-fold CV	0.8573	0.8553	0.8565	0.7050	0.9382
Anti- fungal (606)	Testing	0.8561	0.6675	0.7458	0.5186	0.8578
Targeting Gram-positive bacteria (843)	10-fold CV	0.8852	0.8848	0.8851	0.7687	0.9583
Targeting Gram-positive bacteria (843)	Testing	0.8877	0.6373	0.7423	0.5254	0.8745
Targeting Gram-negative bacteria (839)	10-fold CV	0.8805	0.8815	0.8809	0.7606	0.9560
Targeting Gram-negative bacteria (839)	Testing	0.8575	0.6574	0.7413	0.5116	0.8672

Activity (# features)	Dataset	SN	SP	ACC	MCC	AUC
Anti-parasitic (120)	10-fold CV	0.7526	0.8366	0.8202	0.4955	0.8783
Anti-parasitic (120)	Testing	0.6167	0.7732	0.7685	0.1570	0.7773
Anti-viral (1130)	10-fold CV	0.9109	0.9324	0.9247	0.8382	0.9692
Anti-viral (1130)	Testing	0.9085	0.8406	0.8613	0.7075	0.9404
Anti- cancer (263)	10-fold CV	0.7673	0.7888	0.7855	0.4507	0.8518
Anti- cancer (263)	Testing	0.7766	0.7060	0.7094	0.2208	0.8231
Targeting mammals (362)	10-fold CV	0.8677	0.8893	0.8853	0.6620	0.9392
Targeting mammals (362)	Testing	0.7849	0.8045	0.8035	0.2998	0.8648
Anti- fungal (606)	10-fold CV	0.8573	0.8553	0.8565	0.7050	0.9382
Anti- fungal (606)	Testing	0.8561	0.6675	0.7458	0.5186	0.8578
Targeting Gram-positive bacteria (843)	10-fold CV	0.8852	0.8848	0.8851	0.7687	0.9583
Targeting Gram-positive bacteria (843)	Testing	0.8877	0.6373	0.7423	0.5254	0.8745
Targeting Gram-negative bacteria (839)	10-fold CV	0.8805	0.8815	0.8809	0.7606	0.9560
Targeting Gram-negative bacteria (839)	Testing	0.8575	0.6574	0.7413	0.5116	0.8672

Note: SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 12

Performances of AMPfun and existing AMP prediction tools based on the independent testing set

Activities	Tool	SN	SP	ACC	MCC	AUC
Anti-viral	AMPfun	0.9085	0.8406	0.8613	0.7075	0.9404
	iAMPpred [28]	0.3128	0.3959	0.3706	−0.2682	0.3158
	AVPpred [34]	0.2409	0.8857	0.6901	0.1643	N/A
Anti-cancer	AMPfun	0.7766	0.7060	0.7094	0.2208	0.8231
Anti-cancer	MLACP [21]	0.7234	0.7512	0.7499	0.2272	0.8320
Anti-fungal	AMPfun	0.8561	0.6675	0.7458	0.5186	0.8578
Anti-fungal	iAMPpred [28]	0.6610	0.7212	0.6962	0.3796	0.7412
Anti-fungal	AMPfun	0.7929	0.7455	0.7678	0.5375	0.8448
Anti-fungal	AntiFP [19]	0.6699	0.7039	0.6879	0.3737	N/A
Targeting Gram-positive bacteria	AMPfun	0.8829	0.6282	0.7385	0.5155	0.8653
Targeting Gram-positive bacteria	iAMPpred [28]	0.6872	0.6541	0.6684	0.3382	0.6950
Targeting Gram-negative bacteria	AMPfun	0.8563	0.6522	0.7406	0.5086	0.8590
Targeting Gram-negative bacteria	iAMPpred [28]	0.6932	0.6568	0.6726	0.3469	0.6958

Activities	Tool	SN	SP	ACC	MCC	AUC
Anti-viral	AMPfun	0.9085	0.8406	0.8613	0.7075	0.9404
	iAMPpred [28]	0.3128	0.3959	0.3706	−0.2682	0.3158
	AVPpred [34]	0.2409	0.8857	0.6901	0.1643	N/A
Anti-cancer	AMPfun	0.7766	0.7060	0.7094	0.2208	0.8231
Anti-cancer	MLACP [21]	0.7234	0.7512	0.7499	0.2272	0.8320
Anti-fungal	AMPfun	0.8561	0.6675	0.7458	0.5186	0.8578
Anti-fungal	iAMPpred [28]	0.6610	0.7212	0.6962	0.3796	0.7412
Anti-fungal	AMPfun	0.7929	0.7455	0.7678	0.5375	0.8448
Anti-fungal	AntiFP [19]	0.6699	0.7039	0.6879	0.3737	N/A
Targeting Gram-positive bacteria	AMPfun	0.8829	0.6282	0.7385	0.5155	0.8653
Targeting Gram-positive bacteria	iAMPpred [28]	0.6872	0.6541	0.6684	0.3382	0.6950
Targeting Gram-negative bacteria	AMPfun	0.8563	0.6522	0.7406	0.5086	0.8590
Targeting Gram-negative bacteria	iAMPpred [28]	0.6932	0.6568	0.6726	0.3469	0.6958

Note: N/A means that we could not derive the values from the provided results. SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Table 12

Performances of AMPfun and existing AMP prediction tools based on the independent testing set

Activities	Tool	SN	SP	ACC	MCC	AUC
Anti-viral	AMPfun	0.9085	0.8406	0.8613	0.7075	0.9404
	iAMPpred [28]	0.3128	0.3959	0.3706	−0.2682	0.3158
	AVPpred [34]	0.2409	0.8857	0.6901	0.1643	N/A
Anti-cancer	AMPfun	0.7766	0.7060	0.7094	0.2208	0.8231
Anti-cancer	MLACP [21]	0.7234	0.7512	0.7499	0.2272	0.8320
Anti-fungal	AMPfun	0.8561	0.6675	0.7458	0.5186	0.8578
Anti-fungal	iAMPpred [28]	0.6610	0.7212	0.6962	0.3796	0.7412
Anti-fungal	AMPfun	0.7929	0.7455	0.7678	0.5375	0.8448
Anti-fungal	AntiFP [19]	0.6699	0.7039	0.6879	0.3737	N/A
Targeting Gram-positive bacteria	AMPfun	0.8829	0.6282	0.7385	0.5155	0.8653
Targeting Gram-positive bacteria	iAMPpred [28]	0.6872	0.6541	0.6684	0.3382	0.6950
Targeting Gram-negative bacteria	AMPfun	0.8563	0.6522	0.7406	0.5086	0.8590
Targeting Gram-negative bacteria	iAMPpred [28]	0.6932	0.6568	0.6726	0.3469	0.6958

Activities	Tool	SN	SP	ACC	MCC	AUC
Anti-viral	AMPfun	0.9085	0.8406	0.8613	0.7075	0.9404
	iAMPpred [28]	0.3128	0.3959	0.3706	−0.2682	0.3158
	AVPpred [34]	0.2409	0.8857	0.6901	0.1643	N/A
Anti-cancer	AMPfun	0.7766	0.7060	0.7094	0.2208	0.8231
Anti-cancer	MLACP [21]	0.7234	0.7512	0.7499	0.2272	0.8320
Anti-fungal	AMPfun	0.8561	0.6675	0.7458	0.5186	0.8578
Anti-fungal	iAMPpred [28]	0.6610	0.7212	0.6962	0.3796	0.7412
Anti-fungal	AMPfun	0.7929	0.7455	0.7678	0.5375	0.8448
Anti-fungal	AntiFP [19]	0.6699	0.7039	0.6879	0.3737	N/A
Targeting Gram-positive bacteria	AMPfun	0.8829	0.6282	0.7385	0.5155	0.8653
Targeting Gram-positive bacteria	iAMPpred [28]	0.6872	0.6541	0.6684	0.3382	0.6950
Targeting Gram-negative bacteria	AMPfun	0.8563	0.6522	0.7406	0.5086	0.8590
Targeting Gram-negative bacteria	iAMPpred [28]	0.6932	0.6568	0.6726	0.3469	0.6958

Note: N/A means that we could not derive the values from the provided results. SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthews correlation coefficient; AUC, area under the receiver operating characteristic curve.

Comparison with existing AMP prediction tools in terms of predictive performance

AMPfun was further compared with related prediction tools such as iAMPpred, AVPpred, MLACP and AntiFP. The performance with respect to the various functional activities using the independent testing set is displayed in Table 12, and the corresponding ROC curves are presented in Figure 8. The accuracy of AMPfun on the anti-viral class was higher than that of iAMPpred and AVPpred. Overall, AMPfun achieved much higher MCC values for most functional activities—the exception being for the anti-cancer, where the value was only 0.0064 lower than that of MLACP. Base on evaluation of the AUC values, AMPfun was again higher for the anti-viral functional activity, exceeding the iAMPpred by 0.6246. Similarly, for the anti-fungal AMPs, and those that target Gram-positive and Gram-negative bacteria, our results were 0.1166, 0.1703 and 0.1632 higher, respectively, than those of iAMPpred. However, when the anti-cancer model was examined, the AUC of AMPfun was slightly lower than that of MLACP by 0.0089.

AMPfun web interface

Based on our method, an online prediction server—AMPfun (http://fdblab.csie.ncu.edu.tw/AMPfun/index.html)—has been developed to predict the possibility that a peptide sequence might be an AMP with a particular functional activity. The best prediction models developed for anti-parasite, anti-viral, anti-cancer, anti-fungal AMPs, as well as those targeting mammals, and Gram-positive and Gram-negative bacteria were applied here. Screenshots of the website are shown in Figure S3.

Figure 8

The ROC curves of AMPfun, MLACP and iAMPpred when predicting anti-viral AMPs, anti-cancer AMPs, anti-fungal AMPs, AMPs targeting Gram-positive bacteria and AMPs targeting Gram-negative peptides.

Discussions and conclusion

The unfortunate reality of multidrug antibiotic resistance is rapidly growing throughout the world. Consequently, many common antibiotics are now unable to be used to fight pathogenic bacteria. As such, the identification of AMPs and their functional activities has begun to attract increasing attention. In this study, we constructed a two-stage framework based on machine learning techniques that is able to identify AMPs and determine their functional activities. We adopted a forward feature selection algorithm to illustrate the importance of crucial features when developing the classifiers for different AMP activities. To obtain better crucial information from the peptides sequences in this study, we adopted the n-grams concept to encode the binary profiling of the positional and composition features. AAC, PseAAC and CTDD were also used as features. Among these, both AAC and PseAAC are commonly used features when separating AMPs from non-AMPs and identifying their functional activities [7, 11, 19, 21, 22, 24, 27–30].

Most previous studies only consider the composition or binary profile of single, double or triple amino acids [11, 22, 27–30]. In our study, we innovatively considered amino acid n-grams as a means of encoding the composition and binary profile features of AMPs. We incorporated two approaches, namely counting and t-tests, to investigate the influence of the n-gram on AMPs and their functional activities. However, no n-gram was identified by the counting approach during the 1st stage due to significant differences in the lengths of the AMPs. Put differently, the non-AMPs in the negative dataset were much longer, which seems to have resulted in a more diverse n-gram arrangement. Whatever the frequency of an n-gram, once it appeared in a non-AMP sequence, it was filtered out of the positive dataset. During the feature selection process of constructing the AMP classifier, CTDD was selected firstly. Specifically, the AMP classifier was able to perform well, even when only CTDD was solely considered. Table S1 shows that all of the properties CTDD considers regarding the 1st residue are important patterns for separating AMPs from non-AMPs. Previous studies have indicated that AMPs are generally positively charged and amphiphilic, but have a large proportion of hydrophobic residues [5, 14]. This suggests that charge is a key factor in separating AMPs from non-AMPs with terminal residue properties being significantly different between AMPs and non-AMPs [6].

In the 2nd stage, we used seven class-specific classifiers. In addition to the 1st stage features, we added motifs-based binary profiling positional features and compositional features. As shown in Figure 7, we found that the motifs-related features were relatively less important than other features, but still showed a strong correlation with the anti-parasitic and anti-cancer AMPs, and those targeting mammals. One possible reason for this is that these features need to be combined with others to improve classifier performance. One previous study indicated that charge also plays an important role in the identification of anti-bacterial and anti-fungal peptides [11].

Figures 6 and 7 and Table S3 all show that the features selected may depend on the functional activity of the AMP. Among the anti-parasite classifiers, pseudo amino acids ‘A’ and ‘K’—having hydrophobicity of the neutral groups in 25% of the residues—showed high feature importance. Among the anti-viral classifiers, the charge of neutral groups, secondary structure of the helix groups, solvent accessibility of the buried groups and polarizability and normalized van der Waals volume of the 3rd group were labeled as high feature importance. These important features were all present in the 1st residue, which showed a major difference in position. Separately, the results for the anti-cancer classifier showed that the compositions ‘GL’ and ‘GLF’ and pseudo amino acids ‘C’ and ‘I’ were important. An interestingly relation is that the top 10 features in feature importance levels also reached higher PCCs when the classifier of AMPs that targeting mammals was considered. The results for the anti-fungal classifier indicated that the pseudo amino acids ‘W’, ‘K’ and ‘λ2’ seemed to be essential factors. When the classifier for AMPs targeting Gram-positive bacteria was examined, a negative charge was present on the 75% and 100% of the residues. Additionally, a neutral charge on the 1st residue, helix secondary structure associated with the 1st residue and the presence of pseudo amino acids ‘E’ and ‘G’ were the key features. Finally, when the classifier for AMPs targeting Gram-negative bacteria was considered, a negative charge in 50% and 100% of residues, polar hydrophobicity of the 1st residue and the presence of pseudo amino acids ‘E’ and ‘G’ were important. The distribution of these important pseudo amino acids between the positive and negative training sets was also quite different, which implies that they are important factors in identifying AMP functional activities. Previous research has indicated that amino acids ‘K’, ‘E’, ‘G’, ‘P’, ‘C’ and ‘I’ are important in classifying anti-bacterial and anti-fungal peptides, and that amino acids ‘R’, ‘K’, ‘W’, ‘S’, ‘T’, ‘P’, ‘H’, ‘C’ and ‘I’ are essential for identifying AVPs [11]. As we collected data from a variety of different databases, and the definition of the negative dataset is rigorous, the above findings are highly convincing and solid.

AMPfun was further compared with several AMP prediction models using our class-specific testing set. AMPfun achieved a higher AUC than the other tools. However, for ACPs, the AUC obtained by AMPfun was slightly lower than that of MLACP [21]. Three possible reasons for the better performance of MLACP are as follows: first, we collected data from nine resources containing a range of different functional activities to increase the size and range of our dataset. Second, some tools did not use the PseAAC and CTDD features, which seem to be important when classifying AMPs into certain functional activities. Finally, we used a strict AMP definition when compiling the negative dataset.

This work presents a new scheme for the identification of AMPs, and beyond this, their classification into a wide variety of functional activities. With this scheme, a web server—AMPfun—was developed to provide a tool to screen unknown peptides for AMPs and identify their potential functional activities. Moreover, we have pinpointed some helpful features that assist the classification of short peptides into AMPs and non-AMPs, such as the physical–chemical properties of the peptide’s 1st residue. In conclusion, this study not only constructed classifiers to distinguish the activities of AMPs but also identified key features with respect to different AMP functions. In the future, the post-translational modifications [35, 36] should be considered into a further investigation of AMP functions.

Key Points

Although several studies have proposed different machine learning methods to perform antimicrobial peptide (AMP) prediction, most did not consider the diversity of antimicrobial activities. Thus, we specifically investigated the sequential features of AMPs with various functions including anti-parasitic, anti-viral, anti-cancer and anti-fungal AMPs, as well as those that target mammals, Gram-positive and Gram-negative bacteria.
We proposed a two-staged scheme to first identify AMPs, and second, characterize their functional activities. Features—including hydrophobicity, normalized van der Waals volume, polarity, charge and solvent accessibility—were essential to classify peptides as AMPs and non-AMPs; their use enabled the 1st stage AMP classifier to achieve an area under the receiver operating characteristic curve (AUC) value of 0.9894. In the 2nd stage, we found the pseudo amino acid composition was an informative attribute when differentiating between AMPs in terms of their functional activities.
The independent testing results demonstrated that the AUCs of the multi-class models were 0.7773, 0.9404, 0.8231, 0.8578, 0.8648, 0.8745 and 0.8672 with respect to anti-parasitic, anti-viral, anti-cancer, anti-fungal AMPs and those that target mammals, Gram-positive and Gram-negative bacteria, respectively. The proposed scheme was implemented as a web-based tool—AMPfun—which is now free to access at http://fdblab.csie.ncu.edu.tw/AMPfun/index.html.

Author contributions

C.R.C. and T.R.K. carried out the data collection and curation, participated in the bioinformatics analyses and drafted the manuscript. T.R.K. carried out the web tool implementation. C.R.C., L.C.W. and T.Y.L. participated in the design of the study and performed the draft revision. J.T.H. and T.Y.L. conceived of the study, participated in its design and coordination and helped revise the manuscript. All authors read and approved the final manuscript.

Chia-Ru Chung is a PhD candidate in the Department of Computer Science and Information Engineering, National Central University. Her research interests include bioinformatics, statistical inference, big data analytics, data mining and machine learning.

Ting-Rung Kuo obtained her master degree in Department of Computer Science and Information Engineering, National Central University. Her research interests include bioinformatics, data mining and machine learning.

Prof. Li-Ching Wu has a PhD in computer science and is a bioinformatics specialist. He joined Institute of Systems Biology and Bioinformatics, National Central University as an Assistant Professor in the Fall of 2006 and promoted to associate Professor in 2011. He is the creator of the biology database PGTdb and also participated construction of in several databases and tools such as ProSplicer, VirusProbeDB, and RgsMiner. His research interests include bioinformatics, systems biology, big data, disease network analysis and machine learning.

Dr. Tzong-Yi Lee is now an associate professor in the School of Life and Health Sciences, Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen. His research interests include bioinformatics, computational biology, systems biology, big data analytics, data mining and machine learning.

Dr. Jorng-Tzong Horng is now a Distinguished Professor in the Department of Computer Science and Information Engineering, National Central University, Taiwan. His research interests include database system, bioinformatics, computational biology, big data analytics, data mining and machine learning.

References

1.

Ventola

CL

.

The antibiotic resistance crisis: part 1: causes and threats

.

P T

2015

;

40

:

277

.

PubMed

2.

Jhong

JH

,

Chi

YH

,

Li

WC

, et al.

dbAMP: an integrated resource for exploring antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data

.

Nucleic Acids Res

2019

;

47

:

D285

–

97

.

3.

Gaspar

D

,

Veiga

AS

,

Castanho

MA

.

From antimicrobial to anticancer peptides. A review

.

Front Microbiol

2013

;

4

:

294

.

4.

Huang

KY

,

Chang

TH

,

Jhong

JH

, et al.

Identification of natural antimicrobial peptides from bacteria through metagenomic and metatranscriptomic analysis of high-throughput transcriptome data of Taiwanese oolong teas

.

BMC Syst Biol

2017

;

11

:

131

.

5.

Bahar

AA

,

Ren

D

.

Antimicrobial peptides

.

Pharmaceuticals

2013

;

6

:

1543

–

75

.

6.

Bhadra

P

,

Yan

J

,

Li

J

, et al.

AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest

.

Sci Rep

2018

;

8

:

1697

.

7.

Chang

KY

,

T-p

L

,

Shih

L-Y

, et al.

Analysis and prediction of the critical regions of antimicrobial peptides based on conditional random fields

.

PLoS One

2015

;

10

:e0119490.

8.

Wang

G

,

Li

X

,

Wang

Z

.

APD3: the antimicrobial peptide database as a tool for research and education

.

Nucleic Acids Res

2015

;

44

:

D1087

–

93

.

9.

Waghu

FH

,

Barai

RS

,

Gurung

P

, et al.

CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides

.

Nucleic Acids Res

2015

;

44

:

D1094

–

7

.

10.

Lee

H-T

,

Lee

C-C

,

Yang

J-R

, et al.

A large-scale structural classification of antimicrobial peptides

.

Biomed Res Int

2015

;

2015

:

4

–

5

.

11.

Meher

PK

,

Sahu

TK

,

Saini

V

, et al.

Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC

.

Sci Rep

2017

;

7

:

42362

.

12.

Jenssen

H

,

Hamill

P

,

Hancock

RE

.

Peptide antimicrobial agents

.

Clin Microbiol Rev

2006

;

19

:

491

–

511

.

13.

Silva

O

,

De La Fuente-núñez

C

,

Haney

E

, et al.

An anti-infective synthetic peptide with dual antimicrobial and immunomodulatory activities

.

Sci Rep

2016

;

6

:

35465

.

14.

Zhang

L-j

,

Gallo

RL

.

Antimicrobial peptides

.

Curr Biol

2016

;

26

:

R14

–

9

.

15.

de la Fuente-núñez

C

,

Silva

ON

,

Lu

TK

, et al.

Antimicrobial peptides: role in human disease and potential as immunotherapies

.

Pharmacol Ther

2017

;

178

:

132

–

40

.

16.

Fan

L

,

Sun

J

,

Zhou

M

, et al.

DRAMP: a comprehensive data repository of antimicrobial peptides

.

Sci Rep

2016

;

6

:

24482

.

17.

Mehta

D

,

Anand

P

,

Kumar

V

, et al.

ParaPep: a web resource for experimentally validated antiparasitic peptide sequences and their structures

.

Database

2014

;

2014

:

4

–

5

.

Crossref

18.

Qureshi

A

,

Thakur

N

,

Tandon

H

, et al.

AVPdb: a database of experimentally validated antiviral peptides targeting medically important viruses

.

Nucleic Acids Res

2013

;

42

:

D1147

–

53

.

19.

Agrawal

P

,

Bhalla

S

,

Chaudhary

K

, et al.

In silico approach for prediction of antifungal peptides

.

Front Microbiol

2018

;

9

:

323

.

20.

Tyagi

A

,

Tuknait

A

,

Anand

P

, et al.

CancerPPD: a database of anticancer peptides and proteins

.

Nucleic Acids Res

2014

;

43

:

D837

–

43

.

21.

Manavalan

B

,

Basith

S

,

Shin

TH

, et al.

MLACP: machine-learning-based prediction of anticancer peptides

.

Oncotarget

2017

;

8

:

77121

.

22.

Tyagi

A

,

Kapoor

P

,

Kumar

R

, et al.

In silico models for designing and discovering novel anticancer peptides

.

Sci Rep

2013

;

3

:

2984

.

23.

Zare

M

,

Mohabatkar

H

,

Faramarzi

FK

, et al.

Using Chou’s pseudo amino acid composition and machine learning method to predict the antiviral peptides

.

Open Bioinforma J

2015

;

9

:

3

.

Crossref

24.

Chen

W

,

Ding

H

,

Feng

P

, et al.

iACP: a sequence-based tool for identifying anticancer peptides

.

Oncotarget

2016

;

7

:

16895

.

25.

Hajisharifi

Z

,

Piryaiee

M

,

Beigi

MM

, et al.

Predicting anticancer peptides with Chou’s pseudo amino acid composition and investigating their mutagenicity via Ames test

.

J Theor Biol

2014

;

341

:

34

–

40

.

26.

Khosravian

M

,

Kazemi Faramarzi

F

,

Mohammad Beigi

M

, et al.

Predicting antibacterial peptides by the concept of Chou’s pseudo-amino acid composition and machine learning methods

.

Protein Pept Lett

2013

;

20

:

180

–

6

.

27.

Lata

S

,

Mishra

NK

,

Raghava

GP

.

AntiBP2: improved version of antibacterial peptide prediction

.

BMC Bioinformatics

2010

;

11

:

S19

.

28.

Xiao

X

,

Wang

P

,

Lin

W-Z

, et al.

iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types

.

Anal Biochem

2013

;

436

:

168

–

77

.

29.

Lin

W

,

Xu

D

.

Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types

.

Bioinformatics

2016

;

32

:

3745

–

52

.

30.

The UniProt Consortium

.

UniProt: the universal protein knowledgebase

.

Nucleic Acids Res

2017

;

45

:

D158

–

69

.

Crossref

PubMed

31.

Cao

D-S

,

Xu

Q-S

,

Liang

Y-Z

.

propy: a tool to generate various modes of Chou’s PseAAC

.

Bioinformatics

2013

;

29

:

960

–

2

.

32.

Li

W

,

Godzik

A

.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

.

Bioinformatics

2006

;

22

:

1658

–

9

.

33.

Pedregosa

F

,

Varoquaux

G

,

Gramfort

A

, et al.

Scikit-learn: machine learning in Python

.

J Mach Learn Res

2011

;

12

:

2825

–

30

.