Abstract

Antimicrobial peptides (AMPs) have received a great deal of attention given their potential to become a plausible option to fight multi-drug resistant bacteria as well as other pathogens. Quantitative sequence-activity models (QSAMs) have been helpful to discover new AMPs because they allow to explore a large universe of peptide sequences and help reduce the number of wet lab experiments. A main aspect in the building of QSAMs based on shallow learning is to determine an optimal set of protein descriptors (features) required to discriminate between sequences with different antimicrobial activities. These features are generally handcrafted from peptide sequence datasets that are labeled with specific antimicrobial activities. However, recent developments have shown that unsupervised approaches can be used to determine features that outperform human-engineered (handcrafted) features. Thus, knowing which of these two approaches contribute to a better classification of AMPs, it is a fundamental question in order to design more accurate models. Here, we present a systematic and rigorous study to compare both types of features. Experimental outcomes show that non-handcrafted features lead to achieve better performances than handcrafted features. However, the experiments also prove that an improvement in performance is achieved when both types of features are merged. A relevance analysis reveals that non-handcrafted features have higher information content than handcrafted features, while an interaction-based importance analysis reveals that handcrafted features are more important. These findings suggest that there is complementarity between both types of features. Comparisons regarding state-of-the-art deep models show that shallow models yield better performances both when fed with non-handcrafted features alone and when fed with non-handcrafted and handcrafted features together.

Introduction

Antimicrobial resistance (AMR) [1, 2] emerges as a major threat to human health [3–6] mainly due to the misuse and overuse of conventional drugs [7]. AMR arises when microorganisms (i.e. bacteria, viruses, fungi and parasites) create the ability to defeat the drugs that once impacted them. For instance, according to reports performed to the Global Antimicrobial Resistance and Use Surveillance System [8], the rate of resistance to ciprofloxacin (an antibiotic often used to treat basic infections as the ones of the urinary tract) ranges between 8.4 and 92.9% for Escherichia coli and between 4.1 and 79.4% for Klebsiella pneumoniae. This growing AMR lead infections harder to treat, increasing the risk of disease spread, severe illness and death [3–6]. Indeed, according to the United States Centers for Disease Control and Prevention, more than 35 000 deaths are due to antibiotic-resistant diseases each year [9].

One of the current efforts to tackle AMR is aimed to designing novel antibiotics and therapies based on antimicrobial peptides (AMPs). AMPs [10] are a vital part of the innate immune system, and the interest in them as potential drug candidates is because of their capabilities to kill a wide spectrum of bacteria [11, 12], parasites [13–15], fungi [16, 17] and viruses [18, 19], as well as to treat non-communicable diseases such as diabetes [20] and cancer [21, 22]. But the translation of promissory AMPs into clinical therapeutics constitutes a challenge because of their inherent bio- and physicochemical properties [23, 24]. That is, AMPs are water-soluble and thus they generally show limited ability to diffuse across bio-membranes. Also, they are biologically unstable as they are easily hydrolyzed by proteolytic enzymes, yielding them short plasma half-lives [23, 24]. Thus, it is needed to integrate comprehensive studies of bioactivity, pharmacodynamic, pharmacokinetic and toxicological profiles in different phases of the AMP-based drug discovery (PDD) [25–27]. We use the term AMP (or general-AMP) to refer to peptide sequences with some antimicrobial activity (e.g. antibacterial, antifungal, antiparasitic, antiviral, among others [10]). Otherwise, a specific notation according to the activity of the peptide will be used.

In PDD workflows, quantitative sequence-activity models (QSAMs) perform a vital role [28–52] because they can be applied in the screening of large amounts of peptide sequences to identify likely AMPs. QSAMs are built with machine learning methods in order to achieve good mathematical correlations between the sequential and compositional features and the activity of known AMP sequence datasets. Thus, QSAMs can be analyzed in two groups according to the learning method used: (i) the ones based on shallow learning and (ii) the ones based on deep learning. Hereafter, we use the term of shallow learning to refer to all non-deep learning approaches, that is shallow learning encompasses both traditional learning (e.g. Support Vector Machine, k-Nearest Neighbors, Decision Trees, among others) and ensemble learning (e.g. AdaBoost, Random Forest (RF), among others).

A crucial step in the building of QSAMs based on shallow learning (henceforth also termed shallow models) is the calculation and selection of the ‘best’ set of protein descriptors (PDs). To date, tens of thousands PDs [53–59] can be calculated with different software [60–65]. For instance, more than 250 000 different PDs can be obtained from the ProtDCal [62] and starPep [64] software. These PDs, also known as handcrafted PDs (HPDs), are derived from human-engineered algorithms to extract chemical information contained in the protein sequences themselves. The large number of extant HPDs is because there is not a single HPD family able to codify all chemical information for different datasets because their relevance depends on the protein sequences to be studied [66]. Consequently, thousands of HPDs should be firstly calculated to posteriorly select the most important ones [67, 68] to model the endpoint of interest.

Although there is a wide spectrum of HPDs and feature selectors [67, 68], we think that the sole use of HPDs boundaries to build shallow models with better performances because HPDs are only derived from each protein sequence itself, and thus, they do not contain relevant information related to variation (long-term dependencies) of protein sequences selected across evolution [69–72]. To tackle this issue, position-specific score matrix (PSSM)-derived features can be used [73]. However, it will lead to develop alignment-dependent models, which are not the most suitable to process data coming from sequencing technologies because they are memory- and time-consuming [74]. Before this scenario, the use of deep learning methods has emerged as an alternative to create alignment-free models to identify AMPs [39–41, 45–49, 52], mainly because they have the ability to autonomously learn features that encode the semantic of the input protein sequences, overriding the need to perform a feature engineering process as it is required by shallow models. However, we recently demonstrated [75] that deep learning does not contribute to obtain better performances than shallow learning in the context of the supervised classification of AMPs. In addition of bad modeling practices, one of the main reasons of this outcome is the limited number of AMPs (<14 k) used to train the deep models (see Table 1 in [75]). That is, because the universe of currently known AMP sequences is small [76, 77], the size of the training datasets is not large enough [78–80] to autonomously obtain learned (non-handcrafted) features with better modeling abilities than the handcrafted features that can be calculated on the same datasets.

However, there are currently several public databases [81, 82] containing millions of protein sequences spanning evolutionary diversity [69–72], and to learn from that diversity, self-supervised methods are the most suitable ones. Self-supervision is a learning paradigm that unlike supervised learning does not require manual annotation of input data, and thus, it can leverage of much larger amounts of data [83]. This approach automatically learns useful hierarchical representations from unlabeled datasets, which can then be fine-tuned on few datasets for supervised downstream tasks, such as the identification of AMPs via shallow learning. Rives et al. [83] built a large Transformer-based model (termed ESM-1b Transformer) that was trained on 86 billion amino acids across 250 million sequences obtained from Uniparc [84]. They proved that the learned representations were useful in the supervised prediction of mutational effect and secondary structure, and they improved state-of-the-art features (e.g. the ones derived PSSM [73]) for the long-range residue–residue contact prediction. Moreover, Zhang et al. [50] also introduced a model based on self-supervision that was trained with 556 603 sequences obtained from UniProt [81]. However, their results were inferior to the ones obtained by shallow models fed with ProtDCal HPDs (see Table 3A and E in [75]). This latter confirms that (i) shallow classifiers are most suitable to identify AMPs and (ii) that large amounts of data are required to autonomously determine good learned (non-handcrafted) representations.

All in all, this work is aimed to examine the use of non-handcrafted protein features (nHPDs) obtained via self-supervision in the supervised classification of AMPs using shallow learning, as well as to assess the modeling abilities of nHPDs when combined with handcrafted protein features (HPDs). In this way, it will be analyzed if both types of features are complementary or not to build good shallow models. To this end, we built RF models fed with nHPDs derived from the ESM-1b Transformer model [83], as well as RF models fed with state-of-the-art HPDs. We also built RF models by joining the nHPDs and HPDs accounted for. The models were built and evaluated across five benchmarking datasets. We also analyzed both types of features via chemometric and model-agnostic explainability methods. Finally, we performed comparisons with respect to state-of-the-art deep neural network (DNN) architectures. The RF classifier was the only shallow learner used in this work to guarantee fair analyses and comparability of results regarding the baseline models considered [51]. All the shallow and deep models were built for benchmarking purposes only and not to introduce new models.

Material and methods

AMP datasets

We used five benchmarking datasets (see Table 1 for a description) recently proposed in [51] to develop RF models to identify general-AMPs as well as antibacterial (ABP), antifungal (AFP), antiparasitic (APP) and antiviral (AVP) peptide sequences. A detailed explanation of how these datasets were created can be found in Section 2.1 in [51]. Each of these benchmarking datasets is comprised of training, test and external datasets. Unlike the external datasets, the training and testing datasets were created from a rational division based on clustering (see Section 2.1 in [51]) and thus they present similar distributions. Consequently, the external datasets to be used in this work are most suitable than the testing datasets to assess the generalization ability of the models to be built. Both the datasets and RF models proposed in [51] will be considered as baseline in this work to guarantee fair analyses and comparisons.

Table 1

Datasets used to build, evaluate and compare the models in this work.

EndpointAimTotalPositiveNegative
General-AMPTraining19 54897819767
Test512525642561
External15 685491410 771
ABPTraining13 16665836583
Test339016951695
External15 528475710 771
AFPTraining1556778778
Test430215215
External15 528475710 771
APPTraining1989999
Test623131
External11 18241110 771
AVPTraining464223212321
Test1246623623
External12 001123010 771
EndpointAimTotalPositiveNegative
General-AMPTraining19 54897819767
Test512525642561
External15 685491410 771
ABPTraining13 16665836583
Test339016951695
External15 528475710 771
AFPTraining1556778778
Test430215215
External15 528475710 771
APPTraining1989999
Test623131
External11 18241110 771
AVPTraining464223212321
Test1246623623
External12 001123010 771
Table 1

Datasets used to build, evaluate and compare the models in this work.

EndpointAimTotalPositiveNegative
General-AMPTraining19 54897819767
Test512525642561
External15 685491410 771
ABPTraining13 16665836583
Test339016951695
External15 528475710 771
AFPTraining1556778778
Test430215215
External15 528475710 771
APPTraining1989999
Test623131
External11 18241110 771
AVPTraining464223212321
Test1246623623
External12 001123010 771
EndpointAimTotalPositiveNegative
General-AMPTraining19 54897819767
Test512525642561
External15 685491410 771
ABPTraining13 16665836583
Test339016951695
External15 528475710 771
AFPTraining1556778778
Test430215215
External15 528475710 771
APPTraining1989999
Test623131
External11 18241110 771
AVPTraining464223212321
Test1246623623
External12 001123010 771

Extraction of handcrafted protein descriptors (HPDs)

We used HPDs calculated with the ProtDCal [62] and iLearn [85] software. On the one hand, it is important to mention that the calculation and selection of the ProtDCal HPDs was performed in [51] to build the baseline RF models. This process was performed on each training dataset from a pool comprised initially of 96 026 ProtDCal HDPs (see Section 2.2 in [51]). The importance of the ProtDCal HPDs finally included in the baseline RF models was widely discussed in Section 3.4 in [51]. On the other hand, a total of 3774 HPDs were additionally calculated in this work with the iLearn tool [85]. The iLearn HPDs calculated on each training dataset were grouped according to their definitions as follows: Group 1 (G1) is comprised of 2420 HPDs, of which 1600 are Composition of K-Spaced Amino Acid Pairs (CKSAAP), 20 are Amino Acid Composition (AAC) and 400 are Dipeptide Composition (DPC) and Dipeptide Deviation from Expected Mean (DDE), respectively; Group 2 (G2) is comprised of 255 HPDs, of which 100 are Composition of K-Spaced Amino Acid Group Pairs (CKSAAGP), 5 are Grouped Amino Acid Composition (GAAC), 25 are Grouped Dipeptide Composition (GDPC) and 125 are Grouped Tripeptide Composition (GTPC); Group 3 (G3) is comprised of 32 Normalized Moreau-Broto Autocorrelation (NMBroto) HPDs; Group 4 (G4) is comprised of 273 Composition/Transition/Distribution (CTD) HPDs; Group 5 (G5) is comprised of 686 HPDs, of which 343 are Conjoint Triad (Ctriad) and K-Spaced Conjoint Triad (KSCTriad), respectively; Group 6 (G6) contains 56 HPDs, of which 8 are Sequence-Order-Coupling Number (SOCNumber) and 48 are Quasi-Sequence-Order (QSOrder) and Group 7 (G7) is comprised of 52 HPDs, of which 24 are Pseudo-Amino Acid Composition (PAAC) and 28 are Amphiphilic Pseudo-Amino Acid Composition (APAAC). The lambda parameter value to compute the PAAC and APAAC HPDs was set up to 4.

Extraction of non-handcrafted protein descriptors (nHPDs)

A total of 1280 nHPDs were calculated on the datasets described in Table 1 using the ESM-1b Transformer model proposed by Rives et al. [83]. This 33-layer (∼650 M parameters) model was trained on 86 billion amino acids across 250 million protein sequences downloaded from the Uniparc database [84]. By using the script provided by the authors (https://github.com/facebookresearch/esm), we extracted the average embedding (dim: 1 × 1280; script_param: —include mean) corresponding to the embedding per amino acid of the final layer (dim: sequence_length × 1280; script_param: —repr_layers 33) for each peptide sequence. That average embedding is the vector of nHPDs obtained for each sequence. The average embedding was used because it yielded the best results in the studies of prediction of secondary structures (see Table 6 in [83]) and long-rang residue-residue contacts (see Table 7 in [83]) performed by Rives et al. [83].

Feature selection and modeling methodology

On each descriptor matrix (7 holding iLearn HPDs and 1 holding ESM-1b nHPDs) obtained for each training dataset corresponding to each endpoint (i.e. general-AMPs, ABPs, AFPs, APPs and AVPs—see Table 1), a Shannon Entropy (SE)-based unsupervised filter [86] was applied to filter out those PDs (hereafter, this notation will be used to refer to protein descriptors in general) with an information content (SE value) inferior to 10% of the maximum entropy that a PD can achieve on each training dataset. Relevant PDs for modeling studies should present high-SE values [86] as a measure of their good ability to discriminate among proteins (peptides) with different sequences. The discretization schema used was equal to the number of peptides in each training dataset. Thus, the maximum entropy that can achieve each PD in the general-AMP, antibacterial, antifungal, antiparasitic and antiviral training datasets is equal to 14.25, 13.68, 10.6, 7.63 and 12.18 bits, respectively. The held PDs were sorted in ascending order according to their SE values, and then, a Spearman correlation coefficient between each pair of PDs was calculated to remove the redundant ones. If a pair of PDs were correlated above 0.95, then the PD with lower SE value was filtered out.

Afterward, a supervised selection using the Correlation Subset method [87] was applied on each reduced descriptor matrix to determine a subset of PDs highly correlated with the activity and lowly correlated among them. Five raking-type feature selection techniques were also applied on each reduced descriptor matrix, being the number of top-ranked PDs retained with these methods equal to the size of the subset obtained with the Correlation Subset method. One of the raking-type procedures is the SE-based unsupervised method explained above, whereas the other ranking-type methods are Relief-F [88], Information Gain [89], Symmetrical Uncertainty [89] and Chi Squared [89], which are supervised feature selection approaches. With the Relief-F method, the PDs that best discriminate between peptide sequences that are near to each other are retained, whereas with the Symmetrical Uncertainty and Information Gain methods, the PDs with the highest information content regarding the activity are selected. Finally, the PDs highly dependent on the activity are retained with the Chi Squared method. The SE-based selector was applied using the IMMAN tool [90], whereas the other ones were used with their default settings in the WEKA tool v3.8 [89].

Additionally, because no single feature selector can guarantee an optimal feature subset, an additional descriptor subset (ensemble) was created for the iLearn HPDs and ESM-1b nHPDs, respectively, by joining the descriptor subsets determined with the six selection methods explained above. The importance to merge different feature selectors has been explained elsewhere [67, 68]. So far, per each training dataset, a total of 43 iLearn HPD subsets and 7 ESM-1b nHPD subsets have been created. On the other hand, other 43 subsets that contain HPDs calculated with the ProtDCal software [62] were also created per each training dataset. These subsets were built combining each of the iLearn HPD subsets with the corresponding ProtDCal HPD (baseline) subset that was proposed in [51]. All these subsets containing diversity of HPDs were created to build RF models as robust as possible in order to ensure rigorous comparisons regarding the RF models based on the ESM-1b nHPDs.

On all the built descriptor subsets (93 per training dataset), a wrapper-type selector [91] was applied using the Genetic Algorithm (GA) metaheuristic as search method, the accuracy as fitness measure and RF [92] as learning procedure. In this way, a total of 93 subsets comprised of the ‘most suitable’ features for the RF learner were additionally obtained. Therefore, a total of 186 descriptor subsets were finally taken into account to model each endpoint. Then, a model based on the RF classifier was created on each descriptor subset. RF was used to guarantee comparability of results regarding the RF models (baseline) introduced in [51], which are based on HPDs calculated with the ProtDCal software [62]. All the built RF models were assessed on their corresponding testing and external datasets. The wrapper-type selector, RF classifier and GA metaheuristic were applied using their default settings in the Weka tool v3.8. Schema 1 shows the workflow explained above.

Workflow applied on each training dataset to build the RF models.
Schema 1

Workflow applied on each training dataset to build the RF models.

Results and discussion

Do nHPDs have better modeling abilities than HPDs?

This section is aimed to compare the modeling ability between HPDs and nHPDs. Figure 1 shows the boxplots corresponding to the training Matthews correlation coefficients (MCCtrain) obtained via 10-fold cross-validation by the best RF models built on each family (i.e. iLearn G1, …, iLearn G7 and ESM-1b) and ensemble (i.e. iLearn, and iLearn & ProtDCal) of PDs accounted for. These training MCC values were obtained via 10-fold cross-validation. The best RF model selected per each family (and ensemble) was the one with the highest MCCtrain value (see Data S1 for PD sets, trained models, and Weka software outputs). The MCCtrain values obtained by the baseline models (last boxplot in Figure 1) were also shown. As it can be noted in Figure 1, models with suitable variance (MCCtrain0.65) were built from the AAC (G1), quasi-sequence-order (G6) and pseudo-amino acid composition (G7) HPD families, as well as from the iLearn HPD ensemble, the iLearn & ProtDCal HPD ensemble, and the ESM-1b nHPD family. Consequently, these results indicate that models with suitable training metrics can always be built independently of the type of feature (i.e. HPDs or nHPDs) accounted for. Supplementary Table S1 shows the MCCtrain, training sensitivity (SNtrain), training specificity (SPtrain) and training accuracy (ACCtrain) values of all the models.

Training Mathews correlation coefficients obtained by the models built from each HPD family calculated with the iLearn software, from each ensemble of HPDs (i.e. iLearn, and iLearn & ProtDCal), and from the set of nHPDs obtained with the ESM-1b Transformer model. The last boxplot corresponds to the baseline RF models, which were created with ProtDCal HPDs.
Figure 1

Training Mathews correlation coefficients obtained by the models built from each HPD family calculated with the iLearn software, from each ensemble of HPDs (i.e. iLearn, and iLearn & ProtDCal), and from the set of nHPDs obtained with the ESM-1b Transformer model. The last boxplot corresponds to the baseline RF models, which were created with ProtDCal HPDs.

Moreover, Table 2 shows the performance [i.e. sensitivity (SN), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC)] on the testing and external datasets (see Table 1) of the best RF models built. The performance of the RF models based on the ProtDCal HPDs (baseline) [51] is shown as well. Supplementary Table S2 shows the performance metrics for the worst models. On the one hand, it can be firstly noted that the best built models achieved good performances according to the MCC values on the testing datasets (MCCtest), which are generally comparable (or superior) to the MCCtest values obtained by the baseline RF models (i.e. those based only on ProtDCal HPDs). This suggests that the best built models are not overfitted according to their MCCtrain (see Supplementary Table S1) and MCCtest values, and consequently, they are suitable for further analyses. On the other hand, it can be observed that the best created models based on HPDs mostly presented better performances than the RF models based on nHPDs (i.e. those obtained with the ESM-1b model) according to the MCCtest values. Specifically, the highest MCCtest values achieved by the models based on HPDs were better than the ones based on nHPDs by 3.54, 2.46, 0.11, 9.15 and 5.16 on the general-AMP, ABP, AFP, APP and AVP testing datasets, respectively. However, it can be observed that these better performances were mostly achieved using a greater number of HPDs than nHPDs. In general, these results on the testing datasets could suggest that our premise is uncertain. That is, that the sole use of HPDs does not limit to build better models although they do not contain evolutionary information that is present on protein sequences [69–72].

Table 2

Performance metrics on the test and external datasets (see Table 1) that were obtained by the best RF models built with different types of handcrafted descriptors as well as with ESM-1b non-handcrafted descriptors. The best results are highlighted in bold and italic. The column ‘size’ refers to the number of descriptors included in the models

EndpointType of DescriptorsSizeTesting datasetsExternal datasets
SNtestSPtestACCtestMCCtestSNextSPextACCextMCCext
General-AMPProtDCal (see Table 2 in [51])2070.8620.9450.9040.8100.9110.9090.9100.799
iLearn G1370.8560.9380.8970.7960.8750.8990.8920.756
iLearn G6390.8510.9420.8970.7970.9040.9080.9060.791
iLearn G7320.8500.9410.8950.7940.8930.9120.9060.788
iLearn Ensemble870.8490.9450.8970.7970.8960.9170.9110.797
iLearn ProtDCal Ensemble1440.8660.9520.9090.8200.9100.9100.9100.799
ESM-1b850.8400.9470.8940.7920.9040.9450.9320.843
ABPProtDCal (see Table 2 in [51])930.9120.9570.9350.8700.8970.9230.9150.805
iLearn G1440.8990.9500.9250.8500.8730.9090.8980.766
iLearn G6400.8960.9610.9280.8580.8860.9260.9140.801
iLearn G7320.9030.9490.9260.8530.8800.9320.9160.805
iLearn Ensemble1110.9010.9660.9340.8700.8900.9340.9210.816
iLearn ProtDCal Ensemble860.9100.9610.9360.8730.8960.9340.9220.820
ESM-1b470.8940.9560.9250.8520.9060.9530.9390.856
AFPProtDCal (see Table 2 in [51])1440.9210.9630.9420.8840.7470.9080.8590.664
iLearn G1530.9020.9770.9400.8820.6860.9120.8420.619
iLearn G6490.9070.9630.9350.8710.7090.9220.8570.654
iLearn G7320.9070.9630.9350.8710.6400.9320.8430.615
iLearn Ensemble1110.9070.9630.9350.8710.7020.9100.8460.630
iLearn ProtDCal Ensemble350.9300.9630.9470.8930.7390.9140.8600.666
ESM-1b580.9020.9860.9440.8920.8310.9670.9250.821
APPProtDCal (see Table 2 in [51])400.8710.9030.8870.7750.8250.8670.8650.357
iLearn G1190.8710.7740.8230.6480.7230.8070.8040.244
iLearn G6180.7420.9030.8230.6540.7810.9030.8980.392
iLearn G7310.7740.8390.8070.6140.8250.8870.8850.387
iLearn Ensemble*---------
iLearn ProtDCal Ensemble280.8390.9030.8710.7430.8300.8720.8700.366
ESM-1b280.8710.8390.8550.7100.8880.8780.8780.404
AVPProtDCal (see Table 2 in [51])960.7640.8920.8280.6620.7420.8730.8600.476
iLearn G1810.8070.9000.8540.7110.5840.8780.8480.374
iLearn G6270.7610.8890.8250.6560.7230.8660.8520.452
iLearn G7330.7690.9040.8360.6790.7060.8430.8290.406
iLearn Ensemble1740.7930.9150.8540.7130.6800.8760.8560.438
iLearn ProtDCal Ensemble970.7850.9120.8480.7020.7500.8720.8590.478
ESM-1b700.7830.8910.8370.6780.9210.8680.8730.585
EndpointType of DescriptorsSizeTesting datasetsExternal datasets
SNtestSPtestACCtestMCCtestSNextSPextACCextMCCext
General-AMPProtDCal (see Table 2 in [51])2070.8620.9450.9040.8100.9110.9090.9100.799
iLearn G1370.8560.9380.8970.7960.8750.8990.8920.756
iLearn G6390.8510.9420.8970.7970.9040.9080.9060.791
iLearn G7320.8500.9410.8950.7940.8930.9120.9060.788
iLearn Ensemble870.8490.9450.8970.7970.8960.9170.9110.797
iLearn ProtDCal Ensemble1440.8660.9520.9090.8200.9100.9100.9100.799
ESM-1b850.8400.9470.8940.7920.9040.9450.9320.843
ABPProtDCal (see Table 2 in [51])930.9120.9570.9350.8700.8970.9230.9150.805
iLearn G1440.8990.9500.9250.8500.8730.9090.8980.766
iLearn G6400.8960.9610.9280.8580.8860.9260.9140.801
iLearn G7320.9030.9490.9260.8530.8800.9320.9160.805
iLearn Ensemble1110.9010.9660.9340.8700.8900.9340.9210.816
iLearn ProtDCal Ensemble860.9100.9610.9360.8730.8960.9340.9220.820
ESM-1b470.8940.9560.9250.8520.9060.9530.9390.856
AFPProtDCal (see Table 2 in [51])1440.9210.9630.9420.8840.7470.9080.8590.664
iLearn G1530.9020.9770.9400.8820.6860.9120.8420.619
iLearn G6490.9070.9630.9350.8710.7090.9220.8570.654
iLearn G7320.9070.9630.9350.8710.6400.9320.8430.615
iLearn Ensemble1110.9070.9630.9350.8710.7020.9100.8460.630
iLearn ProtDCal Ensemble350.9300.9630.9470.8930.7390.9140.8600.666
ESM-1b580.9020.9860.9440.8920.8310.9670.9250.821
APPProtDCal (see Table 2 in [51])400.8710.9030.8870.7750.8250.8670.8650.357
iLearn G1190.8710.7740.8230.6480.7230.8070.8040.244
iLearn G6180.7420.9030.8230.6540.7810.9030.8980.392
iLearn G7310.7740.8390.8070.6140.8250.8870.8850.387
iLearn Ensemble*---------
iLearn ProtDCal Ensemble280.8390.9030.8710.7430.8300.8720.8700.366
ESM-1b280.8710.8390.8550.7100.8880.8780.8780.404
AVPProtDCal (see Table 2 in [51])960.7640.8920.8280.6620.7420.8730.8600.476
iLearn G1810.8070.9000.8540.7110.5840.8780.8480.374
iLearn G6270.7610.8890.8250.6560.7230.8660.8520.452
iLearn G7330.7690.9040.8360.6790.7060.8430.8290.406
iLearn Ensemble1740.7930.9150.8540.7130.6800.8760.8560.438
iLearn ProtDCal Ensemble970.7850.9120.8480.7020.7500.8720.8590.478
ESM-1b700.7830.8910.8370.6780.9210.8680.8730.585
a

The RF models built on this subset did not satisfy the rule of thumb of that each variable into a model at least should explain the variance of five cases.

Table 2

Performance metrics on the test and external datasets (see Table 1) that were obtained by the best RF models built with different types of handcrafted descriptors as well as with ESM-1b non-handcrafted descriptors. The best results are highlighted in bold and italic. The column ‘size’ refers to the number of descriptors included in the models

EndpointType of DescriptorsSizeTesting datasetsExternal datasets
SNtestSPtestACCtestMCCtestSNextSPextACCextMCCext
General-AMPProtDCal (see Table 2 in [51])2070.8620.9450.9040.8100.9110.9090.9100.799
iLearn G1370.8560.9380.8970.7960.8750.8990.8920.756
iLearn G6390.8510.9420.8970.7970.9040.9080.9060.791
iLearn G7320.8500.9410.8950.7940.8930.9120.9060.788
iLearn Ensemble870.8490.9450.8970.7970.8960.9170.9110.797
iLearn ProtDCal Ensemble1440.8660.9520.9090.8200.9100.9100.9100.799
ESM-1b850.8400.9470.8940.7920.9040.9450.9320.843
ABPProtDCal (see Table 2 in [51])930.9120.9570.9350.8700.8970.9230.9150.805
iLearn G1440.8990.9500.9250.8500.8730.9090.8980.766
iLearn G6400.8960.9610.9280.8580.8860.9260.9140.801
iLearn G7320.9030.9490.9260.8530.8800.9320.9160.805
iLearn Ensemble1110.9010.9660.9340.8700.8900.9340.9210.816
iLearn ProtDCal Ensemble860.9100.9610.9360.8730.8960.9340.9220.820
ESM-1b470.8940.9560.9250.8520.9060.9530.9390.856
AFPProtDCal (see Table 2 in [51])1440.9210.9630.9420.8840.7470.9080.8590.664
iLearn G1530.9020.9770.9400.8820.6860.9120.8420.619
iLearn G6490.9070.9630.9350.8710.7090.9220.8570.654
iLearn G7320.9070.9630.9350.8710.6400.9320.8430.615
iLearn Ensemble1110.9070.9630.9350.8710.7020.9100.8460.630
iLearn ProtDCal Ensemble350.9300.9630.9470.8930.7390.9140.8600.666
ESM-1b580.9020.9860.9440.8920.8310.9670.9250.821
APPProtDCal (see Table 2 in [51])400.8710.9030.8870.7750.8250.8670.8650.357
iLearn G1190.8710.7740.8230.6480.7230.8070.8040.244
iLearn G6180.7420.9030.8230.6540.7810.9030.8980.392
iLearn G7310.7740.8390.8070.6140.8250.8870.8850.387
iLearn Ensemble*---------
iLearn ProtDCal Ensemble280.8390.9030.8710.7430.8300.8720.8700.366
ESM-1b280.8710.8390.8550.7100.8880.8780.8780.404
AVPProtDCal (see Table 2 in [51])960.7640.8920.8280.6620.7420.8730.8600.476
iLearn G1810.8070.9000.8540.7110.5840.8780.8480.374
iLearn G6270.7610.8890.8250.6560.7230.8660.8520.452
iLearn G7330.7690.9040.8360.6790.7060.8430.8290.406
iLearn Ensemble1740.7930.9150.8540.7130.6800.8760.8560.438
iLearn ProtDCal Ensemble970.7850.9120.8480.7020.7500.8720.8590.478
ESM-1b700.7830.8910.8370.6780.9210.8680.8730.585
EndpointType of DescriptorsSizeTesting datasetsExternal datasets
SNtestSPtestACCtestMCCtestSNextSPextACCextMCCext
General-AMPProtDCal (see Table 2 in [51])2070.8620.9450.9040.8100.9110.9090.9100.799
iLearn G1370.8560.9380.8970.7960.8750.8990.8920.756
iLearn G6390.8510.9420.8970.7970.9040.9080.9060.791
iLearn G7320.8500.9410.8950.7940.8930.9120.9060.788
iLearn Ensemble870.8490.9450.8970.7970.8960.9170.9110.797
iLearn ProtDCal Ensemble1440.8660.9520.9090.8200.9100.9100.9100.799
ESM-1b850.8400.9470.8940.7920.9040.9450.9320.843
ABPProtDCal (see Table 2 in [51])930.9120.9570.9350.8700.8970.9230.9150.805
iLearn G1440.8990.9500.9250.8500.8730.9090.8980.766
iLearn G6400.8960.9610.9280.8580.8860.9260.9140.801
iLearn G7320.9030.9490.9260.8530.8800.9320.9160.805
iLearn Ensemble1110.9010.9660.9340.8700.8900.9340.9210.816
iLearn ProtDCal Ensemble860.9100.9610.9360.8730.8960.9340.9220.820
ESM-1b470.8940.9560.9250.8520.9060.9530.9390.856
AFPProtDCal (see Table 2 in [51])1440.9210.9630.9420.8840.7470.9080.8590.664
iLearn G1530.9020.9770.9400.8820.6860.9120.8420.619
iLearn G6490.9070.9630.9350.8710.7090.9220.8570.654
iLearn G7320.9070.9630.9350.8710.6400.9320.8430.615
iLearn Ensemble1110.9070.9630.9350.8710.7020.9100.8460.630
iLearn ProtDCal Ensemble350.9300.9630.9470.8930.7390.9140.8600.666
ESM-1b580.9020.9860.9440.8920.8310.9670.9250.821
APPProtDCal (see Table 2 in [51])400.8710.9030.8870.7750.8250.8670.8650.357
iLearn G1190.8710.7740.8230.6480.7230.8070.8040.244
iLearn G6180.7420.9030.8230.6540.7810.9030.8980.392
iLearn G7310.7740.8390.8070.6140.8250.8870.8850.387
iLearn Ensemble*---------
iLearn ProtDCal Ensemble280.8390.9030.8710.7430.8300.8720.8700.366
ESM-1b280.8710.8390.8550.7100.8880.8780.8780.404
AVPProtDCal (see Table 2 in [51])960.7640.8920.8280.6620.7420.8730.8600.476
iLearn G1810.8070.9000.8540.7110.5840.8780.8480.374
iLearn G6270.7610.8890.8250.6560.7230.8660.8520.452
iLearn G7330.7690.9040.8360.6790.7060.8430.8290.406
iLearn Ensemble1740.7930.9150.8540.7130.6800.8760.8560.438
iLearn ProtDCal Ensemble970.7850.9120.8480.7020.7500.8720.8590.478
ESM-1b700.7830.8910.8370.6780.9210.8680.8730.585
a

The RF models built on this subset did not satisfy the rule of thumb of that each variable into a model at least should explain the variance of five cases.

However, the results obtained on the external datasets reveal that the RF models based on nHPDs were always better than all the RF models based on HPDs regarding the MCCext values (see Table 2). Specifically, it can be observed that the RF models based on nHPDs achieved MCCext values equal to 0.843, 0.856, 0.821, 0.404 and 0.585 in the classification of general-AMPs, ABPs, AFPs, APPs and AVPs, respectively, whereas the best RF models based on HPDs achieved MCCext values equal to 0.799, 0.82, 0.666, 0.392 and 0.478. Thus, the RF models based on nHPDs were better than the RF models based on HPDs by 5.51, 4.39, 23.27, 3.06 and 22.38%, respectively. The highest improvements were obtained in the classification of AFPs and AVPs due mainly to a better classification of true-AFPs and true-AVPs (SNext metric). In fact, it can be noted that in these two endpoints, the highest SNext values obtained by the models based on HPDs were equal to 0.747 and 0.75, whereas the ones achieved by the models based on nHPDs were equal to 0.831 (11.24% better) and 0.921 (22.8% better), respectively.

These results are in correspondence with the No Free Lunch theorem [94], which states that ‘over the space of all possible problems, every machine learning method will perform as well as every other one on average’. In this case, there is not a single model fed with nHPDs that will always perform better than any other model fed with HPDs, or vice versa. Nevertheless, these results suggest that the nHPDs obtained from the ESM-1b Transformer model [83] present better modeling abilities than the HPDs calculated with the ProtDCal [62] and iLearn [85] software because the models built with them consistently yielded improvements across the external datasets. We state this assertion because the external datasets present a different distribution to the training datasets, unlike the testing datasets because they were built from a rational division based on clustering (see Section 2.1 in [51]). Consequently, the external datasets are most suitable to evaluate the robustness and stability of the created models. Nonetheless, a statistical study based on Bayesian estimation [95] (see Supplementary Table S3 for input values) confirmed our previous assertion, as shown in Supplementary Figure S1. From this figure, it can be concluded that models based on nHPDs have higher probability (0.7) to perform better predictions than models based on HPDs only. Finally, notice also that simpler models were mostly created with the nHPDs than with the HPDs (Ockham razor principle [96]) according to the number of variables. However, it is important to remark that these results do not imply that nHPDs are sufficient to build good models to identify AMPs. Indeed, there is not statistical difference between the models created with both types of PDs (see rope zone in Supplementary Figure S1). This thus recommends to study if HPDs and nHPDs can be combined to achieve better predictions.

Does the combination of nHPDs and HPDs improve the identification of AMPs?

To fulfill the aim of this section, several RF models were created on subsets that contain both nHPDs and HPDs. These subsets were created combining the nHPD subsets with the HPD subsets that were obtained for each training dataset in the modeling process described in Schema 1. Then, a total of 15 reduced subsets were additionally built per each new subset comprised of nHPDs and HPDs as follows: (i) by applying the six feature selectors used in this work; (ii) by joining the output of those six feature selectors; and (iii) by applying the wrapper-type selector used in this work on the seven previous subsets (steps (i) and (ii)) as well as on the non-reduced nHPD and HPD subset. Finally, an RF model on each subset comprised of nHPDs and HPDs was built. These models were grouped according to the types of PDs, i.e. those based on ESM-1b & iLearn PDs, those based on ESM-1b & ProtDCal PDs, and those based on ESM-1b & iLearn & ProtDCal PDs. The best RF model (i.e. that with the highest MCCtrain) on each of these groups was the one used for the further analyses (see Supplementary Table S4 for the training metrics and Table 3 for the performance metrics on the testing/external datasets).

Table 3

Performance metrics on the test and external datasets that were obtained by the best RF models when combining HPDs and nHPDs. The column ‘size’ refers to the number of descriptors included in the models

EndpointType of DescriptorsSizeTesting datasetsExternal datasets
SNtestSPtestACCtestMCCtestSNextSPextACCextMCCext
General-AMPESM-1b and iLearn850.8570.9540.9060.8150.9060.9360.9270.833
ESM-1b and ProtDCal1340.8590.9620.9110.8260.9160.9290.9250.830
ESM-1b and iLearn and ProtDCal1040.8660.9600.9130.8300.9150.9340.9280.837
ABPESM-1b and iLearn960.8980.9650.9320.8650.9030.9590.9420.863
ESM-1b and ProtDCal710.8770.9730.9250.8550.8960.9480.9320.841
ESM-1b and iLearn and ProtDCal490.8980.9730.9350.8730.9070.9580.9420.864
AFPESM-1b and iLearn730.9120.9860.9490.9000.8650.9550.9280.828
ESM-1b and ProtDCal630.8700.9810.9260.8570.7830.9560.9030.767
ESM-1b and iLearn and ProtDCal1110.9300.9810.9560.9130.8510.9520.9210.813
APPESM-1b and iLearn300.8710.9350.9030.8080.8910.9090.9080.463
ESM-1b and ProtDCal250.7740.9680.8710.7560.8980.9270.9260.511
ESM-1b and iLearn and ProtDCal240.9030.9030.9030.8060.8980.9160.9160.483
AVPESM-1b and iLearn970.7880.8940.8410.6860.9070.8810.8840.598
ESM-1b and ProtDCal770.7050.8720.7880.5840.9120.8730.8770.588
ESM-1b and iLearn and ProtDCal1670.7850.8990.8420.6880.9110.8920.8940.621
EndpointType of DescriptorsSizeTesting datasetsExternal datasets
SNtestSPtestACCtestMCCtestSNextSPextACCextMCCext
General-AMPESM-1b and iLearn850.8570.9540.9060.8150.9060.9360.9270.833
ESM-1b and ProtDCal1340.8590.9620.9110.8260.9160.9290.9250.830
ESM-1b and iLearn and ProtDCal1040.8660.9600.9130.8300.9150.9340.9280.837
ABPESM-1b and iLearn960.8980.9650.9320.8650.9030.9590.9420.863
ESM-1b and ProtDCal710.8770.9730.9250.8550.8960.9480.9320.841
ESM-1b and iLearn and ProtDCal490.8980.9730.9350.8730.9070.9580.9420.864
AFPESM-1b and iLearn730.9120.9860.9490.9000.8650.9550.9280.828
ESM-1b and ProtDCal630.8700.9810.9260.8570.7830.9560.9030.767
ESM-1b and iLearn and ProtDCal1110.9300.9810.9560.9130.8510.9520.9210.813
APPESM-1b and iLearn300.8710.9350.9030.8080.8910.9090.9080.463
ESM-1b and ProtDCal250.7740.9680.8710.7560.8980.9270.9260.511
ESM-1b and iLearn and ProtDCal240.9030.9030.9030.8060.8980.9160.9160.483
AVPESM-1b and iLearn970.7880.8940.8410.6860.9070.8810.8840.598
ESM-1b and ProtDCal770.7050.8720.7880.5840.9120.8730.8770.588
ESM-1b and iLearn and ProtDCal1670.7850.8990.8420.6880.9110.8920.8940.621
Table 3

Performance metrics on the test and external datasets that were obtained by the best RF models when combining HPDs and nHPDs. The column ‘size’ refers to the number of descriptors included in the models

EndpointType of DescriptorsSizeTesting datasetsExternal datasets
SNtestSPtestACCtestMCCtestSNextSPextACCextMCCext
General-AMPESM-1b and iLearn850.8570.9540.9060.8150.9060.9360.9270.833
ESM-1b and ProtDCal1340.8590.9620.9110.8260.9160.9290.9250.830
ESM-1b and iLearn and ProtDCal1040.8660.9600.9130.8300.9150.9340.9280.837
ABPESM-1b and iLearn960.8980.9650.9320.8650.9030.9590.9420.863
ESM-1b and ProtDCal710.8770.9730.9250.8550.8960.9480.9320.841
ESM-1b and iLearn and ProtDCal490.8980.9730.9350.8730.9070.9580.9420.864
AFPESM-1b and iLearn730.9120.9860.9490.9000.8650.9550.9280.828
ESM-1b and ProtDCal630.8700.9810.9260.8570.7830.9560.9030.767
ESM-1b and iLearn and ProtDCal1110.9300.9810.9560.9130.8510.9520.9210.813
APPESM-1b and iLearn300.8710.9350.9030.8080.8910.9090.9080.463
ESM-1b and ProtDCal250.7740.9680.8710.7560.8980.9270.9260.511
ESM-1b and iLearn and ProtDCal240.9030.9030.9030.8060.8980.9160.9160.483
AVPESM-1b and iLearn970.7880.8940.8410.6860.9070.8810.8840.598
ESM-1b and ProtDCal770.7050.8720.7880.5840.9120.8730.8770.588
ESM-1b and iLearn and ProtDCal1670.7850.8990.8420.6880.9110.8920.8940.621
EndpointType of DescriptorsSizeTesting datasetsExternal datasets
SNtestSPtestACCtestMCCtestSNextSPextACCextMCCext
General-AMPESM-1b and iLearn850.8570.9540.9060.8150.9060.9360.9270.833
ESM-1b and ProtDCal1340.8590.9620.9110.8260.9160.9290.9250.830
ESM-1b and iLearn and ProtDCal1040.8660.9600.9130.8300.9150.9340.9280.837
ABPESM-1b and iLearn960.8980.9650.9320.8650.9030.9590.9420.863
ESM-1b and ProtDCal710.8770.9730.9250.8550.8960.9480.9320.841
ESM-1b and iLearn and ProtDCal490.8980.9730.9350.8730.9070.9580.9420.864
AFPESM-1b and iLearn730.9120.9860.9490.9000.8650.9550.9280.828
ESM-1b and ProtDCal630.8700.9810.9260.8570.7830.9560.9030.767
ESM-1b and iLearn and ProtDCal1110.9300.9810.9560.9130.8510.9520.9210.813
APPESM-1b and iLearn300.8710.9350.9030.8080.8910.9090.9080.463
ESM-1b and ProtDCal250.7740.9680.8710.7560.8980.9270.9260.511
ESM-1b and iLearn and ProtDCal240.9030.9030.9030.8060.8980.9160.9160.483
AVPESM-1b and iLearn970.7880.8940.8410.6860.9070.8810.8840.598
ESM-1b and ProtDCal770.7050.8720.7880.5840.9120.8730.8770.588
ESM-1b and iLearn and ProtDCal1670.7850.8990.8420.6880.9110.8920.8940.621

Figure 2 depicts (for each endpoint) the MCCtest and MCCext values achieved by the best RF models created from the subsets containing ESM-1b & iLearn PDs, ESM-1b & ProtDCal PDs, and ESM-1b & ProtDCal & iLearn PDs, respectively. The MCCtest and MCCext values obtained by the best RF models only built with ESM-1b nHPDs (see Table 2) are also depicted (bars in blue color), and they are considered as baseline in this analysis. Figure 2 reveals that the RF models derived from the subsets comprised of both nHPDs and HPDs generally yielded better performances than the models only built with nHPDs. Specifically, it can be observed that the combination of iLearn HPDs with ESM-1b nHPDs (bars in light blue color) always contributed to achieve better results, excepting for the general-AMP external dataset (see Figure 2B). From Figure 2, it can also be noted that the models based both on ProtDCal HPDs and ESM-1b nHPDs (bars in yellow color) achieved inferior performances regarding the baseline models in half of the datasets, which suggests that the ProtDCal HPDs do not contribute to consistently build better models when merged with the ESM-1b nHPDs.

Mathews correlation coefficient achieved on the testing (A) and external (B) datasets by the best models built from the nHPD and HPD subsets. The performance of the best models only built with nHPDs is also shown (baseline).
Figure 2

Mathews correlation coefficient achieved on the testing (A) and external (B) datasets by the best models built from the nHPD and HPD subsets. The performance of the best models only built with nHPDs is also shown (baseline).

However, the combination of iLearn and ProtDCal HPDs with ESM-1b nHPDs (bars in red color) contributed to develop the best models. Indeed, the highest MCCtest values (see Figure 2A) were obtained by the models based on the three types of PDs considered (i.e. iLearn, ProtDCal and ESM-1b PDs), except in the APP testing dataset where the best developed model obtained the second highest MCCtest value. The models based on the three types of PDs also attained the highest MCCext values on the ABP and AVP external datasets, and the second highest MCCext values on the general-AMP and APP external datasets (see Figure 2B). All these results are supported by a statistical analysis based on Bayesian estimation [95] (see Tables from Supplementary Table S5 to Supplementary Table S7, for input values). This analysis reveals (see Supplementary Figure S2) that the models developed with iLearn, ProtDCal and ESM-1b PDs (0.92) (see Supplementary Figure S2A) as well as with iLearn and ESM-1b PDs (0.82) (see Supplementary Figure S2B) have higher probability to perform better predictions than the models only built with ESM-1b nHPDs. Moreover, it can be noted from Supplementary Figure S2C that the models created with ProtDCal and ESM-1b PDs present a low probability (0.37) to perform good predictions.

Finally, Figure 3 depicts the number of nHPDs and HPDs included in the RF models built. For each of the endpoints, the first bar corresponds to the best model only created with ESM-1b nHPDs, the second bar corresponds to the best model created with both ESM-1b nHPDs and iLearn HPDs, the third bar corresponds to the best model created with both ESM-1b nHPDs and ProtDCal HPDs, and the fourth bar corresponds to the best model created with the three types of PDs accounted for (i.e. iLearn, ProtDCal and ESM-1b PDs). On the one hand, it can be observed that the combination of nHPDs with HPDs led to build more complex models according to the total number of variables. However, the statistical outcomes explained above justify that increment in complexity. On the other hand, Figure 3 reveals that a greater number of nHPDs were generally included in the best built models. This confirms that in the classification of AMPs, the chemical information related to variation of protein sequences is rather different from the one encoded by the HPDs calculated with the ProtDCal [62] and iLearn [85] tools. Nonetheless, the number of HPDs included in the models also indicates that they contain chemical information that does not present in the ESM-1b nHPDs. This suggests complementarity between both types of features, which leads to get better predictions when they are combined. That complementarity is studied in-depth below.

Number of nHPDs and HPDs included in the best RF models created from the nHPD and HPD subsets. First bar corresponds to the models only created with ESM-1b nHPDs; second bar corresponds to the models created with both ESM-1b nHPDs and iLearn HPDs; third bar corresponds to the models created with both ESM-1b nHPDs and ProtDCal HPDs; and fourth bar corresponds to the models built with ESM-1b nHPDs, iLearn HPDs and ProtDCal HPDs.
Figure 3

Number of nHPDs and HPDs included in the best RF models created from the nHPD and HPD subsets. First bar corresponds to the models only created with ESM-1b nHPDs; second bar corresponds to the models created with both ESM-1b nHPDs and iLearn HPDs; third bar corresponds to the models created with both ESM-1b nHPDs and ProtDCal HPDs; and fourth bar corresponds to the models built with ESM-1b nHPDs, iLearn HPDs and ProtDCal HPDs.

nHPD and HPD space visualization comparisons

In this section, we perform an analysis on effectiveness of nHPDs and HPDs using feature space visualization comparisons. Feature space visualizations of all the nHPDs and HPDs in 2D space were generated via the t-SNE algorithm [97] (see Source Code S1). Specifically, the feature datasets used to train the models to predict general-AMPs, ABPs, AFPs and AVPs were the ones used to perform this analysis. These feature datasets are comprised of iLearn and ProtDCal HPDs (see Data S1B) as well as comprised of ESM-1b nHPDs (see Data S1C). The visualizations of the HPDs and nHPDs are shown in Figures 4A and4B, respectively. As can be seen, the feature space boundary between positive (in red color) and negative peptide sequences in the ESM-1b nHPD visualization is distinctly defined, whereas there is not a clear distinction between the sequences of both classes in the HPD visualization. This confirms that evolutionary information codified in the ESM-1b nHPDs is fundamental for the classification of AMPs. It also suggests that shallow learners will be able to make better decisions than when fed with HPDs only. These findings explain why the RF models using ESM-1b nHPDs presented comparable-to-superior generalization abilities to the RF models based only on the iLearn and/or ProtDCal HPDs (see Tables 2 and 3). These results are also in correspondence with conclusions established elsewhere [98].

Feature space 2D visualizations of the iLearn and ProtDCal HPDs (part A) and ESM-1b nHPDs (part B).
Figure 4

Feature space 2D visualizations of the iLearn and ProtDCal HPDs (part A) and ESM-1b nHPDs (part B).

Analysis of the HPD and nHPD used to build the best models by using chemometric and explainable artificial intelligence (XAI) methods

Here, we study the relevance and complementarity of the nHPDs and HPDs included in the best model created for each endpoint (see Data S1F). A SE-based relevance analysis [86] was first performed to quantify the information content of the PDs. Relevant PDs for modeling should present high SE values as an indicator of their good ability to distinguish among proteins (peptides) with different sequences. A Gini index-based importance (IG) analysis [99] was also carried out to assess how regularly a PD was selected for a split, and how large its general discriminative ability was for the RF classifier. Moreover, a permutation-based importance (IP) study was performed to measure the increase in the prediction error of a model after permuting the values of each of its features (PDs) [100]. In this study, a PD is ‘important’ when at shuffling its values the model error increases. Finally, an interaction analysis based on the Friedman’s H-statistic [101] was performed to assess how much of the variation of the predictions depend on the cooperation of the PDs. These two last studies are model-agnostic global interpretation methods, and their implementations in the Interpretable Machine Learning (iml) package [102] were used (see Source Code S2).

Figure 5 depicts (for each endpoint) the boxplots corresponding to the entropy, importance and interaction values calculated with the analyses described above. These boxplots are depicted for both the ESM-1b nHPDs alone and the iLearn and ProtDCal HPDs together that were used to build the models based on the three types of PDs (see Data S1F). On the one hand, Figure 5A shows the results obtained with the SE-based relevance analysis (see Supplementary Table S8 for the SE values). This analysis reveals that the nHPDs included in the analyzed models present greater information content than the HPDs. The maximum SE-value (mSE) that a PD can obtain in the general-AMP, antibacterial, antifungal, antiparasitic and antiviral training datasets is equal to 14.25, 13.68, 10.6, 7.63 and 12.18 bits, respectively. As it can be observed, the ESM-1b nHPDs always achieved SE values greater than 11.8 (82.8% of the mSE), 11.4 (83.3% of the mSE), 8.2 (77.3% of the mSE), 6.3 (82.6% of the mSE) and 7.6 (62.4% of the mSE) bits, respectively. Notice that the HPDs mostly obtained SE values less than the thresholds mentioned above. This superior behavior of the nHPDs can also be observed in Supplementary Figure S3, excepting for the AFP dataset. These outcomes suggest that the superior performance achieved by the models based on ESM-1b nHPDs can also be because they have a larger sensitivity to progressive changes of protein (peptide) sequences than the HPDs, and thus, they better characterize peptides (proteins) with different sequences. This latter is a desirable attribute for any molecular descriptor [103]. This study was performed with the IMMAN software [90] by using a binning schema (discretization) equal to the number of sequences in each training dataset.

Boxplots corresponding to the results obtained with the SE-based relevance analysis (A), the Gini index-based importance analysis (B), the permutation-based importance analysis (C) and interaction-based importance analysis (D). These boxplots are depicted for both the nHPDs and HPDs that were used in the best model built for each endpoint.
Figure 5

Boxplots corresponding to the results obtained with the SE-based relevance analysis (A), the Gini index-based importance analysis (B), the permutation-based importance analysis (C) and interaction-based importance analysis (D). These boxplots are depicted for both the nHPDs and HPDs that were used in the best model built for each endpoint.

On the other hand, Figure 5B represents the IG values corresponding to the nHPDs and HPDs that were jointly used to create the analyzed models (see Data S1F). Supplementary Table S9 contains the IG value obtained for each PD. It can be observed that the nHPDs generally achieved better IG values than the HPDs in the classification of ABPs and APPs, whereas in the classification of general-AMPs and AVPs, the HPDs were the ones that mostly obtained the best IG values. Both types of PDs had comparable IG value distributions in the classification of AFPs, although the nHPDs were the ones that achieved the highest IG values. These results indicate that these two types of PDs are relevant to distinguish among peptide sequences with positive and negative activities in the recognition of AMPs. Indeed, it can be observed in Supplementary Figure S4 that there are no outlier IG values, which means that no PD was significantly more important than the other ones in the decision-making of the RF learner. The IG values were obtained from the WEKA tool output [89].

Moreover, Figure 5C depicts the outcomes obtained with the permutation-based importance analysis. Supplementary Table S10 contains the IP value obtained for each PD. In this figure, it can be noted that the IP value distributions of the nHPDs were better than the ones of the HPDs, except in the AVP dataset. It can be additionally noted that nHPDs were always important in the predictions of the models analyzed since at least one nHPD obtained an IP value greater than 1. Notice that no HPD was important to recognize APPs (IP1). However, HPDs were those that achieved the highest IP values to identify ABPs, and they also are among the most important ones to classify AVPs. In this last endpoint, HPDs also presented the best distribution of IG values (see Figure 5B). Despite these outcomes, it is important to mention that there are no highly important PDs because their IP values were not much greater than 1 (IP1). This thus suggests that the best built models did not rely significantly more on the nHPDs than on the HPDs (or vice versa) because the permutation of their values did not imply larger increases in the prediction error of the models.

Boxplots corresponding to the performance (MCCext) achieved on the general-AMP, ABP, AFP and AVP external datasets by the 30 models that were built with the AMPScanner, APIN and APIN-Fusion architectures, respectively.
Figure 6

Boxplots corresponding to the performance (MCCext) achieved on the general-AMP, ABP, AFP and AVP external datasets by the 30 models that were built with the AMPScanner, APIN and APIN-Fusion architectures, respectively.

Finally, Figure 5D depicts the outcomes obtained with the interaction analysis based on the H-statistic [101]. Supplementary Table S11 contains the interaction strength (ISH) values obtained for each PD regarding all the other PDs. The results firstly reveal that the interaction effects between the PDs are generally very weak because almost all the ISH values are less than 0.1 (10% of the explained variance). Only two HPDs achieved ISH values superior to the threshold aforementioned in the recognition of ABPs and APPs. It can also be observed that the highest ISH values were obtained by HPDs in the classification of AFPs and AVPs. Only in the classification of general-AMPs, one nHPD obtained the highest ISH value. Nonetheless, it can be analyzed in Figure 5D that the nHPDs have a better distribution of ISH values in the classification of general-AMPs and AVPs. These findings suggest that the variation of the predictions depended more on HPDs than on nHPDs since the former obtained the highest ISH value(s) in 4 out of the 5 endpoints analyzed. All these results demonstrate that the nHPDs and HPDs are complementary because the former never were better than the latter (or vice versa) across all relevance, importance and interaction studies performed. This justifies why the combination between nHPDs and HPDs lead to build better predictive models than when using nHPDs and HPDs independently.

Performance regarding state-of-the-art DNN architectures

This section studies the performance obtained by the RF models built with nHPDs regarding the performance achieved by the AMPScanner [39], APIN [45] and APIN-Fusion [45] DNN architectures, which were proposed in the literature to classify AMP sequences. As the training of these architectures is not deterministic, each of them was retrained 30-times on the training datasets detailed in Table 1, excepting for the APP training dataset because it has very few sequences. Files S1 in [75] and Data S2 hold the models built (.h5 files) with the AMPScanner and APIN (and APIN-Fusion) architectures, respectively. Each built model was evaluated on the corresponding external dataset (see Supplementary Tables S1S4 in [75] for the MCCext values obtained by the AMPScanner architecture, and Supplementary Tables S12 and S13 for the MCCext values obtained by the APIN and APIN-Fusion architectures, respectively). By retraining these three architectures, a fair analysis regarding the RF models is guaranteed. The RF models used in the comparisons were the ones based only on the ESM-1b nHPDs (see Data S1C) as well as the ones based on the subsets comprised of ESM-1b, iLearn and ProtDCal PDs (see Data S1F).

On the one hand, Figure 6 depicts the boxplots corresponding to the MCCext values obtained by the 30 models created with the AMPScanner, APIN and APIN-Fusion architectures to predict general-AMPs, ABPs, AFPs and AVPs, respectively. It can be first noted that the MCCext values obtained from the AMPScanner architecture are the most scattered of all, presenting a standard deviation equal to 3.35, 2.25, 3.61 and 4.74 for the general-AMP, ABP, AFP and AVP external datasets, respectively. This indicates that the AMPScanner performance is not stable on each run and its highest performances can be by chance. In fact, the highest MCCext value on the AVP dataset is an outlier, while most of the MCCext values were less than 0.8 for the general-AMP and ABP external datasets, as well as less than 0.6 and 0.4 for the AFP and AVP external datasets, respectively. Moreover, it can be noted that the APIN and APIN-Fusion architectures were those that achieved the best performances, achieving both MCCext values mostly above the thresholds aforementioned. APIN yielded the least scattered MCCext values (with a standard deviation equal to 1.48, 1.04, 1 and 1.58 for the general-AMP, ABP, AFP and AVP datasets, respectively), whereas APIN-Fusion yielded the highest MCCext values (with a standard deviation equal to 4.86, 2.29, 1.61 and 2.59 for the general-AMP, ABP, AFP and AVP datasets, respectively). The high standard deviation of APIN-Fusion on the general-AMP dataset was due to three inferior outlier MCCext values, that after being removed them, a standard deviation equal to 2.67 was obtained.

On the other hand, Figure 7 shows the highest MCCext values obtained by both the RF models and the DNN models. The average MCCext values (without including outliers) for the DNN models are also shown. In a general sense, it can be observed that the RF models achieved comparable-to-superior performances to the DNN architectures. The average and highest performances achieved by the AMPScanner [39] and APIN [45] architectures were always inferior to the ones achieved by the RF models. The APIN-Fusion architecture [45] was the only one that outperformed the two RF models by 1.42 and 2.15% on the general-AMP external dataset (see Figure 7A), whereas it obtained the same performance as the best RF model on the ABP external dataset (see Figure 7B). For the AFP and AVP datasets, APIN-Fusion was always remarkably worse than the RF models. AMPScanner is composed of a convolutional layer and a long short-term memory (LSTM)-based recurrent layer, whereas APIN is a multi-scale convolutional network. APIN-Fusion is APIN but combined with amino acid composition (AAC) and dipeptide composition (DPC) HPDs. That fusion was the only one that obtained a slightly better performance than the RF models, and it was only on a single dataset.

Highest MCCext values achieved by both the DNN models and RF-based shallow models on the general-AMP, ABP, AFP and AVP external datasets. The average MCCext values for the DNN models are also shown.
Figure 7

Highest MCCext values achieved by both the DNN models and RF-based shallow models on the general-AMP, ABP, AFP and AVP external datasets. The average MCCext values for the DNN models are also shown.

Additional to the experiments performed with the AMPScanner [39], APIN [45] and APIN-Fusion [45] DNN architectures, we also carried out comparisons regarding DNN models based on LSTM, Bidirectional Encoder Representations from Transformers (BERT), and Attention, which were recently created by Ma et al. [104] to identify general-AMPs from the human gut microbiome. The consensus performance of these DNN models was also considered in the comparisons (i.e. a sequence is classified as AMP whether the LSTM-, BERT- and Attention-based models classify it as AMP, otherwise, it is nonAMP). This benchmarking analysis was only performed on the testing and external datasets for general-AMPs (see Table 1). Figure 8 depicts a bar graphic corresponding to the MCCtest and MCCext values achieved by the DNN models (Supplementary Table S14 shows the outputs and Supplementary Table S15 shows the performance metrics), by the best RF model based on the ESM-1b nHPDs alone (see Table 2 and Data S1C), and by the best RF model based on the ESM-1b, iLearn and ProtDCal PDs together (see Table 3 and Data S1F). As can be observed, the RF models were always better (MCCtest>0.79) than the DNN architectures (MCCtestAttention=0.5923,MCCtestBERT=0.6365,MCCtestLSTM=0.5743,MCCtestConsensus=0.6279) on the test dataset, whereas the performance (MCCext=0.8530) based on the consensus of the three DNNs was the only one that slightly outperformed both the performance of the RF model based on the ESM-1b nHPDs (MCCext=0.843) and the performance of the RF model based on the ESM-1b, iLearn and ProtDCal PDs (MCCext=0.837) on the external dataset. The models based on BERT (MCCext=0.8093), Attention(MCCext=0.8023) and LSTM (MCCext=0.6782) were inferior to the two RF models analyzed at least by 3.31% in the external dataset.

Comparison between the DNN models proposed by Ma et al. [104] and the RF models based on ESM-1b nHPDs and based on ESM-1b, iLearn and ProtDCal PDs regarding the MCCtest and MCCext values obtained on the general-AMP testing and external datasets (see Table 1).
Figure 8

Comparison between the DNN models proposed by Ma et al. [104] and the RF models based on ESM-1b nHPDs and based on ESM-1b, iLearn and ProtDCal PDs regarding the MCCtest and MCCext values obtained on the general-AMP testing and external datasets (see Table 1).

In general, these outcomes corroborate that computationally more complex classifiers do not necessarily lead to obtain better performances, which was recently warned in [75]. These findings also suggest that nHPDs derived from the ESM-1b Transformer model [83] have better modeling abilities and chemical information than nHPDs obtained through DNN architectures that use small, labeled AMP datasets. Thus, building DNN architectures on small datasets to only extract features and finally fed shallow classifiers to identify AMPs cannot be the most suitable approach, which is recently being a common practice [105, 106]. Before concluding, it is important to highlight that a comparison between the RF models using ESM-1b nHPDs and several state-of-the-art models was not performed because the baseline models based on ProtDCal HPDs were better than several models freely available at thirteen programs (see Table 5 in [51]). Consequently, all the models built in this work whose performances were better than the baseline models (see Tables 2 and 3), they are also better than the state-of-the-art models analyzed in [51].

Conclusions

We have studied the use of non-handcrafted features (nHPDs) to develop shallow models to identify potential AMPs. We used the ESM-1b Transformer model [83] to calculate these nHPDs. We also considered handcrafted features (HPDs) to compare the performance of shallow models when built with both types of features. As a result, nHPDs lead to develop shallow models with notably higher performances when compared with shallow models built with HPDs only. However, the experimental results comparing HPDs and nHPDs show that both types of descriptors extract complementary and different information from the input AMP sequences. Thus, the combination of nHPDs with HPDs lead to develop shallow models with better performances than using nHPDs only, although it implies an increasing in the size of the models in most cases. Consequently, we recommend always studying whether the increase in the complexity of the models is justified by performing relevance and importance analyses as the ones performed in this work. Finally, we can also conclude that building DNN architectures on small, labeled datasets to only extract features and to fed shallow classifiers could not be the most suitable approach, since features learned via self-supervision seem presenting better modeling abilities. Nonetheless, it is important to highlight that self-supervision is a suitable technique when large datasets are used; otherwise, the learned features (nHPDs) can be less valuable than HPDs (see Table 3A and 3E in [75] for results obtained from a model based on self-supervision but trained with 556 603 protein sequences).

Key Points
  • It is studied the modeling ability of learned features to classify AMPs

  • It is studied the combination of learned and calculated features to classify AMPs

  • Learned features lead to build better models than using calculated features only

  • Better shallow models are built combining both types of features

  • Shallow models built with both types of features better perform than deep models

Data and software availability

Data S1 contains the handcrafted and non-handcrafted descriptor sets, trained models and Weka software outputs corresponding to the best RF model built on each family (or ensemble) of features considered. Data S2 contains the trained models (.h5 files) of the APIN and APIN-Fusion architectures. Because these models were created for benchmarking propose only, no software was built.

Acknowledgement

C.R.G.J. acknowledges to the program ‘Cátedras CONACYT’ from ‘Consejo Nacional de Ciencia y Tecnología (CONACYT), México’ by the support to the endowed chair 501/2018 at ‘Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE)’; CONACYT under grant A1-S-20638 to C.A.B.

Author Biographies

César R. García-Jacas is a Conacyt Researcher at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico. He received his Ph.D. in Technical Science (2015) and M.Sc. in Computer Science (2013) from the Universidad Central “Marta Abreu” de las Villas, Cuba; as well as his B.Eng. in Informatic Science (2009) from the Universidad de las Ciencias Informáticas, La Habana, Cuba.

Luis A. García-González is a Ph.D. student in Computer Science, Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Mexico. He earned his B.Sc. in Computer Science (2021) from the Universidad de Oriente, Cuba.

Felix Martinez-Rios is a Full Professor at the School of Engineering, Universidad Panamericana (UP). He does research in artificial intelligence, quantum computing and theory of computation.

Issac P. Tapia-Contreras is a MSc student in Computer Sciences (2022) at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico.

Carlos A. Brizuela is an Associate professor at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico. He develops machine learning and optimization algorithms for the discovery and design of Bioactive peptides.

Description of the Organization: The Center for Scientific Research and Higher Education at Ensenada (in Spanish: Centro de Investigación Científica y de Educación Superior de Ensenada, CICESE) is a public research center sponsored by the National Council of Science and Technology of Mexico (CONACYT) in the city of Ensenada, Baja California, and specialized in Earth Sciences, Oceanography, Applied Physics, Experimental and Applied Biology and Computer Sciences (https://www.cicese.edu.mx/).

References

1.

WHO
.
Antimicrobial resistance
. https://www.who.int/news-room/fact-sheets/detail/antimicrobial-resistance (
8 October 2020, date last accessed
).

2.

CDC
.
Antibiotic/Antimicrobial Resistance (AR/AMR)
. https://www.cdc.gov/drugresistance/index.html (
8 October 2020, date last accessed
).

3.

Cassini
A
,
Högberg
LD
,
Plachouras
D
, et al.
Attributable deaths and disability-adjusted life-years caused by infections with antibiotic-resistant bacteria in the EU and the European economic area in 2015: a population-level modelling analysis
.
Lancet Infect Dis
2019
;
19
:
56
66
.

4.

Tacconelli
E
,
Pezzani
MD
.
Public health burden of antimicrobial resistance in Europe
.
Lancet Infect Dis
2019
;
19
:
4
6
.

5.

Gasser
M
,
Zingg
W
,
Cassini
A
, et al.
Attributable deaths and disability-adjusted life-years caused by infections with antibiotic-resistant bacteria in Switzerland
.
Lancet Infect Dis
2019
;
19
:
17
8
.

6.

Dadgostar
P
.
Antimicrobial resistance: implications and costs
.
Infect Drug Resist
2019
;
12
:
3903
10
.

7.

Laxminarayan
R
,
Duse
A
,
Wattal
C
, et al.
Antibiotic resistance—the need for global solutions
.
Lancet Infect Dis
2013
;
13
:
1057
98
.

8.

Roberts
M
,
Driggs
D
,
Thorpe
M
, et al.
Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans
.
Nat Mach Intell
2021
;
3
:
199
217
.

10.

Zhang
L-j
,
Gallo
RL
.
Antimicrobial peptides
.
Curr Biol
2016
;
26
:
R14
9
.

11.

Waghu
FH
,
Joseph
S
,
Ghawali
S
, et al.
Designing antibacterial peptides with enhanced killing kinetics
.
Front Microbiol
2018
;
9
:
1
10
.

12.

Liu
Y
,
Ding
S
,
Shen
J
, et al.
Nonribosomal antibacterial peptides that target multidrug-resistant bacteria
.
Nat Prod Rep
2019
;
36
:
573
92
.

13.

Mor
A
.
Multifunctional host defense peptides: antiparasitic activities
.
FEBS J
2009
;
276
:
6474
82
.

14.

Lacerda
AF
,
Pelegrini
PB
,
de
Oliveira
DM
, et al.
Anti-parasitic peptides from arthropods and their application in drug therapy
.
Front Microbiol
2016
;
7
:
1
11
.

15.

Pretzel
J
,
Mohring
F
,
Rahlfs
S
, et al. Antiparasitic peptides. In:
Vilcinskas
A
(ed).
Yellow Biotechnology I: Insect Biotechnologie in Drug Discovery and Preclinical Research
.
Berlin, Heidelberg
:
Springer Berlin Heidelberg
,
2013
,
157
92
.

16.

Devi
MS
,
Sashidhar
RB
.
Antiaflatoxigenic effects of selected antifungal peptides
.
Peptides
2019
;
115
:
15
26
.

17.

Fernández de Ullivarri
M
,
Arbulu
S
,
Garcia-Gutierrez
E
, et al.
Antifungal peptides as therapeutic agents
.
Front Cell Infect Microbiol
2020
;
10
:
1
22
.

18.

Vilas Boas
LCP
,
Campos
ML
,
Berlanda
RLA
, et al.
Antiviral peptides as promising therapeutic drugs
.
Cell Mol Life Sci
2019
;
76
:
3525
42
.

19.

David
CB
,
Gill
D
.
Antiviral activities of human host defense peptides
.
Curr Med Chem
2020
;
27
:
1420
43
.

20.

Kristensen
SL
,
Rørth
R
,
Jhund
PS
, et al.
Cardiovascular, mortality, and kidney outcomes with GLP-1 receptor agonists in patients with type 2 diabetes: a systematic review and meta-analysis of cardiovascular outcome trials
.
Lancet Diabetes Endocrinol
2019
;
7
:
776
85
.

21.

Jin
G
,
Weinberg
A
.
Human antimicrobial peptides and cancer
.
Semin Cell Dev Biol
2019
;
88
:
156
62
.

22.

Ghosh
SK
,
McCormick
TS
,
Weinberg
A
.
Human Beta Defensins and cancer: contradictions and common ground
.
Front Oncol
2019
;
9
:
1
8
.

23.

Lau
JL
,
Dunn
MK
.
Therapeutic peptides: historical perspectives, current development trends, and future directions
.
Bioorg Med Chem
2018
;
26
:
2700
7
.

24.

Huan
Y
,
Kong
Q
,
Mou
H
, et al.
Antimicrobial peptides: classification, design, application and research Progress in multiple fields
.
Front Microbiol
2020
;
11
:
1
21
.

25.

Maccari
G
,
Di Luca
M
,
Nifosì
R
. In silico design of antimicrobial peptides. In:
Zhou
P
,
Huang
J
(eds).
Computational Peptidology
.
New York, NY
:
Springer New York
,
2015
,
195
219
.

26.

Kuczera
K
. Molecular modeling of peptides. In:
Zhou
P
,
Huang
J
(eds).
Computational Peptidology
.
New York, NY
:
Springer New York
,
2015
,
15
41
.

27.

Gupta
S
,
Kapoor
P
,
Chaudhary
K
, et al. Peptide toxicity prediction. In:
Zhou
P
,
Huang
J
(eds).
Computational Peptidology
.
New York, NY
:
Springer New York
,
2015
,
143
57
.

28.

Torrent
M
,
Di Tommaso
P
,
Pulido
D
, et al.
AMPA: an automated web server for prediction of protein antimicrobial regions
.
Bioinformatics
2011
;
28
:
130
1
.

29.

Thakur
N
,
Qureshi
A
,
Kumar
M
.
AVPpred: collection and prediction of highly effective antiviral peptides
.
Nucleic Acids Res
2012
;
40
:
W199
204
.

30.

Fernandes
FC
,
Rigden
DJ
,
Franco
OL
.
Prediction of antimicrobial peptides based on the adaptive neuro-fuzzy inference system application
.
Pept Sci
2012
;
98
:
280
7
.

31.

Joseph
S
,
Karnik
S
,
Nilawe
P
, et al.
ClassAMP: a prediction tool for classification of antimicrobial peptides
.
IEEE/ACM Trans Comput Biol Bioinform
2012
;
9
:
1535
8
.

32.

Xiao
X
,
Wang
P
,
Lin
W-Z
, et al.
iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types
.
Anal Biochem
2013
;
436
:
168
77
.

33.

Lee
H-T
,
Lee
C-C
,
Yang
J-R
, et al.
A large-scale structural classification of antimicrobial peptides
.
Biomed Res Int
2015
;
2015
:
475062
.

34.

Waghu
FH
,
Barai
RS
,
Gurung
P
, et al.
CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides
.
Nucleic Acids Res
2015
;
44
:
D1094
7
.

35.

Lin
W
,
Xu
D
.
Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types
.
Bioinformatics
2016
;
32
:
3745
52
.

36.

Meher
PK
,
Sahu
TK
,
Saini
V
, et al.
Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC
.
Sci Rep
2017
;
7
:
42362
.

37.

Agrawal
P
,
Bhalla
S
,
Chaudhary
K
, et al.
In silico approach for prediction of antifungal peptides
.
Front Microbiol
2018
;
9
:
1
13
.

38.

Bhadra
P
,
Yan
J
,
Li
J
, et al.
AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest
.
Sci Rep
2018
;
8
:
1697
.

39.

Veltri
D
,
Kamath
U
,
Shehu
A
.
Deep learning improves antimicrobial peptide recognition
.
Bioinformatics
2018
;
34
:
2740
7
.

40.

Hamid
M-N
,
Friedberg
I
.
Identifying antimicrobial peptides using word embedding with deep recurrent neural networks
.
Bioinformatics
2018
;
35
:
2009
16
.

41.

Youmans
M
,
Spainhour
JCG
,
Qiu
P
.
Classification of antibacterial peptides using long short-term memory recurrent neural networks
.
IEEE/ACM Trans Comput Biol Bioinform
2020
;
17
:
1134
40
.

42.

Chung
C-R
,
Kuo
T-R
,
Wu
L-C
, et al.
Characterization and identification of antimicrobial peptides with different functional activities
.
Brief Bioinform
2019
;
21
:
1098
114
.

43.

Lin
Y
,
Cai
Y
,
Liu
J
, et al.
An advanced approach to identify antimicrobial peptides and their function types for penaeus through machine learning strategies
.
BMC Bioinf
2019
;
20
:
291
.

44.

Wei
L
,
Zhou
C
,
Su
R
, et al.
PEPred-suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning
.
Bioinformatics
2019
;
35
:
4272
80
.

45.

Su
X
,
Xu
J
,
Yin
Y
, et al.
Antimicrobial peptide identification using multi-scale convolutional network
.
BMC Bioinf
2019
;
20
:
730
.

46.

Li
J
,
Pu
Y
,
Tang
J
, et al.
DeepAVP: a Dual-Channel deep neural network for identifying variable-length antiviral peptides
.
IEEE J Biomed Health Inform
2020
;
24
:
3012
9
.

47.

Yan
J
,
Bhadra
P
,
Li
A
, et al.
Deep-AmPEP30: improve short antimicrobial peptides prediction with deep learning
.
Mol Ther--Nucleic Acids
2020
;
20
:
882
94
.

48.

Fu
H
,
Cao
Z
,
Li
M
, et al.
ACEP: improving antimicrobial peptides recognition through automatic feature fusion and amino acid embedding
.
BMC Genomics
2020
;
21
:
597
.

49.

Sharma
R
,
Shrivastava
S
,
Kumar Singh
S
, et al.
Deep-ABPpred: identifying antibacterial peptides in protein sequences using bidirectional LSTM with word2vec
.
Brief Bioinform
2021
;
22
:
1
19
. https://doi.org/10.1093/bib/bbab065.

50.

Zhang
Y
,
Lin
J
,
Zhao
L
, et al.
A novel antibacterial peptide recognition algorithm based on BERT
.
Brief Bioinform
2021
;
22
:
1
11
. https://doi.org/10.1093/bib/bbab200.

51.

Pinacho-Castellanos
SA
,
García-Jacas
CR
,
Gilson
MK
, et al.
Alignment-free antimicrobial peptide predictors: improving performance by a thorough analysis of the largest available data set
.
J Chem Inf Model
2021
;
61
:
3141
57
.

52.

Sharma
R
,
Shrivastava
S
,
Kumar Singh
S
, et al.
AniAMPpred: artificial intelligence guided discovery of novel antimicrobial peptides in animal kingdom
.
Brief Bioinform
2021
;
22
:
1
23
.

53.

Dubchak
I
,
Muchnik
I
,
Holbrook
SR
, et al.
Prediction of protein folding class using global description of amino acid sequence
.
Proc Natl Acad Sci U S A
1995
;
92
:
8700
.

54.

Chou
K-C
.
Prediction of protein cellular attributes using pseudo-amino acid composition
.
Proteins: Struct, Funct, Bioinf
2001
;
43
:
246
55
.

55.

Chou
K-C
.
Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes
.
Bioinformatics
2004
;
21
:
10
9
.

56.

Shen
J
,
Zhang
J
,
Luo
X
, et al.
Predicting protein–protein interactions based only on sequences information
.
Proc Natl Acad Sci U S A
2007
;
104
:
4337
.

57.

Ruiz-Blanco Yasser
B
,
García
Y
,
Sotomayor-Torres
CM
, et al.
New set of 2D/3D thermodynamic indices for proteins. A formalism based on “molten globule” theory
.
Physics Procedia
2010
;
8
:
63
72
.

58.

Chen
X
,
Qiu
J-D
,
Shi
S-P
, et al.
Incorporating key position and amino acid residue features to identify general and species-specific ubiquitin conjugation sites
.
Bioinformatics
2013
;
29
:
1614
22
.

59.

Marrero-Ponce
Y
,
Teran
JE
,
Contreras-Torres
E
, et al.
LEGO-based generalized set of two linear algebraic 3D bio-macro-molecular descriptors: theory and validation by QSARs
.
J Theor Biol
2020
;
485
:
110039
.

60.

Li
ZR
,
Lin
HH
,
Han
LY
, et al.
PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence
.
Nucleic Acids Res
2006
;
34
:
W32
7
.

61.

Chen
Z
,
Zhao
P
,
Li
F
, et al.
iFeature: a python package and web server for features extraction and selection from protein and peptide sequences
.
Bioinformatics
2018
;
34
:
2499
502
.

62.

Romero-Molina
S
,
Ruiz-Blanco
YB
,
Green
JR
, et al.
ProtDCal-suite: a web server for the numerical codification and functional analysis of proteins
.
Protein Sci
2019
;
28
:
1734
43
.

63.

Contreras-Torres
E
,
Marrero-Ponce
Y
,
Terán
JE
, et al.
MuLiMs-MCoMPAs: a novel multiplatform framework to compute tensor algebra-based three-dimensional protein descriptors
.
J Chem Inf Model
2020
;
60
:
1042
59
.

64.

Aguilera-Mendoza
L
,
Marrero-Ponce
Y
,
García-Jacas
CR
, et al.
Automatic construction of molecular similarity networks for visual graph mining in chemical space of bioactive peptides: an unsupervised learning approach
.
Sci Rep
2020
;
10
:
18074
.

65.

Barigye
SJ
,
Gómez-Ganau
S
,
Serrano-Candelas
E
, et al.
PeptiDesCalculator: software for computation of peptide descriptors. Definition, implementation and case studies for 9 bioactivity endpoints
.
Proteins: Struct, Funct, Bioinf
2021
;
89
:
174
84
.

66.

Todeschini
R
,
Consonni
V
.
Molecular Descriptors for Chemoinformatics
.
Weinheim
:
WILEY-VCH
,
2009
.

67.

Bolón-Canedo
V
,
Alonso-Betanzos
A
.
Ensembles for feature selection: a review and future trends
.
Inf Fusion
2019
;
52
:
1
12
.

68.

Pes
B
.
Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains
.
Neural Comput Applic
2020
;
32
:
5951
73
.

69.

Yanofsky
C
,
Horn
V
,
Thorpe
D
.
Protein structure relationships revealed by mutational analysis
.
Science
1964
;
146
:
1593
4
.

70.

Altschuh
D
,
Lesk
AM
,
Bloomer
AC
, et al.
Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus
.
J Mol Biol
1987
;
193
:
693
707
.

71.

Altschuh
D
,
Vernet
T
,
Berti
P
, et al.
Coordinated amino acid changes in homologous protein families*
.
Protein Eng Des Sel
1988
;
2
:
193
9
.

72.

Hughes
AL
,
Yeager
M
.
Coordinated amino acid changes in the evolution of mammalian Defensins
.
J Mol Evol
1997
;
44
:
675
82
.

73.

Mohammadi
A
,
Zahiri
J
,
Mohammadi
S
, et al.
PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles
.
Biol Methods Protoc
2022
;
7
:
1
11
.

74.

Zielezinski
A
,
Vinga
S
,
Almeida
J
, et al.
Alignment-free sequence comparison: benefits, applications, and tools
.
Genome Biol
2017
;
18
:
186
.

75.

García-Jacas
CR
,
Pinacho-Castellanos
SA
,
García-González
LA
, et al.
Do deep learning models make a difference in the identification of antimicrobial peptides?
Brief Bioinform
2022
;
23
(3):
1
16
. https://doi.org/10.1093/bib/bbac094.

76.

Aguilera-Mendoza
L
,
Marrero-Ponce
Y
,
Tellez-Ibarra
R
, et al.
Overlap and diversity in antimicrobial peptide databases: compiling a non-redundant set of sequences
.
Bioinformatics
2015
;
31
:
2553
9
.

77.

Aguilera-Mendoza
L
,
Marrero-Ponce
Y
,
Beltran
JA
, et al.
Graph-based data integration from bioactive peptide databases of pharmaceutical interest: toward an organized collection enabling visual network analysis
.
Bioinformatics
2019
;
35
:
4739
47
.

78.

Oyedare
T
,
Park
JJ
. Estimating the Required Training Dataset Size for Transmitter Classification Using Deep Learning. In:
2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN)
. Newark, NJ, USA: IEEE,
2019
, p.
1
10
.

79.

Jiang
J
,
Wang
R
,
Wang
M
, et al.
Boosting tree-assisted multitask deep learning for small scientific datasets
.
J Chem Inf Model
2020
;
60
:
1235
44
.

80.

Manibardo
EL
,
Laña
I
,
Ser
JD
.
Deep learning for road traffic forecasting: does it make a difference?
IEEE trans Intell Transp Syst
2021
;
23
:
1
25
.

81.

Consortium TU
.
UniProt: a worldwide hub of protein knowledge
.
Nucleic Acids Res
2018
;
47
:
D506
15
.

82.

Mistry
J
,
Chuguransky
S
,
Williams
L
, et al.
Pfam: the protein families database in 2021
.
Nucleic Acids Res
2020
;
49
:
D412
9
.

83.

Rives
A
,
Meier
J
,
Sercu
T
, et al.
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
.
Proc Natl Acad Sci U S A
2021
;
118
:
e2016239118
.

84.

Bairoch
A
,
Apweiler
R
,
Wu
CH
, et al.
The universal protein resource (UniProt)
.
Nucleic Acids Res
2005
;
33
:
D154
9
.

85.

Chen
Z
,
Zhao
P
,
Li
F
, et al.
iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA
.
RNA and protein sequence data, Briefings Bioinf
2019
;
21
:
1047
57
.

86.

Godden
JW
,
Stahura
FL
,
Bajorath
J
.
Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations
.
J Chem Inf Comput Sci
2000
;
40
:
796
800
.

87.

Hall
MA
.
Correlation-based Feature Selection for Machine Learning. Department of Computer Science
, Vol.
178
.
Hamilton, NewZealand
:
University of Waikato
,
1999
.

88.

Robnik-Šikonja
M
,
Kononenko
I
.
Theoretical and empirical analysis of ReliefF and RReliefF
.
Mach Learn
2003
;
53
:
23
69
.

89.

WEKA software
. http://www.cs.waikato.ac.nz/ml/weka/ (
2 April 2019, date last accessed
).

90.

Urias
RWP
,
Barigye
SJ
,
Marrero-Ponce
Y
, et al.
IMMAN: free software for information theory-based chemometric analysis
.
Mol Divers
2015
;
19
:
305
19
.

91.

Kohavi
R
,
John
GH
.
Wrappers for feature subset selection
.
Artif Intell
1997
;
97
:
273
324
.

92.

Breiman
L
.
Random forests
.
Mach Learn
2001
;
45
:
5
32
.

93.

Golbraikh
A
,
Shen
M
,
Xiao
Z
, et al.
Rational selection of training and test sets for the development of validated QSAR models
.
J Comput Aided Mol Des
2003
;
17
:
241
53
.

94.

Wolpert
DH
. What is important about the no free lunch theorems? In:
Pardalos
PM
,
Rasskazova
V
,
Vrahatis
MN
(eds).
Black Box Optimization, Machine Learning, and No-Free Lunch Theorems
.
Cham
:
Springer International Publishing
,
2021
,
373
88
.

95.

Benavoli
A
,
Corani
G
,
Demšar
J
, et al.
Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis
.
J Mach Learn Res
2017
;
18
:
1
36
.

96.

Lazar
N
.
Ockham's razor
.
Wiley Interdiscip Rev Comput Stat
2010
;
2
:
243
6
.

97.

Van der Maaten
L
,
Hinton
G
.
Visualizing data using t-SNE
.
J Mach Learn Res
2008
;
9
:
2579
605
.

98.

Naseer
S
,
Hussain
W
,
Khan
YD
, et al.
Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations
.
Anal Biochem
2021
;
615
:
114069
.

99.

Qi
Y
. Random forest for bioinformatics. In:
Zhang
C
,
Ma
Y
(eds).
Ensemble Machine Learning: Methods and Applications
.
Boston, MA
:
Springer US
,
2012
,
307
23
.

100.

Fisher
A
,
Rudin
C
,
Dominici
F
.
All models are wrong, but many are useful: learning a Variable's importance by studying an entire class of prediction models simultaneously
.
J Mach Learn Res
2019
;
20
:
1
81
.

101.

Friedman
JH
,
Popescu
BE
.
Predictive learning via rule ensembles
.
Ann Appl Stat
2008
;
2
:
916
54
.

102.

Molnar
C.
iml: Interpretable Machine Learning
https://cran.r-project.org/web/packages/iml/index.html.

103.

Randić
M
.
Generalized molecular descriptors
.
J Math Chem
1991
;
7
:
155
68
.

104.

Ma
Y
,
Guo
Z
,
Xia
B
, et al.
Identification of antimicrobial peptides from the human gut microbiome using deep learning
.
Nat Biotechnol
2022
;
40
:
921
31
.

105.

Xiao
X
,
Shao
Y-T
,
Cheng
X
, et al.
iAMP-CA2L: a new CNN-BiLSTM-SVM classifier based on cellular automata image for identifying antimicrobial peptides and their functional types
.
Brief Bioinform
2021
;
22
:
1
10
. https://doi.org/10.1093/bib/bbab209.

106.

Singh
V
,
Shrivastava
S
,
Kumar Singh
S
, et al.
StaBle-ABPpred: a stacked ensemble predictor based on biLSTM and attention mechanism for accelerated discovery of antibacterial peptides
.
Brief Bioinform
2021
;
23
:
1
17
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)