-
PDF
- Split View
-
Views
-
Cite
Cite
César R García-Jacas, Luis A García-González, Felix Martinez-Rios, Issac P Tapia-Contreras, Carlos A Brizuela, Handcrafted versus non-handcrafted (self-supervised) features for the classification of antimicrobial peptides: complementary or redundant?, Briefings in Bioinformatics, Volume 23, Issue 6, November 2022, bbac428, https://doi.org/10.1093/bib/bbac428
- Share Icon Share
Abstract
Antimicrobial peptides (AMPs) have received a great deal of attention given their potential to become a plausible option to fight multi-drug resistant bacteria as well as other pathogens. Quantitative sequence-activity models (QSAMs) have been helpful to discover new AMPs because they allow to explore a large universe of peptide sequences and help reduce the number of wet lab experiments. A main aspect in the building of QSAMs based on shallow learning is to determine an optimal set of protein descriptors (features) required to discriminate between sequences with different antimicrobial activities. These features are generally handcrafted from peptide sequence datasets that are labeled with specific antimicrobial activities. However, recent developments have shown that unsupervised approaches can be used to determine features that outperform human-engineered (handcrafted) features. Thus, knowing which of these two approaches contribute to a better classification of AMPs, it is a fundamental question in order to design more accurate models. Here, we present a systematic and rigorous study to compare both types of features. Experimental outcomes show that non-handcrafted features lead to achieve better performances than handcrafted features. However, the experiments also prove that an improvement in performance is achieved when both types of features are merged. A relevance analysis reveals that non-handcrafted features have higher information content than handcrafted features, while an interaction-based importance analysis reveals that handcrafted features are more important. These findings suggest that there is complementarity between both types of features. Comparisons regarding state-of-the-art deep models show that shallow models yield better performances both when fed with non-handcrafted features alone and when fed with non-handcrafted and handcrafted features together.
Introduction
Antimicrobial resistance (AMR) [1, 2] emerges as a major threat to human health [3–6] mainly due to the misuse and overuse of conventional drugs [7]. AMR arises when microorganisms (i.e. bacteria, viruses, fungi and parasites) create the ability to defeat the drugs that once impacted them. For instance, according to reports performed to the Global Antimicrobial Resistance and Use Surveillance System [8], the rate of resistance to ciprofloxacin (an antibiotic often used to treat basic infections as the ones of the urinary tract) ranges between 8.4 and 92.9% for Escherichia coli and between 4.1 and 79.4% for Klebsiella pneumoniae. This growing AMR lead infections harder to treat, increasing the risk of disease spread, severe illness and death [3–6]. Indeed, according to the United States Centers for Disease Control and Prevention, more than 35 000 deaths are due to antibiotic-resistant diseases each year [9].
One of the current efforts to tackle AMR is aimed to designing novel antibiotics and therapies based on antimicrobial peptides (AMPs). AMPs [10] are a vital part of the innate immune system, and the interest in them as potential drug candidates is because of their capabilities to kill a wide spectrum of bacteria [11, 12], parasites [13–15], fungi [16, 17] and viruses [18, 19], as well as to treat non-communicable diseases such as diabetes [20] and cancer [21, 22]. But the translation of promissory AMPs into clinical therapeutics constitutes a challenge because of their inherent bio- and physicochemical properties [23, 24]. That is, AMPs are water-soluble and thus they generally show limited ability to diffuse across bio-membranes. Also, they are biologically unstable as they are easily hydrolyzed by proteolytic enzymes, yielding them short plasma half-lives [23, 24]. Thus, it is needed to integrate comprehensive studies of bioactivity, pharmacodynamic, pharmacokinetic and toxicological profiles in different phases of the AMP-based drug discovery (PDD) [25–27]. We use the term AMP (or general-AMP) to refer to peptide sequences with some antimicrobial activity (e.g. antibacterial, antifungal, antiparasitic, antiviral, among others [10]). Otherwise, a specific notation according to the activity of the peptide will be used.
In PDD workflows, quantitative sequence-activity models (QSAMs) perform a vital role [28–52] because they can be applied in the screening of large amounts of peptide sequences to identify likely AMPs. QSAMs are built with machine learning methods in order to achieve good mathematical correlations between the sequential and compositional features and the activity of known AMP sequence datasets. Thus, QSAMs can be analyzed in two groups according to the learning method used: (i) the ones based on shallow learning and (ii) the ones based on deep learning. Hereafter, we use the term of shallow learning to refer to all non-deep learning approaches, that is shallow learning encompasses both traditional learning (e.g. Support Vector Machine, k-Nearest Neighbors, Decision Trees, among others) and ensemble learning (e.g. AdaBoost, Random Forest (RF), among others).
A crucial step in the building of QSAMs based on shallow learning (henceforth also termed shallow models) is the calculation and selection of the ‘best’ set of protein descriptors (PDs). To date, tens of thousands PDs [53–59] can be calculated with different software [60–65]. For instance, more than 250 000 different PDs can be obtained from the ProtDCal [62] and starPep [64] software. These PDs, also known as handcrafted PDs (HPDs), are derived from human-engineered algorithms to extract chemical information contained in the protein sequences themselves. The large number of extant HPDs is because there is not a single HPD family able to codify all chemical information for different datasets because their relevance depends on the protein sequences to be studied [66]. Consequently, thousands of HPDs should be firstly calculated to posteriorly select the most important ones [67, 68] to model the endpoint of interest.
Although there is a wide spectrum of HPDs and feature selectors [67, 68], we think that the sole use of HPDs boundaries to build shallow models with better performances because HPDs are only derived from each protein sequence itself, and thus, they do not contain relevant information related to variation (long-term dependencies) of protein sequences selected across evolution [69–72]. To tackle this issue, position-specific score matrix (PSSM)-derived features can be used [73]. However, it will lead to develop alignment-dependent models, which are not the most suitable to process data coming from sequencing technologies because they are memory- and time-consuming [74]. Before this scenario, the use of deep learning methods has emerged as an alternative to create alignment-free models to identify AMPs [39–41, 45–49, 52], mainly because they have the ability to autonomously learn features that encode the semantic of the input protein sequences, overriding the need to perform a feature engineering process as it is required by shallow models. However, we recently demonstrated [75] that deep learning does not contribute to obtain better performances than shallow learning in the context of the supervised classification of AMPs. In addition of bad modeling practices, one of the main reasons of this outcome is the limited number of AMPs (<14 k) used to train the deep models (see Table 1 in [75]). That is, because the universe of currently known AMP sequences is small [76, 77], the size of the training datasets is not large enough [78–80] to autonomously obtain learned (non-handcrafted) features with better modeling abilities than the handcrafted features that can be calculated on the same datasets.
However, there are currently several public databases [81, 82] containing millions of protein sequences spanning evolutionary diversity [69–72], and to learn from that diversity, self-supervised methods are the most suitable ones. Self-supervision is a learning paradigm that unlike supervised learning does not require manual annotation of input data, and thus, it can leverage of much larger amounts of data [83]. This approach automatically learns useful hierarchical representations from unlabeled datasets, which can then be fine-tuned on few datasets for supervised downstream tasks, such as the identification of AMPs via shallow learning. Rives et al. [83] built a large Transformer-based model (termed ESM-1b Transformer) that was trained on 86 billion amino acids across 250 million sequences obtained from Uniparc [84]. They proved that the learned representations were useful in the supervised prediction of mutational effect and secondary structure, and they improved state-of-the-art features (e.g. the ones derived PSSM [73]) for the long-range residue–residue contact prediction. Moreover, Zhang et al. [50] also introduced a model based on self-supervision that was trained with 556 603 sequences obtained from UniProt [81]. However, their results were inferior to the ones obtained by shallow models fed with ProtDCal HPDs (see Table 3A and E in [75]). This latter confirms that (i) shallow classifiers are most suitable to identify AMPs and (ii) that large amounts of data are required to autonomously determine good learned (non-handcrafted) representations.
All in all, this work is aimed to examine the use of non-handcrafted protein features (nHPDs) obtained via self-supervision in the supervised classification of AMPs using shallow learning, as well as to assess the modeling abilities of nHPDs when combined with handcrafted protein features (HPDs). In this way, it will be analyzed if both types of features are complementary or not to build good shallow models. To this end, we built RF models fed with nHPDs derived from the ESM-1b Transformer model [83], as well as RF models fed with state-of-the-art HPDs. We also built RF models by joining the nHPDs and HPDs accounted for. The models were built and evaluated across five benchmarking datasets. We also analyzed both types of features via chemometric and model-agnostic explainability methods. Finally, we performed comparisons with respect to state-of-the-art deep neural network (DNN) architectures. The RF classifier was the only shallow learner used in this work to guarantee fair analyses and comparability of results regarding the baseline models considered [51]. All the shallow and deep models were built for benchmarking purposes only and not to introduce new models.
Material and methods
AMP datasets
We used five benchmarking datasets (see Table 1 for a description) recently proposed in [51] to develop RF models to identify general-AMPs as well as antibacterial (ABP), antifungal (AFP), antiparasitic (APP) and antiviral (AVP) peptide sequences. A detailed explanation of how these datasets were created can be found in Section 2.1 in [51]. Each of these benchmarking datasets is comprised of training, test and external datasets. Unlike the external datasets, the training and testing datasets were created from a rational division based on clustering (see Section 2.1 in [51]) and thus they present similar distributions. Consequently, the external datasets to be used in this work are most suitable than the testing datasets to assess the generalization ability of the models to be built. Both the datasets and RF models proposed in [51] will be considered as baseline in this work to guarantee fair analyses and comparisons.
Endpoint . | Aim . | Total . | Positive . | Negative . |
---|---|---|---|---|
General-AMP | Training | 19 548 | 9781 | 9767 |
Test | 5125 | 2564 | 2561 | |
External | 15 685 | 4914 | 10 771 | |
ABP | Training | 13 166 | 6583 | 6583 |
Test | 3390 | 1695 | 1695 | |
External | 15 528 | 4757 | 10 771 | |
AFP | Training | 1556 | 778 | 778 |
Test | 430 | 215 | 215 | |
External | 15 528 | 4757 | 10 771 | |
APP | Training | 198 | 99 | 99 |
Test | 62 | 31 | 31 | |
External | 11 182 | 411 | 10 771 | |
AVP | Training | 4642 | 2321 | 2321 |
Test | 1246 | 623 | 623 | |
External | 12 001 | 1230 | 10 771 |
Endpoint . | Aim . | Total . | Positive . | Negative . |
---|---|---|---|---|
General-AMP | Training | 19 548 | 9781 | 9767 |
Test | 5125 | 2564 | 2561 | |
External | 15 685 | 4914 | 10 771 | |
ABP | Training | 13 166 | 6583 | 6583 |
Test | 3390 | 1695 | 1695 | |
External | 15 528 | 4757 | 10 771 | |
AFP | Training | 1556 | 778 | 778 |
Test | 430 | 215 | 215 | |
External | 15 528 | 4757 | 10 771 | |
APP | Training | 198 | 99 | 99 |
Test | 62 | 31 | 31 | |
External | 11 182 | 411 | 10 771 | |
AVP | Training | 4642 | 2321 | 2321 |
Test | 1246 | 623 | 623 | |
External | 12 001 | 1230 | 10 771 |
Endpoint . | Aim . | Total . | Positive . | Negative . |
---|---|---|---|---|
General-AMP | Training | 19 548 | 9781 | 9767 |
Test | 5125 | 2564 | 2561 | |
External | 15 685 | 4914 | 10 771 | |
ABP | Training | 13 166 | 6583 | 6583 |
Test | 3390 | 1695 | 1695 | |
External | 15 528 | 4757 | 10 771 | |
AFP | Training | 1556 | 778 | 778 |
Test | 430 | 215 | 215 | |
External | 15 528 | 4757 | 10 771 | |
APP | Training | 198 | 99 | 99 |
Test | 62 | 31 | 31 | |
External | 11 182 | 411 | 10 771 | |
AVP | Training | 4642 | 2321 | 2321 |
Test | 1246 | 623 | 623 | |
External | 12 001 | 1230 | 10 771 |
Endpoint . | Aim . | Total . | Positive . | Negative . |
---|---|---|---|---|
General-AMP | Training | 19 548 | 9781 | 9767 |
Test | 5125 | 2564 | 2561 | |
External | 15 685 | 4914 | 10 771 | |
ABP | Training | 13 166 | 6583 | 6583 |
Test | 3390 | 1695 | 1695 | |
External | 15 528 | 4757 | 10 771 | |
AFP | Training | 1556 | 778 | 778 |
Test | 430 | 215 | 215 | |
External | 15 528 | 4757 | 10 771 | |
APP | Training | 198 | 99 | 99 |
Test | 62 | 31 | 31 | |
External | 11 182 | 411 | 10 771 | |
AVP | Training | 4642 | 2321 | 2321 |
Test | 1246 | 623 | 623 | |
External | 12 001 | 1230 | 10 771 |
Extraction of handcrafted protein descriptors (HPDs)
We used HPDs calculated with the ProtDCal [62] and iLearn [85] software. On the one hand, it is important to mention that the calculation and selection of the ProtDCal HPDs was performed in [51] to build the baseline RF models. This process was performed on each training dataset from a pool comprised initially of 96 026 ProtDCal HDPs (see Section 2.2 in [51]). The importance of the ProtDCal HPDs finally included in the baseline RF models was widely discussed in Section 3.4 in [51]. On the other hand, a total of 3774 HPDs were additionally calculated in this work with the iLearn tool [85]. The iLearn HPDs calculated on each training dataset were grouped according to their definitions as follows: Group 1 (G1) is comprised of 2420 HPDs, of which 1600 are Composition of K-Spaced Amino Acid Pairs (CKSAAP), 20 are Amino Acid Composition (AAC) and 400 are Dipeptide Composition (DPC) and Dipeptide Deviation from Expected Mean (DDE), respectively; Group 2 (G2) is comprised of 255 HPDs, of which 100 are Composition of K-Spaced Amino Acid Group Pairs (CKSAAGP), 5 are Grouped Amino Acid Composition (GAAC), 25 are Grouped Dipeptide Composition (GDPC) and 125 are Grouped Tripeptide Composition (GTPC); Group 3 (G3) is comprised of 32 Normalized Moreau-Broto Autocorrelation (NMBroto) HPDs; Group 4 (G4) is comprised of 273 Composition/Transition/Distribution (CTD) HPDs; Group 5 (G5) is comprised of 686 HPDs, of which 343 are Conjoint Triad (Ctriad) and K-Spaced Conjoint Triad (KSCTriad), respectively; Group 6 (G6) contains 56 HPDs, of which 8 are Sequence-Order-Coupling Number (SOCNumber) and 48 are Quasi-Sequence-Order (QSOrder) and Group 7 (G7) is comprised of 52 HPDs, of which 24 are Pseudo-Amino Acid Composition (PAAC) and 28 are Amphiphilic Pseudo-Amino Acid Composition (APAAC). The lambda parameter value to compute the PAAC and APAAC HPDs was set up to 4.
Extraction of non-handcrafted protein descriptors (nHPDs)
A total of 1280 nHPDs were calculated on the datasets described in Table 1 using the ESM-1b Transformer model proposed by Rives et al. [83]. This 33-layer (∼650 M parameters) model was trained on 86 billion amino acids across 250 million protein sequences downloaded from the Uniparc database [84]. By using the script provided by the authors (https://github.com/facebookresearch/esm), we extracted the average embedding (dim: 1 × 1280; script_param: —include mean) corresponding to the embedding per amino acid of the final layer (dim: sequence_length × 1280; script_param: —repr_layers 33) for each peptide sequence. That average embedding is the vector of nHPDs obtained for each sequence. The average embedding was used because it yielded the best results in the studies of prediction of secondary structures (see Table 6 in [83]) and long-rang residue-residue contacts (see Table 7 in [83]) performed by Rives et al. [83].
Feature selection and modeling methodology
On each descriptor matrix (7 holding iLearn HPDs and 1 holding ESM-1b nHPDs) obtained for each training dataset corresponding to each endpoint (i.e. general-AMPs, ABPs, AFPs, APPs and AVPs—see Table 1), a Shannon Entropy (SE)-based unsupervised filter [86] was applied to filter out those PDs (hereafter, this notation will be used to refer to protein descriptors in general) with an information content (SE value) inferior to 10% of the maximum entropy that a PD can achieve on each training dataset. Relevant PDs for modeling studies should present high-SE values [86] as a measure of their good ability to discriminate among proteins (peptides) with different sequences. The discretization schema used was equal to the number of peptides in each training dataset. Thus, the maximum entropy that can achieve each PD in the general-AMP, antibacterial, antifungal, antiparasitic and antiviral training datasets is equal to 14.25, 13.68, 10.6, 7.63 and 12.18 bits, respectively. The held PDs were sorted in ascending order according to their SE values, and then, a Spearman correlation coefficient between each pair of PDs was calculated to remove the redundant ones. If a pair of PDs were correlated above 0.95, then the PD with lower SE value was filtered out.
Afterward, a supervised selection using the Correlation Subset method [87] was applied on each reduced descriptor matrix to determine a subset of PDs highly correlated with the activity and lowly correlated among them. Five raking-type feature selection techniques were also applied on each reduced descriptor matrix, being the number of top-ranked PDs retained with these methods equal to the size of the subset obtained with the Correlation Subset method. One of the raking-type procedures is the SE-based unsupervised method explained above, whereas the other ranking-type methods are Relief-F [88], Information Gain [89], Symmetrical Uncertainty [89] and Chi Squared [89], which are supervised feature selection approaches. With the Relief-F method, the PDs that best discriminate between peptide sequences that are near to each other are retained, whereas with the Symmetrical Uncertainty and Information Gain methods, the PDs with the highest information content regarding the activity are selected. Finally, the PDs highly dependent on the activity are retained with the Chi Squared method. The SE-based selector was applied using the IMMAN tool [90], whereas the other ones were used with their default settings in the WEKA tool v3.8 [89].
Additionally, because no single feature selector can guarantee an optimal feature subset, an additional descriptor subset (ensemble) was created for the iLearn HPDs and ESM-1b nHPDs, respectively, by joining the descriptor subsets determined with the six selection methods explained above. The importance to merge different feature selectors has been explained elsewhere [67, 68]. So far, per each training dataset, a total of 43 iLearn HPD subsets and 7 ESM-1b nHPD subsets have been created. On the other hand, other 43 subsets that contain HPDs calculated with the ProtDCal software [62] were also created per each training dataset. These subsets were built combining each of the iLearn HPD subsets with the corresponding ProtDCal HPD (baseline) subset that was proposed in [51]. All these subsets containing diversity of HPDs were created to build RF models as robust as possible in order to ensure rigorous comparisons regarding the RF models based on the ESM-1b nHPDs.
On all the built descriptor subsets (93 per training dataset), a wrapper-type selector [91] was applied using the Genetic Algorithm (GA) metaheuristic as search method, the accuracy as fitness measure and RF [92] as learning procedure. In this way, a total of 93 subsets comprised of the ‘most suitable’ features for the RF learner were additionally obtained. Therefore, a total of 186 descriptor subsets were finally taken into account to model each endpoint. Then, a model based on the RF classifier was created on each descriptor subset. RF was used to guarantee comparability of results regarding the RF models (baseline) introduced in [51], which are based on HPDs calculated with the ProtDCal software [62]. All the built RF models were assessed on their corresponding testing and external datasets. The wrapper-type selector, RF classifier and GA metaheuristic were applied using their default settings in the Weka tool v3.8. Schema 1 shows the workflow explained above.

Workflow applied on each training dataset to build the RF models.
Results and discussion
Do nHPDs have better modeling abilities than HPDs?
This section is aimed to compare the modeling ability between HPDs and nHPDs. Figure 1 shows the boxplots corresponding to the training Matthews correlation coefficients (MCCtrain) obtained via 10-fold cross-validation by the best RF models built on each family (i.e. iLearn G1, …, iLearn G7 and ESM-1b) and ensemble (i.e. iLearn, and iLearn & ProtDCal) of PDs accounted for. These training MCC values were obtained via 10-fold cross-validation. The best RF model selected per each family (and ensemble) was the one with the highest MCCtrain value (see Data S1 for PD sets, trained models, and Weka software outputs). The MCCtrain values obtained by the baseline models (last boxplot in Figure 1) were also shown. As it can be noted in Figure 1, models with suitable variance

Training Mathews correlation coefficients obtained by the models built from each HPD family calculated with the iLearn software, from each ensemble of HPDs (i.e. iLearn, and iLearn & ProtDCal), and from the set of nHPDs obtained with the ESM-1b Transformer model. The last boxplot corresponds to the baseline RF models, which were created with ProtDCal HPDs.
Moreover, Table 2 shows the performance [i.e. sensitivity (SN), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC)] on the testing and external datasets (see Table 1) of the best RF models built. The performance of the RF models based on the ProtDCal HPDs (baseline) [51] is shown as well. Supplementary Table S2 shows the performance metrics for the worst models. On the one hand, it can be firstly noted that the best built models achieved good performances according to the MCC values on the testing datasets (MCCtest), which are generally comparable (or superior) to the MCCtest values obtained by the baseline RF models (i.e. those based only on ProtDCal HPDs). This suggests that the best built models are not overfitted according to their MCCtrain (see Supplementary Table S1) and MCCtest values, and consequently, they are suitable for further analyses. On the other hand, it can be observed that the best created models based on HPDs mostly presented better performances than the RF models based on nHPDs (i.e. those obtained with the ESM-1b model) according to the MCCtest values. Specifically, the highest MCCtest values achieved by the models based on HPDs were better than the ones based on nHPDs by 3.54, 2.46, 0.11, 9.15 and 5.16 on the general-AMP, ABP, AFP, APP and AVP testing datasets, respectively. However, it can be observed that these better performances were mostly achieved using a greater number of HPDs than nHPDs. In general, these results on the testing datasets could suggest that our premise is uncertain. That is, that the sole use of HPDs does not limit to build better models although they do not contain evolutionary information that is present on protein sequences [69–72].
Performance metrics on the test and external datasets (see Table 1) that were obtained by the best RF models built with different types of handcrafted descriptors as well as with ESM-1b non-handcrafted descriptors. The best results are highlighted in bold and italic. The column ‘size’ refers to the number of descriptors included in the models
Endpoint . | Type of Descriptors . | Size . | Testing datasets . | External datasets . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | . | SNtest . | SPtest . | ACCtest . | MCCtest . | SNext . | SPext . | ACCext . | MCCext . |
General-AMP | ProtDCal (see Table 2 in [51]) | 207 | 0.862 | 0.945 | 0.904 | 0.810 | 0.911 | 0.909 | 0.910 | 0.799 |
iLearn G1 | 37 | 0.856 | 0.938 | 0.897 | 0.796 | 0.875 | 0.899 | 0.892 | 0.756 | |
iLearn G6 | 39 | 0.851 | 0.942 | 0.897 | 0.797 | 0.904 | 0.908 | 0.906 | 0.791 | |
iLearn G7 | 32 | 0.850 | 0.941 | 0.895 | 0.794 | 0.893 | 0.912 | 0.906 | 0.788 | |
iLearn Ensemble | 87 | 0.849 | 0.945 | 0.897 | 0.797 | 0.896 | 0.917 | 0.911 | 0.797 | |
iLearn ProtDCal Ensemble | 144 | 0.866 | 0.952 | 0.909 | 0.820 | 0.910 | 0.910 | 0.910 | 0.799 | |
ESM-1b | 85 | 0.840 | 0.947 | 0.894 | 0.792 | 0.904 | 0.945 | 0.932 | 0.843 | |
ABP | ProtDCal (see Table 2 in [51]) | 93 | 0.912 | 0.957 | 0.935 | 0.870 | 0.897 | 0.923 | 0.915 | 0.805 |
iLearn G1 | 44 | 0.899 | 0.950 | 0.925 | 0.850 | 0.873 | 0.909 | 0.898 | 0.766 | |
iLearn G6 | 40 | 0.896 | 0.961 | 0.928 | 0.858 | 0.886 | 0.926 | 0.914 | 0.801 | |
iLearn G7 | 32 | 0.903 | 0.949 | 0.926 | 0.853 | 0.880 | 0.932 | 0.916 | 0.805 | |
iLearn Ensemble | 111 | 0.901 | 0.966 | 0.934 | 0.870 | 0.890 | 0.934 | 0.921 | 0.816 | |
iLearn ProtDCal Ensemble | 86 | 0.910 | 0.961 | 0.936 | 0.873 | 0.896 | 0.934 | 0.922 | 0.820 | |
ESM-1b | 47 | 0.894 | 0.956 | 0.925 | 0.852 | 0.906 | 0.953 | 0.939 | 0.856 | |
AFP | ProtDCal (see Table 2 in [51]) | 144 | 0.921 | 0.963 | 0.942 | 0.884 | 0.747 | 0.908 | 0.859 | 0.664 |
iLearn G1 | 53 | 0.902 | 0.977 | 0.940 | 0.882 | 0.686 | 0.912 | 0.842 | 0.619 | |
iLearn G6 | 49 | 0.907 | 0.963 | 0.935 | 0.871 | 0.709 | 0.922 | 0.857 | 0.654 | |
iLearn G7 | 32 | 0.907 | 0.963 | 0.935 | 0.871 | 0.640 | 0.932 | 0.843 | 0.615 | |
iLearn Ensemble | 111 | 0.907 | 0.963 | 0.935 | 0.871 | 0.702 | 0.910 | 0.846 | 0.630 | |
iLearn ProtDCal Ensemble | 35 | 0.930 | 0.963 | 0.947 | 0.893 | 0.739 | 0.914 | 0.860 | 0.666 | |
ESM-1b | 58 | 0.902 | 0.986 | 0.944 | 0.892 | 0.831 | 0.967 | 0.925 | 0.821 | |
APP | ProtDCal (see Table 2 in [51]) | 40 | 0.871 | 0.903 | 0.887 | 0.775 | 0.825 | 0.867 | 0.865 | 0.357 |
iLearn G1 | 19 | 0.871 | 0.774 | 0.823 | 0.648 | 0.723 | 0.807 | 0.804 | 0.244 | |
iLearn G6 | 18 | 0.742 | 0.903 | 0.823 | 0.654 | 0.781 | 0.903 | 0.898 | 0.392 | |
iLearn G7 | 31 | 0.774 | 0.839 | 0.807 | 0.614 | 0.825 | 0.887 | 0.885 | 0.387 | |
iLearn Ensemble* | - | - | - | - | - | - | - | - | - | |
iLearn ProtDCal Ensemble | 28 | 0.839 | 0.903 | 0.871 | 0.743 | 0.830 | 0.872 | 0.870 | 0.366 | |
ESM-1b | 28 | 0.871 | 0.839 | 0.855 | 0.710 | 0.888 | 0.878 | 0.878 | 0.404 | |
AVP | ProtDCal (see Table 2 in [51]) | 96 | 0.764 | 0.892 | 0.828 | 0.662 | 0.742 | 0.873 | 0.860 | 0.476 |
iLearn G1 | 81 | 0.807 | 0.900 | 0.854 | 0.711 | 0.584 | 0.878 | 0.848 | 0.374 | |
iLearn G6 | 27 | 0.761 | 0.889 | 0.825 | 0.656 | 0.723 | 0.866 | 0.852 | 0.452 | |
iLearn G7 | 33 | 0.769 | 0.904 | 0.836 | 0.679 | 0.706 | 0.843 | 0.829 | 0.406 | |
iLearn Ensemble | 174 | 0.793 | 0.915 | 0.854 | 0.713 | 0.680 | 0.876 | 0.856 | 0.438 | |
iLearn ProtDCal Ensemble | 97 | 0.785 | 0.912 | 0.848 | 0.702 | 0.750 | 0.872 | 0.859 | 0.478 | |
ESM-1b | 70 | 0.783 | 0.891 | 0.837 | 0.678 | 0.921 | 0.868 | 0.873 | 0.585 |
Endpoint . | Type of Descriptors . | Size . | Testing datasets . | External datasets . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | . | SNtest . | SPtest . | ACCtest . | MCCtest . | SNext . | SPext . | ACCext . | MCCext . |
General-AMP | ProtDCal (see Table 2 in [51]) | 207 | 0.862 | 0.945 | 0.904 | 0.810 | 0.911 | 0.909 | 0.910 | 0.799 |
iLearn G1 | 37 | 0.856 | 0.938 | 0.897 | 0.796 | 0.875 | 0.899 | 0.892 | 0.756 | |
iLearn G6 | 39 | 0.851 | 0.942 | 0.897 | 0.797 | 0.904 | 0.908 | 0.906 | 0.791 | |
iLearn G7 | 32 | 0.850 | 0.941 | 0.895 | 0.794 | 0.893 | 0.912 | 0.906 | 0.788 | |
iLearn Ensemble | 87 | 0.849 | 0.945 | 0.897 | 0.797 | 0.896 | 0.917 | 0.911 | 0.797 | |
iLearn ProtDCal Ensemble | 144 | 0.866 | 0.952 | 0.909 | 0.820 | 0.910 | 0.910 | 0.910 | 0.799 | |
ESM-1b | 85 | 0.840 | 0.947 | 0.894 | 0.792 | 0.904 | 0.945 | 0.932 | 0.843 | |
ABP | ProtDCal (see Table 2 in [51]) | 93 | 0.912 | 0.957 | 0.935 | 0.870 | 0.897 | 0.923 | 0.915 | 0.805 |
iLearn G1 | 44 | 0.899 | 0.950 | 0.925 | 0.850 | 0.873 | 0.909 | 0.898 | 0.766 | |
iLearn G6 | 40 | 0.896 | 0.961 | 0.928 | 0.858 | 0.886 | 0.926 | 0.914 | 0.801 | |
iLearn G7 | 32 | 0.903 | 0.949 | 0.926 | 0.853 | 0.880 | 0.932 | 0.916 | 0.805 | |
iLearn Ensemble | 111 | 0.901 | 0.966 | 0.934 | 0.870 | 0.890 | 0.934 | 0.921 | 0.816 | |
iLearn ProtDCal Ensemble | 86 | 0.910 | 0.961 | 0.936 | 0.873 | 0.896 | 0.934 | 0.922 | 0.820 | |
ESM-1b | 47 | 0.894 | 0.956 | 0.925 | 0.852 | 0.906 | 0.953 | 0.939 | 0.856 | |
AFP | ProtDCal (see Table 2 in [51]) | 144 | 0.921 | 0.963 | 0.942 | 0.884 | 0.747 | 0.908 | 0.859 | 0.664 |
iLearn G1 | 53 | 0.902 | 0.977 | 0.940 | 0.882 | 0.686 | 0.912 | 0.842 | 0.619 | |
iLearn G6 | 49 | 0.907 | 0.963 | 0.935 | 0.871 | 0.709 | 0.922 | 0.857 | 0.654 | |
iLearn G7 | 32 | 0.907 | 0.963 | 0.935 | 0.871 | 0.640 | 0.932 | 0.843 | 0.615 | |
iLearn Ensemble | 111 | 0.907 | 0.963 | 0.935 | 0.871 | 0.702 | 0.910 | 0.846 | 0.630 | |
iLearn ProtDCal Ensemble | 35 | 0.930 | 0.963 | 0.947 | 0.893 | 0.739 | 0.914 | 0.860 | 0.666 | |
ESM-1b | 58 | 0.902 | 0.986 | 0.944 | 0.892 | 0.831 | 0.967 | 0.925 | 0.821 | |
APP | ProtDCal (see Table 2 in [51]) | 40 | 0.871 | 0.903 | 0.887 | 0.775 | 0.825 | 0.867 | 0.865 | 0.357 |
iLearn G1 | 19 | 0.871 | 0.774 | 0.823 | 0.648 | 0.723 | 0.807 | 0.804 | 0.244 | |
iLearn G6 | 18 | 0.742 | 0.903 | 0.823 | 0.654 | 0.781 | 0.903 | 0.898 | 0.392 | |
iLearn G7 | 31 | 0.774 | 0.839 | 0.807 | 0.614 | 0.825 | 0.887 | 0.885 | 0.387 | |
iLearn Ensemble* | - | - | - | - | - | - | - | - | - | |
iLearn ProtDCal Ensemble | 28 | 0.839 | 0.903 | 0.871 | 0.743 | 0.830 | 0.872 | 0.870 | 0.366 | |
ESM-1b | 28 | 0.871 | 0.839 | 0.855 | 0.710 | 0.888 | 0.878 | 0.878 | 0.404 | |
AVP | ProtDCal (see Table 2 in [51]) | 96 | 0.764 | 0.892 | 0.828 | 0.662 | 0.742 | 0.873 | 0.860 | 0.476 |
iLearn G1 | 81 | 0.807 | 0.900 | 0.854 | 0.711 | 0.584 | 0.878 | 0.848 | 0.374 | |
iLearn G6 | 27 | 0.761 | 0.889 | 0.825 | 0.656 | 0.723 | 0.866 | 0.852 | 0.452 | |
iLearn G7 | 33 | 0.769 | 0.904 | 0.836 | 0.679 | 0.706 | 0.843 | 0.829 | 0.406 | |
iLearn Ensemble | 174 | 0.793 | 0.915 | 0.854 | 0.713 | 0.680 | 0.876 | 0.856 | 0.438 | |
iLearn ProtDCal Ensemble | 97 | 0.785 | 0.912 | 0.848 | 0.702 | 0.750 | 0.872 | 0.859 | 0.478 | |
ESM-1b | 70 | 0.783 | 0.891 | 0.837 | 0.678 | 0.921 | 0.868 | 0.873 | 0.585 |
The RF models built on this subset did not satisfy the rule of thumb of that each variable into a model at least should explain the variance of five cases.
Performance metrics on the test and external datasets (see Table 1) that were obtained by the best RF models built with different types of handcrafted descriptors as well as with ESM-1b non-handcrafted descriptors. The best results are highlighted in bold and italic. The column ‘size’ refers to the number of descriptors included in the models
Endpoint . | Type of Descriptors . | Size . | Testing datasets . | External datasets . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | . | SNtest . | SPtest . | ACCtest . | MCCtest . | SNext . | SPext . | ACCext . | MCCext . |
General-AMP | ProtDCal (see Table 2 in [51]) | 207 | 0.862 | 0.945 | 0.904 | 0.810 | 0.911 | 0.909 | 0.910 | 0.799 |
iLearn G1 | 37 | 0.856 | 0.938 | 0.897 | 0.796 | 0.875 | 0.899 | 0.892 | 0.756 | |
iLearn G6 | 39 | 0.851 | 0.942 | 0.897 | 0.797 | 0.904 | 0.908 | 0.906 | 0.791 | |
iLearn G7 | 32 | 0.850 | 0.941 | 0.895 | 0.794 | 0.893 | 0.912 | 0.906 | 0.788 | |
iLearn Ensemble | 87 | 0.849 | 0.945 | 0.897 | 0.797 | 0.896 | 0.917 | 0.911 | 0.797 | |
iLearn ProtDCal Ensemble | 144 | 0.866 | 0.952 | 0.909 | 0.820 | 0.910 | 0.910 | 0.910 | 0.799 | |
ESM-1b | 85 | 0.840 | 0.947 | 0.894 | 0.792 | 0.904 | 0.945 | 0.932 | 0.843 | |
ABP | ProtDCal (see Table 2 in [51]) | 93 | 0.912 | 0.957 | 0.935 | 0.870 | 0.897 | 0.923 | 0.915 | 0.805 |
iLearn G1 | 44 | 0.899 | 0.950 | 0.925 | 0.850 | 0.873 | 0.909 | 0.898 | 0.766 | |
iLearn G6 | 40 | 0.896 | 0.961 | 0.928 | 0.858 | 0.886 | 0.926 | 0.914 | 0.801 | |
iLearn G7 | 32 | 0.903 | 0.949 | 0.926 | 0.853 | 0.880 | 0.932 | 0.916 | 0.805 | |
iLearn Ensemble | 111 | 0.901 | 0.966 | 0.934 | 0.870 | 0.890 | 0.934 | 0.921 | 0.816 | |
iLearn ProtDCal Ensemble | 86 | 0.910 | 0.961 | 0.936 | 0.873 | 0.896 | 0.934 | 0.922 | 0.820 | |
ESM-1b | 47 | 0.894 | 0.956 | 0.925 | 0.852 | 0.906 | 0.953 | 0.939 | 0.856 | |
AFP | ProtDCal (see Table 2 in [51]) | 144 | 0.921 | 0.963 | 0.942 | 0.884 | 0.747 | 0.908 | 0.859 | 0.664 |
iLearn G1 | 53 | 0.902 | 0.977 | 0.940 | 0.882 | 0.686 | 0.912 | 0.842 | 0.619 | |
iLearn G6 | 49 | 0.907 | 0.963 | 0.935 | 0.871 | 0.709 | 0.922 | 0.857 | 0.654 | |
iLearn G7 | 32 | 0.907 | 0.963 | 0.935 | 0.871 | 0.640 | 0.932 | 0.843 | 0.615 | |
iLearn Ensemble | 111 | 0.907 | 0.963 | 0.935 | 0.871 | 0.702 | 0.910 | 0.846 | 0.630 | |
iLearn ProtDCal Ensemble | 35 | 0.930 | 0.963 | 0.947 | 0.893 | 0.739 | 0.914 | 0.860 | 0.666 | |
ESM-1b | 58 | 0.902 | 0.986 | 0.944 | 0.892 | 0.831 | 0.967 | 0.925 | 0.821 | |
APP | ProtDCal (see Table 2 in [51]) | 40 | 0.871 | 0.903 | 0.887 | 0.775 | 0.825 | 0.867 | 0.865 | 0.357 |
iLearn G1 | 19 | 0.871 | 0.774 | 0.823 | 0.648 | 0.723 | 0.807 | 0.804 | 0.244 | |
iLearn G6 | 18 | 0.742 | 0.903 | 0.823 | 0.654 | 0.781 | 0.903 | 0.898 | 0.392 | |
iLearn G7 | 31 | 0.774 | 0.839 | 0.807 | 0.614 | 0.825 | 0.887 | 0.885 | 0.387 | |
iLearn Ensemble* | - | - | - | - | - | - | - | - | - | |
iLearn ProtDCal Ensemble | 28 | 0.839 | 0.903 | 0.871 | 0.743 | 0.830 | 0.872 | 0.870 | 0.366 | |
ESM-1b | 28 | 0.871 | 0.839 | 0.855 | 0.710 | 0.888 | 0.878 | 0.878 | 0.404 | |
AVP | ProtDCal (see Table 2 in [51]) | 96 | 0.764 | 0.892 | 0.828 | 0.662 | 0.742 | 0.873 | 0.860 | 0.476 |
iLearn G1 | 81 | 0.807 | 0.900 | 0.854 | 0.711 | 0.584 | 0.878 | 0.848 | 0.374 | |
iLearn G6 | 27 | 0.761 | 0.889 | 0.825 | 0.656 | 0.723 | 0.866 | 0.852 | 0.452 | |
iLearn G7 | 33 | 0.769 | 0.904 | 0.836 | 0.679 | 0.706 | 0.843 | 0.829 | 0.406 | |
iLearn Ensemble | 174 | 0.793 | 0.915 | 0.854 | 0.713 | 0.680 | 0.876 | 0.856 | 0.438 | |
iLearn ProtDCal Ensemble | 97 | 0.785 | 0.912 | 0.848 | 0.702 | 0.750 | 0.872 | 0.859 | 0.478 | |
ESM-1b | 70 | 0.783 | 0.891 | 0.837 | 0.678 | 0.921 | 0.868 | 0.873 | 0.585 |
Endpoint . | Type of Descriptors . | Size . | Testing datasets . | External datasets . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | . | SNtest . | SPtest . | ACCtest . | MCCtest . | SNext . | SPext . | ACCext . | MCCext . |
General-AMP | ProtDCal (see Table 2 in [51]) | 207 | 0.862 | 0.945 | 0.904 | 0.810 | 0.911 | 0.909 | 0.910 | 0.799 |
iLearn G1 | 37 | 0.856 | 0.938 | 0.897 | 0.796 | 0.875 | 0.899 | 0.892 | 0.756 | |
iLearn G6 | 39 | 0.851 | 0.942 | 0.897 | 0.797 | 0.904 | 0.908 | 0.906 | 0.791 | |
iLearn G7 | 32 | 0.850 | 0.941 | 0.895 | 0.794 | 0.893 | 0.912 | 0.906 | 0.788 | |
iLearn Ensemble | 87 | 0.849 | 0.945 | 0.897 | 0.797 | 0.896 | 0.917 | 0.911 | 0.797 | |
iLearn ProtDCal Ensemble | 144 | 0.866 | 0.952 | 0.909 | 0.820 | 0.910 | 0.910 | 0.910 | 0.799 | |
ESM-1b | 85 | 0.840 | 0.947 | 0.894 | 0.792 | 0.904 | 0.945 | 0.932 | 0.843 | |
ABP | ProtDCal (see Table 2 in [51]) | 93 | 0.912 | 0.957 | 0.935 | 0.870 | 0.897 | 0.923 | 0.915 | 0.805 |
iLearn G1 | 44 | 0.899 | 0.950 | 0.925 | 0.850 | 0.873 | 0.909 | 0.898 | 0.766 | |
iLearn G6 | 40 | 0.896 | 0.961 | 0.928 | 0.858 | 0.886 | 0.926 | 0.914 | 0.801 | |
iLearn G7 | 32 | 0.903 | 0.949 | 0.926 | 0.853 | 0.880 | 0.932 | 0.916 | 0.805 | |
iLearn Ensemble | 111 | 0.901 | 0.966 | 0.934 | 0.870 | 0.890 | 0.934 | 0.921 | 0.816 | |
iLearn ProtDCal Ensemble | 86 | 0.910 | 0.961 | 0.936 | 0.873 | 0.896 | 0.934 | 0.922 | 0.820 | |
ESM-1b | 47 | 0.894 | 0.956 | 0.925 | 0.852 | 0.906 | 0.953 | 0.939 | 0.856 | |
AFP | ProtDCal (see Table 2 in [51]) | 144 | 0.921 | 0.963 | 0.942 | 0.884 | 0.747 | 0.908 | 0.859 | 0.664 |
iLearn G1 | 53 | 0.902 | 0.977 | 0.940 | 0.882 | 0.686 | 0.912 | 0.842 | 0.619 | |
iLearn G6 | 49 | 0.907 | 0.963 | 0.935 | 0.871 | 0.709 | 0.922 | 0.857 | 0.654 | |
iLearn G7 | 32 | 0.907 | 0.963 | 0.935 | 0.871 | 0.640 | 0.932 | 0.843 | 0.615 | |
iLearn Ensemble | 111 | 0.907 | 0.963 | 0.935 | 0.871 | 0.702 | 0.910 | 0.846 | 0.630 | |
iLearn ProtDCal Ensemble | 35 | 0.930 | 0.963 | 0.947 | 0.893 | 0.739 | 0.914 | 0.860 | 0.666 | |
ESM-1b | 58 | 0.902 | 0.986 | 0.944 | 0.892 | 0.831 | 0.967 | 0.925 | 0.821 | |
APP | ProtDCal (see Table 2 in [51]) | 40 | 0.871 | 0.903 | 0.887 | 0.775 | 0.825 | 0.867 | 0.865 | 0.357 |
iLearn G1 | 19 | 0.871 | 0.774 | 0.823 | 0.648 | 0.723 | 0.807 | 0.804 | 0.244 | |
iLearn G6 | 18 | 0.742 | 0.903 | 0.823 | 0.654 | 0.781 | 0.903 | 0.898 | 0.392 | |
iLearn G7 | 31 | 0.774 | 0.839 | 0.807 | 0.614 | 0.825 | 0.887 | 0.885 | 0.387 | |
iLearn Ensemble* | - | - | - | - | - | - | - | - | - | |
iLearn ProtDCal Ensemble | 28 | 0.839 | 0.903 | 0.871 | 0.743 | 0.830 | 0.872 | 0.870 | 0.366 | |
ESM-1b | 28 | 0.871 | 0.839 | 0.855 | 0.710 | 0.888 | 0.878 | 0.878 | 0.404 | |
AVP | ProtDCal (see Table 2 in [51]) | 96 | 0.764 | 0.892 | 0.828 | 0.662 | 0.742 | 0.873 | 0.860 | 0.476 |
iLearn G1 | 81 | 0.807 | 0.900 | 0.854 | 0.711 | 0.584 | 0.878 | 0.848 | 0.374 | |
iLearn G6 | 27 | 0.761 | 0.889 | 0.825 | 0.656 | 0.723 | 0.866 | 0.852 | 0.452 | |
iLearn G7 | 33 | 0.769 | 0.904 | 0.836 | 0.679 | 0.706 | 0.843 | 0.829 | 0.406 | |
iLearn Ensemble | 174 | 0.793 | 0.915 | 0.854 | 0.713 | 0.680 | 0.876 | 0.856 | 0.438 | |
iLearn ProtDCal Ensemble | 97 | 0.785 | 0.912 | 0.848 | 0.702 | 0.750 | 0.872 | 0.859 | 0.478 | |
ESM-1b | 70 | 0.783 | 0.891 | 0.837 | 0.678 | 0.921 | 0.868 | 0.873 | 0.585 |
The RF models built on this subset did not satisfy the rule of thumb of that each variable into a model at least should explain the variance of five cases.
However, the results obtained on the external datasets reveal that the RF models based on nHPDs were always better than all the RF models based on HPDs regarding the MCCext values (see Table 2). Specifically, it can be observed that the RF models based on nHPDs achieved MCCext values equal to 0.843, 0.856, 0.821, 0.404 and 0.585 in the classification of general-AMPs, ABPs, AFPs, APPs and AVPs, respectively, whereas the best RF models based on HPDs achieved MCCext values equal to 0.799, 0.82, 0.666, 0.392 and 0.478. Thus, the RF models based on nHPDs were better than the RF models based on HPDs by 5.51, 4.39, 23.27, 3.06 and 22.38%, respectively. The highest improvements were obtained in the classification of AFPs and AVPs due mainly to a better classification of true-AFPs and true-AVPs (SNext metric). In fact, it can be noted that in these two endpoints, the highest SNext values obtained by the models based on HPDs were equal to 0.747 and 0.75, whereas the ones achieved by the models based on nHPDs were equal to 0.831 (11.24% better) and 0.921 (22.8% better), respectively.
These results are in correspondence with the No Free Lunch theorem [94], which states that ‘over the space of all possible problems, every machine learning method will perform as well as every other one on average’. In this case, there is not a single model fed with nHPDs that will always perform better than any other model fed with HPDs, or vice versa. Nevertheless, these results suggest that the nHPDs obtained from the ESM-1b Transformer model [83] present better modeling abilities than the HPDs calculated with the ProtDCal [62] and iLearn [85] software because the models built with them consistently yielded improvements across the external datasets. We state this assertion because the external datasets present a different distribution to the training datasets, unlike the testing datasets because they were built from a rational division based on clustering (see Section 2.1 in [51]). Consequently, the external datasets are most suitable to evaluate the robustness and stability of the created models. Nonetheless, a statistical study based on Bayesian estimation [95] (see Supplementary Table S3 for input values) confirmed our previous assertion, as shown in Supplementary Figure S1. From this figure, it can be concluded that models based on nHPDs have higher probability
Does the combination of nHPDs and HPDs improve the identification of AMPs?
To fulfill the aim of this section, several RF models were created on subsets that contain both nHPDs and HPDs. These subsets were created combining the nHPD subsets with the HPD subsets that were obtained for each training dataset in the modeling process described in Schema 1. Then, a total of 15 reduced subsets were additionally built per each new subset comprised of nHPDs and HPDs as follows: (i) by applying the six feature selectors used in this work; (ii) by joining the output of those six feature selectors; and (iii) by applying the wrapper-type selector used in this work on the seven previous subsets (steps (i) and (ii)) as well as on the non-reduced nHPD and HPD subset. Finally, an RF model on each subset comprised of nHPDs and HPDs was built. These models were grouped according to the types of PDs, i.e. those based on ESM-1b & iLearn PDs, those based on ESM-1b & ProtDCal PDs, and those based on ESM-1b & iLearn & ProtDCal PDs. The best RF model (i.e. that with the highest MCCtrain) on each of these groups was the one used for the further analyses (see Supplementary Table S4 for the training metrics and Table 3 for the performance metrics on the testing/external datasets).
Performance metrics on the test and external datasets that were obtained by the best RF models when combining HPDs and nHPDs. The column ‘size’ refers to the number of descriptors included in the models
Endpoint . | Type of Descriptors . | Size . | Testing datasets . | External datasets . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
SNtest . | SPtest . | ACCtest . | MCCtest . | SNext . | SPext . | ACCext . | MCCext . | |||
General-AMP | ESM-1b and iLearn | 85 | 0.857 | 0.954 | 0.906 | 0.815 | 0.906 | 0.936 | 0.927 | 0.833 |
ESM-1b and ProtDCal | 134 | 0.859 | 0.962 | 0.911 | 0.826 | 0.916 | 0.929 | 0.925 | 0.830 | |
ESM-1b and iLearn and ProtDCal | 104 | 0.866 | 0.960 | 0.913 | 0.830 | 0.915 | 0.934 | 0.928 | 0.837 | |
ABP | ESM-1b and iLearn | 96 | 0.898 | 0.965 | 0.932 | 0.865 | 0.903 | 0.959 | 0.942 | 0.863 |
ESM-1b and ProtDCal | 71 | 0.877 | 0.973 | 0.925 | 0.855 | 0.896 | 0.948 | 0.932 | 0.841 | |
ESM-1b and iLearn and ProtDCal | 49 | 0.898 | 0.973 | 0.935 | 0.873 | 0.907 | 0.958 | 0.942 | 0.864 | |
AFP | ESM-1b and iLearn | 73 | 0.912 | 0.986 | 0.949 | 0.900 | 0.865 | 0.955 | 0.928 | 0.828 |
ESM-1b and ProtDCal | 63 | 0.870 | 0.981 | 0.926 | 0.857 | 0.783 | 0.956 | 0.903 | 0.767 | |
ESM-1b and iLearn and ProtDCal | 111 | 0.930 | 0.981 | 0.956 | 0.913 | 0.851 | 0.952 | 0.921 | 0.813 | |
APP | ESM-1b and iLearn | 30 | 0.871 | 0.935 | 0.903 | 0.808 | 0.891 | 0.909 | 0.908 | 0.463 |
ESM-1b and ProtDCal | 25 | 0.774 | 0.968 | 0.871 | 0.756 | 0.898 | 0.927 | 0.926 | 0.511 | |
ESM-1b and iLearn and ProtDCal | 24 | 0.903 | 0.903 | 0.903 | 0.806 | 0.898 | 0.916 | 0.916 | 0.483 | |
AVP | ESM-1b and iLearn | 97 | 0.788 | 0.894 | 0.841 | 0.686 | 0.907 | 0.881 | 0.884 | 0.598 |
ESM-1b and ProtDCal | 77 | 0.705 | 0.872 | 0.788 | 0.584 | 0.912 | 0.873 | 0.877 | 0.588 | |
ESM-1b and iLearn and ProtDCal | 167 | 0.785 | 0.899 | 0.842 | 0.688 | 0.911 | 0.892 | 0.894 | 0.621 |
Endpoint . | Type of Descriptors . | Size . | Testing datasets . | External datasets . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
SNtest . | SPtest . | ACCtest . | MCCtest . | SNext . | SPext . | ACCext . | MCCext . | |||
General-AMP | ESM-1b and iLearn | 85 | 0.857 | 0.954 | 0.906 | 0.815 | 0.906 | 0.936 | 0.927 | 0.833 |
ESM-1b and ProtDCal | 134 | 0.859 | 0.962 | 0.911 | 0.826 | 0.916 | 0.929 | 0.925 | 0.830 | |
ESM-1b and iLearn and ProtDCal | 104 | 0.866 | 0.960 | 0.913 | 0.830 | 0.915 | 0.934 | 0.928 | 0.837 | |
ABP | ESM-1b and iLearn | 96 | 0.898 | 0.965 | 0.932 | 0.865 | 0.903 | 0.959 | 0.942 | 0.863 |
ESM-1b and ProtDCal | 71 | 0.877 | 0.973 | 0.925 | 0.855 | 0.896 | 0.948 | 0.932 | 0.841 | |
ESM-1b and iLearn and ProtDCal | 49 | 0.898 | 0.973 | 0.935 | 0.873 | 0.907 | 0.958 | 0.942 | 0.864 | |
AFP | ESM-1b and iLearn | 73 | 0.912 | 0.986 | 0.949 | 0.900 | 0.865 | 0.955 | 0.928 | 0.828 |
ESM-1b and ProtDCal | 63 | 0.870 | 0.981 | 0.926 | 0.857 | 0.783 | 0.956 | 0.903 | 0.767 | |
ESM-1b and iLearn and ProtDCal | 111 | 0.930 | 0.981 | 0.956 | 0.913 | 0.851 | 0.952 | 0.921 | 0.813 | |
APP | ESM-1b and iLearn | 30 | 0.871 | 0.935 | 0.903 | 0.808 | 0.891 | 0.909 | 0.908 | 0.463 |
ESM-1b and ProtDCal | 25 | 0.774 | 0.968 | 0.871 | 0.756 | 0.898 | 0.927 | 0.926 | 0.511 | |
ESM-1b and iLearn and ProtDCal | 24 | 0.903 | 0.903 | 0.903 | 0.806 | 0.898 | 0.916 | 0.916 | 0.483 | |
AVP | ESM-1b and iLearn | 97 | 0.788 | 0.894 | 0.841 | 0.686 | 0.907 | 0.881 | 0.884 | 0.598 |
ESM-1b and ProtDCal | 77 | 0.705 | 0.872 | 0.788 | 0.584 | 0.912 | 0.873 | 0.877 | 0.588 | |
ESM-1b and iLearn and ProtDCal | 167 | 0.785 | 0.899 | 0.842 | 0.688 | 0.911 | 0.892 | 0.894 | 0.621 |
Performance metrics on the test and external datasets that were obtained by the best RF models when combining HPDs and nHPDs. The column ‘size’ refers to the number of descriptors included in the models
Endpoint . | Type of Descriptors . | Size . | Testing datasets . | External datasets . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
SNtest . | SPtest . | ACCtest . | MCCtest . | SNext . | SPext . | ACCext . | MCCext . | |||
General-AMP | ESM-1b and iLearn | 85 | 0.857 | 0.954 | 0.906 | 0.815 | 0.906 | 0.936 | 0.927 | 0.833 |
ESM-1b and ProtDCal | 134 | 0.859 | 0.962 | 0.911 | 0.826 | 0.916 | 0.929 | 0.925 | 0.830 | |
ESM-1b and iLearn and ProtDCal | 104 | 0.866 | 0.960 | 0.913 | 0.830 | 0.915 | 0.934 | 0.928 | 0.837 | |
ABP | ESM-1b and iLearn | 96 | 0.898 | 0.965 | 0.932 | 0.865 | 0.903 | 0.959 | 0.942 | 0.863 |
ESM-1b and ProtDCal | 71 | 0.877 | 0.973 | 0.925 | 0.855 | 0.896 | 0.948 | 0.932 | 0.841 | |
ESM-1b and iLearn and ProtDCal | 49 | 0.898 | 0.973 | 0.935 | 0.873 | 0.907 | 0.958 | 0.942 | 0.864 | |
AFP | ESM-1b and iLearn | 73 | 0.912 | 0.986 | 0.949 | 0.900 | 0.865 | 0.955 | 0.928 | 0.828 |
ESM-1b and ProtDCal | 63 | 0.870 | 0.981 | 0.926 | 0.857 | 0.783 | 0.956 | 0.903 | 0.767 | |
ESM-1b and iLearn and ProtDCal | 111 | 0.930 | 0.981 | 0.956 | 0.913 | 0.851 | 0.952 | 0.921 | 0.813 | |
APP | ESM-1b and iLearn | 30 | 0.871 | 0.935 | 0.903 | 0.808 | 0.891 | 0.909 | 0.908 | 0.463 |
ESM-1b and ProtDCal | 25 | 0.774 | 0.968 | 0.871 | 0.756 | 0.898 | 0.927 | 0.926 | 0.511 | |
ESM-1b and iLearn and ProtDCal | 24 | 0.903 | 0.903 | 0.903 | 0.806 | 0.898 | 0.916 | 0.916 | 0.483 | |
AVP | ESM-1b and iLearn | 97 | 0.788 | 0.894 | 0.841 | 0.686 | 0.907 | 0.881 | 0.884 | 0.598 |
ESM-1b and ProtDCal | 77 | 0.705 | 0.872 | 0.788 | 0.584 | 0.912 | 0.873 | 0.877 | 0.588 | |
ESM-1b and iLearn and ProtDCal | 167 | 0.785 | 0.899 | 0.842 | 0.688 | 0.911 | 0.892 | 0.894 | 0.621 |
Endpoint . | Type of Descriptors . | Size . | Testing datasets . | External datasets . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
SNtest . | SPtest . | ACCtest . | MCCtest . | SNext . | SPext . | ACCext . | MCCext . | |||
General-AMP | ESM-1b and iLearn | 85 | 0.857 | 0.954 | 0.906 | 0.815 | 0.906 | 0.936 | 0.927 | 0.833 |
ESM-1b and ProtDCal | 134 | 0.859 | 0.962 | 0.911 | 0.826 | 0.916 | 0.929 | 0.925 | 0.830 | |
ESM-1b and iLearn and ProtDCal | 104 | 0.866 | 0.960 | 0.913 | 0.830 | 0.915 | 0.934 | 0.928 | 0.837 | |
ABP | ESM-1b and iLearn | 96 | 0.898 | 0.965 | 0.932 | 0.865 | 0.903 | 0.959 | 0.942 | 0.863 |
ESM-1b and ProtDCal | 71 | 0.877 | 0.973 | 0.925 | 0.855 | 0.896 | 0.948 | 0.932 | 0.841 | |
ESM-1b and iLearn and ProtDCal | 49 | 0.898 | 0.973 | 0.935 | 0.873 | 0.907 | 0.958 | 0.942 | 0.864 | |
AFP | ESM-1b and iLearn | 73 | 0.912 | 0.986 | 0.949 | 0.900 | 0.865 | 0.955 | 0.928 | 0.828 |
ESM-1b and ProtDCal | 63 | 0.870 | 0.981 | 0.926 | 0.857 | 0.783 | 0.956 | 0.903 | 0.767 | |
ESM-1b and iLearn and ProtDCal | 111 | 0.930 | 0.981 | 0.956 | 0.913 | 0.851 | 0.952 | 0.921 | 0.813 | |
APP | ESM-1b and iLearn | 30 | 0.871 | 0.935 | 0.903 | 0.808 | 0.891 | 0.909 | 0.908 | 0.463 |
ESM-1b and ProtDCal | 25 | 0.774 | 0.968 | 0.871 | 0.756 | 0.898 | 0.927 | 0.926 | 0.511 | |
ESM-1b and iLearn and ProtDCal | 24 | 0.903 | 0.903 | 0.903 | 0.806 | 0.898 | 0.916 | 0.916 | 0.483 | |
AVP | ESM-1b and iLearn | 97 | 0.788 | 0.894 | 0.841 | 0.686 | 0.907 | 0.881 | 0.884 | 0.598 |
ESM-1b and ProtDCal | 77 | 0.705 | 0.872 | 0.788 | 0.584 | 0.912 | 0.873 | 0.877 | 0.588 | |
ESM-1b and iLearn and ProtDCal | 167 | 0.785 | 0.899 | 0.842 | 0.688 | 0.911 | 0.892 | 0.894 | 0.621 |
Figure 2 depicts (for each endpoint) the MCCtest and MCCext values achieved by the best RF models created from the subsets containing ESM-1b & iLearn PDs, ESM-1b & ProtDCal PDs, and ESM-1b & ProtDCal & iLearn PDs, respectively. The MCCtest and MCCext values obtained by the best RF models only built with ESM-1b nHPDs (see Table 2) are also depicted (bars in blue color), and they are considered as baseline in this analysis. Figure 2 reveals that the RF models derived from the subsets comprised of both nHPDs and HPDs generally yielded better performances than the models only built with nHPDs. Specifically, it can be observed that the combination of iLearn HPDs with ESM-1b nHPDs (bars in light blue color) always contributed to achieve better results, excepting for the general-AMP external dataset (see Figure 2B). From Figure 2, it can also be noted that the models based both on ProtDCal HPDs and ESM-1b nHPDs (bars in yellow color) achieved inferior performances regarding the baseline models in half of the datasets, which suggests that the ProtDCal HPDs do not contribute to consistently build better models when merged with the ESM-1b nHPDs.

Mathews correlation coefficient achieved on the testing (A) and external (B) datasets by the best models built from the nHPD and HPD subsets. The performance of the best models only built with nHPDs is also shown (baseline).
However, the combination of iLearn and ProtDCal HPDs with ESM-1b nHPDs (bars in red color) contributed to develop the best models. Indeed, the highest MCCtest values (see Figure 2A) were obtained by the models based on the three types of PDs considered (i.e. iLearn, ProtDCal and ESM-1b PDs), except in the APP testing dataset where the best developed model obtained the second highest MCCtest value. The models based on the three types of PDs also attained the highest MCCext values on the ABP and AVP external datasets, and the second highest MCCext values on the general-AMP and APP external datasets (see Figure 2B). All these results are supported by a statistical analysis based on Bayesian estimation [95] (see Tables from Supplementary Table S5 to Supplementary Table S7, for input values). This analysis reveals (see Supplementary Figure S2) that the models developed with iLearn, ProtDCal and ESM-1b PDs
Finally, Figure 3 depicts the number of nHPDs and HPDs included in the RF models built. For each of the endpoints, the first bar corresponds to the best model only created with ESM-1b nHPDs, the second bar corresponds to the best model created with both ESM-1b nHPDs and iLearn HPDs, the third bar corresponds to the best model created with both ESM-1b nHPDs and ProtDCal HPDs, and the fourth bar corresponds to the best model created with the three types of PDs accounted for (i.e. iLearn, ProtDCal and ESM-1b PDs). On the one hand, it can be observed that the combination of nHPDs with HPDs led to build more complex models according to the total number of variables. However, the statistical outcomes explained above justify that increment in complexity. On the other hand, Figure 3 reveals that a greater number of nHPDs were generally included in the best built models. This confirms that in the classification of AMPs, the chemical information related to variation of protein sequences is rather different from the one encoded by the HPDs calculated with the ProtDCal [62] and iLearn [85] tools. Nonetheless, the number of HPDs included in the models also indicates that they contain chemical information that does not present in the ESM-1b nHPDs. This suggests complementarity between both types of features, which leads to get better predictions when they are combined. That complementarity is studied in-depth below.

Number of nHPDs and HPDs included in the best RF models created from the nHPD and HPD subsets. First bar corresponds to the models only created with ESM-1b nHPDs; second bar corresponds to the models created with both ESM-1b nHPDs and iLearn HPDs; third bar corresponds to the models created with both ESM-1b nHPDs and ProtDCal HPDs; and fourth bar corresponds to the models built with ESM-1b nHPDs, iLearn HPDs and ProtDCal HPDs.
nHPD and HPD space visualization comparisons
In this section, we perform an analysis on effectiveness of nHPDs and HPDs using feature space visualization comparisons. Feature space visualizations of all the nHPDs and HPDs in 2D space were generated via the t-SNE algorithm [97] (see Source Code S1). Specifically, the feature datasets used to train the models to predict general-AMPs, ABPs, AFPs and AVPs were the ones used to perform this analysis. These feature datasets are comprised of iLearn and ProtDCal HPDs (see Data S1B) as well as comprised of ESM-1b nHPDs (see Data S1C). The visualizations of the HPDs and nHPDs are shown in Figures 4A and4B, respectively. As can be seen, the feature space boundary between positive (in red color) and negative peptide sequences in the ESM-1b nHPD visualization is distinctly defined, whereas there is not a clear distinction between the sequences of both classes in the HPD visualization. This confirms that evolutionary information codified in the ESM-1b nHPDs is fundamental for the classification of AMPs. It also suggests that shallow learners will be able to make better decisions than when fed with HPDs only. These findings explain why the RF models using ESM-1b nHPDs presented comparable-to-superior generalization abilities to the RF models based only on the iLearn and/or ProtDCal HPDs (see Tables 2 and 3). These results are also in correspondence with conclusions established elsewhere [98].

Feature space 2D visualizations of the iLearn and ProtDCal HPDs (part A) and ESM-1b nHPDs (part B).
Analysis of the HPD and nHPD used to build the best models by using chemometric and explainable artificial intelligence (XAI) methods
Here, we study the relevance and complementarity of the nHPDs and HPDs included in the best model created for each endpoint (see Data S1F). A SE-based relevance analysis [86] was first performed to quantify the information content of the PDs. Relevant PDs for modeling should present high SE values as an indicator of their good ability to distinguish among proteins (peptides) with different sequences. A Gini index-based importance (IG) analysis [99] was also carried out to assess how regularly a PD was selected for a split, and how large its general discriminative ability was for the RF classifier. Moreover, a permutation-based importance (IP) study was performed to measure the increase in the prediction error of a model after permuting the values of each of its features (PDs) [100]. In this study, a PD is ‘important’ when at shuffling its values the model error increases. Finally, an interaction analysis based on the Friedman’s H-statistic [101] was performed to assess how much of the variation of the predictions depend on the cooperation of the PDs. These two last studies are model-agnostic global interpretation methods, and their implementations in the Interpretable Machine Learning (iml) package [102] were used (see Source Code S2).
Figure 5 depicts (for each endpoint) the boxplots corresponding to the entropy, importance and interaction values calculated with the analyses described above. These boxplots are depicted for both the ESM-1b nHPDs alone and the iLearn and ProtDCal HPDs together that were used to build the models based on the three types of PDs (see Data S1F). On the one hand, Figure 5A shows the results obtained with the SE-based relevance analysis (see Supplementary Table S8 for the SE values). This analysis reveals that the nHPDs included in the analyzed models present greater information content than the HPDs. The maximum SE-value (mSE) that a PD can obtain in the general-AMP, antibacterial, antifungal, antiparasitic and antiviral training datasets is equal to 14.25, 13.68, 10.6, 7.63 and 12.18 bits, respectively. As it can be observed, the ESM-1b nHPDs always achieved SE values greater than 11.8 (82.8% of the mSE), 11.4 (83.3% of the mSE), 8.2 (77.3% of the mSE), 6.3 (82.6% of the mSE) and 7.6 (62.4% of the mSE) bits, respectively. Notice that the HPDs mostly obtained SE values less than the thresholds mentioned above. This superior behavior of the nHPDs can also be observed in Supplementary Figure S3, excepting for the AFP dataset. These outcomes suggest that the superior performance achieved by the models based on ESM-1b nHPDs can also be because they have a larger sensitivity to progressive changes of protein (peptide) sequences than the HPDs, and thus, they better characterize peptides (proteins) with different sequences. This latter is a desirable attribute for any molecular descriptor [103]. This study was performed with the IMMAN software [90] by using a binning schema (discretization) equal to the number of sequences in each training dataset.

Boxplots corresponding to the results obtained with the SE-based relevance analysis (A), the Gini index-based importance analysis (B), the permutation-based importance analysis (C) and interaction-based importance analysis (D). These boxplots are depicted for both the nHPDs and HPDs that were used in the best model built for each endpoint.
On the other hand, Figure 5B represents the IG values corresponding to the nHPDs and HPDs that were jointly used to create the analyzed models (see Data S1F). Supplementary Table S9 contains the IG value obtained for each PD. It can be observed that the nHPDs generally achieved better IG values than the HPDs in the classification of ABPs and APPs, whereas in the classification of general-AMPs and AVPs, the HPDs were the ones that mostly obtained the best IG values. Both types of PDs had comparable IG value distributions in the classification of AFPs, although the nHPDs were the ones that achieved the highest IG values. These results indicate that these two types of PDs are relevant to distinguish among peptide sequences with positive and negative activities in the recognition of AMPs. Indeed, it can be observed in Supplementary Figure S4 that there are no outlier IG values, which means that no PD was significantly more important than the other ones in the decision-making of the RF learner. The IG values were obtained from the WEKA tool output [89].
Moreover, Figure 5C depicts the outcomes obtained with the permutation-based importance analysis. Supplementary Table S10 contains the IP value obtained for each PD. In this figure, it can be noted that the IP value distributions of the nHPDs were better than the ones of the HPDs, except in the AVP dataset. It can be additionally noted that nHPDs were always important in the predictions of the models analyzed since at least one nHPD obtained an IP value greater than 1. Notice that no HPD was important to recognize APPs

Boxplots corresponding to the performance (MCCext) achieved on the general-AMP, ABP, AFP and AVP external datasets by the 30 models that were built with the AMPScanner, APIN and APIN-Fusion architectures, respectively.
Finally, Figure 5D depicts the outcomes obtained with the interaction analysis based on the H-statistic [101]. Supplementary Table S11 contains the interaction strength (ISH) values obtained for each PD regarding all the other PDs. The results firstly reveal that the interaction effects between the PDs are generally very weak because almost all the ISH values are less than 0.1 (10% of the explained variance). Only two HPDs achieved ISH values superior to the threshold aforementioned in the recognition of ABPs and APPs. It can also be observed that the highest ISH values were obtained by HPDs in the classification of AFPs and AVPs. Only in the classification of general-AMPs, one nHPD obtained the highest ISH value. Nonetheless, it can be analyzed in Figure 5D that the nHPDs have a better distribution of ISH values in the classification of general-AMPs and AVPs. These findings suggest that the variation of the predictions depended more on HPDs than on nHPDs since the former obtained the highest ISH value(s) in 4 out of the 5 endpoints analyzed. All these results demonstrate that the nHPDs and HPDs are complementary because the former never were better than the latter (or vice versa) across all relevance, importance and interaction studies performed. This justifies why the combination between nHPDs and HPDs lead to build better predictive models than when using nHPDs and HPDs independently.
Performance regarding state-of-the-art DNN architectures
This section studies the performance obtained by the RF models built with nHPDs regarding the performance achieved by the AMPScanner [39], APIN [45] and APIN-Fusion [45] DNN architectures, which were proposed in the literature to classify AMP sequences. As the training of these architectures is not deterministic, each of them was retrained 30-times on the training datasets detailed in Table 1, excepting for the APP training dataset because it has very few sequences. Files S1 in [75] and Data S2 hold the models built (.h5 files) with the AMPScanner and APIN (and APIN-Fusion) architectures, respectively. Each built model was evaluated on the corresponding external dataset (see Supplementary Tables S1–S4 in [75] for the MCCext values obtained by the AMPScanner architecture, and Supplementary Tables S12 and S13 for the MCCext values obtained by the APIN and APIN-Fusion architectures, respectively). By retraining these three architectures, a fair analysis regarding the RF models is guaranteed. The RF models used in the comparisons were the ones based only on the ESM-1b nHPDs (see Data S1C) as well as the ones based on the subsets comprised of ESM-1b, iLearn and ProtDCal PDs (see Data S1F).
On the one hand, Figure 6 depicts the boxplots corresponding to the MCCext values obtained by the 30 models created with the AMPScanner, APIN and APIN-Fusion architectures to predict general-AMPs, ABPs, AFPs and AVPs, respectively. It can be first noted that the MCCext values obtained from the AMPScanner architecture are the most scattered of all, presenting a standard deviation equal to 3.35, 2.25, 3.61 and 4.74 for the general-AMP, ABP, AFP and AVP external datasets, respectively. This indicates that the AMPScanner performance is not stable on each run and its highest performances can be by chance. In fact, the highest MCCext value on the AVP dataset is an outlier, while most of the MCCext values were less than 0.8 for the general-AMP and ABP external datasets, as well as less than 0.6 and 0.4 for the AFP and AVP external datasets, respectively. Moreover, it can be noted that the APIN and APIN-Fusion architectures were those that achieved the best performances, achieving both MCCext values mostly above the thresholds aforementioned. APIN yielded the least scattered MCCext values (with a standard deviation equal to 1.48, 1.04, 1 and 1.58 for the general-AMP, ABP, AFP and AVP datasets, respectively), whereas APIN-Fusion yielded the highest MCCext values (with a standard deviation equal to 4.86, 2.29, 1.61 and 2.59 for the general-AMP, ABP, AFP and AVP datasets, respectively). The high standard deviation of APIN-Fusion on the general-AMP dataset was due to three inferior outlier MCCext values, that after being removed them, a standard deviation equal to 2.67 was obtained.
On the other hand, Figure 7 shows the highest MCCext values obtained by both the RF models and the DNN models. The average MCCext values (without including outliers) for the DNN models are also shown. In a general sense, it can be observed that the RF models achieved comparable-to-superior performances to the DNN architectures. The average and highest performances achieved by the AMPScanner [39] and APIN [45] architectures were always inferior to the ones achieved by the RF models. The APIN-Fusion architecture [45] was the only one that outperformed the two RF models by 1.42 and 2.15% on the general-AMP external dataset (see Figure 7A), whereas it obtained the same performance as the best RF model on the ABP external dataset (see Figure 7B). For the AFP and AVP datasets, APIN-Fusion was always remarkably worse than the RF models. AMPScanner is composed of a convolutional layer and a long short-term memory (LSTM)-based recurrent layer, whereas APIN is a multi-scale convolutional network. APIN-Fusion is APIN but combined with amino acid composition (AAC) and dipeptide composition (DPC) HPDs. That fusion was the only one that obtained a slightly better performance than the RF models, and it was only on a single dataset.

Highest MCCext values achieved by both the DNN models and RF-based shallow models on the general-AMP, ABP, AFP and AVP external datasets. The average MCCext values for the DNN models are also shown.
Additional to the experiments performed with the AMPScanner [39], APIN [45] and APIN-Fusion [45] DNN architectures, we also carried out comparisons regarding DNN models based on LSTM, Bidirectional Encoder Representations from Transformers (BERT), and Attention, which were recently created by Ma et al. [104] to identify general-AMPs from the human gut microbiome. The consensus performance of these DNN models was also considered in the comparisons (i.e. a sequence is classified as AMP whether the LSTM-, BERT- and Attention-based models classify it as AMP, otherwise, it is nonAMP). This benchmarking analysis was only performed on the testing and external datasets for general-AMPs (see Table 1). Figure 8 depicts a bar graphic corresponding to the MCCtest and MCCext values achieved by the DNN models (Supplementary Table S14 shows the outputs and Supplementary Table S15 shows the performance metrics), by the best RF model based on the ESM-1b nHPDs alone (see Table 2 and Data S1C), and by the best RF model based on the ESM-1b, iLearn and ProtDCal PDs together (see Table 3 and Data S1F). As can be observed, the RF models were always better
In general, these outcomes corroborate that computationally more complex classifiers do not necessarily lead to obtain better performances, which was recently warned in [75]. These findings also suggest that nHPDs derived from the ESM-1b Transformer model [83] have better modeling abilities and chemical information than nHPDs obtained through DNN architectures that use small, labeled AMP datasets. Thus, building DNN architectures on small datasets to only extract features and finally fed shallow classifiers to identify AMPs cannot be the most suitable approach, which is recently being a common practice [105, 106]. Before concluding, it is important to highlight that a comparison between the RF models using ESM-1b nHPDs and several state-of-the-art models was not performed because the baseline models based on ProtDCal HPDs were better than several models freely available at thirteen programs (see Table 5 in [51]). Consequently, all the models built in this work whose performances were better than the baseline models (see Tables 2 and 3), they are also better than the state-of-the-art models analyzed in [51].
Conclusions
We have studied the use of non-handcrafted features (nHPDs) to develop shallow models to identify potential AMPs. We used the ESM-1b Transformer model [83] to calculate these nHPDs. We also considered handcrafted features (HPDs) to compare the performance of shallow models when built with both types of features. As a result, nHPDs lead to develop shallow models with notably higher performances when compared with shallow models built with HPDs only. However, the experimental results comparing HPDs and nHPDs show that both types of descriptors extract complementary and different information from the input AMP sequences. Thus, the combination of nHPDs with HPDs lead to develop shallow models with better performances than using nHPDs only, although it implies an increasing in the size of the models in most cases. Consequently, we recommend always studying whether the increase in the complexity of the models is justified by performing relevance and importance analyses as the ones performed in this work. Finally, we can also conclude that building DNN architectures on small, labeled datasets to only extract features and to fed shallow classifiers could not be the most suitable approach, since features learned via self-supervision seem presenting better modeling abilities. Nonetheless, it is important to highlight that self-supervision is a suitable technique when large datasets are used; otherwise, the learned features (nHPDs) can be less valuable than HPDs (see Table 3A and 3E in [75] for results obtained from a model based on self-supervision but trained with 556 603 protein sequences).
It is studied the modeling ability of learned features to classify AMPs
It is studied the combination of learned and calculated features to classify AMPs
Learned features lead to build better models than using calculated features only
Better shallow models are built combining both types of features
Shallow models built with both types of features better perform than deep models
Data and software availability
Data S1 contains the handcrafted and non-handcrafted descriptor sets, trained models and Weka software outputs corresponding to the best RF model built on each family (or ensemble) of features considered. Data S2 contains the trained models (.h5 files) of the APIN and APIN-Fusion architectures. Because these models were created for benchmarking propose only, no software was built.
Acknowledgement
C.R.G.J. acknowledges to the program ‘Cátedras CONACYT’ from ‘Consejo Nacional de Ciencia y Tecnología (CONACYT), México’ by the support to the endowed chair 501/2018 at ‘Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE)’; CONACYT under grant A1-S-20638 to C.A.B.
Author Biographies
César R. García-Jacas is a Conacyt Researcher at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico. He received his Ph.D. in Technical Science (2015) and M.Sc. in Computer Science (2013) from the Universidad Central “Marta Abreu” de las Villas, Cuba; as well as his B.Eng. in Informatic Science (2009) from the Universidad de las Ciencias Informáticas, La Habana, Cuba.
Luis A. García-González is a Ph.D. student in Computer Science, Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Mexico. He earned his B.Sc. in Computer Science (2021) from the Universidad de Oriente, Cuba.
Felix Martinez-Rios is a Full Professor at the School of Engineering, Universidad Panamericana (UP). He does research in artificial intelligence, quantum computing and theory of computation.
Issac P. Tapia-Contreras is a MSc student in Computer Sciences (2022) at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico.
Carlos A. Brizuela is an Associate professor at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico. He develops machine learning and optimization algorithms for the discovery and design of Bioactive peptides.
Description of the Organization: The Center for Scientific Research and Higher Education at Ensenada (in Spanish: Centro de Investigación Científica y de Educación Superior de Ensenada, CICESE) is a public research center sponsored by the National Council of Science and Technology of Mexico (CONACYT) in the city of Ensenada, Baja California, and specialized in Earth Sciences, Oceanography, Applied Physics, Experimental and Applied Biology and Computer Sciences (https://www.cicese.edu.mx/).