Handcrafted versus non-handcrafted (self-supervised) features for the classification of antimicrobial peptides: complementary or redundant?

Datasets used to build, evaluate and compare the models in this work.

Endpoint	Aim	Total	Positive	Negative
General-AMP	Training	19 548	9781	9767
	Test	5125	2564	2561
	External	15 685	4914	10 771
ABP	Training	13 166	6583	6583
	Test	3390	1695	1695
	External	15 528	4757	10 771
AFP	Training	1556	778	778
	Test	430	215	215
	External	15 528	4757	10 771
APP	Training	198	99	99
	Test	62	31	31
	External	11 182	411	10 771
AVP	Training	4642	2321	2321
	Test	1246	623	623
	External	12 001	1230	10 771

Endpoint	Aim	Total	Positive	Negative
General-AMP	Training	19 548	9781	9767
	Test	5125	2564	2561
	External	15 685	4914	10 771
ABP	Training	13 166	6583	6583
	Test	3390	1695	1695
	External	15 528	4757	10 771
AFP	Training	1556	778	778
	Test	430	215	215
	External	15 528	4757	10 771
APP	Training	198	99	99
	Test	62	31	31
	External	11 182	411	10 771
AVP	Training	4642	2321	2321
	Test	1246	623	623
	External	12 001	1230	10 771

Table 1

Datasets used to build, evaluate and compare the models in this work.

Endpoint	Aim	Total	Positive	Negative
General-AMP	Training	19 548	9781	9767
	Test	5125	2564	2561
	External	15 685	4914	10 771
ABP	Training	13 166	6583	6583
	Test	3390	1695	1695
	External	15 528	4757	10 771
AFP	Training	1556	778	778
	Test	430	215	215
	External	15 528	4757	10 771
APP	Training	198	99	99
	Test	62	31	31
	External	11 182	411	10 771
AVP	Training	4642	2321	2321
	Test	1246	623	623
	External	12 001	1230	10 771

Endpoint	Aim	Total	Positive	Negative
General-AMP	Training	19 548	9781	9767
	Test	5125	2564	2561
	External	15 685	4914	10 771
ABP	Training	13 166	6583	6583
	Test	3390	1695	1695
	External	15 528	4757	10 771
AFP	Training	1556	778	778
	Test	430	215	215
	External	15 528	4757	10 771
APP	Training	198	99	99
	Test	62	31	31
	External	11 182	411	10 771
AVP	Training	4642	2321	2321
	Test	1246	623	623
	External	12 001	1230	10 771

Extraction of handcrafted protein descriptors (HPDs)

We used HPDs calculated with the ProtDCal [62] and iLearn [85] software. On the one hand, it is important to mention that the calculation and selection of the ProtDCal HPDs was performed in [51] to build the baseline RF models. This process was performed on each training dataset from a pool comprised initially of 96 026 ProtDCal HDPs (see Section 2.2 in [51]). The importance of the ProtDCal HPDs finally included in the baseline RF models was widely discussed in Section 3.4 in [51]. On the other hand, a total of 3774 HPDs were additionally calculated in this work with the iLearn tool [85]. The iLearn HPDs calculated on each training dataset were grouped according to their definitions as follows: Group 1 (G1) is comprised of 2420 HPDs, of which 1600 are Composition of K-Spaced Amino Acid Pairs (CKSAAP), 20 are Amino Acid Composition (AAC) and 400 are Dipeptide Composition (DPC) and Dipeptide Deviation from Expected Mean (DDE), respectively; Group 2 (G2) is comprised of 255 HPDs, of which 100 are Composition of K-Spaced Amino Acid Group Pairs (CKSAAGP), 5 are Grouped Amino Acid Composition (GAAC), 25 are Grouped Dipeptide Composition (GDPC) and 125 are Grouped Tripeptide Composition (GTPC); Group 3 (G3) is comprised of 32 Normalized Moreau-Broto Autocorrelation (NMBroto) HPDs; Group 4 (G4) is comprised of 273 Composition/Transition/Distribution (CTD) HPDs; Group 5 (G5) is comprised of 686 HPDs, of which 343 are Conjoint Triad (Ctriad) and K-Spaced Conjoint Triad (KSCTriad), respectively; Group 6 (G6) contains 56 HPDs, of which 8 are Sequence-Order-Coupling Number (SOCNumber) and 48 are Quasi-Sequence-Order (QSOrder) and Group 7 (G7) is comprised of 52 HPDs, of which 24 are Pseudo-Amino Acid Composition (PAAC) and 28 are Amphiphilic Pseudo-Amino Acid Composition (APAAC). The lambda parameter value to compute the PAAC and APAAC HPDs was set up to 4.

Extraction of non-handcrafted protein descriptors (nHPDs)

A total of 1280 nHPDs were calculated on the datasets described in Table 1 using the ESM-1b Transformer model proposed by Rives et al. [83]. This 33-layer (∼650 M parameters) model was trained on 86 billion amino acids across 250 million protein sequences downloaded from the Uniparc database [84]. By using the script provided by the authors (https://github.com/facebookresearch/esm), we extracted the average embedding (dim: 1 × 1280; script_param: —include mean) corresponding to the embedding per amino acid of the final layer (dim: sequence_length × 1280; script_param: —repr_layers 33) for each peptide sequence. That average embedding is the vector of nHPDs obtained for each sequence. The average embedding was used because it yielded the best results in the studies of prediction of secondary structures (see Table 6 in [83]) and long-rang residue-residue contacts (see Table 7 in [83]) performed by Rives et al. [83].

Feature selection and modeling methodology

On each descriptor matrix (7 holding iLearn HPDs and 1 holding ESM-1b nHPDs) obtained for each training dataset corresponding to each endpoint (i.e. general-AMPs, ABPs, AFPs, APPs and AVPs—see Table 1), a Shannon Entropy (SE)-based unsupervised filter [86] was applied to filter out those PDs (hereafter, this notation will be used to refer to protein descriptors in general) with an information content (SE value) inferior to 10% of the maximum entropy that a PD can achieve on each training dataset. Relevant PDs for modeling studies should present high-SE values [86] as a measure of their good ability to discriminate among proteins (peptides) with different sequences. The discretization schema used was equal to the number of peptides in each training dataset. Thus, the maximum entropy that can achieve each PD in the general-AMP, antibacterial, antifungal, antiparasitic and antiviral training datasets is equal to 14.25, 13.68, 10.6, 7.63 and 12.18 bits, respectively. The held PDs were sorted in ascending order according to their SE values, and then, a Spearman correlation coefficient between each pair of PDs was calculated to remove the redundant ones. If a pair of PDs were correlated above 0.95, then the PD with lower SE value was filtered out.

Afterward, a supervised selection using the Correlation Subset method [87] was applied on each reduced descriptor matrix to determine a subset of PDs highly correlated with the activity and lowly correlated among them. Five raking-type feature selection techniques were also applied on each reduced descriptor matrix, being the number of top-ranked PDs retained with these methods equal to the size of the subset obtained with the Correlation Subset method. One of the raking-type procedures is the SE-based unsupervised method explained above, whereas the other ranking-type methods are Relief-F [88], Information Gain [89], Symmetrical Uncertainty [89] and Chi Squared [89], which are supervised feature selection approaches. With the Relief-F method, the PDs that best discriminate between peptide sequences that are near to each other are retained, whereas with the Symmetrical Uncertainty and Information Gain methods, the PDs with the highest information content regarding the activity are selected. Finally, the PDs highly dependent on the activity are retained with the Chi Squared method. The SE-based selector was applied using the IMMAN tool [90], whereas the other ones were used with their default settings in the WEKA tool v3.8 [89].

Additionally, because no single feature selector can guarantee an optimal feature subset, an additional descriptor subset (ensemble) was created for the iLearn HPDs and ESM-1b nHPDs, respectively, by joining the descriptor subsets determined with the six selection methods explained above. The importance to merge different feature selectors has been explained elsewhere [67, 68]. So far, per each training dataset, a total of 43 iLearn HPD subsets and 7 ESM-1b nHPD subsets have been created. On the other hand, other 43 subsets that contain HPDs calculated with the ProtDCal software [62] were also created per each training dataset. These subsets were built combining each of the iLearn HPD subsets with the corresponding ProtDCal HPD (baseline) subset that was proposed in [51]. All these subsets containing diversity of HPDs were created to build RF models as robust as possible in order to ensure rigorous comparisons regarding the RF models based on the ESM-1b nHPDs.

On all the built descriptor subsets (93 per training dataset), a wrapper-type selector [91] was applied using the Genetic Algorithm (GA) metaheuristic as search method, the accuracy as fitness measure and RF [92] as learning procedure. In this way, a total of 93 subsets comprised of the ‘most suitable’ features for the RF learner were additionally obtained. Therefore, a total of 186 descriptor subsets were finally taken into account to model each endpoint. Then, a model based on the RF classifier was created on each descriptor subset. RF was used to guarantee comparability of results regarding the RF models (baseline) introduced in [51], which are based on HPDs calculated with the ProtDCal software [62]. All the built RF models were assessed on their corresponding testing and external datasets. The wrapper-type selector, RF classifier and GA metaheuristic were applied using their default settings in the Weka tool v3.8. Schema 1 shows the workflow explained above.

Schema 1

Workflow applied on each training dataset to build the RF models.

Results and discussion

Do nHPDs have better modeling abilities than HPDs?

This section is aimed to compare the modeling ability between HPDs and nHPDs. Figure 1 shows the boxplots corresponding to the training Matthews correlation coefficients (MCC_train) obtained via 10-fold cross-validation by the best RF models built on each family (i.e. iLearn G1, …, iLearn G7 and ESM-1b) and ensemble (i.e. iLearn, and iLearn & ProtDCal) of PDs accounted for. These training MCC values were obtained via 10-fold cross-validation. The best RF model selected per each family (and ensemble) was the one with the highest MCC_train value (see Data S1 for PD sets, trained models, and Weka software outputs). The MCC_train values obtained by the baseline models (last boxplot in Figure 1) were also shown. As it can be noted in Figure 1, models with suitable variance $({MCC}_{train} \geq 0.65)$ were built from the AAC (G1), quasi-sequence-order (G6) and pseudo-amino acid composition (G7) HPD families, as well as from the iLearn HPD ensemble, the iLearn & ProtDCal HPD ensemble, and the ESM-1b nHPD family. Consequently, these results indicate that models with suitable training metrics can always be built independently of the type of feature (i.e. HPDs or nHPDs) accounted for. Supplementary Table S1 shows the MCC_train, training sensitivity (SN_train), training specificity (SP_train) and training accuracy (ACC_train) values of all the models.

Figure 1

Training Mathews correlation coefficients obtained by the models built from each HPD family calculated with the iLearn software, from each ensemble of HPDs (i.e. iLearn, and iLearn & ProtDCal), and from the set of nHPDs obtained with the ESM-1b Transformer model. The last boxplot corresponds to the baseline RF models, which were created with ProtDCal HPDs.

Moreover, Table 2 shows the performance [i.e. sensitivity (SN), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC)] on the testing and external datasets (see Table 1) of the best RF models built. The performance of the RF models based on the ProtDCal HPDs (baseline) [51] is shown as well. Supplementary Table S2 shows the performance metrics for the worst models. On the one hand, it can be firstly noted that the best built models achieved good performances according to the MCC values on the testing datasets (MCC_test), which are generally comparable (or superior) to the MCC_test values obtained by the baseline RF models (i.e. those based only on ProtDCal HPDs). This suggests that the best built models are not overfitted according to their MCC_train (see Supplementary Table S1) and MCC_test values, and consequently, they are suitable for further analyses. On the other hand, it can be observed that the best created models based on HPDs mostly presented better performances than the RF models based on nHPDs (i.e. those obtained with the ESM-1b model) according to the MCC_test values. Specifically, the highest MCC_test values achieved by the models based on HPDs were better than the ones based on nHPDs by 3.54, 2.46, 0.11, 9.15 and 5.16 on the general-AMP, ABP, AFP, APP and AVP testing datasets, respectively. However, it can be observed that these better performances were mostly achieved using a greater number of HPDs than nHPDs. In general, these results on the testing datasets could suggest that our premise is uncertain. That is, that the sole use of HPDs does not limit to build better models although they do not contain evolutionary information that is present on protein sequences [69–72].

Table 2

Performance metrics on the test and external datasets (see Table 1) that were obtained by the best RF models built with different types of handcrafted descriptors as well as with ESM-1b non-handcrafted descriptors. The best results are highlighted in bold and italic. The column ‘size’ refers to the number of descriptors included in the models

Endpoint	Type of Descriptors	Size	Testing datasets				External datasets
			SN_test	SP_test	ACC_test	MCC_test	SN_ext	SP_ext	ACC_ext	MCC_ext
General-AMP	ProtDCal (see Table 2 in [51])	207	0.862	0.945	0.904	0.810	*0.911*	0.909	0.910	0.799
	iLearn G1	37	0.856	0.938	0.897	0.796	0.875	0.899	0.892	0.756
	iLearn G6	39	0.851	0.942	0.897	0.797	0.904	0.908	0.906	0.791
	iLearn G7	32	0.850	0.941	0.895	0.794	0.893	0.912	0.906	0.788
	iLearn Ensemble	87	0.849	0.945	0.897	0.797	0.896	0.917	0.911	0.797
	iLearn ProtDCal Ensemble	144	*0.866*	*0.952*	*0.909*	*0.820*	0.910	0.910	0.910	0.799
	ESM-1b	85	0.840	0.947	0.894	0.792	0.904	*0.945*	*0.932*	*0.843*
ABP	ProtDCal (see Table 2 in [51])	93	*0.912*	0.957	0.935	0.870	0.897	0.923	0.915	0.805
	iLearn G1	44	0.899	0.950	0.925	0.850	0.873	0.909	0.898	0.766
	iLearn G6	40	0.896	0.961	0.928	0.858	0.886	0.926	0.914	0.801
	iLearn G7	32	0.903	0.949	0.926	0.853	0.880	0.932	0.916	0.805
	iLearn Ensemble	111	0.901	*0.966*	0.934	0.870	0.890	0.934	0.921	0.816
	iLearn ProtDCal Ensemble	86	0.910	0.961	*0.936*	*0.873*	0.896	0.934	0.922	0.820
	ESM-1b	47	0.894	0.956	0.925	0.852	*0.906*	*0.953*	*0.939*	*0.856*
AFP	ProtDCal (see Table 2 in [51])	144	0.921	0.963	0.942	0.884	0.747	0.908	0.859	0.664
	iLearn G1	53	0.902	0.977	0.940	0.882	0.686	0.912	0.842	0.619
	iLearn G6	49	0.907	0.963	0.935	0.871	0.709	0.922	0.857	0.654
	iLearn G7	32	0.907	0.963	0.935	0.871	0.640	0.932	0.843	0.615
	iLearn Ensemble	111	0.907	0.963	0.935	0.871	0.702	0.910	0.846	0.630
	iLearn ProtDCal Ensemble	35	*0.930*	0.963	*0.947*	*0.893*	0.739	0.914	0.860	0.666
	ESM-1b	58	0.902	*0.986*	0.944	0.892	*0.831*	*0.967*	*0.925*	*0.821*
APP	ProtDCal (see Table 2 in [51])	40	*0.871*	*0.903*	*0.887*	*0.775*	0.825	0.867	0.865	0.357
	iLearn G1	19	*0.871*	0.774	0.823	0.648	0.723	0.807	0.804	0.244
	iLearn G6	18	0.742	*0.903*	0.823	0.654	0.781	*0.903*	*0.898*	0.392
	iLearn G7	31	0.774	0.839	0.807	0.614	0.825	0.887	0.885	0.387
	iLearn Ensemble*	-	-	-	-	-	-	-	-	-
	iLearn ProtDCal Ensemble	28	0.839	*0.903*	0.871	0.743	0.830	0.872	0.870	0.366
	ESM-1b	28	*0.871*	0.839	0.855	0.710	*0.888*	0.878	0.878	*0.404*
AVP	ProtDCal (see Table 2 in [51])	96	0.764	0.892	0.828	0.662	0.742	0.873	0.860	0.476
	iLearn G1	81	*0.807*	0.900	*0.854*	0.711	0.584	*0.878*	0.848	0.374
	iLearn G6	27	0.761	0.889	0.825	0.656	0.723	0.866	0.852	0.452
	iLearn G7	33	0.769	0.904	0.836	0.679	0.706	0.843	0.829	0.406
	iLearn Ensemble	174	0.793	*0.915*	*0.854*	*0.713*	0.680	0.876	0.856	0.438
	iLearn ProtDCal Ensemble	97	0.785	0.912	0.848	0.702	0.750	0.872	0.859	0.478
	ESM-1b	70	0.783	0.891	0.837	0.678	*0.921*	0.868	*0.873*	*0.585*

Endpoint	Type of Descriptors	Size	Testing datasets				External datasets
			SN_test	SP_test	ACC_test	MCC_test	SN_ext	SP_ext	ACC_ext	MCC_ext
General-AMP	ProtDCal (see Table 2 in [51])	207	0.862	0.945	0.904	0.810	*0.911*	0.909	0.910	0.799
	iLearn G1	37	0.856	0.938	0.897	0.796	0.875	0.899	0.892	0.756
	iLearn G6	39	0.851	0.942	0.897	0.797	0.904	0.908	0.906	0.791
	iLearn G7	32	0.850	0.941	0.895	0.794	0.893	0.912	0.906	0.788
	iLearn Ensemble	87	0.849	0.945	0.897	0.797	0.896	0.917	0.911	0.797
	iLearn ProtDCal Ensemble	144	*0.866*	*0.952*	*0.909*	*0.820*	0.910	0.910	0.910	0.799
	ESM-1b	85	0.840	0.947	0.894	0.792	0.904	*0.945*	*0.932*	*0.843*
ABP	ProtDCal (see Table 2 in [51])	93	*0.912*	0.957	0.935	0.870	0.897	0.923	0.915	0.805
	iLearn G1	44	0.899	0.950	0.925	0.850	0.873	0.909	0.898	0.766
	iLearn G6	40	0.896	0.961	0.928	0.858	0.886	0.926	0.914	0.801
	iLearn G7	32	0.903	0.949	0.926	0.853	0.880	0.932	0.916	0.805
	iLearn Ensemble	111	0.901	*0.966*	0.934	0.870	0.890	0.934	0.921	0.816
	iLearn ProtDCal Ensemble	86	0.910	0.961	*0.936*	*0.873*	0.896	0.934	0.922	0.820
	ESM-1b	47	0.894	0.956	0.925	0.852	*0.906*	*0.953*	*0.939*	*0.856*
AFP	ProtDCal (see Table 2 in [51])	144	0.921	0.963	0.942	0.884	0.747	0.908	0.859	0.664
	iLearn G1	53	0.902	0.977	0.940	0.882	0.686	0.912	0.842	0.619
	iLearn G6	49	0.907	0.963	0.935	0.871	0.709	0.922	0.857	0.654
	iLearn G7	32	0.907	0.963	0.935	0.871	0.640	0.932	0.843	0.615
	iLearn Ensemble	111	0.907	0.963	0.935	0.871	0.702	0.910	0.846	0.630
	iLearn ProtDCal Ensemble	35	*0.930*	0.963	*0.947*	*0.893*	0.739	0.914	0.860	0.666
	ESM-1b	58	0.902	*0.986*	0.944	0.892	*0.831*	*0.967*	*0.925*	*0.821*
APP	ProtDCal (see Table 2 in [51])	40	*0.871*	*0.903*	*0.887*	*0.775*	0.825	0.867	0.865	0.357
	iLearn G1	19	*0.871*	0.774	0.823	0.648	0.723	0.807	0.804	0.244
	iLearn G6	18	0.742	*0.903*	0.823	0.654	0.781	*0.903*	*0.898*	0.392
	iLearn G7	31	0.774	0.839	0.807	0.614	0.825	0.887	0.885	0.387
	iLearn Ensemble*	-	-	-	-	-	-	-	-	-
	iLearn ProtDCal Ensemble	28	0.839	*0.903*	0.871	0.743	0.830	0.872	0.870	0.366
	ESM-1b	28	*0.871*	0.839	0.855	0.710	*0.888*	0.878	0.878	*0.404*
AVP	ProtDCal (see Table 2 in [51])	96	0.764	0.892	0.828	0.662	0.742	0.873	0.860	0.476
	iLearn G1	81	*0.807*	0.900	*0.854*	0.711	0.584	*0.878*	0.848	0.374
	iLearn G6	27	0.761	0.889	0.825	0.656	0.723	0.866	0.852	0.452
	iLearn G7	33	0.769	0.904	0.836	0.679	0.706	0.843	0.829	0.406
	iLearn Ensemble	174	0.793	*0.915*	*0.854*	*0.713*	0.680	0.876	0.856	0.438
	iLearn ProtDCal Ensemble	97	0.785	0.912	0.848	0.702	0.750	0.872	0.859	0.478
	ESM-1b	70	0.783	0.891	0.837	0.678	*0.921*	0.868	*0.873*	*0.585*

a

The RF models built on this subset did not satisfy the rule of thumb of that each variable into a model at least should explain the variance of five cases.

Table 2

Performance metrics on the test and external datasets (see Table 1) that were obtained by the best RF models built with different types of handcrafted descriptors as well as with ESM-1b non-handcrafted descriptors. The best results are highlighted in bold and italic. The column ‘size’ refers to the number of descriptors included in the models

Endpoint	Type of Descriptors	Size	Testing datasets				External datasets
			SN_test	SP_test	ACC_test	MCC_test	SN_ext	SP_ext	ACC_ext	MCC_ext
General-AMP	ProtDCal (see Table 2 in [51])	207	0.862	0.945	0.904	0.810	*0.911*	0.909	0.910	0.799
	iLearn G1	37	0.856	0.938	0.897	0.796	0.875	0.899	0.892	0.756
	iLearn G6	39	0.851	0.942	0.897	0.797	0.904	0.908	0.906	0.791
	iLearn G7	32	0.850	0.941	0.895	0.794	0.893	0.912	0.906	0.788
	iLearn Ensemble	87	0.849	0.945	0.897	0.797	0.896	0.917	0.911	0.797
	iLearn ProtDCal Ensemble	144	*0.866*	*0.952*	*0.909*	*0.820*	0.910	0.910	0.910	0.799
	ESM-1b	85	0.840	0.947	0.894	0.792	0.904	*0.945*	*0.932*	*0.843*
ABP	ProtDCal (see Table 2 in [51])	93	*0.912*	0.957	0.935	0.870	0.897	0.923	0.915	0.805
	iLearn G1	44	0.899	0.950	0.925	0.850	0.873	0.909	0.898	0.766
	iLearn G6	40	0.896	0.961	0.928	0.858	0.886	0.926	0.914	0.801
	iLearn G7	32	0.903	0.949	0.926	0.853	0.880	0.932	0.916	0.805
	iLearn Ensemble	111	0.901	*0.966*	0.934	0.870	0.890	0.934	0.921	0.816
	iLearn ProtDCal Ensemble	86	0.910	0.961	*0.936*	*0.873*	0.896	0.934	0.922	0.820
	ESM-1b	47	0.894	0.956	0.925	0.852	*0.906*	*0.953*	*0.939*	*0.856*
AFP	ProtDCal (see Table 2 in [51])	144	0.921	0.963	0.942	0.884	0.747	0.908	0.859	0.664
	iLearn G1	53	0.902	0.977	0.940	0.882	0.686	0.912	0.842	0.619
	iLearn G6	49	0.907	0.963	0.935	0.871	0.709	0.922	0.857	0.654
	iLearn G7	32	0.907	0.963	0.935	0.871	0.640	0.932	0.843	0.615
	iLearn Ensemble	111	0.907	0.963	0.935	0.871	0.702	0.910	0.846	0.630
	iLearn ProtDCal Ensemble	35	*0.930*	0.963	*0.947*	*0.893*	0.739	0.914	0.860	0.666
	ESM-1b	58	0.902	*0.986*	0.944	0.892	*0.831*	*0.967*	*0.925*	*0.821*
APP	ProtDCal (see Table 2 in [51])	40	*0.871*	*0.903*	*0.887*	*0.775*	0.825	0.867	0.865	0.357
	iLearn G1	19	*0.871*	0.774	0.823	0.648	0.723	0.807	0.804	0.244
	iLearn G6	18	0.742	*0.903*	0.823	0.654	0.781	*0.903*	*0.898*	0.392
	iLearn G7	31	0.774	0.839	0.807	0.614	0.825	0.887	0.885	0.387
	iLearn Ensemble*	-	-	-	-	-	-	-	-	-
	iLearn ProtDCal Ensemble	28	0.839	*0.903*	0.871	0.743	0.830	0.872	0.870	0.366
	ESM-1b	28	*0.871*	0.839	0.855	0.710	*0.888*	0.878	0.878	*0.404*
AVP	ProtDCal (see Table 2 in [51])	96	0.764	0.892	0.828	0.662	0.742	0.873	0.860	0.476
	iLearn G1	81	*0.807*	0.900	*0.854*	0.711	0.584	*0.878*	0.848	0.374
	iLearn G6	27	0.761	0.889	0.825	0.656	0.723	0.866	0.852	0.452
	iLearn G7	33	0.769	0.904	0.836	0.679	0.706	0.843	0.829	0.406
	iLearn Ensemble	174	0.793	*0.915*	*0.854*	*0.713*	0.680	0.876	0.856	0.438
	iLearn ProtDCal Ensemble	97	0.785	0.912	0.848	0.702	0.750	0.872	0.859	0.478
	ESM-1b	70	0.783	0.891	0.837	0.678	*0.921*	0.868	*0.873*	*0.585*

Endpoint	Type of Descriptors	Size	Testing datasets				External datasets
			SN_test	SP_test	ACC_test	MCC_test	SN_ext	SP_ext	ACC_ext	MCC_ext
General-AMP	ProtDCal (see Table 2 in [51])	207	0.862	0.945	0.904	0.810	*0.911*	0.909	0.910	0.799
	iLearn G1	37	0.856	0.938	0.897	0.796	0.875	0.899	0.892	0.756
	iLearn G6	39	0.851	0.942	0.897	0.797	0.904	0.908	0.906	0.791
	iLearn G7	32	0.850	0.941	0.895	0.794	0.893	0.912	0.906	0.788
	iLearn Ensemble	87	0.849	0.945	0.897	0.797	0.896	0.917	0.911	0.797
	iLearn ProtDCal Ensemble	144	*0.866*	*0.952*	*0.909*	*0.820*	0.910	0.910	0.910	0.799
	ESM-1b	85	0.840	0.947	0.894	0.792	0.904	*0.945*	*0.932*	*0.843*
ABP	ProtDCal (see Table 2 in [51])	93	*0.912*	0.957	0.935	0.870	0.897	0.923	0.915	0.805
	iLearn G1	44	0.899	0.950	0.925	0.850	0.873	0.909	0.898	0.766
	iLearn G6	40	0.896	0.961	0.928	0.858	0.886	0.926	0.914	0.801
	iLearn G7	32	0.903	0.949	0.926	0.853	0.880	0.932	0.916	0.805
	iLearn Ensemble	111	0.901	*0.966*	0.934	0.870	0.890	0.934	0.921	0.816
	iLearn ProtDCal Ensemble	86	0.910	0.961	*0.936*	*0.873*	0.896	0.934	0.922	0.820
	ESM-1b	47	0.894	0.956	0.925	0.852	*0.906*	*0.953*	*0.939*	*0.856*
AFP	ProtDCal (see Table 2 in [51])	144	0.921	0.963	0.942	0.884	0.747	0.908	0.859	0.664
	iLearn G1	53	0.902	0.977	0.940	0.882	0.686	0.912	0.842	0.619
	iLearn G6	49	0.907	0.963	0.935	0.871	0.709	0.922	0.857	0.654
	iLearn G7	32	0.907	0.963	0.935	0.871	0.640	0.932	0.843	0.615
	iLearn Ensemble	111	0.907	0.963	0.935	0.871	0.702	0.910	0.846	0.630
	iLearn ProtDCal Ensemble	35	*0.930*	0.963	*0.947*	*0.893*	0.739	0.914	0.860	0.666
	ESM-1b	58	0.902	*0.986*	0.944	0.892	*0.831*	*0.967*	*0.925*	*0.821*
APP	ProtDCal (see Table 2 in [51])	40	*0.871*	*0.903*	*0.887*	*0.775*	0.825	0.867	0.865	0.357
	iLearn G1	19	*0.871*	0.774	0.823	0.648	0.723	0.807	0.804	0.244
	iLearn G6	18	0.742	*0.903*	0.823	0.654	0.781	*0.903*	*0.898*	0.392
	iLearn G7	31	0.774	0.839	0.807	0.614	0.825	0.887	0.885	0.387
	iLearn Ensemble*	-	-	-	-	-	-	-	-	-
	iLearn ProtDCal Ensemble	28	0.839	*0.903*	0.871	0.743	0.830	0.872	0.870	0.366
	ESM-1b	28	*0.871*	0.839	0.855	0.710	*0.888*	0.878	0.878	*0.404*
AVP	ProtDCal (see Table 2 in [51])	96	0.764	0.892	0.828	0.662	0.742	0.873	0.860	0.476
	iLearn G1	81	*0.807*	0.900	*0.854*	0.711	0.584	*0.878*	0.848	0.374
	iLearn G6	27	0.761	0.889	0.825	0.656	0.723	0.866	0.852	0.452
	iLearn G7	33	0.769	0.904	0.836	0.679	0.706	0.843	0.829	0.406
	iLearn Ensemble	174	0.793	*0.915*	*0.854*	*0.713*	0.680	0.876	0.856	0.438
	iLearn ProtDCal Ensemble	97	0.785	0.912	0.848	0.702	0.750	0.872	0.859	0.478
	ESM-1b	70	0.783	0.891	0.837	0.678	*0.921*	0.868	*0.873*	*0.585*

a

The RF models built on this subset did not satisfy the rule of thumb of that each variable into a model at least should explain the variance of five cases.

However, the results obtained on the external datasets reveal that the RF models based on nHPDs were always better than all the RF models based on HPDs regarding the MCC_ext values (see Table 2). Specifically, it can be observed that the RF models based on nHPDs achieved MCC_ext values equal to 0.843, 0.856, 0.821, 0.404 and 0.585 in the classification of general-AMPs, ABPs, AFPs, APPs and AVPs, respectively, whereas the best RF models based on HPDs achieved MCC_ext values equal to 0.799, 0.82, 0.666, 0.392 and 0.478. Thus, the RF models based on nHPDs were better than the RF models based on HPDs by 5.51, 4.39, 23.27, 3.06 and 22.38%, respectively. The highest improvements were obtained in the classification of AFPs and AVPs due mainly to a better classification of true-AFPs and true-AVPs (SN_ext metric). In fact, it can be noted that in these two endpoints, the highest SN_ext values obtained by the models based on HPDs were equal to 0.747 and 0.75, whereas the ones achieved by the models based on nHPDs were equal to 0.831 (11.24% better) and 0.921 (22.8% better), respectively.

These results are in correspondence with the No Free Lunch theorem [94], which states that ‘over the space of all possible problems, every machine learning method will perform as well as every other one on average’. In this case, there is not a single model fed with nHPDs that will always perform better than any other model fed with HPDs, or vice versa. Nevertheless, these results suggest that the nHPDs obtained from the ESM-1b Transformer model [83] present better modeling abilities than the HPDs calculated with the ProtDCal [62] and iLearn [85] software because the models built with them consistently yielded improvements across the external datasets. We state this assertion because the external datasets present a different distribution to the training datasets, unlike the testing datasets because they were built from a rational division based on clustering (see Section 2.1 in [51]). Consequently, the external datasets are most suitable to evaluate the robustness and stability of the created models. Nonetheless, a statistical study based on Bayesian estimation [95] (see Supplementary Table S3 for input values) confirmed our previous assertion, as shown in Supplementary Figure S1. From this figure, it can be concluded that models based on nHPDs have higher probability $(\approx 0.7)$ to perform better predictions than models based on HPDs only. Finally, notice also that simpler models were mostly created with the nHPDs than with the HPDs (Ockham razor principle [96]) according to the number of variables. However, it is important to remark that these results do not imply that nHPDs are sufficient to build good models to identify AMPs. Indeed, there is not statistical difference between the models created with both types of PDs (see rope zone in Supplementary Figure S1). This thus recommends to study if HPDs and nHPDs can be combined to achieve better predictions.

Does the combination of nHPDs and HPDs improve the identification of AMPs?

To fulfill the aim of this section, several RF models were created on subsets that contain both nHPDs and HPDs. These subsets were created combining the nHPD subsets with the HPD subsets that were obtained for each training dataset in the modeling process described in Schema 1. Then, a total of 15 reduced subsets were additionally built per each new subset comprised of nHPDs and HPDs as follows: (i) by applying the six feature selectors used in this work; (ii) by joining the output of those six feature selectors; and (iii) by applying the wrapper-type selector used in this work on the seven previous subsets (steps (i) and (ii)) as well as on the non-reduced nHPD and HPD subset. Finally, an RF model on each subset comprised of nHPDs and HPDs was built. These models were grouped according to the types of PDs, i.e. those based on ESM-1b & iLearn PDs, those based on ESM-1b & ProtDCal PDs, and those based on ESM-1b & iLearn & ProtDCal PDs. The best RF model (i.e. that with the highest MCC_train) on each of these groups was the one used for the further analyses (see Supplementary Table S4 for the training metrics and Table 3 for the performance metrics on the testing/external datasets).

Table 3

Performance metrics on the test and external datasets that were obtained by the best RF models when combining HPDs and nHPDs. The column ‘size’ refers to the number of descriptors included in the models

Endpoint	Type of Descriptors	Size	Testing datasets				External datasets
Endpoint	Type of Descriptors	Size	SN_test	SP_test	ACC_test	MCC_test	SN_ext	SP_ext	ACC_ext	MCC_ext
General-AMP	ESM-1b and iLearn	85	0.857	0.954	0.906	0.815	0.906	*0.936*	0.927	0.833
	ESM-1b and ProtDCal	134	0.859	*0.962*	0.911	0.826	*0.916*	0.929	0.925	0.830
	ESM-1b and iLearn and ProtDCal	104	*0.866*	0.960	*0.913*	*0.830*	0.915	0.934	*0.928*	*0.837*
ABP	ESM-1b and iLearn	96	*0.898*	0.965	0.932	0.865	0.903	*0.959*	*0.942*	0.863
	ESM-1b and ProtDCal	71	0.877	*0.973*	0.925	0.855	0.896	0.948	0.932	0.841
	ESM-1b and iLearn and ProtDCal	49	*0.898*	*0.973*	*0.935*	*0.873*	*0.907*	0.958	*0.942*	*0.864*
AFP	ESM-1b and iLearn	73	0.912	*0.986*	0.949	0.900	*0.865*	0.955	*0.928*	*0.828*
	ESM-1b and ProtDCal	63	0.870	0.981	0.926	0.857	0.783	*0.956*	0.903	0.767
	ESM-1b and iLearn and ProtDCal	111	*0.930*	0.981	*0.956*	*0.913*	0.851	0.952	0.921	0.813
APP	ESM-1b and iLearn	30	0.871	0.935	*0.903*	*0.808*	0.891	0.909	0.908	0.463
	ESM-1b and ProtDCal	25	0.774	*0.968*	0.871	0.756	*0.898*	*0.927*	*0.926*	*0.511*
	ESM-1b and iLearn and ProtDCal	24	*0.903*	0.903	*0.903*	0.806	*0.898*	0.916	0.916	0.483
AVP	ESM-1b and iLearn	97	*0.788*	0.894	0.841	0.686	0.907	0.881	0.884	0.598
	ESM-1b and ProtDCal	77	0.705	0.872	0.788	0.584	*0.912*	0.873	0.877	0.588
	ESM-1b and iLearn and ProtDCal	167	0.785	*0.899*	*0.842*	*0.688*	0.911	*0.892*	*0.894*	*0.621*

Endpoint	Type of Descriptors	Size	Testing datasets				External datasets
Endpoint	Type of Descriptors	Size	SN_test	SP_test	ACC_test	MCC_test	SN_ext	SP_ext	ACC_ext	MCC_ext
General-AMP	ESM-1b and iLearn	85	0.857	0.954	0.906	0.815	0.906	*0.936*	0.927	0.833
	ESM-1b and ProtDCal	134	0.859	*0.962*	0.911	0.826	*0.916*	0.929	0.925	0.830
	ESM-1b and iLearn and ProtDCal	104	*0.866*	0.960	*0.913*	*0.830*	0.915	0.934	*0.928*	*0.837*
ABP	ESM-1b and iLearn	96	*0.898*	0.965	0.932	0.865	0.903	*0.959*	*0.942*	0.863
	ESM-1b and ProtDCal	71	0.877	*0.973*	0.925	0.855	0.896	0.948	0.932	0.841
	ESM-1b and iLearn and ProtDCal	49	*0.898*	*0.973*	*0.935*	*0.873*	*0.907*	0.958	*0.942*	*0.864*
AFP	ESM-1b and iLearn	73	0.912	*0.986*	0.949	0.900	*0.865*	0.955	*0.928*	*0.828*
	ESM-1b and ProtDCal	63	0.870	0.981	0.926	0.857	0.783	*0.956*	0.903	0.767
	ESM-1b and iLearn and ProtDCal	111	*0.930*	0.981	*0.956*	*0.913*	0.851	0.952	0.921	0.813
APP	ESM-1b and iLearn	30	0.871	0.935	*0.903*	*0.808*	0.891	0.909	0.908	0.463
	ESM-1b and ProtDCal	25	0.774	*0.968*	0.871	0.756	*0.898*	*0.927*	*0.926*	*0.511*
	ESM-1b and iLearn and ProtDCal	24	*0.903*	0.903	*0.903*	0.806	*0.898*	0.916	0.916	0.483
AVP	ESM-1b and iLearn	97	*0.788*	0.894	0.841	0.686	0.907	0.881	0.884	0.598
	ESM-1b and ProtDCal	77	0.705	0.872	0.788	0.584	*0.912*	0.873	0.877	0.588
	ESM-1b and iLearn and ProtDCal	167	0.785	*0.899*	*0.842*	*0.688*	0.911	*0.892*	*0.894*	*0.621*

Table 3

Performance metrics on the test and external datasets that were obtained by the best RF models when combining HPDs and nHPDs. The column ‘size’ refers to the number of descriptors included in the models

Endpoint	Type of Descriptors	Size	Testing datasets				External datasets
Endpoint	Type of Descriptors	Size	SN_test	SP_test	ACC_test	MCC_test	SN_ext	SP_ext	ACC_ext	MCC_ext
General-AMP	ESM-1b and iLearn	85	0.857	0.954	0.906	0.815	0.906	*0.936*	0.927	0.833
	ESM-1b and ProtDCal	134	0.859	*0.962*	0.911	0.826	*0.916*	0.929	0.925	0.830
	ESM-1b and iLearn and ProtDCal	104	*0.866*	0.960	*0.913*	*0.830*	0.915	0.934	*0.928*	*0.837*
ABP	ESM-1b and iLearn	96	*0.898*	0.965	0.932	0.865	0.903	*0.959*	*0.942*	0.863
	ESM-1b and ProtDCal	71	0.877	*0.973*	0.925	0.855	0.896	0.948	0.932	0.841
	ESM-1b and iLearn and ProtDCal	49	*0.898*	*0.973*	*0.935*	*0.873*	*0.907*	0.958	*0.942*	*0.864*
AFP	ESM-1b and iLearn	73	0.912	*0.986*	0.949	0.900	*0.865*	0.955	*0.928*	*0.828*
	ESM-1b and ProtDCal	63	0.870	0.981	0.926	0.857	0.783	*0.956*	0.903	0.767
	ESM-1b and iLearn and ProtDCal	111	*0.930*	0.981	*0.956*	*0.913*	0.851	0.952	0.921	0.813
APP	ESM-1b and iLearn	30	0.871	0.935	*0.903*	*0.808*	0.891	0.909	0.908	0.463
	ESM-1b and ProtDCal	25	0.774	*0.968*	0.871	0.756	*0.898*	*0.927*	*0.926*	*0.511*
	ESM-1b and iLearn and ProtDCal	24	*0.903*	0.903	*0.903*	0.806	*0.898*	0.916	0.916	0.483
AVP	ESM-1b and iLearn	97	*0.788*	0.894	0.841	0.686	0.907	0.881	0.884	0.598
	ESM-1b and ProtDCal	77	0.705	0.872	0.788	0.584	*0.912*	0.873	0.877	0.588
	ESM-1b and iLearn and ProtDCal	167	0.785	*0.899*	*0.842*	*0.688*	0.911	*0.892*	*0.894*	*0.621*

Endpoint	Type of Descriptors	Size	Testing datasets				External datasets
Endpoint	Type of Descriptors	Size	SN_test	SP_test	ACC_test	MCC_test	SN_ext	SP_ext	ACC_ext	MCC_ext
General-AMP	ESM-1b and iLearn	85	0.857	0.954	0.906	0.815	0.906	*0.936*	0.927	0.833
	ESM-1b and ProtDCal	134	0.859	*0.962*	0.911	0.826	*0.916*	0.929	0.925	0.830
	ESM-1b and iLearn and ProtDCal	104	*0.866*	0.960	*0.913*	*0.830*	0.915	0.934	*0.928*	*0.837*
ABP	ESM-1b and iLearn	96	*0.898*	0.965	0.932	0.865	0.903	*0.959*	*0.942*	0.863
	ESM-1b and ProtDCal	71	0.877	*0.973*	0.925	0.855	0.896	0.948	0.932	0.841
	ESM-1b and iLearn and ProtDCal	49	*0.898*	*0.973*	*0.935*	*0.873*	*0.907*	0.958	*0.942*	*0.864*
AFP	ESM-1b and iLearn	73	0.912	*0.986*	0.949	0.900	*0.865*	0.955	*0.928*	*0.828*
	ESM-1b and ProtDCal	63	0.870	0.981	0.926	0.857	0.783	*0.956*	0.903	0.767
	ESM-1b and iLearn and ProtDCal	111	*0.930*	0.981	*0.956*	*0.913*	0.851	0.952	0.921	0.813
APP	ESM-1b and iLearn	30	0.871	0.935	*0.903*	*0.808*	0.891	0.909	0.908	0.463
	ESM-1b and ProtDCal	25	0.774	*0.968*	0.871	0.756	*0.898*	*0.927*	*0.926*	*0.511*
	ESM-1b and iLearn and ProtDCal	24	*0.903*	0.903	*0.903*	0.806	*0.898*	0.916	0.916	0.483
AVP	ESM-1b and iLearn	97	*0.788*	0.894	0.841	0.686	0.907	0.881	0.884	0.598
	ESM-1b and ProtDCal	77	0.705	0.872	0.788	0.584	*0.912*	0.873	0.877	0.588
	ESM-1b and iLearn and ProtDCal	167	0.785	*0.899*	*0.842*	*0.688*	0.911	*0.892*	*0.894*	*0.621*

Figure 2 depicts (for each endpoint) the MCC_test and MCC_ext values achieved by the best RF models created from the subsets containing ESM-1b & iLearn PDs, ESM-1b & ProtDCal PDs, and ESM-1b & ProtDCal & iLearn PDs, respectively. The MCC_test and MCC_ext values obtained by the best RF models only built with ESM-1b nHPDs (see Table 2) are also depicted (bars in blue color), and they are considered as baseline in this analysis. Figure 2 reveals that the RF models derived from the subsets comprised of both nHPDs and HPDs generally yielded better performances than the models only built with nHPDs. Specifically, it can be observed that the combination of iLearn HPDs with ESM-1b nHPDs (bars in light blue color) always contributed to achieve better results, excepting for the general-AMP external dataset (see Figure 2B). From Figure 2, it can also be noted that the models based both on ProtDCal HPDs and ESM-1b nHPDs (bars in yellow color) achieved inferior performances regarding the baseline models in half of the datasets, which suggests that the ProtDCal HPDs do not contribute to consistently build better models when merged with the ESM-1b nHPDs.

Figure 2

Mathews correlation coefficient achieved on the testing (A) and external (B) datasets by the best models built from the nHPD and HPD subsets. The performance of the best models only built with nHPDs is also shown (baseline).

However, the combination of iLearn and ProtDCal HPDs with ESM-1b nHPDs (bars in red color) contributed to develop the best models. Indeed, the highest MCC_test values (see Figure 2A) were obtained by the models based on the three types of PDs considered (i.e. iLearn, ProtDCal and ESM-1b PDs), except in the APP testing dataset where the best developed model obtained the second highest MCC_test value. The models based on the three types of PDs also attained the highest MCC_ext values on the ABP and AVP external datasets, and the second highest MCC_ext values on the general-AMP and APP external datasets (see Figure 2B). All these results are supported by a statistical analysis based on Bayesian estimation [95] (see Tables from Supplementary Table S5 to Supplementary Table S7, for input values). This analysis reveals (see Supplementary Figure S2) that the models developed with iLearn, ProtDCal and ESM-1b PDs $(\approx 0.92)$ (see Supplementary Figure S2A) as well as with iLearn and ESM-1b PDs $(\approx 0.82)$ (see Supplementary Figure S2B) have higher probability to perform better predictions than the models only built with ESM-1b nHPDs. Moreover, it can be noted from Supplementary Figure S2C that the models created with ProtDCal and ESM-1b PDs present a low probability $(\approx 0.37)$ to perform good predictions.

Finally, Figure 3 depicts the number of nHPDs and HPDs included in the RF models built. For each of the endpoints, the first bar corresponds to the best model only created with ESM-1b nHPDs, the second bar corresponds to the best model created with both ESM-1b nHPDs and iLearn HPDs, the third bar corresponds to the best model created with both ESM-1b nHPDs and ProtDCal HPDs, and the fourth bar corresponds to the best model created with the three types of PDs accounted for (i.e. iLearn, ProtDCal and ESM-1b PDs). On the one hand, it can be observed that the combination of nHPDs with HPDs led to build more complex models according to the total number of variables. However, the statistical outcomes explained above justify that increment in complexity. On the other hand, Figure 3 reveals that a greater number of nHPDs were generally included in the best built models. This confirms that in the classification of AMPs, the chemical information related to variation of protein sequences is rather different from the one encoded by the HPDs calculated with the ProtDCal [62] and iLearn [85] tools. Nonetheless, the number of HPDs included in the models also indicates that they contain chemical information that does not present in the ESM-1b nHPDs. This suggests complementarity between both types of features, which leads to get better predictions when they are combined. That complementarity is studied in-depth below.

Figure 3

Number of nHPDs and HPDs included in the best RF models created from the nHPD and HPD subsets. First bar corresponds to the models only created with ESM-1b nHPDs; second bar corresponds to the models created with both ESM-1b nHPDs and iLearn HPDs; third bar corresponds to the models created with both ESM-1b nHPDs and ProtDCal HPDs; and fourth bar corresponds to the models built with ESM-1b nHPDs, iLearn HPDs and ProtDCal HPDs.

nHPD and HPD space visualization comparisons

In this section, we perform an analysis on effectiveness of nHPDs and HPDs using feature space visualization comparisons. Feature space visualizations of all the nHPDs and HPDs in 2D space were generated via the t-SNE algorithm [97] (see Source Code S1). Specifically, the feature datasets used to train the models to predict general-AMPs, ABPs, AFPs and AVPs were the ones used to perform this analysis. These feature datasets are comprised of iLearn and ProtDCal HPDs (see Data S1B) as well as comprised of ESM-1b nHPDs (see Data S1C). The visualizations of the HPDs and nHPDs are shown in Figures 4A and4B, respectively. As can be seen, the feature space boundary between positive (in red color) and negative peptide sequences in the ESM-1b nHPD visualization is distinctly defined, whereas there is not a clear distinction between the sequences of both classes in the HPD visualization. This confirms that evolutionary information codified in the ESM-1b nHPDs is fundamental for the classification of AMPs. It also suggests that shallow learners will be able to make better decisions than when fed with HPDs only. These findings explain why the RF models using ESM-1b nHPDs presented comparable-to-superior generalization abilities to the RF models based only on the iLearn and/or ProtDCal HPDs (see Tables 2 and 3). These results are also in correspondence with conclusions established elsewhere [98].

Figure 4

Feature space 2D visualizations of the iLearn and ProtDCal HPDs (part A) and ESM-1b nHPDs (part B).

Analysis of the HPD and nHPD used to build the best models by using chemometric and explainable artificial intelligence (XAI) methods

Here, we study the relevance and complementarity of the nHPDs and HPDs included in the best model created for each endpoint (see Data S1F). A SE-based relevance analysis [86] was first performed to quantify the information content of the PDs. Relevant PDs for modeling should present high SE values as an indicator of their good ability to distinguish among proteins (peptides) with different sequences. A Gini index-based importance (I_G) analysis [99] was also carried out to assess how regularly a PD was selected for a split, and how large its general discriminative ability was for the RF classifier. Moreover, a permutation-based importance (I_P) study was performed to measure the increase in the prediction error of a model after permuting the values of each of its features (PDs) [100]. In this study, a PD is ‘important’ when at shuffling its values the model error increases. Finally, an interaction analysis based on the Friedman’s H-statistic [101] was performed to assess how much of the variation of the predictions depend on the cooperation of the PDs. These two last studies are model-agnostic global interpretation methods, and their implementations in the Interpretable Machine Learning (iml) package [102] were used (see Source Code S2).

Figure 5 depicts (for each endpoint) the boxplots corresponding to the entropy, importance and interaction values calculated with the analyses described above. These boxplots are depicted for both the ESM-1b nHPDs alone and the iLearn and ProtDCal HPDs together that were used to build the models based on the three types of PDs (see Data S1F). On the one hand, Figure 5A shows the results obtained with the SE-based relevance analysis (see Supplementary Table S8 for the SE values). This analysis reveals that the nHPDs included in the analyzed models present greater information content than the HPDs. The maximum SE-value (mSE) that a PD can obtain in the general-AMP, antibacterial, antifungal, antiparasitic and antiviral training datasets is equal to 14.25, 13.68, 10.6, 7.63 and 12.18 bits, respectively. As it can be observed, the ESM-1b nHPDs always achieved SE values greater than 11.8 (82.8% of the mSE), 11.4 (83.3% of the mSE), 8.2 (77.3% of the mSE), 6.3 (82.6% of the mSE) and 7.6 (62.4% of the mSE) bits, respectively. Notice that the HPDs mostly obtained SE values less than the thresholds mentioned above. This superior behavior of the nHPDs can also be observed in Supplementary Figure S3, excepting for the AFP dataset. These outcomes suggest that the superior performance achieved by the models based on ESM-1b nHPDs can also be because they have a larger sensitivity to progressive changes of protein (peptide) sequences than the HPDs, and thus, they better characterize peptides (proteins) with different sequences. This latter is a desirable attribute for any molecular descriptor [103]. This study was performed with the IMMAN software [90] by using a binning schema (discretization) equal to the number of sequences in each training dataset.

Figure 5

Boxplots corresponding to the results obtained with the SE-based relevance analysis (A), the Gini index-based importance analysis (B), the permutation-based importance analysis (C) and interaction-based importance analysis (D). These boxplots are depicted for both the nHPDs and HPDs that were used in the best model built for each endpoint.

On the other hand, Figure 5B represents the I_G values corresponding to the nHPDs and HPDs that were jointly used to create the analyzed models (see Data S1F). Supplementary Table S9 contains the I_G value obtained for each PD. It can be observed that the nHPDs generally achieved better I_G values than the HPDs in the classification of ABPs and APPs, whereas in the classification of general-AMPs and AVPs, the HPDs were the ones that mostly obtained the best I_G values. Both types of PDs had comparable I_G value distributions in the classification of AFPs, although the nHPDs were the ones that achieved the highest I_G values. These results indicate that these two types of PDs are relevant to distinguish among peptide sequences with positive and negative activities in the recognition of AMPs. Indeed, it can be observed in Supplementary Figure S4 that there are no outlier I_G values, which means that no PD was significantly more important than the other ones in the decision-making of the RF learner. The I_G values were obtained from the WEKA tool output [89].

Moreover, Figure 5C depicts the outcomes obtained with the permutation-based importance analysis. Supplementary Table S10 contains the I_P value obtained for each PD. In this figure, it can be noted that the I_P value distributions of the nHPDs were better than the ones of the HPDs, except in the AVP dataset. It can be additionally noted that nHPDs were always important in the predictions of the models analyzed since at least one nHPD obtained an I_P value greater than 1. Notice that no HPD was important to recognize APPs $(I_{P} \leq 1)$ ⁠. However, HPDs were those that achieved the highest I_P values to identify ABPs, and they also are among the most important ones to classify AVPs. In this last endpoint, HPDs also presented the best distribution of I_G values (see Figure 5B). Despite these outcomes, it is important to mention that there are no highly important PDs because their I_P values were not much greater than 1 $(I_{P} ≫ 1)$ ⁠. This thus suggests that the best built models did not rely significantly more on the nHPDs than on the HPDs (or vice versa) because the permutation of their values did not imply larger increases in the prediction error of the models.

Figure 6

Boxplots corresponding to the performance (MCC_ext) achieved on the general-AMP, ABP, AFP and AVP external datasets by the 30 models that were built with the AMPScanner, APIN and APIN-Fusion architectures, respectively.

Finally, Figure 5D depicts the outcomes obtained with the interaction analysis based on the H-statistic [101]. Supplementary Table S11 contains the interaction strength (IS_H) values obtained for each PD regarding all the other PDs. The results firstly reveal that the interaction effects between the PDs are generally very weak because almost all the IS_H values are less than 0.1 (10% of the explained variance). Only two HPDs achieved IS_H values superior to the threshold aforementioned in the recognition of ABPs and APPs. It can also be observed that the highest IS_H values were obtained by HPDs in the classification of AFPs and AVPs. Only in the classification of general-AMPs, one nHPD obtained the highest IS_H value. Nonetheless, it can be analyzed in Figure 5D that the nHPDs have a better distribution of IS_H values in the classification of general-AMPs and AVPs. These findings suggest that the variation of the predictions depended more on HPDs than on nHPDs since the former obtained the highest IS_H value(s) in 4 out of the 5 endpoints analyzed. All these results demonstrate that the nHPDs and HPDs are complementary because the former never were better than the latter (or vice versa) across all relevance, importance and interaction studies performed. This justifies why the combination between nHPDs and HPDs lead to build better predictive models than when using nHPDs and HPDs independently.

Performance regarding state-of-the-art DNN architectures

This section studies the performance obtained by the RF models built with nHPDs regarding the performance achieved by the AMPScanner [39], APIN [45] and APIN-Fusion [45] DNN architectures, which were proposed in the literature to classify AMP sequences. As the training of these architectures is not deterministic, each of them was retrained 30-times on the training datasets detailed in Table 1, excepting for the APP training dataset because it has very few sequences. Files S1 in [75] and Data S2 hold the models built (.h5 files) with the AMPScanner and APIN (and APIN-Fusion) architectures, respectively. Each built model was evaluated on the corresponding external dataset (see Supplementary Tables S1–S4 in [75] for the MCC_ext values obtained by the AMPScanner architecture, and Supplementary Tables S12 and S13 for the MCC_ext values obtained by the APIN and APIN-Fusion architectures, respectively). By retraining these three architectures, a fair analysis regarding the RF models is guaranteed. The RF models used in the comparisons were the ones based only on the ESM-1b nHPDs (see Data S1C) as well as the ones based on the subsets comprised of ESM-1b, iLearn and ProtDCal PDs (see Data S1F).

On the one hand, Figure 6 depicts the boxplots corresponding to the MCC_ext values obtained by the 30 models created with the AMPScanner, APIN and APIN-Fusion architectures to predict general-AMPs, ABPs, AFPs and AVPs, respectively. It can be first noted that the MCC_ext values obtained from the AMPScanner architecture are the most scattered of all, presenting a standard deviation equal to 3.35, 2.25, 3.61 and 4.74 for the general-AMP, ABP, AFP and AVP external datasets, respectively. This indicates that the AMPScanner performance is not stable on each run and its highest performances can be by chance. In fact, the highest MCC_ext value on the AVP dataset is an outlier, while most of the MCC_ext values were less than 0.8 for the general-AMP and ABP external datasets, as well as less than 0.6 and 0.4 for the AFP and AVP external datasets, respectively. Moreover, it can be noted that the APIN and APIN-Fusion architectures were those that achieved the best performances, achieving both MCC_ext values mostly above the thresholds aforementioned. APIN yielded the least scattered MCC_ext values (with a standard deviation equal to 1.48, 1.04, 1 and 1.58 for the general-AMP, ABP, AFP and AVP datasets, respectively), whereas APIN-Fusion yielded the highest MCC_ext values (with a standard deviation equal to 4.86, 2.29, 1.61 and 2.59 for the general-AMP, ABP, AFP and AVP datasets, respectively). The high standard deviation of APIN-Fusion on the general-AMP dataset was due to three inferior outlier MCC_ext values, that after being removed them, a standard deviation equal to 2.67 was obtained.

On the other hand, Figure 7 shows the highest MCC_ext values obtained by both the RF models and the DNN models. The average MCC_ext values (without including outliers) for the DNN models are also shown. In a general sense, it can be observed that the RF models achieved comparable-to-superior performances to the DNN architectures. The average and highest performances achieved by the AMPScanner [39] and APIN [45] architectures were always inferior to the ones achieved by the RF models. The APIN-Fusion architecture [45] was the only one that outperformed the two RF models by 1.42 and 2.15% on the general-AMP external dataset (see Figure 7A), whereas it obtained the same performance as the best RF model on the ABP external dataset (see Figure 7B). For the AFP and AVP datasets, APIN-Fusion was always remarkably worse than the RF models. AMPScanner is composed of a convolutional layer and a long short-term memory (LSTM)-based recurrent layer, whereas APIN is a multi-scale convolutional network. APIN-Fusion is APIN but combined with amino acid composition (AAC) and dipeptide composition (DPC) HPDs. That fusion was the only one that obtained a slightly better performance than the RF models, and it was only on a single dataset.

Figure 7

Highest MCC_ext values achieved by both the DNN models and RF-based shallow models on the general-AMP, ABP, AFP and AVP external datasets. The average MCC_ext values for the DNN models are also shown.

Additional to the experiments performed with the AMPScanner [39], APIN [45] and APIN-Fusion [45] DNN architectures, we also carried out comparisons regarding DNN models based on LSTM, Bidirectional Encoder Representations from Transformers (BERT), and Attention, which were recently created by Ma et al. [104] to identify general-AMPs from the human gut microbiome. The consensus performance of these DNN models was also considered in the comparisons (i.e. a sequence is classified as AMP whether the LSTM-, BERT- and Attention-based models classify it as AMP, otherwise, it is nonAMP). This benchmarking analysis was only performed on the testing and external datasets for general-AMPs (see Table 1). Figure 8 depicts a bar graphic corresponding to the MCC_test and MCC_ext values achieved by the DNN models (Supplementary Table S14 shows the outputs and Supplementary Table S15 shows the performance metrics), by the best RF model based on the ESM-1b nHPDs alone (see Table 2 and Data S1C), and by the best RF model based on the ESM-1b, iLearn and ProtDCal PDs together (see Table 3 and Data S1F). As can be observed, the RF models were always better $({MCC}_{test} > 0.79)$ than the DNN architectures $({MCC}_{test - Attention} = 0.5923, {MCC}_{test - BERT} = 0.6365, {MCC}_{test - LSTM} = 0.5743, {MCC}_{test - Consensus} = 0.6279)$ on the test dataset, whereas the performance $({MCC}_{ext} = 0.8530)$ based on the consensus of the three DNNs was the only one that slightly outperformed both the performance of the RF model based on the ESM-1b nHPDs $({MCC}_{ext} = 0.843)$ and the performance of the RF model based on the ESM-1b, iLearn and ProtDCal PDs $({MCC}_{ext} = 0.837)$ on the external dataset. The models based on BERT $({MCC}_{ext} = 0.8093)$ ⁠, Attention $({MCC}_{ext} = 0.8023)$ and LSTM $({MCC}_{ext} = 0.6782)$ were inferior to the two RF models analyzed at least by 3.31% in the external dataset.

Figure 8

Comparison between the DNN models proposed by Ma et al. [104] and the RF models based on ESM-1b nHPDs and based on ESM-1b, iLearn and ProtDCal PDs regarding the MCC_test and MCC_ext values obtained on the general-AMP testing and external datasets (see Table 1).

. https://www.who.int/news-room/fact-sheets/detail/antimicrobial-resistance (

In general, these outcomes corroborate that computationally more complex classifiers do not necessarily lead to obtain better performances, which was recently warned in [75]. These findings also suggest that nHPDs derived from the ESM-1b Transformer model [83] have better modeling abilities and chemical information than nHPDs obtained through DNN architectures that use small, labeled AMP datasets. Thus, building DNN architectures on small datasets to only extract features and finally fed shallow classifiers to identify AMPs cannot be the most suitable approach, which is recently being a common practice [105, 106]. Before concluding, it is important to highlight that a comparison between the RF models using ESM-1b nHPDs and several state-of-the-art models was not performed because the baseline models based on ProtDCal HPDs were better than several models freely available at thirteen programs (see Table 5 in [51]). Consequently, all the models built in this work whose performances were better than the baseline models (see Tables 2 and 3), they are also better than the state-of-the-art models analyzed in [51].

Conclusions

We have studied the use of non-handcrafted features (nHPDs) to develop shallow models to identify potential AMPs. We used the ESM-1b Transformer model [83] to calculate these nHPDs. We also considered handcrafted features (HPDs) to compare the performance of shallow models when built with both types of features. As a result, nHPDs lead to develop shallow models with notably higher performances when compared with shallow models built with HPDs only. However, the experimental results comparing HPDs and nHPDs show that both types of descriptors extract complementary and different information from the input AMP sequences. Thus, the combination of nHPDs with HPDs lead to develop shallow models with better performances than using nHPDs only, although it implies an increasing in the size of the models in most cases. Consequently, we recommend always studying whether the increase in the complexity of the models is justified by performing relevance and importance analyses as the ones performed in this work. Finally, we can also conclude that building DNN architectures on small, labeled datasets to only extract features and to fed shallow classifiers could not be the most suitable approach, since features learned via self-supervision seem presenting better modeling abilities. Nonetheless, it is important to highlight that self-supervision is a suitable technique when large datasets are used; otherwise, the learned features (nHPDs) can be less valuable than HPDs (see Table 3A and 3E in [75] for results obtained from a model based on self-supervision but trained with 556 603 protein sequences).

Key Points

It is studied the modeling ability of learned features to classify AMPs
It is studied the combination of learned and calculated features to classify AMPs
Learned features lead to build better models than using calculated features only
Better shallow models are built combining both types of features
Shallow models built with both types of features better perform than deep models

Data and software availability

Data S1 contains the handcrafted and non-handcrafted descriptor sets, trained models and Weka software outputs corresponding to the best RF model built on each family (or ensemble) of features considered. Data S2 contains the trained models (.h5 files) of the APIN and APIN-Fusion architectures. Because these models were created for benchmarking propose only, no software was built.

Acknowledgement

C.R.G.J. acknowledges to the program ‘Cátedras CONACYT’ from ‘Consejo Nacional de Ciencia y Tecnología (CONACYT), México’ by the support to the endowed chair 501/2018 at ‘Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE)’; CONACYT under grant A1-S-20638 to C.A.B.

Author Biographies

César R. García-Jacas is a Conacyt Researcher at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico. He received his Ph.D. in Technical Science (2015) and M.Sc. in Computer Science (2013) from the Universidad Central “Marta Abreu” de las Villas, Cuba; as well as his B.Eng. in Informatic Science (2009) from the Universidad de las Ciencias Informáticas, La Habana, Cuba.

Luis A. García-González is a Ph.D. student in Computer Science, Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Mexico. He earned his B.Sc. in Computer Science (2021) from the Universidad de Oriente, Cuba.

Felix Martinez-Rios is a Full Professor at the School of Engineering, Universidad Panamericana (UP). He does research in artificial intelligence, quantum computing and theory of computation.

Issac P. Tapia-Contreras is a MSc student in Computer Sciences (2022) at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico.

Carlos A. Brizuela is an Associate professor at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico. He develops machine learning and optimization algorithms for the discovery and design of Bioactive peptides.

Description of the Organization: The Center for Scientific Research and Higher Education at Ensenada (in Spanish: Centro de Investigación Científica y de Educación Superior de Ensenada, CICESE) is a public research center sponsored by the National Council of Science and Technology of Mexico (CONACYT) in the city of Ensenada, Baja California, and specialized in Earth Sciences, Oceanography, Applied Physics, Experimental and Applied Biology and Computer Sciences (https://www.cicese.edu.mx/).

References

1.

WHO

.

Antimicrobial resistance

8 October 2020, date last accessed

).

2.

CDC

.

Antibiotic/Antimicrobial Resistance (AR/AMR)

. https://www.cdc.gov/drugresistance/index.html (

8 October 2020, date last accessed

).

3.

Cassini

A

,

Högberg

LD

,

Plachouras

D

, et al.

Attributable deaths and disability-adjusted life-years caused by infections with antibiotic-resistant bacteria in the EU and the European economic area in 2015: a population-level modelling analysis

.

Lancet Infect Dis

2019

;

19

:

56

–

66

.

4.

Tacconelli

E

,

Pezzani

MD

.

Public health burden of antimicrobial resistance in Europe

.

Lancet Infect Dis

2019

;

19

:

4

–

6

.

5.

Gasser

M

,

Zingg

W

,

Cassini

A

, et al.

Attributable deaths and disability-adjusted life-years caused by infections with antibiotic-resistant bacteria in Switzerland

.

Lancet Infect Dis

2019

;

19

:

17

–

8

.

6.

Dadgostar

P

.

Antimicrobial resistance: implications and costs

.

Infect Drug Resist

2019

;

12

:

3903

–

10

.

7.

Laxminarayan

R

,

Duse

A

,

Wattal

C

, et al.

Antibiotic resistance—the need for global solutions

.

Lancet Infect Dis

2013

;

13

:

1057

–

98

.

8.

Roberts

M

,

Driggs

D

,

Thorpe

M

, et al.

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

.

Nat Mach Intell

2021

;

3

:

199

–

217

.

. https://www.cdc.gov/drugresistance/pdf/threats-report/2019-ar-threats-report-508.pdf2021.

9.

CDC

.

Centers for Disease Control and Prevention

2019

10.

Zhang

L-j

,

Gallo

RL

.

Antimicrobial peptides

.

Curr Biol

2016

;

26

:

R14

–

9

.

11.

Waghu

FH

,

Joseph

S

,

Ghawali

S

, et al.

Designing antibacterial peptides with enhanced killing kinetics

.

Front Microbiol

2018

;

9

:

1

–

10

.

12.

Liu

Y

,

Ding

S

,

Shen

J

, et al.

Nonribosomal antibacterial peptides that target multidrug-resistant bacteria

.

Nat Prod Rep

2019

;

36

:

573

–

92

.

13.

Mor

A

.

Multifunctional host defense peptides: antiparasitic activities

.

FEBS J

2009

;

276

:

6474

–

82

.

14.

Lacerda

AF

,

Pelegrini

PB

,

de

Oliveira

DM

, et al.

Anti-parasitic peptides from arthropods and their application in drug therapy

.

Front Microbiol

2016

;

7

:

1

–

11

.

15.

Pretzel

J

,

Mohring

F

,

Rahlfs

S

, et al. Antiparasitic peptides. In:

Vilcinskas

A

(ed).

Yellow Biotechnology I: Insect Biotechnologie in Drug Discovery and Preclinical Research

.

Berlin, Heidelberg

:

Springer Berlin Heidelberg

,

2013

,

157

–

92

.

16.

Devi

MS

,

Sashidhar

RB

.

Antiaflatoxigenic effects of selected antifungal peptides

.

Peptides

2019

;

115

:

15

–

26

.

17.

Fernández de Ullivarri

M

,

Arbulu

S

,

Garcia-Gutierrez

E

, et al.

Antifungal peptides as therapeutic agents

.

Front Cell Infect Microbiol

2020

;

10

:

1

–

22

.

18.

Vilas Boas

LCP

,

Campos

ML

,

Berlanda

RLA

, et al.

Antiviral peptides as promising therapeutic drugs

.

Cell Mol Life Sci

2019

;

76

:

3525

–

42

.

19.

David

CB

,

Gill

D

.

Antiviral activities of human host defense peptides

.

Curr Med Chem

2020

;

27

:

1420

–

43

.

20.

Kristensen

SL

,

Rørth

R

,

Jhund

PS

, et al.

Cardiovascular, mortality, and kidney outcomes with GLP-1 receptor agonists in patients with type 2 diabetes: a systematic review and meta-analysis of cardiovascular outcome trials

.

Lancet Diabetes Endocrinol

2019

;

7

:

776

–

85

.

21.

Jin

G

,

Weinberg

A

.

Human antimicrobial peptides and cancer

.

Semin Cell Dev Biol

2019

;

88

:

156

–

62

.

22.

Ghosh

SK

,

McCormick

TS

,

Weinberg

A

.

Human Beta Defensins and cancer: contradictions and common ground

.

Front Oncol

2019

;

9

:

1

–

8

.

23.

Lau

JL

,

Dunn

MK

.

Therapeutic peptides: historical perspectives, current development trends, and future directions

.

Bioorg Med Chem

2018

;

26

:

2700

–

7

.

24.

Huan

Y

,

Kong

Q

,

Mou

H

, et al.

Antimicrobial peptides: classification, design, application and research Progress in multiple fields

.

Front Microbiol

2020

;

11

:

1

–

21

.

25.

Maccari

G

,

Di Luca

M

,

Nifosì

R

. In silico design of antimicrobial peptides. In:

Zhou

P

,

Huang

J

(eds).

Computational Peptidology

.

New York, NY

:

Springer New York

,

2015

,

195

–

219

.

26.

Kuczera

K

. Molecular modeling of peptides. In:

Zhou

P

,

Huang

J

(eds).

Computational Peptidology

.

New York, NY

:

Springer New York

,

2015

,

15

–

41

.

27.

Gupta

S

,

Kapoor

P

,

Chaudhary

K

, et al. Peptide toxicity prediction. In:

Zhou

P

,

Huang

J

(eds).

Computational Peptidology

.

New York, NY

:

Springer New York

,

2015

,

143

–

57

.

28.

Torrent

M

,

Di Tommaso

P

,

Pulido

D

, et al.

AMPA: an automated web server for prediction of protein antimicrobial regions

.

Bioinformatics

2011

;

28

:

130

–

1

.

29.

Thakur

N

,

Qureshi

A

,

Kumar

M

.

AVPpred: collection and prediction of highly effective antiviral peptides

.

Nucleic Acids Res

2012

;

40

:

W199

–

204

.

30.

Fernandes

FC

,

Rigden

DJ

,

Franco

OL

.

Prediction of antimicrobial peptides based on the adaptive neuro-fuzzy inference system application

.

Pept Sci

2012

;

98

:

280

–

7

.

31.

Joseph

S

,

Karnik

S

,

Nilawe

P

, et al.

ClassAMP: a prediction tool for classification of antimicrobial peptides

.

IEEE/ACM Trans Comput Biol Bioinform

2012

;

9

:

1535

–

8

.

32.

Xiao

X

,

Wang

P

,

Lin

W-Z

, et al.

iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types

.

Anal Biochem

2013

;

436

:

168

–

77

.

33.

Lee

H-T

,

Lee

C-C

,

Yang

J-R

, et al.

A large-scale structural classification of antimicrobial peptides

.

Biomed Res Int

2015

;

2015

:

475062

.

34.

Waghu

FH

,

Barai

RS

,

Gurung

P

, et al.

CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides

.

Nucleic Acids Res

2015

;

44

:

D1094

–

7

.

35.

Lin

W

,

Xu

D

.

Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types

.

Bioinformatics

2016

;

32

:

3745

–

52

.

36.

Meher

PK

,

Sahu

TK

,

Saini

V

, et al.

Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC

.

Sci Rep

2017

;

7

:

42362

.

37.

Agrawal

P

,

Bhalla

S

,

Chaudhary

K

, et al.

In silico approach for prediction of antifungal peptides

.

Front Microbiol

2018

;

9

:

1

–

13

.

38.

Bhadra

P

,

Yan

J

,

Li

J

, et al.

AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest

.

Sci Rep

2018

;

8

:

1697

.

39.

Veltri

D

,

Kamath

U

,

Shehu

A

.

Deep learning improves antimicrobial peptide recognition

.

Bioinformatics

2018

;

34

:

2740

–

7

.

40.

Hamid

M-N

,

Friedberg

I

.

Identifying antimicrobial peptides using word embedding with deep recurrent neural networks

.

Bioinformatics

2018

;

35

:

2009

–

16

.

41.

Youmans

M

,

Spainhour

JCG

,

Qiu

P

.

Classification of antibacterial peptides using long short-term memory recurrent neural networks

.

IEEE/ACM Trans Comput Biol Bioinform

2020

;

17

:

1134

–

40

.

42.

Chung

C-R

,

Kuo

T-R

,

Wu

L-C

, et al.

Characterization and identification of antimicrobial peptides with different functional activities

.

Brief Bioinform

2019

;

21

:

1098

–

114

.

43.

Lin

Y

,

Cai

Y

,

Liu

J

, et al.

An advanced approach to identify antimicrobial peptides and their function types for penaeus through machine learning strategies

.

BMC Bioinf

2019

;

20

:

291

.

44.

Wei

L

,

Zhou

C

,

Su

R

, et al.

PEPred-suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning

.

Bioinformatics

2019

;

35

:

4272

–

80

.

45.

Su

X

,

Xu

J

,

Yin

Y

, et al.

Antimicrobial peptide identification using multi-scale convolutional network

.

BMC Bioinf

2019

;

20

:

730

.

. https://doi.org/10.1093/bib/bbab065.

46.

Li

J

,

Pu

Y

,

Tang

J

, et al.

DeepAVP: a Dual-Channel deep neural network for identifying variable-length antiviral peptides

.

IEEE J Biomed Health Inform

2020

;

24

:

3012

–

9

.

47.

Yan

J

,

Bhadra

P

,

Li

A

, et al.

Deep-AmPEP30: improve short antimicrobial peptides prediction with deep learning

.

Mol Ther--Nucleic Acids

2020

;

20

:

882

–

94

.

48.

Fu

H

,

Cao

Z

,

Li

M

, et al.

ACEP: improving antimicrobial peptides recognition through automatic feature fusion and amino acid embedding

.

BMC Genomics

2020

;

21

:

597

.

49.

Sharma

R

,

Shrivastava

S

,

Kumar Singh

S

, et al.

Deep-ABPpred: identifying antibacterial peptides in protein sequences using bidirectional LSTM with word2vec

.

Brief Bioinform

2021

;

22

:

1

–

19

. https://doi.org/10.1093/bib/bbab200.

50.

Zhang

Y

,

Lin

J

,

Zhao

L

, et al.

A novel antibacterial peptide recognition algorithm based on BERT

.

Brief Bioinform

2021

;

22

:

1

–

11

51.

Pinacho-Castellanos

SA

,

García-Jacas

CR

,

Gilson

MK

, et al.

Alignment-free antimicrobial peptide predictors: improving performance by a thorough analysis of the largest available data set

.

J Chem Inf Model

2021

;

61

:

3141

–

57

.

52.

Sharma

R

,

Shrivastava

S

,

Kumar Singh

S

, et al.

AniAMPpred: artificial intelligence guided discovery of novel antimicrobial peptides in animal kingdom

.

Brief Bioinform

2021

;

22

:

1

–

23

.

53.

Dubchak

I

,

Muchnik

I

,

Holbrook

SR

, et al.

Prediction of protein folding class using global description of amino acid sequence

.

Proc Natl Acad Sci U S A

1995

;

92

:

8700

.

54.

Chou

K-C

.

Prediction of protein cellular attributes using pseudo-amino acid composition

.

Proteins: Struct, Funct, Bioinf

2001

;

43

:

246

–

55

.

55.

Chou

K-C

.

Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

.

Bioinformatics

2004

;

21

:

10

–

9

.

56.

Shen

J

,

Zhang

J

,

Luo

X

, et al.

Predicting protein–protein interactions based only on sequences information

.

Proc Natl Acad Sci U S A

2007

;

104

:

4337

.

57.

Ruiz-Blanco Yasser

B

,

García

Y

,

Sotomayor-Torres

CM

, et al.

New set of 2D/3D thermodynamic indices for proteins. A formalism based on “molten globule” theory

.

Physics Procedia

2010

;

8

:

63

–

72

.

58.

Chen

X

,

Qiu

J-D

,

Shi

S-P

, et al.

Incorporating key position and amino acid residue features to identify general and species-specific ubiquitin conjugation sites

.

Bioinformatics

2013

;

29

:

1614

–

22

.

59.

Marrero-Ponce

Y

,

Teran

JE

,

Contreras-Torres

E

, et al.

LEGO-based generalized set of two linear algebraic 3D bio-macro-molecular descriptors: theory and validation by QSARs

.

J Theor Biol

2020

;

485

:

110039

.

60.

Li

ZR

,

Lin

HH

,

Han

LY

, et al.

PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence

.

Nucleic Acids Res

2006

;

34

:

W32

–

7

.

61.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

iFeature: a python package and web server for features extraction and selection from protein and peptide sequences

.

Bioinformatics

2018

;

34

:

2499

–

502

.

62.

Romero-Molina

S

,

Ruiz-Blanco

YB

,

Green

JR

, et al.

ProtDCal-suite: a web server for the numerical codification and functional analysis of proteins

.

Protein Sci

2019

;

28

:

1734

–

43

.

63.

Contreras-Torres

E

,

Marrero-Ponce

Y

,

Terán

JE

, et al.

MuLiMs-MCoMPAs: a novel multiplatform framework to compute tensor algebra-based three-dimensional protein descriptors

.

J Chem Inf Model

2020

;

60

:

1042

–

59

.

64.

Aguilera-Mendoza

L

,

Marrero-Ponce

Y

,

García-Jacas

CR

, et al.

Automatic construction of molecular similarity networks for visual graph mining in chemical space of bioactive peptides: an unsupervised learning approach

.

Sci Rep

2020

;

10

:

18074

.

65.

Barigye

SJ

,

Gómez-Ganau

S

,

Serrano-Candelas

E

, et al.

PeptiDesCalculator: software for computation of peptide descriptors. Definition, implementation and case studies for 9 bioactivity endpoints

.

Proteins: Struct, Funct, Bioinf

2021

;

89

:

174

–

84

.

66.

Todeschini

R

,

Consonni

V

.

Molecular Descriptors for Chemoinformatics

.

Weinheim

:

WILEY-VCH

,

2009

.

67.

Bolón-Canedo

V

,

Alonso-Betanzos

A

.

Ensembles for feature selection: a review and future trends

.

Inf Fusion

2019

;

52

:

1

–

12

.

68.

Pes

B

.

Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains

.

Neural Comput Applic

2020

;

32

:

5951

–

73

.

69.

Yanofsky

C

,

Horn

V

,

Thorpe

D

.

Protein structure relationships revealed by mutational analysis

.

Science

1964

;

146

:

1593

–

4

.

70.

Altschuh

D

,

Lesk

AM

,

Bloomer

AC

, et al.

Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus

.

J Mol Biol

1987

;

193

:

693

–

707

.

71.

Altschuh

D

,

Vernet

T

,

Berti

P

, et al.

Coordinated amino acid changes in homologous protein families*

.

Protein Eng Des Sel

1988

;

2

:

193

–

9

.

72.

Hughes

AL

,

Yeager

M

.

Coordinated amino acid changes in the evolution of mammalian Defensins

.

J Mol Evol

1997

;

44

:

675

–

82

.

73.

Mohammadi

A

,

Zahiri

J

,

Mohammadi

S

, et al.

PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles

.

Biol Methods Protoc

2022

;

7

:

1

–

11

.

. https://doi.org/10.1093/bib/bbac094.

74.

Zielezinski

A

,

Vinga

S

,

Almeida

J

, et al.

Alignment-free sequence comparison: benefits, applications, and tools

.

Genome Biol

2017

;

18

:

186

.

75.

García-Jacas

CR

,

Pinacho-Castellanos

SA

,

García-González

LA

, et al.

Do deep learning models make a difference in the identification of antimicrobial peptides?

Brief Bioinform

2022

;

23

(3):

1

–

16

76.

Aguilera-Mendoza

L

,

Marrero-Ponce

Y

,

Tellez-Ibarra

R

, et al.

Overlap and diversity in antimicrobial peptide databases: compiling a non-redundant set of sequences

.

Bioinformatics

2015

;

31

:

2553

–

9

.

77.

Aguilera-Mendoza

L

,

Marrero-Ponce

Y

,

Beltran

JA

, et al.

Graph-based data integration from bioactive peptide databases of pharmaceutical interest: toward an organized collection enabling visual network analysis

.

Bioinformatics

2019

;

35

:

4739

–

47

.

78.

Oyedare

T

,

Park

JJ

. Estimating the Required Training Dataset Size for Transmitter Classification Using Deep Learning. In:

2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN)

. Newark, NJ, USA: IEEE,

2019

, p.

1

–

10

.

79.

Jiang

J

,

Wang

R

,

Wang

M

, et al.

Boosting tree-assisted multitask deep learning for small scientific datasets

.

J Chem Inf Model

2020

;

60

:

1235

–

44

.

80.

Manibardo

EL

,

Laña

I

,

Ser

JD

.

Deep learning for road traffic forecasting: does it make a difference?

IEEE trans Intell Transp Syst

2021

;

23

:

1

–

25

.

81.

Consortium TU

.

UniProt: a worldwide hub of protein knowledge

.

Nucleic Acids Res

2018

;

47

:

D506

–

15

.

82.

Mistry

J

,

Chuguransky

S

,

Williams

L

, et al.

Pfam: the protein families database in 2021

.

Nucleic Acids Res

2020

;

49

:

D412

–

9

.

83.

Rives

A

,

Meier

J

,

Sercu

T

, et al.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

.

Proc Natl Acad Sci U S A

2021

;

118

:

e2016239118

.

84.

Bairoch

A

,

Apweiler

R

,

Wu

CH

, et al.

The universal protein resource (UniProt)

.

Nucleic Acids Res

2005

;

33

:

D154

–

9

.

85.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA

.

RNA and protein sequence data, Briefings Bioinf

2019

;

21

:

1047

–

57

.

86.

Godden

JW

,

Stahura

FL

,

Bajorath

J

.

Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations

.

J Chem Inf Comput Sci

2000

;

40

:

796

–

800

.

87.

Hall

MA

.

Correlation-based Feature Selection for Machine Learning. Department of Computer Science

, Vol.

178

.

Hamilton, NewZealand

:

University of Waikato

,

1999

.

Google Preview

88.

Robnik-Šikonja

M

,

Kononenko

I

.

Theoretical and empirical analysis of ReliefF and RReliefF

.

Mach Learn

2003

;

53

:

23

–

69

.

. http://www.cs.waikato.ac.nz/ml/weka/ (

89.

WEKA software

2 April 2019, date last accessed

).

90.

Urias

RWP

,

Barigye

SJ

,

Marrero-Ponce

Y

, et al.

IMMAN: free software for information theory-based chemometric analysis

.

Mol Divers

2015

;

19

:

305

–

19

.

91.

Kohavi

R

,

John

GH

.

Wrappers for feature subset selection

.

Artif Intell

1997

;

97

:

273

–

324

.

92.

Breiman

L

.

Random forests

.

Mach Learn

2001

;

45

:

5

–

32

.

93.

Golbraikh

A

,

Shen

M

,

Xiao

Z

, et al.

Rational selection of training and test sets for the development of validated QSAR models

.

J Comput Aided Mol Des

2003

;

17

:

241

–

53

.

94.

Wolpert

DH

. What is important about the no free lunch theorems? In:

Pardalos

PM

,

Rasskazova

V

,

Vrahatis

MN

(eds).

Black Box Optimization, Machine Learning, and No-Free Lunch Theorems

.

Cham

:

Springer International Publishing

,

2021

,

373

–

88

.

95.

Benavoli

A

,

Corani

G

,

Demšar

J

, et al.

Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis

.

J Mach Learn Res

2017

;

18

:

1

–

36

.

96.

Lazar

N

.

Ockham's razor

.

Wiley Interdiscip Rev Comput Stat

2010

;

2

:

243

–

6

.

97.

Van der Maaten

L

,

Hinton

G

.

Visualizing data using t-SNE

.

J Mach Learn Res

2008

;

9

:

2579

–

605

.

98.

Naseer

S

,

Hussain

W

,

Khan

YD

, et al.

Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations

.

Anal Biochem

2021

;

615

:

114069

.

99.

Qi

Y

. Random forest for bioinformatics. In:

Zhang

C

,

Ma

Y

(eds).

Ensemble Machine Learning: Methods and Applications

.

Boston, MA

:

Springer US

,

2012

,

307

–

23

.

100.

Fisher

A

,

Rudin

C

,

Dominici

F

.

All models are wrong, but many are useful: learning a Variable's importance by studying an entire class of prediction models simultaneously

.

J Mach Learn Res

2019

;

20

:

1

–

81

.

101.

Friedman

JH

,

Popescu

BE

.

Predictive learning via rule ensembles

.

Ann Appl Stat

2008

;

2

:

916

–

54

.

https://cran.r-project.org/web/packages/iml/index.html.

102.

Molnar

C.

iml: Interpretable Machine Learning

103.

Randić

M

.

Generalized molecular descriptors

.

J Math Chem

1991

;

7

:

155

–

68

.

. https://doi.org/10.1093/bib/bbab209.

104.

Ma

Y

,

Guo

Z

,

Xia

B

, et al.

Identification of antimicrobial peptides from the human gut microbiome using deep learning

.

Nat Biotechnol

2022

;

40

:

921

–

31

.

105.

Xiao

X

,

Shao

Y-T

,

Cheng

X

, et al.

iAMP-CA2L: a new CNN-BiLSTM-SVM classifier based on cellular automata image for identifying antimicrobial peptides and their functional types

.

Brief Bioinform

2021

;

22

:

1

–

10

106.

Singh

V

,

Shrivastava

S

,

Kumar Singh

S

, et al.

StaBle-ABPpred: a stacked ensemble predictor based on biLSTM and attention mechanism for accelerated discovery of antibacterial peptides

.

Brief Bioinform

2021

;

23

:

1

–

17

.