A framework for predicting variable-length epitopes of human-adapted viruses using machine learning methods

Yin, Rui; Zhu, Xianghe; Zeng, Min; Wu, Pengfei; Li, Min; Kwoh, Chee Keong

doi:10.1093/bib/bbac281

Abstract

The coronavirus disease 2019 pandemic has alerted people of the threat caused by viruses. Vaccine is the most effective way to prevent the disease from spreading. The interaction between antibodies and antigens will clear the infectious organisms from the host. Identifying B-cell epitopes is critical in vaccine design, development of disease diagnostics and antibody production. However, traditional experimental methods to determine epitopes are time-consuming and expensive, and the predictive performance using the existing in silico methods is not satisfactory. This paper develops a general framework to predict variable-length linear B-cell epitopes specific for human-adapted viruses with machine learning approaches based on Protvec representation of peptides and physicochemical properties of amino acids. QR decomposition is incorporated during the embedding process that enables our models to handle variable-length sequences. Experimental results on large immune epitope datasets validate that our proposed model’s performance is superior to the state-of-the-art methods in terms of AUROC (0.827) and AUPR (0.831) on the testing set. Moreover, sequence analysis also provides the results of the viral category for the corresponding predicted epitopes with high precision. Therefore, this framework is shown to reliably identify linear B-cell epitopes of human-adapted viruses given protein sequences and could provide assistance for potential future pandemics and epidemics.

Epitope prediction, Human viruses, Variable-length, Machine learning

Introduction

Antibody is a type of blood protein produced counteracting a specific antigen [1]. It is a fundamental component of humoral immune response used by the immune system to recognize and neutralize a unique molecule of the pathogen such as bacteria and viruses [2, 3]. The recognized region of an antigen by the antibody is referred to as B-cell epitopes [4] and antibodies can inhibit the function of antigens by binding the epitope regions [5]. The B-cell epitopes can be categorized into two groups, e.g. continuous and discontinuous. The continuous epitopes are a stretch of amino acid sequences that are linear in shape, while in the latter type, the amino acids are in close contact due to the 3D-protein conformation. Accurate detection of B-cell epitopes is critical for biomedical applications such as vaccine design [6], disease diagnostics [7] and therapeutic antibodies [8].

Figure 1

The basic workflow of the proposed framework. (A) Viral peptide inclusion criteria. Viral sequences were extracted from IEDB with human-adapted viruses. The identical sequences and virus category with fewer than 50 samples were excluded. Peptides with lengths shorter than 4 were also removed and we exported the remaining sequences with labels. (B) Feature representation. The sequences were encoded using ProtVec and AAindex combined with QR decomposition and CTD transformation to handle variable-length sequences. (C) Feature transformation process. Mathematical formulations of QR decomposition and CTD transformation to implement embedding process of input viral peptides. (D) Classification and evaluation. The filtered and processed samples were divided into training (90%) and testing (10%) datasets, where 5-fold cross-validation was performed on the training set using multiple classifiers for epitope prediction and species identification. The final performance estimated was quantified by the metrics such as accuracy, precision and recall, etc., on the held-out testing test.

Open in new tab Download slide

The identification of B-cell epitopes includes experimental determination and computational prediction. Experimental methods mainly use X-ray crystallography and traditional B-cell epitope mapping [9, 10]. However, most of these methods are costly, laborious and time-consuming and are not able to identify all epitopes. Advancement in bioinformatics has significantly contributed to the development of immunoinformatics involving the discovery of antibody structures [11], epitope analysis [12] and immune network modeling [13]. Various computational methods have been proposed to predict B-cell epitopes based on sequence and structural data as well as their co-evolutionary information [14] [15]. The early epitope prediction only uses one or several physicochemical properties of amino acids (e.g. polarity, solvent accessibility and hydrophilicity) with conventional computational approaches. These methods are usually calculating the average propensity value along with a sliding window that demonstrated to obtain poor performance in the practice [16–19]. With improved computing ability and increased available experimentally identified epitopes, methods based on machine learning techniques have been applied to improve the prediction performance by incorporating new amino acid-based features. One example is that the advent of proteomics and the construction of databases consisting of Ag-Ab crystal structures makes it easier to carry out a deeper analysis of conformational epitopes. For instance, PEPOP is a structure-based approach that uses the 3D coordinates of a protein to both predict clusters of surface accessible segments and residues that might correspond to discontinuous epitopes and to be used for immunogenic peptide design [20]. CBTOPE was proposed for the prediction of discontinuous epitopes from antigen primary structure through combining amino acid properties and traditional features of physicochemical profiles [21]. Zhang et al. followed the work of CBTOPE and introduced an ensemble learning technique to build the prediction model that explored more potential sequence-derived features relevant to conformational epitopes [22].

Though the majority of B-cell epitopes are conformational, much attention has been captured on the prediction of continuous epitopes as they are critical for the design of peptide-based vaccines [23] [24] [25]. ABCPred was the first method to predict B-cell epitopes using standard feed-forward and recurrent neural network (NN) in an antigenic sequence [26]. Chen et al. proposed a new amino acid pair (AAP) antigenicity scale, combining with support vector machine (SVM) to find B-cell epitopes [27]. This method was based on the finding that B-cell epitopes favor particular AAPs and trained on a homology-reduced dataset of linear B-cell epitopes. EL-Manzalawy et al. explored two machine learning approaches for predicting flexible length linear B-cell epitopes using the subsequence kernel [28]. Furthermore, Lian et al. presented a novel linear B-cell epitope prediction model using the multiple linear regression based on the antigen’s primary sequence information[29]. Larsen et al. introduced BepiPred for predicting continuous epitopes by combining two residues properties with Hidden Markov Model [30]. Jespersen et al. extended it by developing BepiPred-2.0 that was trained on epitopes annotated from antibody-antigen protein structures using random forest (RF) [31]. More recently, Collatz et al. described EpiDope, a python tool that leveraged a deep neural network (DNN) to detect linear B-cell epitope regions on individual protein sequences [32]. Bahai et al. illustrated a new method named EpitopeVec, which used a combination of residue properties, modified antigenicity scales, and protein language model-based representations as features of peptides for linear B-cell epitope predictions [33].

Despite these advancements, most of the existing methods concentrate on epitopes of fixed length, but linear B-cell epitopes can vary over a broad range in length. Identifying epitope sequences without trimming them will allow us to provide prediction tools to experiment with any arbitrary epitope lengths. In 2020, we witnessed and experienced a global pandemic due to severe acute respiratory syndrome coronavirus 2 that caused infections in millions of people and a large number of deaths. The viruses are still mutating and circulating rapidly with emerging new variants. Understanding humoral responses to these viruses is critical for improving therapeutics and vaccines and for preparing for future pandemics or epidemics led by other potential novel viruses. This work developed a novel framework that combines the physicochemical properties of amino acids and a new continuous distributed representation of biological sequences (ProtVec) for epitope prediction (Fig. 1). This model focuses on human-adapted viruses and can predict variable-length epitopes. We trained and tested our model using multiple traditional machine learning and deep learning methods, and the results indicate that XGBoost achieved the best performance. Further experiments demonstrate that our proposed model outperformed the other existing state-of-the-art methods. This model also could reveal the viral species of predicted epitopes with high accuracy. We hope this high predictive power of the model would facilitate the in vitro experimental validation for linear epitopes and accelerate the development of serological assays and antibody production. The contributions and innovations of this work are listed below:

We proposed a machine learning framework can not only predict viral epitopes but also identify the host of the viruses.
This framework can be applied at variable sequence lengths of human-adapted viruses.
The experimental results demonstrate superior performance over the other existing methods, and the model can be served to facilitate the vaccine design.

Materials and methods

Data collection

We downloaded all the linear viral peptides of B-cells of the human host from the Immune Epitope Database (IEDB) [34] on 30 March 2021, which resulted in 210 311 raw protein sequences. To ensure the quality of training samples, we selected virus species with over 50 cases and eliminated the peptides that were shorter than 4. Next, we merged the identical sequences from each category and kept the information of their verified regions. For those peptides with multiple assays, we classified the peptides as epitopes or non-epitopes based on the majority of the counts from the assays. The peptides with equal negative and positive assay results were also eliminated from the dataset. Finally, we ended up with 17 different human-adapted viruses containing 4975 epitopes and 4956 non-epitopes ranging from 4 to 50 in length. The detailed information is shown in Table 1.

Table 1

Open in new tab

The detailed information epitopes and non-epitopes of selective human-adapted viruses

Category of virus	Sequence	Epitopes	Non-epitopes	Min. length	Max. length	Avg. length
Alphapapillomavirus	179	168	11	7	30	12.9
Orthonairovirus	226	32	194	12	20	12.9
Dengue virus	904	428	476	5	46	10.3
Ebolavirus	650	74	576	10	25	15.0
Enterovirus	354	122	232	7	38	17.8
Hepacivirus	3354	1988	1366	5	48	14.3
Human alphaherpesvirus	819	134	685	6	31	19.2
Human immunodeficiency virus	454	447	7	4	47	14.1
Human orthopneumovirus	217	72	145	6	41	11.2
Influenza A virus	239	233	6	7	50	20.1
Measles morbillivirus	181	134	47	7	37	13.8
Orthohantavirus	385	101	284	8	45	19.7
Orthohepevirus A	287	176	111	6	48	23.1
Primate erythroparvovirus 1	276	61	215	7	41	20.1
Primate T-lymphotropic virus 1	197	181	16	6	47	17.8
Severe acute respiratory syndrome coronavirus	1001	533	468	4	50	17.6
Zika virus	208	91	117	8	30	17.7

Category of virus	Sequence	Epitopes	Non-epitopes	Min. length	Max. length	Avg. length
Alphapapillomavirus	179	168	11	7	30	12.9
Orthonairovirus	226	32	194	12	20	12.9
Dengue virus	904	428	476	5	46	10.3
Ebolavirus	650	74	576	10	25	15.0
Enterovirus	354	122	232	7	38	17.8
Hepacivirus	3354	1988	1366	5	48	14.3
Human alphaherpesvirus	819	134	685	6	31	19.2
Human immunodeficiency virus	454	447	7	4	47	14.1
Human orthopneumovirus	217	72	145	6	41	11.2
Influenza A virus	239	233	6	7	50	20.1
Measles morbillivirus	181	134	47	7	37	13.8
Orthohantavirus	385	101	284	8	45	19.7
Orthohepevirus A	287	176	111	6	48	23.1
Primate erythroparvovirus 1	276	61	215	7	41	20.1
Primate T-lymphotropic virus 1	197	181	16	6	47	17.8
Severe acute respiratory syndrome coronavirus	1001	533	468	4	50	17.6
Zika virus	208	91	117	8	30	17.7

Table 1

Open in new tab

The detailed information epitopes and non-epitopes of selective human-adapted viruses

Category of virus	Sequence	Epitopes	Non-epitopes	Min. length	Max. length	Avg. length
Alphapapillomavirus	179	168	11	7	30	12.9
Orthonairovirus	226	32	194	12	20	12.9
Dengue virus	904	428	476	5	46	10.3
Ebolavirus	650	74	576	10	25	15.0
Enterovirus	354	122	232	7	38	17.8
Hepacivirus	3354	1988	1366	5	48	14.3
Human alphaherpesvirus	819	134	685	6	31	19.2
Human immunodeficiency virus	454	447	7	4	47	14.1
Human orthopneumovirus	217	72	145	6	41	11.2
Influenza A virus	239	233	6	7	50	20.1
Measles morbillivirus	181	134	47	7	37	13.8
Orthohantavirus	385	101	284	8	45	19.7
Orthohepevirus A	287	176	111	6	48	23.1
Primate erythroparvovirus 1	276	61	215	7	41	20.1
Primate T-lymphotropic virus 1	197	181	16	6	47	17.8
Severe acute respiratory syndrome coronavirus	1001	533	468	4	50	17.6
Zika virus	208	91	117	8	30	17.7

Category of virus	Sequence	Epitopes	Non-epitopes	Min. length	Max. length	Avg. length
Alphapapillomavirus	179	168	11	7	30	12.9
Orthonairovirus	226	32	194	12	20	12.9
Dengue virus	904	428	476	5	46	10.3
Ebolavirus	650	74	576	10	25	15.0
Enterovirus	354	122	232	7	38	17.8
Hepacivirus	3354	1988	1366	5	48	14.3
Human alphaherpesvirus	819	134	685	6	31	19.2
Human immunodeficiency virus	454	447	7	4	47	14.1
Human orthopneumovirus	217	72	145	6	41	11.2
Influenza A virus	239	233	6	7	50	20.1
Measles morbillivirus	181	134	47	7	37	13.8
Orthohantavirus	385	101	284	8	45	19.7
Orthohepevirus A	287	176	111	6	48	23.1
Primate erythroparvovirus 1	276	61	215	7	41	20.1
Primate T-lymphotropic virus 1	197	181	16	6	47	17.8
Severe acute respiratory syndrome coronavirus	1001	533	468	4	50	17.6
Zika virus	208	91	117	8	30	17.7

Feature representation

Previous methods of mapping a variable-length amino acid sequence into a fixed-length feature vector contain amino acid composition, dipeptide composition (DPC), AAP antigenicity scale and k-mer representation, etc. Once the variable-length sequences are converted into fixed-length feature vectors, machine learning algorithms can be applied to various prediction tasks. Recently, continuous representation has been explored in bioinformatics applications. ProtVec [35] has been demonstrated as an efficient embedding method for biological sequences for a variety of biomedical problems [36] [37] [38]. Inspired by the continuous vector representation of words in the natural language processing, this method regarded 3-mers of amino acids as words and collected 546 790 protein sequences from the Swiss-Prot database as the training set. Skip-gram NN was applied to the dataset that attempted to maximize the probability of observed sequences and 100-dimensional numerical vectors were calculated to represent 3-mer peptides. In this work, we leveraged ProtVec to encode the sequences. Given a sequence of length N, we split it into N-2 consecutive 3-mers and each 3-mer was represented by a 100-dimensional vector mapped from ProtVec. Therefore, we have (N-2)*100-dimensional vectors to represent input sequences.

To address the issue of different lengths of input viral peptides, we introduced QR decomposition. Assume a real matrix |$X \in \mathbb{R}^{m\times n}$| that can be written as

$$\begin{align}& X = QR = Q \left[\begin{array}{c}R^{\prime}\\O\end{array}\right], \end{align}$$

(1)

where |$Q \in \mathbb{R}^{m\times m} $| is an orthogonal matrix (⁠|$QQ^T = Q^TQ = I_m$|⁠), |$R^{\prime} \in \mathbb{R}^{n\times n}$| is an upper-diagonal matrix (⁠|$r_{ij} = 0$| for all |$i> j$|⁠) and |$O \in \mathbb{R}^{(m-n) \times n}$| is a zero matrix. We apply Gram–Schmidt process to perform QR decomposition. Specifically, for matrix |$X = [x_1, x_2,..., x_n]$|⁠, where |${x}_i$| are column vectors, we have

$$\begin{align}& {a}_1 = {x}_1, {e}_1 = \frac{{a}_1}{\|{a}_1\|}; \end{align}$$

(2)

$$\begin{align}& {a}_i = {x}_i - ({x}_i^T \cdot{e}_{i-1}){e}_{i-1}, {e}_i = \frac{{a}_i}{\|{a}_i\|}, \end{align}$$

(3)

where |$i \in \{2,3,\cdots , n\} $| and |$\|\cdot \|$| is the |$L_2$| norm. The matrix |$A = [a_1, a_2,..., a_n]$| can be formulated as |$A = QR$|⁠, where |$Q = [e_1, e_2,..., e_n]$| and |$R$| is an upper triangular matrix. The following will prove such process gives the matrix by showing

$$\begin{align}& x_{i}=\sum_{k=1}^{i} r_{k i} e_{k}, \end{align}$$

(4)

where |$\mathrm{i} \in \{1,2, \ldots , \mathrm{n}\}$| and some |$r_{k i}$|⁠.

First notice that when |$\mathrm{i}=1$| obviously it is true by choosing |$r_{11}=\left \|a_{1}\right \|$|⁠. When |$\mathrm{i}=2$|⁠, |$\mathrm{e}_{2}$| is orthogonal to |$e_{1}$| since |$e_{2}^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|} a_{2}^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|}\left [x_{2}-\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}\right ]^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|}\left (x_{2}^{T} \cdot e_{1}\right )(1-$||$\left .e_{1}^{T} e_{1}\right )=0$|⁠. Moreover, we have |$\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}+\left \|a_{2}\right \| e_{2}=\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}+a_{2}=x_{2}$|⁠, then (4) holds where |$r_{12}=x_{2}^{T} \cdot e_{1}$| and |$r_{22}=\left \|a_{2}\right \|$|⁠. For |$\mathrm{i} \geq 3$|⁠, (4) can be obtained with similar process by setting |$r_{i j}=x_{j}^{T} \cdot e_{i}$| for |$\mathrm{i} \leq \mathrm{j}-1$| and |$r_{j j}=\left \|a_{j}\right \|$|⁠. Therefore, QR decomposition always exists even if the matrix does not have full rank.

In our work, we applied the QR decomposition to all the embedded sequences, which will be represented in |$100\times (N-2)$|-dimensional vectors. Each sequence could be denoted as the product of |$Q \in \mathbb{R}^{100\times 100 }$| and |$R \in \mathbb{R}^{100\times (N-2)}$|⁠. Considering |$R$| is a upper triangular matrix with plenty of zero values, and also |$R^{\prime}$| occupied a small proportion in |$R$|⁠, we only used the matrix |$Q$| to represent each input sequence when building the model. As a result, all the peptides can be transformed into fixed length matrices after embedding.

In addition, to make full use of the sequence information, we added physiochemical properties of amino acids as additional features through Composition–Transition–Distribution method [39]. This method mapped each genomic sequence into a fixed-length vector based on amino acid properties. We selected seven representative physicochemical properties [40] (net charge, hydrophobicity, polarizability, normalized van der Waals, volume polarity, solvent accessibility and secondary structure) and converted the sequences into numerical vectors using AAindex [41], a database of numerical indices representing various physicochemical and biochemical properties of amino acids. We divided all amino acids into three different groups based on the physicochemical properties [42] (Supplementary Materials S1). We derived 147 features for each peptide sequence using CTD descriptors formulated below.

$$\begin{align}& Composition = \left(\frac{C_{G1}}{N_{p}}, \frac{C_{G2}}{N_{p}}, \frac{C_{G3}}{N_{p}}\right); \end{align}$$

(5)

$$\begin{align}& Transition = \left(\frac{T_{G1G2}}{N_{p}-1}, \frac{T_{G1G3}}{N_{p}-1}, \frac{T_{G2G3}}{N_{p}-1}\right); \end{align}$$

(6)

$$\begin{align}& Distribution = \left(\frac{D_{i0}}{N_{p}}, \frac{D_{i25}}{N_{p}}, \frac{D_{i50}}{N_{p}}, \frac{D_{i75}}{N_{p}}, \frac{D_{i100}}{N_{p}}\right), \end{align}$$

(7)

where composition represents the percent frequency of amino acids of a particular property divided by the total number of amino acids. |$N_p$| is the total number of amino acids of the peptide |$p$| and |$C_{Gi}$| is the frequency of amino acid property of group i in the sequence. Transition characterizes the percentage frequency with which amino acid from a group is followed by another group denoted as |$T_{GiGj}$|⁠. It means the property in group i is followed by group j or the other way around such that i, j = 1,2,3 and |$G_{i} \neq G_{j}$|⁠. The third descriptor measures the distribution of each attribute in the sequence and |$D_{i}$| denotes the percentage in the positions of amino acid properties in group i. The distribution is based on the first 25%, 50%, 75% and 100% of the amino acids for each attribute.

Classification

The feature transformation process enabled us to convert variable-length peptides into numerical matrices with identical dimensions that can be handled with machine learning models for classification tasks. Here, we leveraged several methods, including traditional machine learning algorithms and DNN architectures to predict viral epitope among all the encoded peptides. The traditional machine learning algorithms included SVM, K-nearest neighbor (KNN), naïve bayes (NB), RF, NN and XGBoost. The SVM classifier learns a hyperplane to differentiate the positive and negative samples using different kernel functions to maximize the geometric margin. KNN is a nonparametric classification method where the function is approximated locally and all computation is deferred until function evaluation. The NB classifier is based on the Bayes’ Theorem, adapted for use across binary and multiclass classification problems. RFs use multiple decision trees on various subsamples of the dataset to improve the predictive performance. The classic NN classifier implements a multilayer perceptron algorithm that trains using Backpropagation. XGBoost is an optimized distributed gradient boosting method that provides parallel tree boosting to solve classification problems in an efficient and accurate way. In comparison, we also employed DNN models that were previously used and proved to be effective by encoding various physicochemical and structural information for predictive tasks on protein sequences [43–45]. The DNN architecture consists of several variants of convolutional and recurrent NNs, including AlexNet [46], VGG [47], SqueezeNet [48], long short-term memory (LSTM) [49] and gated recurrent unit (GRU) [50].

Experimental setup

We implemented all the approaches with Scikit-learn [51] and PyTorch [52]. The exported data were randomly divided into training and independent testing set in a ratio of 9:1. We applied 5-fold cross validation on the training set where each subset containing an equal number of peptides. Four of the subsets were used to train the model and the remaining subset was for validation. The procedure was repeated five times and the final prediction was the average of five validation results over the cross-validation. The parameters for both traditional machine learning methods and DNN models can be found in Supplementary Materials S2. We used several metrics to evaluate our models including accuracy, precision, recall and F1. The definition of these metrics were described as follows, where TP, TN, FP and FN are true positive, true negative, false positive and false negative, respectively, which measures if the binary targets are correctly predicted or not.

$$\begin{align}& Accuracy = \frac{TP + TN} {TP + FP + TN + FN} \end{align}$$

(8)

$$\begin{align}& Precision = \frac{TP}{TP + FP} \end{align}$$

(9)

$$\begin{align}& Recall = \frac{TP}{TP + FN} \end{align}$$

(10)

$$\begin{align}& F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{align}$$

(11)

To further demonstrate the superiority of our proposed framework, we compared our model against several existing tools for linear B-cell epitope prediction on the independent testing set. These tools and methods were in their latest version including Bepipred2.0 [31], LBtope [53], iBCE-EL [54], EpitopeVec [33], Parker Hydrophilicity prediction [55], Chou and Fasman beta turn prediction [56], Emini surface accessibility prediction [57], Kolaskar and Tongaonkar antigenicity Prediction [58] and Karplus and Schulz Flexibility Prediction [59]. In addition, we added two more approaches with different feature representations, namely, AAP antigenicity scale and DPC, for epitope prediction. AAP used a new scale that suggested certain AAPs are favored in epitope regions [27]. Here, we only compare this method with sequence length of 15 denoted as AAP15. DPC is a vector specifying the abundance of dipeptides that can reflect the correlation of adjacent amino acids in protein sequence [60]. We followed the original settings to implement these methods and the detailed parameters and can be found in Supplementary Materials S3. To compare the performance of the compared models, we adopted the area under Receiver operating characteristic (AUROC) and Precision-Recall (AUPR) curves, which is a more efficient way to display the relationship between sensitivity and specificity for assessing models and a good practice in imbalanced dataset.

Results and discussion

The length distribution of epitopes and non-epitopes

The length of collected B-cell epitopes and non-epitopes varies over a broad range across viruses. Thus, it is natural to know the distribution of the sequences in different lengths. Figure 2 displayed the length distribution of unique linear B-cell epitopes and non-epitopes of filtered dataset. We found that most of the peptides range from 6 to 20 in length, where the sequences of length 20 occupied almost half. Both epitope and non-epitope peptides in length 20 had the largest samples compared with peptides in other lengths. The number of peptides over 30 in length only occupied 1% of all the samples. We will still keep these samples for training as they meet the inclusion criteria. Nevertheless, our proposed model is capable of handling different lengths of peptides without extending or trimming them.

Figure 2

The length distribution of unique linear B-cell epitopes and non-epitopes in the dataset.

Open in new tab Download slide

Performance comparison on combination of features and classifiers

We then investigated the performance using different machine learning models with the ProtVec and CTD features and their combination. We trained and tested the models on the training set with 5-fold cross-validation and the performance was averaged over the hold-out folds (Table 2). In more detail, we could find that traditional machine learning methods (KNN, SVM, NB, RF, NN and XGBoost) showed better performance than deep learning models (VGG, AlexNet, SqueezeNet, RNN, LSTM, GRU). One of the possible explanations is that deep learning models require much more data to function properly due to the complex multilayer structures than traditional machine learning models. In addition, the loss of information of input sequences due to QR decomposition might hinder the DNN obtain a better result. Among all the models, XGBoost outperformed other classifiers for predicting epitopes in terms of accuracy (0.753), F-score (0.740), AUROC (0.754) and AUPR (0.825) using the combination of ProtVec and CTD features, while NB showed the best precision (0.835) and GRU obtained highest recall value (0.864). In addition, the combination of ProtVec and CTD features achieved better results than using the ProtVec and CTD features independently of all the models except SVM, where only ProtVec features suggested better performance. These observations indicate that incorporating ProtVec and CTD features would improve the model performance for epitope prediction of human-adapted viruses.

Table 2

Open in new tab

Performance comparison of combination of features in epitope prediction using different machine learning models

Model		Feature	Accuracy	Precision	Recall	F-score	AUROC	AUPR
Traditional classifier	KNN	ProtVec	0.611	0.583	0.809	0.678	0.608	0.692
		CTD	0.588	0.591	0.602	0.597	0.588	0.687
		ProtVec+CTD	0.633	0.640	0.630	0.635	0.633	0.719
	SVM	ProtVec	0.686	0.708	0.648	0.677	0.687	0.778
		CTD	0.619	0.638	0.572	0.603	0.619	0.675
		ProtVec+CTD	0.620	0.631	0.577	0.602	0.621	0.645
	NB	ProtVec	0.578	0.823	0.212	0.338	0.582	0.620
		CTD	0.573	0.585	0.536	0.560	0.573	0.574
		ProtVec+CTD	0.579	0.835	0.211	0.337	0.584	0.622
	RF	ProtVec	0.683	0.745	0.570	0.646	0.685	0.770
		CTD	0.693	0.691	0.690	0.690	0.751	0.788
		ProtVec+CTD	0.707	0.765	0.608	0.678	0.708	0.777
	NN	ProtVec	0.670	0.684	0.670	0.675	0.668	0.748
		CTD	0.546	0.804	0.139	0.237	0.552	0.609
		ProtVec+CTD	0.671	0.674	0.678	0.676	0.671	0.738
	XGBoost	ProtVec	0.721	0.752	0.669	0.708	0.721	0.810
		CTD	0.690	0.698	0.685	0.692	0.690	0.780
		ProtVec+CTD	0.753	0.791	0.695	0.740	0.754	0.825
Deep learning model	VGG	ProtVec	0.583	0.617	0.465	0.530	0.521	0.539
		CTD	0.552	0.586	0.397	0.473	0.469	0.490
		ProtVec+CTD	0.597	0.628	0.503	0.558	0.529	0.525
	AlexNet	ProtVec	0.570	0.608	0.487	0.463	0.511	0.532
		CTD	0.577	0.592	0.530	0.559	0.485	0.577
		ProtVec+CTD	0.590	0.608	0.534	0.569	0.520	0.529
	SqueezeNet	ProtVec	0.575	0.570	0.650	0.607	0.547	0.571
		CTD	0.589	0.613	0.508	0.556	0.498	0.661
		ProtVec+CTD	0.667	0.679	0.648	0.663	0.559	0.646
	RNN	ProtVec	0.575	0.717	0.266	0.388	0.425	0.504
		CTD	0.585	0.630	0.437	0.516	0.462	0.517
		ProtVec+CTD	0.602	0.714	0.357	0.476	0.479	0.484
	LSTM	ProtVec	0.542	0.588	0.324	0.417	0.487	0.527
		CTD	0.493	0.500	0.021	0.041	0.503	0.513
		ProtVec+CTD	0.568	0.565	0.644	0.602	0.522	0.535
	GRU	ProtVec	0.550	0.659	0.234	0.346	0.479	0.488
		CTD	0.499	0.503	0.864	0.636	0.506	0.533
		ProtVec+CTD	0.571	0.645	0.339	0.445	0.468	0.551

Model		Feature	Accuracy	Precision	Recall	F-score	AUROC	AUPR
Traditional classifier	KNN	ProtVec	0.611	0.583	0.809	0.678	0.608	0.692
		CTD	0.588	0.591	0.602	0.597	0.588	0.687
		ProtVec+CTD	0.633	0.640	0.630	0.635	0.633	0.719
	SVM	ProtVec	0.686	0.708	0.648	0.677	0.687	0.778
		CTD	0.619	0.638	0.572	0.603	0.619	0.675
		ProtVec+CTD	0.620	0.631	0.577	0.602	0.621	0.645
	NB	ProtVec	0.578	0.823	0.212	0.338	0.582	0.620
		CTD	0.573	0.585	0.536	0.560	0.573	0.574
		ProtVec+CTD	0.579	0.835	0.211	0.337	0.584	0.622
	RF	ProtVec	0.683	0.745	0.570	0.646	0.685	0.770
		CTD	0.693	0.691	0.690	0.690	0.751	0.788
		ProtVec+CTD	0.707	0.765	0.608	0.678	0.708	0.777
	NN	ProtVec	0.670	0.684	0.670	0.675	0.668	0.748
		CTD	0.546	0.804	0.139	0.237	0.552	0.609
		ProtVec+CTD	0.671	0.674	0.678	0.676	0.671	0.738
	XGBoost	ProtVec	0.721	0.752	0.669	0.708	0.721	0.810
		CTD	0.690	0.698	0.685	0.692	0.690	0.780
		ProtVec+CTD	0.753	0.791	0.695	0.740	0.754	0.825
Deep learning model	VGG	ProtVec	0.583	0.617	0.465	0.530	0.521	0.539
		CTD	0.552	0.586	0.397	0.473	0.469	0.490
		ProtVec+CTD	0.597	0.628	0.503	0.558	0.529	0.525
	AlexNet	ProtVec	0.570	0.608	0.487	0.463	0.511	0.532
		CTD	0.577	0.592	0.530	0.559	0.485	0.577
		ProtVec+CTD	0.590	0.608	0.534	0.569	0.520	0.529
	SqueezeNet	ProtVec	0.575	0.570	0.650	0.607	0.547	0.571
		CTD	0.589	0.613	0.508	0.556	0.498	0.661
		ProtVec+CTD	0.667	0.679	0.648	0.663	0.559	0.646
	RNN	ProtVec	0.575	0.717	0.266	0.388	0.425	0.504
		CTD	0.585	0.630	0.437	0.516	0.462	0.517
		ProtVec+CTD	0.602	0.714	0.357	0.476	0.479	0.484
	LSTM	ProtVec	0.542	0.588	0.324	0.417	0.487	0.527
		CTD	0.493	0.500	0.021	0.041	0.503	0.513
		ProtVec+CTD	0.568	0.565	0.644	0.602	0.522	0.535
	GRU	ProtVec	0.550	0.659	0.234	0.346	0.479	0.488
		CTD	0.499	0.503	0.864	0.636	0.506	0.533
		ProtVec+CTD	0.571	0.645	0.339	0.445	0.468	0.551

Table 2

Open in new tab

Performance comparison of combination of features in epitope prediction using different machine learning models

Model		Feature	Accuracy	Precision	Recall	F-score	AUROC	AUPR
Traditional classifier	KNN	ProtVec	0.611	0.583	0.809	0.678	0.608	0.692
		CTD	0.588	0.591	0.602	0.597	0.588	0.687
		ProtVec+CTD	0.633	0.640	0.630	0.635	0.633	0.719
	SVM	ProtVec	0.686	0.708	0.648	0.677	0.687	0.778
		CTD	0.619	0.638	0.572	0.603	0.619	0.675
		ProtVec+CTD	0.620	0.631	0.577	0.602	0.621	0.645
	NB	ProtVec	0.578	0.823	0.212	0.338	0.582	0.620
		CTD	0.573	0.585	0.536	0.560	0.573	0.574
		ProtVec+CTD	0.579	0.835	0.211	0.337	0.584	0.622
	RF	ProtVec	0.683	0.745	0.570	0.646	0.685	0.770
		CTD	0.693	0.691	0.690	0.690	0.751	0.788
		ProtVec+CTD	0.707	0.765	0.608	0.678	0.708	0.777
	NN	ProtVec	0.670	0.684	0.670	0.675	0.668	0.748
		CTD	0.546	0.804	0.139	0.237	0.552	0.609
		ProtVec+CTD	0.671	0.674	0.678	0.676	0.671	0.738
	XGBoost	ProtVec	0.721	0.752	0.669	0.708	0.721	0.810
		CTD	0.690	0.698	0.685	0.692	0.690	0.780
		ProtVec+CTD	0.753	0.791	0.695	0.740	0.754	0.825
Deep learning model	VGG	ProtVec	0.583	0.617	0.465	0.530	0.521	0.539
		CTD	0.552	0.586	0.397	0.473	0.469	0.490
		ProtVec+CTD	0.597	0.628	0.503	0.558	0.529	0.525
	AlexNet	ProtVec	0.570	0.608	0.487	0.463	0.511	0.532
		CTD	0.577	0.592	0.530	0.559	0.485	0.577
		ProtVec+CTD	0.590	0.608	0.534	0.569	0.520	0.529
	SqueezeNet	ProtVec	0.575	0.570	0.650	0.607	0.547	0.571
		CTD	0.589	0.613	0.508	0.556	0.498	0.661
		ProtVec+CTD	0.667	0.679	0.648	0.663	0.559	0.646
	RNN	ProtVec	0.575	0.717	0.266	0.388	0.425	0.504
		CTD	0.585	0.630	0.437	0.516	0.462	0.517
		ProtVec+CTD	0.602	0.714	0.357	0.476	0.479	0.484
	LSTM	ProtVec	0.542	0.588	0.324	0.417	0.487	0.527
		CTD	0.493	0.500	0.021	0.041	0.503	0.513
		ProtVec+CTD	0.568	0.565	0.644	0.602	0.522	0.535
	GRU	ProtVec	0.550	0.659	0.234	0.346	0.479	0.488
		CTD	0.499	0.503	0.864	0.636	0.506	0.533
		ProtVec+CTD	0.571	0.645	0.339	0.445	0.468	0.551

Model		Feature	Accuracy	Precision	Recall	F-score	AUROC	AUPR
Traditional classifier	KNN	ProtVec	0.611	0.583	0.809	0.678	0.608	0.692
		CTD	0.588	0.591	0.602	0.597	0.588	0.687
		ProtVec+CTD	0.633	0.640	0.630	0.635	0.633	0.719
	SVM	ProtVec	0.686	0.708	0.648	0.677	0.687	0.778
		CTD	0.619	0.638	0.572	0.603	0.619	0.675
		ProtVec+CTD	0.620	0.631	0.577	0.602	0.621	0.645
	NB	ProtVec	0.578	0.823	0.212	0.338	0.582	0.620
		CTD	0.573	0.585	0.536	0.560	0.573	0.574
		ProtVec+CTD	0.579	0.835	0.211	0.337	0.584	0.622
	RF	ProtVec	0.683	0.745	0.570	0.646	0.685	0.770
		CTD	0.693	0.691	0.690	0.690	0.751	0.788
		ProtVec+CTD	0.707	0.765	0.608	0.678	0.708	0.777
	NN	ProtVec	0.670	0.684	0.670	0.675	0.668	0.748
		CTD	0.546	0.804	0.139	0.237	0.552	0.609
		ProtVec+CTD	0.671	0.674	0.678	0.676	0.671	0.738
	XGBoost	ProtVec	0.721	0.752	0.669	0.708	0.721	0.810
		CTD	0.690	0.698	0.685	0.692	0.690	0.780
		ProtVec+CTD	0.753	0.791	0.695	0.740	0.754	0.825
Deep learning model	VGG	ProtVec	0.583	0.617	0.465	0.530	0.521	0.539
		CTD	0.552	0.586	0.397	0.473	0.469	0.490
		ProtVec+CTD	0.597	0.628	0.503	0.558	0.529	0.525
	AlexNet	ProtVec	0.570	0.608	0.487	0.463	0.511	0.532
		CTD	0.577	0.592	0.530	0.559	0.485	0.577
		ProtVec+CTD	0.590	0.608	0.534	0.569	0.520	0.529
	SqueezeNet	ProtVec	0.575	0.570	0.650	0.607	0.547	0.571
		CTD	0.589	0.613	0.508	0.556	0.498	0.661
		ProtVec+CTD	0.667	0.679	0.648	0.663	0.559	0.646
	RNN	ProtVec	0.575	0.717	0.266	0.388	0.425	0.504
		CTD	0.585	0.630	0.437	0.516	0.462	0.517
		ProtVec+CTD	0.602	0.714	0.357	0.476	0.479	0.484
	LSTM	ProtVec	0.542	0.588	0.324	0.417	0.487	0.527
		CTD	0.493	0.500	0.021	0.041	0.503	0.513
		ProtVec+CTD	0.568	0.565	0.644	0.602	0.522	0.535
	GRU	ProtVec	0.550	0.659	0.234	0.346	0.479	0.488
		CTD	0.499	0.503	0.864	0.636	0.506	0.533
		ProtVec+CTD	0.571	0.645	0.339	0.445	0.468	0.551

Comparison with existing methods

We further compared our proposed framework with several existing state-of-the-art methods for epitope prediction on independent testing data. We used XGBoost for comparison as it demonstrated superior performance than other classifiers. Figure 3 showed the ROC and PR curves of epitope prediction performance. We could observe that our proposed models leveraging ProtVec and CTD features presented leading outcomes in both ROC and PR AUCs. The combination of ProtVec and CTD features had the best results among all the methods, with 0.827 and 0.831 of AUROC and AUPR, respectively. The substantial performance difference between some of the compared models (e.g. BepiPred2 and LBtope) and the proposed models suggests the lack of generalization ability of the previous methods. This is probably due to the fundamental difference of the transformed features from the data that the existing methods are not efficient in distinguishing the epitope and non-epitope sequences. Our proposed model significantly improved the predictive performance by incorporating features of ProtVec representation and CTD-based physicochemical properties of amino acids. Meanwhile, our framework can predict epitopes with variable length, while many of the compared methods mainly focus on fix-length sequences. Therefore, our framework is superior regarding the predictive performance and generalizability for human-adapted viral epitopes identification and has the potential when applied to other types of viruses.

Figure 3

Performance of the proposed framework and compared methods at predicting epitopes of human-adapted viruses on independent testing set evaluated by Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves.

Open in new tab Download slide

The identification of viral epitope species

Following the viral epitope prediction, we retrained the framework to identify the viral category of the predicted epitopes. Figure 4 presented the performance of identifying viral species for predicted epitopes. We observed that the classifiers based only on ProtVec features would gain better predictive precision with the value of 0.857 by NNs. Meanwhile, RF and KNNs also achieved comparable performance based on ProtVec features. Interestingly, this is different from predicting viral epitopes, where the combination of ProtVec and CTD features displays the best results for most classifiers. The primary reason is that, according to the results, CTD features are not good indicators to identify viral species. Using CTD features alone to predict virus species suggested an unsatisfied result with only 0.57 in precision on average. Nevertheless, our proposed framework not only demonstrated its ability to predict human-adapted variable-length viral epitopes but also can identify the viral species of the epitope.

Figure 4

The precision on identifying viral species of input epitope and non-epitopes with combinations of ProtVec and CTD features using multiple classifiers.

Open in new tab Download slide

Model evaluation on different length groups for epitope prediction

To further estimate the efficacy of the proposed framework for identifying epitopes at specific peptide lengths, we conducted experiments to predict the epitopes at different length groups of human-adapted viruses. The peptides are divided into several groups based on different lengths due to insufficient samples, i.e. ‘|$\leq $|11’, ‘12–15’, ‘16–19’ and ‘|$\geq $|20’ that consist of 223, 282, 147 and 342 samples, respectively, for testing. Figure 5 showed the predictive performance in terms of accuracy, precision, recall and F1-score. The model was using XGBoost as the classifier based on combined ProtVec and CTD features with the same setting as Table 2. We could discern that our model achieved comparable performance in different length groups. The accuracy ranged from 0.656 to 0.803, where the groups of ‘|$\leq $|11’, ‘16-19’ and ‘|$\geq $|20’ obtained better performance than ‘12–15’ group. When we looked at the other metrics, it indicated similar results regarding the precision and F-score. However, the group of ‘|$\leq $|11’ evidently outperformed the other three groups in recall, with the value of 0.853, suggesting our model is more capable of identifying epitopes with shorter lengths. Although the model displayed an overall superior performance on the group of ‘|$\leq $|11’, the predictive outcomes on the other three groups were also very compelling. Therefore, the evaluation on testing data of different length groups further demonstrated the generalizability of our proposed framework for epitope prediction on single fixed-length sequences.

Figure 5

Comparison of different length groups in the epitope prediction in terms of accuracy, precision, recall and F1 score validated with testing set.

Open in new tab Download slide

Conclusion

One challenge to predict linear B-cell epitopes is that these epitopes vary in length, ranging from 4 to 50 amino acids in this work. This paper overcomes this problem by describing a general computational framework for predicting variable-length linear B-cell epitopes, focusing on human-adapted viruses. We introduce ProtVec and QR decomposition that allows us to convert peptides in different lengths into the same dimension of feature vectors. The addition of physicochemical properties of amino acids based on CTD descriptors further improves the prediction model. Experimental results indicate that our proposed framework not only outperforms the other existing computational methods but also can identify the viral species of the epitope with high precision. The superior predictive power of the proposed framework enables more precise identification of epitopes, facilitating the peptide-based vaccine design and the development of medical treatment efficiently and cost-effectively. In future work, we will incorporate more features such as the information of tertiary structures of peptides into the prediction model. In addition, it is interesting to explore how this framework could be specifically involved and help the vaccine design process in the real world.

Key Points

A machine learning framework is proposed to predict variable-length epitopes of human-adapted viruses that could facilitate the process of vaccine design.
Experimental results indicate that XGBoost model scores the best with combined ProtVec and CTD features.
Our proposed model remarkably outperforms the existing methods on the testing dataset and can identify the viral species of the predicted epitope.

Author contributions statement

R.Y., X.Z. and C.K.K. designed the experiments; R.Y. and X.Z. conducted the experiments and analyzed the results. R.Y. wrote the manuscript and M.Z. and P.W. revised the manuscript. All the authors reviewed the manuscript.

Acknowledgments

This project is supported by AcRF Tier 2 grant MOE2014-T2-2-023, Ministry of Education, Singapore and A*STAR-NTU-SUTD AI Partnership grant, RGANS1905.

Availability and Implementation

The source codes, data and supplementary materials are publicly available at https://github.com/Rayin-saber/Epitope-prediction-of-human-adapted-viruses

Author Biographies

Rui Yin is a research fellow at the Department of Biomedical Informatics, Harvard Medical School. His research interests focus on data mining and machine learning to make sense of big heterogeneous data for real-world application in biomedical fields.

Xianghe Zhu is currently an MSc student in Statistical Science at the University of Oxford. His research interests include statistical and probabilistic network analysis, machine learning and bioinformatics.

Min Zeng is an Assistant Professor in the School of Computer Science and Engineering, Central South University, Changsha, Hunan, P. R. China. His research interests include machine learning and deep learning techniques for bioinformatics and computational biology.

Pengfei Wu is a pediatrician at the Department of Genetics and Endocrinology, National Children’s Medical Center for South Central Region, Guangzhou, China. His research interest is medical genomics and bioinformatics.

Min Li is a professor and vice dean at the School of Computer Science and Engineering, Central South University, Changsha, Hunan, P.R. China. Her research interests include computational biology, systems biology and bioinformatics.

Chee Keong Kwoh is an associate professor at the School of Computer Science and Engineering, Nanyang Technological University, Singapore. His research interests include data mining, soft computing and graph-based inference; applications areas include bioinformatics and biomedical engineering.

References

1.

Reth

M

.

Matching cellular dimensions with molecular sizes

.

Nat Immunol

2013

;

14

(

8

):

765

–

7

.

2.

Baumgarth

N

,

Tung

JW

,

Herzenberg

LA

. Inherent specificities in natural antibodies: a key to immune defense against pathogen invasion. In:

Springer seminars in immunopathology

, Vol.

26

. No. 4.

Springer

,

2005

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

3.

Murphy

K

,

Travers

P

,

Walport

M

, et al.

Immunobiology

.

NY

:

Garland Science New York

,

2012

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

4.

Kringelum

JV

,

Lundegaard

C

,

Lund

O

, et al.

Reliable b cell epitope predictions: impacts of method development and improved benchmarking

.

PLoS Comput Biol

2012

;

8

(

12

):e1002829.

Google Scholar

OpenURL Placeholder Text

WorldCat

5.

Ekiert

DC

,

Bhabha

G

,

Elsliger

M-A

, et al.

Antibody recognition of a highly conserved influenza virus epitope

.

Science

2009

;

324

(

5924

):

246

–

51

.

6.

Yin

R

,

Yu

Z

,

Zhou

X

, et al.

Time series computational prediction of vaccines for influenza a h3n2 with recurrent neural networks

.

J Bioinform Comput Biol

2020

;

18

(

01

):

2040002

.

7.

Ahmad

TA

,

Eweida

AE

,

Sheweita

SA

.

B-cell epitope mapping for the design of vaccines and effective diagnostics

.

Trials in Vaccinology

2016

;

5

:

71

–

83

.

Google Scholar

Crossref

WorldCat

8.

Kametani

Y

,

Miyamoto

A

,

Tsuda

B

, et al.

B cell epitope-based vaccination therapy

.

Antibodies

2015

;

4

(

3

):

225

–

39

.

Google Scholar

Crossref

WorldCat

9.

Gershoni

JM

,

Roitburd-Berman

A

,

Siman-Tov

DD

, et al.

Epitope mapping

.

BioDrugs

2007

;

21

(

3

):

145

–

56

.

10.

Huang

J

,

Beibei

R

,

Dai

P

.

Bioinformatics resources and tools for phage display

.

Molecules

2011

;

16

(

1

):

694

–

709

.

11.

Shirai

H

,

Ikeda

K

,

Yamashita

K

, et al.

High-resolution modeling of antibody structures by a combination of bioinformatics, expert knowledge, and molecular simulations

.

Proteins: Structure, Function, and Bioinformatics

2014

;

82

(

8

):

1624

–

35

.

Google Scholar

Crossref

WorldCat

12.

Yasser

EL-M

,

Honavar

V

.

Recent advances in b-cell epitope prediction methods

.

Immunome research

2010

;

6

(

2

):

1

–

9

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

13.

Segel

LA

,

Perelson

AS

. Computations in shape space: a new approach to immune network theory. In:

Theoretical immunology

.

CRC Press

,

2018

,

321

–

43

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

14.

Lun

H

,

Chan

KCC

.

Extracting coevolutionary features from protein sequences for predicting protein-protein interactions

.

IEEE/ACM Trans Comput Biol Bioinform

2016

;

14

(

1

):

155

–

66

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

15.

Lun

H

,

Pengwei

H

,

Luo

X

, et al.

Incorporating the coevolving information of substrates in predicting hiv-1 protease cleavage sites

.

IEEE/ACM Trans Comput Biol Bioinform

2019

;

17

(

6

):

2017

–

28

.

Google Scholar

OpenURL Placeholder Text

WorldCat

16.

Kolaskar

AS

,

Kulkarni-Kale

U

.

Prediction of three-dimensional structure and mapping of conformational epitopes of envelope glycoprotein of japanese encephalitis virus

.

Virology

1999

;

261

(

1

):

31

–

42

.

17.

Yin

R

,

Zhou

X

,

Zheng

J

, et al.

Computational identification of physicochemical signatures for host tropism of influenza A virus

.

J Bioinform Comput Biol

2018

;

16

(06):1840023.

Google Scholar

OpenURL Placeholder Text

WorldCat

18.

Blythe

MJ

,

Flower

DR

.

Benchmarking b cell epitope prediction: underperformance of existing methods

.

Protein Sci

2005

;

14

(

1

):

246

–

8

.

19.

Zhou

X

,

Yin

R

,

Kwoh

C-K

, et al.

A context-free encoding scheme of protein sequences for predicting antigenicity of diverse influenza a viruses

.

BMC Genomics

2018

;

19

(

10

):

145

–

54

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

20.

Moreau

V

,

Fleury

C

,

Piquer

D

, et al.

Pepop: computational design of immunogenic peptides

.

Bmc Bioinformatics

2008

;

9

(

1

):

1

–

15

.

21.

Ansari

HR

,

Raghava

GPS

.

Identification of conformational b-cell epitopes in an antigen from its primary sequence

.

Immunome research

2010

;

6

(

1

):

1

–

9

.

22.

Zhang

W

,

Niu

Y

,

Xiong

Y

, et al.

Computational prediction of conformational b-cell epitopes from antigen primary structures by ensemble learning

.

PloS one

2012

;

7

(8):e43575.

23.

Andersen

PH

,

Nielsen

M

,

Lund

OLE

.

Prediction of residues in discontinuous b-cell epitopes using protein 3d structures

.

Protein Sci

2006

;

15

(

11

):

2558

–

67

.

24.

Flower

DR

.

Immunoinformatics: Predicting immunogenicity in silico

. Vol. 409.

Springer Science & Business Media

,

2007

.

25.

Potocnakova

L

,

Bhide

M

,

Pulzova

LB

.

An introduction to b-cell epitope mapping and in silico epitope prediction

.

J Immunol Res

2016

;

2016

:6760830.

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Saha

S

,

Raghava

GPS

.

Prediction of continuous b-cell epitopes in an antigen using recurrent neural network

.

Proteins: Structure, Function, and Bioinformatics

2006

;

65

(

1

):

40

–

8

.

Google Scholar

Crossref

WorldCat

27.

Chen

J

,

Liu

H

,

Yang

J

, et al.

Prediction of linear b-cell epitopes using amino acid pair antigenicity scale

.

Amino Acids

2007

;

33

(

3

):

423

–

8

.

28.

El-Manzalawy

Y

,

Dobbs

D

,

Honavar

V

. Predicting flexible length linear b-cell epitopes. In:

Computational Systems Bioinformatics

, Vol.

7

.

World Scientific

,

2008

,

121

–

32

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

29.

Lian

Y

,

Ge

M

,

Pan

X-M

.

Epmlr: sequence-based linear b-cell epitope prediction method using multiple linear regression

.

BMC bioinformatics

2014

;

15

(

1

):

1

–

6

.

30.

Larsen

JEP

,

Lund

O

,

Nielsen

M

.

Improved method for predicting linear b-cell epitopes

.

Immunome research

2006

;

2

(

1

):

1

–

7

.

31.

Jespersen

MC

,

Peters

B

,

Nielsen

M

, et al.

Bepipred-2.0: improving sequence-based b-cell epitope prediction using conformational epitopes

.

Nucleic Acids Res

2017

;

45

(

W1

):

W24

–

9

.

32.

Collatz

M

,

Mock

F

,

Barth

E

, et al.

Epidope: A deep neural network for linear b-cell epitope prediction

.

Bioinformatics

2021

;

37

(

4

):

448

–

55

.

33.

Bahai

A

,

Asgari

E

,

Mofrad

MRK

, et al.

Epitopevec: Linear epitope prediction using deep protein sequence embeddings

.

Bioinform

2021

;

37

(23):4517–4525.

34.

Vita

R

,

Mahajan

S

,

Overton

JA

, et al. (eds).

The immune epitope database (iedb): 2018 update

.

Nucleic Acids Res

2019

;

47

(

D1

):

D339

–

43

.

35.

Asgari

E

,

Mofrad

MRK

.

Continuous distributed representation of biological sequences for deep proteomics and genomics

.

PloS one

2015

;

10

(

11

):e0141287.

Google Scholar

OpenURL Placeholder Text

WorldCat

36.

Yin

R

,

Luusua

E

,

Dabrowski

J

, et al.

Tempel: time-series mutation prediction of influenza a viruses via attention-based recurrent neural networks

.

Bioinformatics

2020

;

36

(

9

):

2697

–

704

.

37.

Aoki

G

,

Sakakibara

Y

.

Convolutional neural networks for classification of alignments of non-coding rna sequences

.

Bioinformatics

2018

;

34

(

13

):

i237

–

44

.

38.

Yin

R

,

Thwin

NN

,

Zhuang

P

, et al.

IAV-CNN: a 2D convolutional neural network model to predict antigenic variants of influenza a virus

.

IEEE/ACM Trans Comput Biol Bioinform

2021

.

Google Scholar

OpenURL Placeholder Text

WorldCat

39.

Dubchak

I

,

Muchnik

I

,

Holbrook

SR

, et al.

Prediction of protein folding class using global description of amino acid sequence

.

Proc Natl Acad Sci

1995

;

92

(

19

):

8700

–

4

.

40.

Dubchak

I

,

Muchnik

I

,

Mayor

C

, et al.

Recognition of a protein fold in the context of the SCOP classification

.

Proteins: structure, function, and bioinformatics

1999

;

35

(

4

):

401

–

7

.

Google Scholar

Crossref

WorldCat

41.

Kawashima

S

,

Kanehisa

M

.

Aaindex: amino acid index database

.

Nucleic Acids Res

2000

;

28

(

1

):

374

–

4

.

42.

Tomii

K

,

Kanehisa

M

.

Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins

.

Protein Engineering, Design and Selection

1996

;

9

(

1

):

27

–

36

.

Google Scholar

Crossref

WorldCat

43.

Zhou

X

,

Yin

R

,

Zheng

J

, et al.

An encoding scheme capturing generic priors and properties of amino acids improves protein classification

.

IEEE Access

2018

;

7

:

7348

–

56

.

Google Scholar

Crossref

WorldCat

44.

Heinzinger

M

,

Ahmed Elnaggar

Y

,

Wang

CD

, et al.

Modeling aspects of the language of life through transfer-learning protein sequences

.

BMC bioinformatics

2019

;

20

(

1

):

1

–

17

.

45.

Yin

R

,

Luo

Z

,

Zhuang

P

, et al.

Virprenet: a weighted ensemble convolutional neural network for the virulence prediction of influenza a virus using all eight segments

.

Bioinformatics

2021

;

37

(

6

):

737

–

43

.

46.

Krizhevsky

A

,

Sutskever

I

,

Hinton

GE

. Imagenet classification with deep convolutional neural networks. In:

Advances in neural information processing systems

,

2012

,

1097

–

105

.

47.

Simonyan

K

,

Zisserman

A

.

Very deep convolutional networks for large-scale image recognition

CoRR abs/1409.1556 (2015): n. pag.

2015

.

48.

Iandola

FN

,

Han

S

,

Moskewicz

MW

, et al.

Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size

.

arXiv preprint arXiv:1602.07360

.

2016

.

49.

Hochreiter

S

,

Schmidhuber

J

.

Long short-term memory

.

Neural Comput

1997

;

9

(

8

):

1735

–

80

.

50.

Dey

R

,

Salem

FM

. Gate-variants of gated recurrent unit (gru) neural networks. In:

2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS)

.

IEEE

,

2017

,

1597

–

600

.

51.

Pedregosa

F

,

Varoquaux

G

,

Gramfort

A

, et al.

Scikit-learn: Machine learning in python

.

Journal of machine learning research

2011

;

12

(

Oct

):

2825

–

30

.

Google Scholar

OpenURL Placeholder Text

WorldCat

52.

Paszke

A

,

Gross

S

,

Chintala

S

, et al.

Automatic differentiation in pytorch

.

2017

.

53.

Singh

H

,

Ansari

HR

,

Raghava

GPS

.

Improved method for linear b-cell epitope prediction using antigen’s primary sequence

.

PloS one

2013

;

8

(

5

):e62216.

Google Scholar

OpenURL Placeholder Text

WorldCat

54.

Manavalan

B

,

Govindaraj

RG

,

Shin

TH

, et al.

ibce-el: a new ensemble learning framework for improved linear b-cell epitope prediction

.

Front Immunol

2018

;

9

:

1695

.

55.

Parker

JMR

,

Guo

D

,

Hodges

RS

.

New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and x-ray-derived accessible sites

.

Biochemistry

1986

;

25

(

19

):

5425

–

32

.

56.

Pellequer

J-L

,

Westhof

E

,

Van Regenmortel

MHV

.

Correlation between the location of antigenic sites and the prediction of turns in proteins

.

Immunol Lett

1993

;

36

(

1

):

83

–

99

.

57.

Emini

EA

,

Hughes

J

,

Perlow

DS

, et al.

Induction of hepatitis a virus-neutralizing antibody by a virus-specific synthetic peptide

.

J Virol

1985

;

55

(

3

):

836

–

9

.

58.

Kolaskar

AS

,

Tongaonkar

PC

.

A semi-empirical method for prediction of antigenic determinants on protein antigens

.

FEBS Lett

1990

;

276

(

1-2

,

172

):–

174

.

Google Scholar

OpenURL Placeholder Text

WorldCat

59.

Karplus

PA

,

Schulz

GE

.

Prediction of chain flexibility in proteins

.

Naturwissenschaften

1985

;

72

(

4

):

212

–

3

.

Google Scholar

Crossref

WorldCat

60.

Chin-Sheng

Y

,

Lin

C-J

,

Hwang

J-K

.

Predicting subcellular localization of proteins for gram-negative bacteria by support vector machines based on n-peptide ompositions

.

Protein Sci

2004

;

13

(

5

):

1402

–

6

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
July 2022	111
August 2022	38
September 2022	34
October 2022	41
November 2022	24
December 2022	21
January 2023	4
February 2023	16
March 2023	36
April 2023	15
May 2023	24
June 2023	14
July 2023	14
August 2023	10
September 2023	12
October 2023	18
November 2023	8
December 2023	16
January 2024	26
February 2024	16
March 2024	26
April 2024	23
May 2024	17
June 2024	12
July 2024	28
August 2024	20
September 2024	30
October 2024	20
November 2024	11
December 2024	16
January 2025	45
February 2025	25
March 2025	54
April 2025	30

Article Contents

A framework for predicting variable-length epitopes of human-adapted viruses using machine learning methods

Abstract

Introduction

Materials and methods

Data collection

Feature representation

Classification

Experimental setup

Results and discussion

The length distribution of epitopes and non-epitopes

Performance comparison on combination of features and classifiers

Comparison with existing methods

The identification of viral epitope species

Model evaluation on different length groups for epitope prediction

Conclusion

Author contributions statement

Acknowledgments

Availability and Implementation

Author Biographies

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

A framework for predicting variable-length epitopes of human-adapted viruses using machine learning methods

Abstract

Introduction

Materials and methods

Data collection

Feature representation

Classification

Experimental setup

Results and discussion

The length distribution of epitopes and non-epitopes

Performance comparison on combination of features and classifiers

Comparison with existing methods

The identification of viral epitope species

Model evaluation on different length groups for epitope prediction

Conclusion

Author contributions statement

Acknowledgments

Availability and Implementation

Author Biographies

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only