-
PDF
- Split View
-
Views
-
Cite
Cite
Rui Yin, Xianghe Zhu, Min Zeng, Pengfei Wu, Min Li, Chee Keong Kwoh, A framework for predicting variable-length epitopes of human-adapted viruses using machine learning methods, Briefings in Bioinformatics, Volume 23, Issue 5, September 2022, bbac281, https://doi.org/10.1093/bib/bbac281
- Share Icon Share
Abstract
The coronavirus disease 2019 pandemic has alerted people of the threat caused by viruses. Vaccine is the most effective way to prevent the disease from spreading. The interaction between antibodies and antigens will clear the infectious organisms from the host. Identifying B-cell epitopes is critical in vaccine design, development of disease diagnostics and antibody production. However, traditional experimental methods to determine epitopes are time-consuming and expensive, and the predictive performance using the existing in silico methods is not satisfactory. This paper develops a general framework to predict variable-length linear B-cell epitopes specific for human-adapted viruses with machine learning approaches based on Protvec representation of peptides and physicochemical properties of amino acids. QR decomposition is incorporated during the embedding process that enables our models to handle variable-length sequences. Experimental results on large immune epitope datasets validate that our proposed model’s performance is superior to the state-of-the-art methods in terms of AUROC (0.827) and AUPR (0.831) on the testing set. Moreover, sequence analysis also provides the results of the viral category for the corresponding predicted epitopes with high precision. Therefore, this framework is shown to reliably identify linear B-cell epitopes of human-adapted viruses given protein sequences and could provide assistance for potential future pandemics and epidemics.
Introduction
Antibody is a type of blood protein produced counteracting a specific antigen [1]. It is a fundamental component of humoral immune response used by the immune system to recognize and neutralize a unique molecule of the pathogen such as bacteria and viruses [2, 3]. The recognized region of an antigen by the antibody is referred to as B-cell epitopes [4] and antibodies can inhibit the function of antigens by binding the epitope regions [5]. The B-cell epitopes can be categorized into two groups, e.g. continuous and discontinuous. The continuous epitopes are a stretch of amino acid sequences that are linear in shape, while in the latter type, the amino acids are in close contact due to the 3D-protein conformation. Accurate detection of B-cell epitopes is critical for biomedical applications such as vaccine design [6], disease diagnostics [7] and therapeutic antibodies [8].

The basic workflow of the proposed framework. (A) Viral peptide inclusion criteria. Viral sequences were extracted from IEDB with human-adapted viruses. The identical sequences and virus category with fewer than 50 samples were excluded. Peptides with lengths shorter than 4 were also removed and we exported the remaining sequences with labels. (B) Feature representation. The sequences were encoded using ProtVec and AAindex combined with QR decomposition and CTD transformation to handle variable-length sequences. (C) Feature transformation process. Mathematical formulations of QR decomposition and CTD transformation to implement embedding process of input viral peptides. (D) Classification and evaluation. The filtered and processed samples were divided into training (90%) and testing (10%) datasets, where 5-fold cross-validation was performed on the training set using multiple classifiers for epitope prediction and species identification. The final performance estimated was quantified by the metrics such as accuracy, precision and recall, etc., on the held-out testing test.
The identification of B-cell epitopes includes experimental determination and computational prediction. Experimental methods mainly use X-ray crystallography and traditional B-cell epitope mapping [9, 10]. However, most of these methods are costly, laborious and time-consuming and are not able to identify all epitopes. Advancement in bioinformatics has significantly contributed to the development of immunoinformatics involving the discovery of antibody structures [11], epitope analysis [12] and immune network modeling [13]. Various computational methods have been proposed to predict B-cell epitopes based on sequence and structural data as well as their co-evolutionary information [14] [15]. The early epitope prediction only uses one or several physicochemical properties of amino acids (e.g. polarity, solvent accessibility and hydrophilicity) with conventional computational approaches. These methods are usually calculating the average propensity value along with a sliding window that demonstrated to obtain poor performance in the practice [16–19]. With improved computing ability and increased available experimentally identified epitopes, methods based on machine learning techniques have been applied to improve the prediction performance by incorporating new amino acid-based features. One example is that the advent of proteomics and the construction of databases consisting of Ag-Ab crystal structures makes it easier to carry out a deeper analysis of conformational epitopes. For instance, PEPOP is a structure-based approach that uses the 3D coordinates of a protein to both predict clusters of surface accessible segments and residues that might correspond to discontinuous epitopes and to be used for immunogenic peptide design [20]. CBTOPE was proposed for the prediction of discontinuous epitopes from antigen primary structure through combining amino acid properties and traditional features of physicochemical profiles [21]. Zhang et al. followed the work of CBTOPE and introduced an ensemble learning technique to build the prediction model that explored more potential sequence-derived features relevant to conformational epitopes [22].
Though the majority of B-cell epitopes are conformational, much attention has been captured on the prediction of continuous epitopes as they are critical for the design of peptide-based vaccines [23] [24] [25]. ABCPred was the first method to predict B-cell epitopes using standard feed-forward and recurrent neural network (NN) in an antigenic sequence [26]. Chen et al. proposed a new amino acid pair (AAP) antigenicity scale, combining with support vector machine (SVM) to find B-cell epitopes [27]. This method was based on the finding that B-cell epitopes favor particular AAPs and trained on a homology-reduced dataset of linear B-cell epitopes. EL-Manzalawy et al. explored two machine learning approaches for predicting flexible length linear B-cell epitopes using the subsequence kernel [28]. Furthermore, Lian et al. presented a novel linear B-cell epitope prediction model using the multiple linear regression based on the antigen’s primary sequence information[29]. Larsen et al. introduced BepiPred for predicting continuous epitopes by combining two residues properties with Hidden Markov Model [30]. Jespersen et al. extended it by developing BepiPred-2.0 that was trained on epitopes annotated from antibody-antigen protein structures using random forest (RF) [31]. More recently, Collatz et al. described EpiDope, a python tool that leveraged a deep neural network (DNN) to detect linear B-cell epitope regions on individual protein sequences [32]. Bahai et al. illustrated a new method named EpitopeVec, which used a combination of residue properties, modified antigenicity scales, and protein language model-based representations as features of peptides for linear B-cell epitope predictions [33].
Despite these advancements, most of the existing methods concentrate on epitopes of fixed length, but linear B-cell epitopes can vary over a broad range in length. Identifying epitope sequences without trimming them will allow us to provide prediction tools to experiment with any arbitrary epitope lengths. In 2020, we witnessed and experienced a global pandemic due to severe acute respiratory syndrome coronavirus 2 that caused infections in millions of people and a large number of deaths. The viruses are still mutating and circulating rapidly with emerging new variants. Understanding humoral responses to these viruses is critical for improving therapeutics and vaccines and for preparing for future pandemics or epidemics led by other potential novel viruses. This work developed a novel framework that combines the physicochemical properties of amino acids and a new continuous distributed representation of biological sequences (ProtVec) for epitope prediction (Fig. 1). This model focuses on human-adapted viruses and can predict variable-length epitopes. We trained and tested our model using multiple traditional machine learning and deep learning methods, and the results indicate that XGBoost achieved the best performance. Further experiments demonstrate that our proposed model outperformed the other existing state-of-the-art methods. This model also could reveal the viral species of predicted epitopes with high accuracy. We hope this high predictive power of the model would facilitate the in vitro experimental validation for linear epitopes and accelerate the development of serological assays and antibody production. The contributions and innovations of this work are listed below:
We proposed a machine learning framework can not only predict viral epitopes but also identify the host of the viruses.
This framework can be applied at variable sequence lengths of human-adapted viruses.
The experimental results demonstrate superior performance over the other existing methods, and the model can be served to facilitate the vaccine design.
Materials and methods
Data collection
We downloaded all the linear viral peptides of B-cells of the human host from the Immune Epitope Database (IEDB) [34] on 30 March 2021, which resulted in 210 311 raw protein sequences. To ensure the quality of training samples, we selected virus species with over 50 cases and eliminated the peptides that were shorter than 4. Next, we merged the identical sequences from each category and kept the information of their verified regions. For those peptides with multiple assays, we classified the peptides as epitopes or non-epitopes based on the majority of the counts from the assays. The peptides with equal negative and positive assay results were also eliminated from the dataset. Finally, we ended up with 17 different human-adapted viruses containing 4975 epitopes and 4956 non-epitopes ranging from 4 to 50 in length. The detailed information is shown in Table 1.
The detailed information epitopes and non-epitopes of selective human-adapted viruses
Category of virus . | Sequence . | Epitopes . | Non-epitopes . | Min. length . | Max. length . | Avg. length . |
---|---|---|---|---|---|---|
Alphapapillomavirus | 179 | 168 | 11 | 7 | 30 | 12.9 |
Orthonairovirus | 226 | 32 | 194 | 12 | 20 | 12.9 |
Dengue virus | 904 | 428 | 476 | 5 | 46 | 10.3 |
Ebolavirus | 650 | 74 | 576 | 10 | 25 | 15.0 |
Enterovirus | 354 | 122 | 232 | 7 | 38 | 17.8 |
Hepacivirus | 3354 | 1988 | 1366 | 5 | 48 | 14.3 |
Human alphaherpesvirus | 819 | 134 | 685 | 6 | 31 | 19.2 |
Human immunodeficiency virus | 454 | 447 | 7 | 4 | 47 | 14.1 |
Human orthopneumovirus | 217 | 72 | 145 | 6 | 41 | 11.2 |
Influenza A virus | 239 | 233 | 6 | 7 | 50 | 20.1 |
Measles morbillivirus | 181 | 134 | 47 | 7 | 37 | 13.8 |
Orthohantavirus | 385 | 101 | 284 | 8 | 45 | 19.7 |
Orthohepevirus A | 287 | 176 | 111 | 6 | 48 | 23.1 |
Primate erythroparvovirus 1 | 276 | 61 | 215 | 7 | 41 | 20.1 |
Primate T-lymphotropic virus 1 | 197 | 181 | 16 | 6 | 47 | 17.8 |
Severe acute respiratory syndrome coronavirus | 1001 | 533 | 468 | 4 | 50 | 17.6 |
Zika virus | 208 | 91 | 117 | 8 | 30 | 17.7 |
Category of virus . | Sequence . | Epitopes . | Non-epitopes . | Min. length . | Max. length . | Avg. length . |
---|---|---|---|---|---|---|
Alphapapillomavirus | 179 | 168 | 11 | 7 | 30 | 12.9 |
Orthonairovirus | 226 | 32 | 194 | 12 | 20 | 12.9 |
Dengue virus | 904 | 428 | 476 | 5 | 46 | 10.3 |
Ebolavirus | 650 | 74 | 576 | 10 | 25 | 15.0 |
Enterovirus | 354 | 122 | 232 | 7 | 38 | 17.8 |
Hepacivirus | 3354 | 1988 | 1366 | 5 | 48 | 14.3 |
Human alphaherpesvirus | 819 | 134 | 685 | 6 | 31 | 19.2 |
Human immunodeficiency virus | 454 | 447 | 7 | 4 | 47 | 14.1 |
Human orthopneumovirus | 217 | 72 | 145 | 6 | 41 | 11.2 |
Influenza A virus | 239 | 233 | 6 | 7 | 50 | 20.1 |
Measles morbillivirus | 181 | 134 | 47 | 7 | 37 | 13.8 |
Orthohantavirus | 385 | 101 | 284 | 8 | 45 | 19.7 |
Orthohepevirus A | 287 | 176 | 111 | 6 | 48 | 23.1 |
Primate erythroparvovirus 1 | 276 | 61 | 215 | 7 | 41 | 20.1 |
Primate T-lymphotropic virus 1 | 197 | 181 | 16 | 6 | 47 | 17.8 |
Severe acute respiratory syndrome coronavirus | 1001 | 533 | 468 | 4 | 50 | 17.6 |
Zika virus | 208 | 91 | 117 | 8 | 30 | 17.7 |
The detailed information epitopes and non-epitopes of selective human-adapted viruses
Category of virus . | Sequence . | Epitopes . | Non-epitopes . | Min. length . | Max. length . | Avg. length . |
---|---|---|---|---|---|---|
Alphapapillomavirus | 179 | 168 | 11 | 7 | 30 | 12.9 |
Orthonairovirus | 226 | 32 | 194 | 12 | 20 | 12.9 |
Dengue virus | 904 | 428 | 476 | 5 | 46 | 10.3 |
Ebolavirus | 650 | 74 | 576 | 10 | 25 | 15.0 |
Enterovirus | 354 | 122 | 232 | 7 | 38 | 17.8 |
Hepacivirus | 3354 | 1988 | 1366 | 5 | 48 | 14.3 |
Human alphaherpesvirus | 819 | 134 | 685 | 6 | 31 | 19.2 |
Human immunodeficiency virus | 454 | 447 | 7 | 4 | 47 | 14.1 |
Human orthopneumovirus | 217 | 72 | 145 | 6 | 41 | 11.2 |
Influenza A virus | 239 | 233 | 6 | 7 | 50 | 20.1 |
Measles morbillivirus | 181 | 134 | 47 | 7 | 37 | 13.8 |
Orthohantavirus | 385 | 101 | 284 | 8 | 45 | 19.7 |
Orthohepevirus A | 287 | 176 | 111 | 6 | 48 | 23.1 |
Primate erythroparvovirus 1 | 276 | 61 | 215 | 7 | 41 | 20.1 |
Primate T-lymphotropic virus 1 | 197 | 181 | 16 | 6 | 47 | 17.8 |
Severe acute respiratory syndrome coronavirus | 1001 | 533 | 468 | 4 | 50 | 17.6 |
Zika virus | 208 | 91 | 117 | 8 | 30 | 17.7 |
Category of virus . | Sequence . | Epitopes . | Non-epitopes . | Min. length . | Max. length . | Avg. length . |
---|---|---|---|---|---|---|
Alphapapillomavirus | 179 | 168 | 11 | 7 | 30 | 12.9 |
Orthonairovirus | 226 | 32 | 194 | 12 | 20 | 12.9 |
Dengue virus | 904 | 428 | 476 | 5 | 46 | 10.3 |
Ebolavirus | 650 | 74 | 576 | 10 | 25 | 15.0 |
Enterovirus | 354 | 122 | 232 | 7 | 38 | 17.8 |
Hepacivirus | 3354 | 1988 | 1366 | 5 | 48 | 14.3 |
Human alphaherpesvirus | 819 | 134 | 685 | 6 | 31 | 19.2 |
Human immunodeficiency virus | 454 | 447 | 7 | 4 | 47 | 14.1 |
Human orthopneumovirus | 217 | 72 | 145 | 6 | 41 | 11.2 |
Influenza A virus | 239 | 233 | 6 | 7 | 50 | 20.1 |
Measles morbillivirus | 181 | 134 | 47 | 7 | 37 | 13.8 |
Orthohantavirus | 385 | 101 | 284 | 8 | 45 | 19.7 |
Orthohepevirus A | 287 | 176 | 111 | 6 | 48 | 23.1 |
Primate erythroparvovirus 1 | 276 | 61 | 215 | 7 | 41 | 20.1 |
Primate T-lymphotropic virus 1 | 197 | 181 | 16 | 6 | 47 | 17.8 |
Severe acute respiratory syndrome coronavirus | 1001 | 533 | 468 | 4 | 50 | 17.6 |
Zika virus | 208 | 91 | 117 | 8 | 30 | 17.7 |
Feature representation
Previous methods of mapping a variable-length amino acid sequence into a fixed-length feature vector contain amino acid composition, dipeptide composition (DPC), AAP antigenicity scale and k-mer representation, etc. Once the variable-length sequences are converted into fixed-length feature vectors, machine learning algorithms can be applied to various prediction tasks. Recently, continuous representation has been explored in bioinformatics applications. ProtVec [35] has been demonstrated as an efficient embedding method for biological sequences for a variety of biomedical problems [36] [37] [38]. Inspired by the continuous vector representation of words in the natural language processing, this method regarded 3-mers of amino acids as words and collected 546 790 protein sequences from the Swiss-Prot database as the training set. Skip-gram NN was applied to the dataset that attempted to maximize the probability of observed sequences and 100-dimensional numerical vectors were calculated to represent 3-mer peptides. In this work, we leveraged ProtVec to encode the sequences. Given a sequence of length N, we split it into N-2 consecutive 3-mers and each 3-mer was represented by a 100-dimensional vector mapped from ProtVec. Therefore, we have (N-2)*100-dimensional vectors to represent input sequences.
First notice that when |$\mathrm{i}=1$| obviously it is true by choosing |$r_{11}=\left \|a_{1}\right \|$|. When |$\mathrm{i}=2$|, |$\mathrm{e}_{2}$| is orthogonal to |$e_{1}$| since |$e_{2}^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|} a_{2}^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|}\left [x_{2}-\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}\right ]^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|}\left (x_{2}^{T} \cdot e_{1}\right )(1-$||$\left .e_{1}^{T} e_{1}\right )=0$|. Moreover, we have |$\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}+\left \|a_{2}\right \| e_{2}=\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}+a_{2}=x_{2}$|, then (4) holds where |$r_{12}=x_{2}^{T} \cdot e_{1}$| and |$r_{22}=\left \|a_{2}\right \|$|. For |$\mathrm{i} \geq 3$|, (4) can be obtained with similar process by setting |$r_{i j}=x_{j}^{T} \cdot e_{i}$| for |$\mathrm{i} \leq \mathrm{j}-1$| and |$r_{j j}=\left \|a_{j}\right \|$|. Therefore, QR decomposition always exists even if the matrix does not have full rank.
In our work, we applied the QR decomposition to all the embedded sequences, which will be represented in |$100\times (N-2)$|-dimensional vectors. Each sequence could be denoted as the product of |$Q \in \mathbb{R}^{100\times 100 }$| and |$R \in \mathbb{R}^{100\times (N-2)}$|. Considering |$R$| is a upper triangular matrix with plenty of zero values, and also |$R^{\prime}$| occupied a small proportion in |$R$|, we only used the matrix |$Q$| to represent each input sequence when building the model. As a result, all the peptides can be transformed into fixed length matrices after embedding.
Classification
The feature transformation process enabled us to convert variable-length peptides into numerical matrices with identical dimensions that can be handled with machine learning models for classification tasks. Here, we leveraged several methods, including traditional machine learning algorithms and DNN architectures to predict viral epitope among all the encoded peptides. The traditional machine learning algorithms included SVM, K-nearest neighbor (KNN), naïve bayes (NB), RF, NN and XGBoost. The SVM classifier learns a hyperplane to differentiate the positive and negative samples using different kernel functions to maximize the geometric margin. KNN is a nonparametric classification method where the function is approximated locally and all computation is deferred until function evaluation. The NB classifier is based on the Bayes’ Theorem, adapted for use across binary and multiclass classification problems. RFs use multiple decision trees on various subsamples of the dataset to improve the predictive performance. The classic NN classifier implements a multilayer perceptron algorithm that trains using Backpropagation. XGBoost is an optimized distributed gradient boosting method that provides parallel tree boosting to solve classification problems in an efficient and accurate way. In comparison, we also employed DNN models that were previously used and proved to be effective by encoding various physicochemical and structural information for predictive tasks on protein sequences [43–45]. The DNN architecture consists of several variants of convolutional and recurrent NNs, including AlexNet [46], VGG [47], SqueezeNet [48], long short-term memory (LSTM) [49] and gated recurrent unit (GRU) [50].
Experimental setup
To further demonstrate the superiority of our proposed framework, we compared our model against several existing tools for linear B-cell epitope prediction on the independent testing set. These tools and methods were in their latest version including Bepipred2.0 [31], LBtope [53], iBCE-EL [54], EpitopeVec [33], Parker Hydrophilicity prediction [55], Chou and Fasman beta turn prediction [56], Emini surface accessibility prediction [57], Kolaskar and Tongaonkar antigenicity Prediction [58] and Karplus and Schulz Flexibility Prediction [59]. In addition, we added two more approaches with different feature representations, namely, AAP antigenicity scale and DPC, for epitope prediction. AAP used a new scale that suggested certain AAPs are favored in epitope regions [27]. Here, we only compare this method with sequence length of 15 denoted as AAP15. DPC is a vector specifying the abundance of dipeptides that can reflect the correlation of adjacent amino acids in protein sequence [60]. We followed the original settings to implement these methods and the detailed parameters and can be found in Supplementary Materials S3. To compare the performance of the compared models, we adopted the area under Receiver operating characteristic (AUROC) and Precision-Recall (AUPR) curves, which is a more efficient way to display the relationship between sensitivity and specificity for assessing models and a good practice in imbalanced dataset.
Results and discussion
The length distribution of epitopes and non-epitopes
The length of collected B-cell epitopes and non-epitopes varies over a broad range across viruses. Thus, it is natural to know the distribution of the sequences in different lengths. Figure 2 displayed the length distribution of unique linear B-cell epitopes and non-epitopes of filtered dataset. We found that most of the peptides range from 6 to 20 in length, where the sequences of length 20 occupied almost half. Both epitope and non-epitope peptides in length 20 had the largest samples compared with peptides in other lengths. The number of peptides over 30 in length only occupied 1% of all the samples. We will still keep these samples for training as they meet the inclusion criteria. Nevertheless, our proposed model is capable of handling different lengths of peptides without extending or trimming them.

The length distribution of unique linear B-cell epitopes and non-epitopes in the dataset.
Performance comparison on combination of features and classifiers
We then investigated the performance using different machine learning models with the ProtVec and CTD features and their combination. We trained and tested the models on the training set with 5-fold cross-validation and the performance was averaged over the hold-out folds (Table 2). In more detail, we could find that traditional machine learning methods (KNN, SVM, NB, RF, NN and XGBoost) showed better performance than deep learning models (VGG, AlexNet, SqueezeNet, RNN, LSTM, GRU). One of the possible explanations is that deep learning models require much more data to function properly due to the complex multilayer structures than traditional machine learning models. In addition, the loss of information of input sequences due to QR decomposition might hinder the DNN obtain a better result. Among all the models, XGBoost outperformed other classifiers for predicting epitopes in terms of accuracy (0.753), F-score (0.740), AUROC (0.754) and AUPR (0.825) using the combination of ProtVec and CTD features, while NB showed the best precision (0.835) and GRU obtained highest recall value (0.864). In addition, the combination of ProtVec and CTD features achieved better results than using the ProtVec and CTD features independently of all the models except SVM, where only ProtVec features suggested better performance. These observations indicate that incorporating ProtVec and CTD features would improve the model performance for epitope prediction of human-adapted viruses.
Performance comparison of combination of features in epitope prediction using different machine learning models
Model . | Feature . | Accuracy . | Precision . | Recall . | F-score . | AUROC . | AUPR . | |
---|---|---|---|---|---|---|---|---|
Traditional classifier | KNN | ProtVec | 0.611 | 0.583 | 0.809 | 0.678 | 0.608 | 0.692 |
CTD | 0.588 | 0.591 | 0.602 | 0.597 | 0.588 | 0.687 | ||
ProtVec+CTD | 0.633 | 0.640 | 0.630 | 0.635 | 0.633 | 0.719 | ||
SVM | ProtVec | 0.686 | 0.708 | 0.648 | 0.677 | 0.687 | 0.778 | |
CTD | 0.619 | 0.638 | 0.572 | 0.603 | 0.619 | 0.675 | ||
ProtVec+CTD | 0.620 | 0.631 | 0.577 | 0.602 | 0.621 | 0.645 | ||
NB | ProtVec | 0.578 | 0.823 | 0.212 | 0.338 | 0.582 | 0.620 | |
CTD | 0.573 | 0.585 | 0.536 | 0.560 | 0.573 | 0.574 | ||
ProtVec+CTD | 0.579 | 0.835 | 0.211 | 0.337 | 0.584 | 0.622 | ||
RF | ProtVec | 0.683 | 0.745 | 0.570 | 0.646 | 0.685 | 0.770 | |
CTD | 0.693 | 0.691 | 0.690 | 0.690 | 0.751 | 0.788 | ||
ProtVec+CTD | 0.707 | 0.765 | 0.608 | 0.678 | 0.708 | 0.777 | ||
NN | ProtVec | 0.670 | 0.684 | 0.670 | 0.675 | 0.668 | 0.748 | |
CTD | 0.546 | 0.804 | 0.139 | 0.237 | 0.552 | 0.609 | ||
ProtVec+CTD | 0.671 | 0.674 | 0.678 | 0.676 | 0.671 | 0.738 | ||
XGBoost | ProtVec | 0.721 | 0.752 | 0.669 | 0.708 | 0.721 | 0.810 | |
CTD | 0.690 | 0.698 | 0.685 | 0.692 | 0.690 | 0.780 | ||
ProtVec+CTD | 0.753 | 0.791 | 0.695 | 0.740 | 0.754 | 0.825 | ||
Deep learning model | VGG | ProtVec | 0.583 | 0.617 | 0.465 | 0.530 | 0.521 | 0.539 |
CTD | 0.552 | 0.586 | 0.397 | 0.473 | 0.469 | 0.490 | ||
ProtVec+CTD | 0.597 | 0.628 | 0.503 | 0.558 | 0.529 | 0.525 | ||
AlexNet | ProtVec | 0.570 | 0.608 | 0.487 | 0.463 | 0.511 | 0.532 | |
CTD | 0.577 | 0.592 | 0.530 | 0.559 | 0.485 | 0.577 | ||
ProtVec+CTD | 0.590 | 0.608 | 0.534 | 0.569 | 0.520 | 0.529 | ||
SqueezeNet | ProtVec | 0.575 | 0.570 | 0.650 | 0.607 | 0.547 | 0.571 | |
CTD | 0.589 | 0.613 | 0.508 | 0.556 | 0.498 | 0.661 | ||
ProtVec+CTD | 0.667 | 0.679 | 0.648 | 0.663 | 0.559 | 0.646 | ||
RNN | ProtVec | 0.575 | 0.717 | 0.266 | 0.388 | 0.425 | 0.504 | |
CTD | 0.585 | 0.630 | 0.437 | 0.516 | 0.462 | 0.517 | ||
ProtVec+CTD | 0.602 | 0.714 | 0.357 | 0.476 | 0.479 | 0.484 | ||
LSTM | ProtVec | 0.542 | 0.588 | 0.324 | 0.417 | 0.487 | 0.527 | |
CTD | 0.493 | 0.500 | 0.021 | 0.041 | 0.503 | 0.513 | ||
ProtVec+CTD | 0.568 | 0.565 | 0.644 | 0.602 | 0.522 | 0.535 | ||
GRU | ProtVec | 0.550 | 0.659 | 0.234 | 0.346 | 0.479 | 0.488 | |
CTD | 0.499 | 0.503 | 0.864 | 0.636 | 0.506 | 0.533 | ||
ProtVec+CTD | 0.571 | 0.645 | 0.339 | 0.445 | 0.468 | 0.551 |
Model . | Feature . | Accuracy . | Precision . | Recall . | F-score . | AUROC . | AUPR . | |
---|---|---|---|---|---|---|---|---|
Traditional classifier | KNN | ProtVec | 0.611 | 0.583 | 0.809 | 0.678 | 0.608 | 0.692 |
CTD | 0.588 | 0.591 | 0.602 | 0.597 | 0.588 | 0.687 | ||
ProtVec+CTD | 0.633 | 0.640 | 0.630 | 0.635 | 0.633 | 0.719 | ||
SVM | ProtVec | 0.686 | 0.708 | 0.648 | 0.677 | 0.687 | 0.778 | |
CTD | 0.619 | 0.638 | 0.572 | 0.603 | 0.619 | 0.675 | ||
ProtVec+CTD | 0.620 | 0.631 | 0.577 | 0.602 | 0.621 | 0.645 | ||
NB | ProtVec | 0.578 | 0.823 | 0.212 | 0.338 | 0.582 | 0.620 | |
CTD | 0.573 | 0.585 | 0.536 | 0.560 | 0.573 | 0.574 | ||
ProtVec+CTD | 0.579 | 0.835 | 0.211 | 0.337 | 0.584 | 0.622 | ||
RF | ProtVec | 0.683 | 0.745 | 0.570 | 0.646 | 0.685 | 0.770 | |
CTD | 0.693 | 0.691 | 0.690 | 0.690 | 0.751 | 0.788 | ||
ProtVec+CTD | 0.707 | 0.765 | 0.608 | 0.678 | 0.708 | 0.777 | ||
NN | ProtVec | 0.670 | 0.684 | 0.670 | 0.675 | 0.668 | 0.748 | |
CTD | 0.546 | 0.804 | 0.139 | 0.237 | 0.552 | 0.609 | ||
ProtVec+CTD | 0.671 | 0.674 | 0.678 | 0.676 | 0.671 | 0.738 | ||
XGBoost | ProtVec | 0.721 | 0.752 | 0.669 | 0.708 | 0.721 | 0.810 | |
CTD | 0.690 | 0.698 | 0.685 | 0.692 | 0.690 | 0.780 | ||
ProtVec+CTD | 0.753 | 0.791 | 0.695 | 0.740 | 0.754 | 0.825 | ||
Deep learning model | VGG | ProtVec | 0.583 | 0.617 | 0.465 | 0.530 | 0.521 | 0.539 |
CTD | 0.552 | 0.586 | 0.397 | 0.473 | 0.469 | 0.490 | ||
ProtVec+CTD | 0.597 | 0.628 | 0.503 | 0.558 | 0.529 | 0.525 | ||
AlexNet | ProtVec | 0.570 | 0.608 | 0.487 | 0.463 | 0.511 | 0.532 | |
CTD | 0.577 | 0.592 | 0.530 | 0.559 | 0.485 | 0.577 | ||
ProtVec+CTD | 0.590 | 0.608 | 0.534 | 0.569 | 0.520 | 0.529 | ||
SqueezeNet | ProtVec | 0.575 | 0.570 | 0.650 | 0.607 | 0.547 | 0.571 | |
CTD | 0.589 | 0.613 | 0.508 | 0.556 | 0.498 | 0.661 | ||
ProtVec+CTD | 0.667 | 0.679 | 0.648 | 0.663 | 0.559 | 0.646 | ||
RNN | ProtVec | 0.575 | 0.717 | 0.266 | 0.388 | 0.425 | 0.504 | |
CTD | 0.585 | 0.630 | 0.437 | 0.516 | 0.462 | 0.517 | ||
ProtVec+CTD | 0.602 | 0.714 | 0.357 | 0.476 | 0.479 | 0.484 | ||
LSTM | ProtVec | 0.542 | 0.588 | 0.324 | 0.417 | 0.487 | 0.527 | |
CTD | 0.493 | 0.500 | 0.021 | 0.041 | 0.503 | 0.513 | ||
ProtVec+CTD | 0.568 | 0.565 | 0.644 | 0.602 | 0.522 | 0.535 | ||
GRU | ProtVec | 0.550 | 0.659 | 0.234 | 0.346 | 0.479 | 0.488 | |
CTD | 0.499 | 0.503 | 0.864 | 0.636 | 0.506 | 0.533 | ||
ProtVec+CTD | 0.571 | 0.645 | 0.339 | 0.445 | 0.468 | 0.551 |
Performance comparison of combination of features in epitope prediction using different machine learning models
Model . | Feature . | Accuracy . | Precision . | Recall . | F-score . | AUROC . | AUPR . | |
---|---|---|---|---|---|---|---|---|
Traditional classifier | KNN | ProtVec | 0.611 | 0.583 | 0.809 | 0.678 | 0.608 | 0.692 |
CTD | 0.588 | 0.591 | 0.602 | 0.597 | 0.588 | 0.687 | ||
ProtVec+CTD | 0.633 | 0.640 | 0.630 | 0.635 | 0.633 | 0.719 | ||
SVM | ProtVec | 0.686 | 0.708 | 0.648 | 0.677 | 0.687 | 0.778 | |
CTD | 0.619 | 0.638 | 0.572 | 0.603 | 0.619 | 0.675 | ||
ProtVec+CTD | 0.620 | 0.631 | 0.577 | 0.602 | 0.621 | 0.645 | ||
NB | ProtVec | 0.578 | 0.823 | 0.212 | 0.338 | 0.582 | 0.620 | |
CTD | 0.573 | 0.585 | 0.536 | 0.560 | 0.573 | 0.574 | ||
ProtVec+CTD | 0.579 | 0.835 | 0.211 | 0.337 | 0.584 | 0.622 | ||
RF | ProtVec | 0.683 | 0.745 | 0.570 | 0.646 | 0.685 | 0.770 | |
CTD | 0.693 | 0.691 | 0.690 | 0.690 | 0.751 | 0.788 | ||
ProtVec+CTD | 0.707 | 0.765 | 0.608 | 0.678 | 0.708 | 0.777 | ||
NN | ProtVec | 0.670 | 0.684 | 0.670 | 0.675 | 0.668 | 0.748 | |
CTD | 0.546 | 0.804 | 0.139 | 0.237 | 0.552 | 0.609 | ||
ProtVec+CTD | 0.671 | 0.674 | 0.678 | 0.676 | 0.671 | 0.738 | ||
XGBoost | ProtVec | 0.721 | 0.752 | 0.669 | 0.708 | 0.721 | 0.810 | |
CTD | 0.690 | 0.698 | 0.685 | 0.692 | 0.690 | 0.780 | ||
ProtVec+CTD | 0.753 | 0.791 | 0.695 | 0.740 | 0.754 | 0.825 | ||
Deep learning model | VGG | ProtVec | 0.583 | 0.617 | 0.465 | 0.530 | 0.521 | 0.539 |
CTD | 0.552 | 0.586 | 0.397 | 0.473 | 0.469 | 0.490 | ||
ProtVec+CTD | 0.597 | 0.628 | 0.503 | 0.558 | 0.529 | 0.525 | ||
AlexNet | ProtVec | 0.570 | 0.608 | 0.487 | 0.463 | 0.511 | 0.532 | |
CTD | 0.577 | 0.592 | 0.530 | 0.559 | 0.485 | 0.577 | ||
ProtVec+CTD | 0.590 | 0.608 | 0.534 | 0.569 | 0.520 | 0.529 | ||
SqueezeNet | ProtVec | 0.575 | 0.570 | 0.650 | 0.607 | 0.547 | 0.571 | |
CTD | 0.589 | 0.613 | 0.508 | 0.556 | 0.498 | 0.661 | ||
ProtVec+CTD | 0.667 | 0.679 | 0.648 | 0.663 | 0.559 | 0.646 | ||
RNN | ProtVec | 0.575 | 0.717 | 0.266 | 0.388 | 0.425 | 0.504 | |
CTD | 0.585 | 0.630 | 0.437 | 0.516 | 0.462 | 0.517 | ||
ProtVec+CTD | 0.602 | 0.714 | 0.357 | 0.476 | 0.479 | 0.484 | ||
LSTM | ProtVec | 0.542 | 0.588 | 0.324 | 0.417 | 0.487 | 0.527 | |
CTD | 0.493 | 0.500 | 0.021 | 0.041 | 0.503 | 0.513 | ||
ProtVec+CTD | 0.568 | 0.565 | 0.644 | 0.602 | 0.522 | 0.535 | ||
GRU | ProtVec | 0.550 | 0.659 | 0.234 | 0.346 | 0.479 | 0.488 | |
CTD | 0.499 | 0.503 | 0.864 | 0.636 | 0.506 | 0.533 | ||
ProtVec+CTD | 0.571 | 0.645 | 0.339 | 0.445 | 0.468 | 0.551 |
Model . | Feature . | Accuracy . | Precision . | Recall . | F-score . | AUROC . | AUPR . | |
---|---|---|---|---|---|---|---|---|
Traditional classifier | KNN | ProtVec | 0.611 | 0.583 | 0.809 | 0.678 | 0.608 | 0.692 |
CTD | 0.588 | 0.591 | 0.602 | 0.597 | 0.588 | 0.687 | ||
ProtVec+CTD | 0.633 | 0.640 | 0.630 | 0.635 | 0.633 | 0.719 | ||
SVM | ProtVec | 0.686 | 0.708 | 0.648 | 0.677 | 0.687 | 0.778 | |
CTD | 0.619 | 0.638 | 0.572 | 0.603 | 0.619 | 0.675 | ||
ProtVec+CTD | 0.620 | 0.631 | 0.577 | 0.602 | 0.621 | 0.645 | ||
NB | ProtVec | 0.578 | 0.823 | 0.212 | 0.338 | 0.582 | 0.620 | |
CTD | 0.573 | 0.585 | 0.536 | 0.560 | 0.573 | 0.574 | ||
ProtVec+CTD | 0.579 | 0.835 | 0.211 | 0.337 | 0.584 | 0.622 | ||
RF | ProtVec | 0.683 | 0.745 | 0.570 | 0.646 | 0.685 | 0.770 | |
CTD | 0.693 | 0.691 | 0.690 | 0.690 | 0.751 | 0.788 | ||
ProtVec+CTD | 0.707 | 0.765 | 0.608 | 0.678 | 0.708 | 0.777 | ||
NN | ProtVec | 0.670 | 0.684 | 0.670 | 0.675 | 0.668 | 0.748 | |
CTD | 0.546 | 0.804 | 0.139 | 0.237 | 0.552 | 0.609 | ||
ProtVec+CTD | 0.671 | 0.674 | 0.678 | 0.676 | 0.671 | 0.738 | ||
XGBoost | ProtVec | 0.721 | 0.752 | 0.669 | 0.708 | 0.721 | 0.810 | |
CTD | 0.690 | 0.698 | 0.685 | 0.692 | 0.690 | 0.780 | ||
ProtVec+CTD | 0.753 | 0.791 | 0.695 | 0.740 | 0.754 | 0.825 | ||
Deep learning model | VGG | ProtVec | 0.583 | 0.617 | 0.465 | 0.530 | 0.521 | 0.539 |
CTD | 0.552 | 0.586 | 0.397 | 0.473 | 0.469 | 0.490 | ||
ProtVec+CTD | 0.597 | 0.628 | 0.503 | 0.558 | 0.529 | 0.525 | ||
AlexNet | ProtVec | 0.570 | 0.608 | 0.487 | 0.463 | 0.511 | 0.532 | |
CTD | 0.577 | 0.592 | 0.530 | 0.559 | 0.485 | 0.577 | ||
ProtVec+CTD | 0.590 | 0.608 | 0.534 | 0.569 | 0.520 | 0.529 | ||
SqueezeNet | ProtVec | 0.575 | 0.570 | 0.650 | 0.607 | 0.547 | 0.571 | |
CTD | 0.589 | 0.613 | 0.508 | 0.556 | 0.498 | 0.661 | ||
ProtVec+CTD | 0.667 | 0.679 | 0.648 | 0.663 | 0.559 | 0.646 | ||
RNN | ProtVec | 0.575 | 0.717 | 0.266 | 0.388 | 0.425 | 0.504 | |
CTD | 0.585 | 0.630 | 0.437 | 0.516 | 0.462 | 0.517 | ||
ProtVec+CTD | 0.602 | 0.714 | 0.357 | 0.476 | 0.479 | 0.484 | ||
LSTM | ProtVec | 0.542 | 0.588 | 0.324 | 0.417 | 0.487 | 0.527 | |
CTD | 0.493 | 0.500 | 0.021 | 0.041 | 0.503 | 0.513 | ||
ProtVec+CTD | 0.568 | 0.565 | 0.644 | 0.602 | 0.522 | 0.535 | ||
GRU | ProtVec | 0.550 | 0.659 | 0.234 | 0.346 | 0.479 | 0.488 | |
CTD | 0.499 | 0.503 | 0.864 | 0.636 | 0.506 | 0.533 | ||
ProtVec+CTD | 0.571 | 0.645 | 0.339 | 0.445 | 0.468 | 0.551 |
Comparison with existing methods
We further compared our proposed framework with several existing state-of-the-art methods for epitope prediction on independent testing data. We used XGBoost for comparison as it demonstrated superior performance than other classifiers. Figure 3 showed the ROC and PR curves of epitope prediction performance. We could observe that our proposed models leveraging ProtVec and CTD features presented leading outcomes in both ROC and PR AUCs. The combination of ProtVec and CTD features had the best results among all the methods, with 0.827 and 0.831 of AUROC and AUPR, respectively. The substantial performance difference between some of the compared models (e.g. BepiPred2 and LBtope) and the proposed models suggests the lack of generalization ability of the previous methods. This is probably due to the fundamental difference of the transformed features from the data that the existing methods are not efficient in distinguishing the epitope and non-epitope sequences. Our proposed model significantly improved the predictive performance by incorporating features of ProtVec representation and CTD-based physicochemical properties of amino acids. Meanwhile, our framework can predict epitopes with variable length, while many of the compared methods mainly focus on fix-length sequences. Therefore, our framework is superior regarding the predictive performance and generalizability for human-adapted viral epitopes identification and has the potential when applied to other types of viruses.

Performance of the proposed framework and compared methods at predicting epitopes of human-adapted viruses on independent testing set evaluated by Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves.
The identification of viral epitope species
Following the viral epitope prediction, we retrained the framework to identify the viral category of the predicted epitopes. Figure 4 presented the performance of identifying viral species for predicted epitopes. We observed that the classifiers based only on ProtVec features would gain better predictive precision with the value of 0.857 by NNs. Meanwhile, RF and KNNs also achieved comparable performance based on ProtVec features. Interestingly, this is different from predicting viral epitopes, where the combination of ProtVec and CTD features displays the best results for most classifiers. The primary reason is that, according to the results, CTD features are not good indicators to identify viral species. Using CTD features alone to predict virus species suggested an unsatisfied result with only 0.57 in precision on average. Nevertheless, our proposed framework not only demonstrated its ability to predict human-adapted variable-length viral epitopes but also can identify the viral species of the epitope.

The precision on identifying viral species of input epitope and non-epitopes with combinations of ProtVec and CTD features using multiple classifiers.
Model evaluation on different length groups for epitope prediction
To further estimate the efficacy of the proposed framework for identifying epitopes at specific peptide lengths, we conducted experiments to predict the epitopes at different length groups of human-adapted viruses. The peptides are divided into several groups based on different lengths due to insufficient samples, i.e. ‘|$\leq $|11’, ‘12–15’, ‘16–19’ and ‘|$\geq $|20’ that consist of 223, 282, 147 and 342 samples, respectively, for testing. Figure 5 showed the predictive performance in terms of accuracy, precision, recall and F1-score. The model was using XGBoost as the classifier based on combined ProtVec and CTD features with the same setting as Table 2. We could discern that our model achieved comparable performance in different length groups. The accuracy ranged from 0.656 to 0.803, where the groups of ‘|$\leq $|11’, ‘16-19’ and ‘|$\geq $|20’ obtained better performance than ‘12–15’ group. When we looked at the other metrics, it indicated similar results regarding the precision and F-score. However, the group of ‘|$\leq $|11’ evidently outperformed the other three groups in recall, with the value of 0.853, suggesting our model is more capable of identifying epitopes with shorter lengths. Although the model displayed an overall superior performance on the group of ‘|$\leq $|11’, the predictive outcomes on the other three groups were also very compelling. Therefore, the evaluation on testing data of different length groups further demonstrated the generalizability of our proposed framework for epitope prediction on single fixed-length sequences.

Comparison of different length groups in the epitope prediction in terms of accuracy, precision, recall and F1 score validated with testing set.
Conclusion
One challenge to predict linear B-cell epitopes is that these epitopes vary in length, ranging from 4 to 50 amino acids in this work. This paper overcomes this problem by describing a general computational framework for predicting variable-length linear B-cell epitopes, focusing on human-adapted viruses. We introduce ProtVec and QR decomposition that allows us to convert peptides in different lengths into the same dimension of feature vectors. The addition of physicochemical properties of amino acids based on CTD descriptors further improves the prediction model. Experimental results indicate that our proposed framework not only outperforms the other existing computational methods but also can identify the viral species of the epitope with high precision. The superior predictive power of the proposed framework enables more precise identification of epitopes, facilitating the peptide-based vaccine design and the development of medical treatment efficiently and cost-effectively. In future work, we will incorporate more features such as the information of tertiary structures of peptides into the prediction model. In addition, it is interesting to explore how this framework could be specifically involved and help the vaccine design process in the real world.
A machine learning framework is proposed to predict variable-length epitopes of human-adapted viruses that could facilitate the process of vaccine design.
Experimental results indicate that XGBoost model scores the best with combined ProtVec and CTD features.
Our proposed model remarkably outperforms the existing methods on the testing dataset and can identify the viral species of the predicted epitope.
Author contributions statement
R.Y., X.Z. and C.K.K. designed the experiments; R.Y. and X.Z. conducted the experiments and analyzed the results. R.Y. wrote the manuscript and M.Z. and P.W. revised the manuscript. All the authors reviewed the manuscript.
Acknowledgments
This project is supported by AcRF Tier 2 grant MOE2014-T2-2-023, Ministry of Education, Singapore and A*STAR-NTU-SUTD AI Partnership grant, RGANS1905.
Availability and Implementation
The source codes, data and supplementary materials are publicly available at https://github.com/Rayin-saber/Epitope-prediction-of-human-adapted-viruses
Author Biographies
Rui Yin is a research fellow at the Department of Biomedical Informatics, Harvard Medical School. His research interests focus on data mining and machine learning to make sense of big heterogeneous data for real-world application in biomedical fields.
Xianghe Zhu is currently an MSc student in Statistical Science at the University of Oxford. His research interests include statistical and probabilistic network analysis, machine learning and bioinformatics.
Min Zeng is an Assistant Professor in the School of Computer Science and Engineering, Central South University, Changsha, Hunan, P. R. China. His research interests include machine learning and deep learning techniques for bioinformatics and computational biology.
Pengfei Wu is a pediatrician at the Department of Genetics and Endocrinology, National Children’s Medical Center for South Central Region, Guangzhou, China. His research interest is medical genomics and bioinformatics.
Min Li is a professor and vice dean at the School of Computer Science and Engineering, Central South University, Changsha, Hunan, P.R. China. Her research interests include computational biology, systems biology and bioinformatics.
Chee Keong Kwoh is an associate professor at the School of Computer Science and Engineering, Nanyang Technological University, Singapore. His research interests include data mining, soft computing and graph-based inference; applications areas include bioinformatics and biomedical engineering.