Abstract

The coronavirus disease 2019 pandemic has alerted people of the threat caused by viruses. Vaccine is the most effective way to prevent the disease from spreading. The interaction between antibodies and antigens will clear the infectious organisms from the host. Identifying B-cell epitopes is critical in vaccine design, development of disease diagnostics and antibody production. However, traditional experimental methods to determine epitopes are time-consuming and expensive, and the predictive performance using the existing in silico methods is not satisfactory. This paper develops a general framework to predict variable-length linear B-cell epitopes specific for human-adapted viruses with machine learning approaches based on Protvec representation of peptides and physicochemical properties of amino acids. QR decomposition is incorporated during the embedding process that enables our models to handle variable-length sequences. Experimental results on large immune epitope datasets validate that our proposed model’s performance is superior to the state-of-the-art methods in terms of AUROC (0.827) and AUPR (0.831) on the testing set. Moreover, sequence analysis also provides the results of the viral category for the corresponding predicted epitopes with high precision. Therefore, this framework is shown to reliably identify linear B-cell epitopes of human-adapted viruses given protein sequences and could provide assistance for potential future pandemics and epidemics.

Introduction

Antibody is a type of blood protein produced counteracting a specific antigen [1]. It is a fundamental component of humoral immune response used by the immune system to recognize and neutralize a unique molecule of the pathogen such as bacteria and viruses [2, 3]. The recognized region of an antigen by the antibody is referred to as B-cell epitopes [4] and antibodies can inhibit the function of antigens by binding the epitope regions [5]. The B-cell epitopes can be categorized into two groups, e.g. continuous and discontinuous. The continuous epitopes are a stretch of amino acid sequences that are linear in shape, while in the latter type, the amino acids are in close contact due to the 3D-protein conformation. Accurate detection of B-cell epitopes is critical for biomedical applications such as vaccine design [6], disease diagnostics [7] and therapeutic antibodies [8].

The basic workflow of the proposed framework. (A) Viral peptide inclusion criteria. Viral sequences were extracted from IEDB with human-adapted viruses. The identical sequences and virus category with fewer than 50 samples were excluded. Peptides with lengths shorter than 4 were also removed and we exported the remaining sequences with labels. (B) Feature representation. The sequences were encoded using ProtVec and AAindex combined with QR decomposition and CTD transformation to handle variable-length sequences. (C) Feature transformation process. Mathematical formulations of QR decomposition and CTD transformation to implement embedding process of input viral peptides. (D) Classification and evaluation. The filtered and processed samples were divided into training (90%) and testing (10%) datasets, where 5-fold cross-validation was performed on the training set using multiple classifiers for epitope prediction and species identification. The final performance estimated was quantified by the metrics such as accuracy, precision and recall, etc., on the held-out testing test.
Figure 1

The basic workflow of the proposed framework. (A) Viral peptide inclusion criteria. Viral sequences were extracted from IEDB with human-adapted viruses. The identical sequences and virus category with fewer than 50 samples were excluded. Peptides with lengths shorter than 4 were also removed and we exported the remaining sequences with labels. (B) Feature representation. The sequences were encoded using ProtVec and AAindex combined with QR decomposition and CTD transformation to handle variable-length sequences. (C) Feature transformation process. Mathematical formulations of QR decomposition and CTD transformation to implement embedding process of input viral peptides. (D) Classification and evaluation. The filtered and processed samples were divided into training (90%) and testing (10%) datasets, where 5-fold cross-validation was performed on the training set using multiple classifiers for epitope prediction and species identification. The final performance estimated was quantified by the metrics such as accuracy, precision and recall, etc., on the held-out testing test.

The identification of B-cell epitopes includes experimental determination and computational prediction. Experimental methods mainly use X-ray crystallography and traditional B-cell epitope mapping [9, 10]. However, most of these methods are costly, laborious and time-consuming and are not able to identify all epitopes. Advancement in bioinformatics has significantly contributed to the development of immunoinformatics involving the discovery of antibody structures [11], epitope analysis [12] and immune network modeling [13]. Various computational methods have been proposed to predict B-cell epitopes based on sequence and structural data as well as their co-evolutionary information [14] [15]. The early epitope prediction only uses one or several physicochemical properties of amino acids (e.g. polarity, solvent accessibility and hydrophilicity) with conventional computational approaches. These methods are usually calculating the average propensity value along with a sliding window that demonstrated to obtain poor performance in the practice [16–19]. With improved computing ability and increased available experimentally identified epitopes, methods based on machine learning techniques have been applied to improve the prediction performance by incorporating new amino acid-based features. One example is that the advent of proteomics and the construction of databases consisting of Ag-Ab crystal structures makes it easier to carry out a deeper analysis of conformational epitopes. For instance, PEPOP is a structure-based approach that uses the 3D coordinates of a protein to both predict clusters of surface accessible segments and residues that might correspond to discontinuous epitopes and to be used for immunogenic peptide design [20]. CBTOPE was proposed for the prediction of discontinuous epitopes from antigen primary structure through combining amino acid properties and traditional features of physicochemical profiles [21]. Zhang et al. followed the work of CBTOPE and introduced an ensemble learning technique to build the prediction model that explored more potential sequence-derived features relevant to conformational epitopes [22].

Though the majority of B-cell epitopes are conformational, much attention has been captured on the prediction of continuous epitopes as they are critical for the design of peptide-based vaccines [23] [24] [25]. ABCPred was the first method to predict B-cell epitopes using standard feed-forward and recurrent neural network (NN) in an antigenic sequence [26]. Chen et al. proposed a new amino acid pair (AAP) antigenicity scale, combining with support vector machine (SVM) to find B-cell epitopes [27]. This method was based on the finding that B-cell epitopes favor particular AAPs and trained on a homology-reduced dataset of linear B-cell epitopes. EL-Manzalawy et al. explored two machine learning approaches for predicting flexible length linear B-cell epitopes using the subsequence kernel [28]. Furthermore, Lian et al. presented a novel linear B-cell epitope prediction model using the multiple linear regression based on the antigen’s primary sequence information[29]. Larsen et al. introduced BepiPred for predicting continuous epitopes by combining two residues properties with Hidden Markov Model [30]. Jespersen et al. extended it by developing BepiPred-2.0 that was trained on epitopes annotated from antibody-antigen protein structures using random forest (RF) [31]. More recently, Collatz et al. described EpiDope, a python tool that leveraged a deep neural network (DNN) to detect linear B-cell epitope regions on individual protein sequences [32]. Bahai et al. illustrated a new method named EpitopeVec, which used a combination of residue properties, modified antigenicity scales, and protein language model-based representations as features of peptides for linear B-cell epitope predictions [33].

Despite these advancements, most of the existing methods concentrate on epitopes of fixed length, but linear B-cell epitopes can vary over a broad range in length. Identifying epitope sequences without trimming them will allow us to provide prediction tools to experiment with any arbitrary epitope lengths. In 2020, we witnessed and experienced a global pandemic due to severe acute respiratory syndrome coronavirus 2 that caused infections in millions of people and a large number of deaths. The viruses are still mutating and circulating rapidly with emerging new variants. Understanding humoral responses to these viruses is critical for improving therapeutics and vaccines and for preparing for future pandemics or epidemics led by other potential novel viruses. This work developed a novel framework that combines the physicochemical properties of amino acids and a new continuous distributed representation of biological sequences (ProtVec) for epitope prediction (Fig. 1). This model focuses on human-adapted viruses and can predict variable-length epitopes. We trained and tested our model using multiple traditional machine learning and deep learning methods, and the results indicate that XGBoost achieved the best performance. Further experiments demonstrate that our proposed model outperformed the other existing state-of-the-art methods. This model also could reveal the viral species of predicted epitopes with high accuracy. We hope this high predictive power of the model would facilitate the in vitro experimental validation for linear epitopes and accelerate the development of serological assays and antibody production. The contributions and innovations of this work are listed below:

  • We proposed a machine learning framework can not only predict viral epitopes but also identify the host of the viruses.

  • This framework can be applied at variable sequence lengths of human-adapted viruses.

  • The experimental results demonstrate superior performance over the other existing methods, and the model can be served to facilitate the vaccine design.

Materials and methods

Data collection

We downloaded all the linear viral peptides of B-cells of the human host from the Immune Epitope Database (IEDB) [34] on 30 March 2021, which resulted in 210 311 raw protein sequences. To ensure the quality of training samples, we selected virus species with over 50 cases and eliminated the peptides that were shorter than 4. Next, we merged the identical sequences from each category and kept the information of their verified regions. For those peptides with multiple assays, we classified the peptides as epitopes or non-epitopes based on the majority of the counts from the assays. The peptides with equal negative and positive assay results were also eliminated from the dataset. Finally, we ended up with 17 different human-adapted viruses containing 4975 epitopes and 4956 non-epitopes ranging from 4 to 50 in length. The detailed information is shown in Table 1.

Table 1

The detailed information epitopes and non-epitopes of selective human-adapted viruses

Category of virusSequenceEpitopesNon-epitopesMin. lengthMax. lengthAvg. length
Alphapapillomavirus1791681173012.9
Orthonairovirus22632194122012.9
Dengue virus90442847654610.3
Ebolavirus65074576102515.0
Enterovirus35412223273817.8
Hepacivirus33541988136654814.3
Human alphaherpesvirus81913468563119.2
Human immunodeficiency virus454447744714.1
Human orthopneumovirus2177214564111.2
Influenza A virus239233675020.1
Measles morbillivirus1811344773713.8
Orthohantavirus38510128484519.7
Orthohepevirus A28717611164823.1
Primate erythroparvovirus 12766121574120.1
Primate T-lymphotropic virus 11971811664717.8
Severe acute respiratory syndrome coronavirus100153346845017.6
Zika virus2089111783017.7
Category of virusSequenceEpitopesNon-epitopesMin. lengthMax. lengthAvg. length
Alphapapillomavirus1791681173012.9
Orthonairovirus22632194122012.9
Dengue virus90442847654610.3
Ebolavirus65074576102515.0
Enterovirus35412223273817.8
Hepacivirus33541988136654814.3
Human alphaherpesvirus81913468563119.2
Human immunodeficiency virus454447744714.1
Human orthopneumovirus2177214564111.2
Influenza A virus239233675020.1
Measles morbillivirus1811344773713.8
Orthohantavirus38510128484519.7
Orthohepevirus A28717611164823.1
Primate erythroparvovirus 12766121574120.1
Primate T-lymphotropic virus 11971811664717.8
Severe acute respiratory syndrome coronavirus100153346845017.6
Zika virus2089111783017.7
Table 1

The detailed information epitopes and non-epitopes of selective human-adapted viruses

Category of virusSequenceEpitopesNon-epitopesMin. lengthMax. lengthAvg. length
Alphapapillomavirus1791681173012.9
Orthonairovirus22632194122012.9
Dengue virus90442847654610.3
Ebolavirus65074576102515.0
Enterovirus35412223273817.8
Hepacivirus33541988136654814.3
Human alphaherpesvirus81913468563119.2
Human immunodeficiency virus454447744714.1
Human orthopneumovirus2177214564111.2
Influenza A virus239233675020.1
Measles morbillivirus1811344773713.8
Orthohantavirus38510128484519.7
Orthohepevirus A28717611164823.1
Primate erythroparvovirus 12766121574120.1
Primate T-lymphotropic virus 11971811664717.8
Severe acute respiratory syndrome coronavirus100153346845017.6
Zika virus2089111783017.7
Category of virusSequenceEpitopesNon-epitopesMin. lengthMax. lengthAvg. length
Alphapapillomavirus1791681173012.9
Orthonairovirus22632194122012.9
Dengue virus90442847654610.3
Ebolavirus65074576102515.0
Enterovirus35412223273817.8
Hepacivirus33541988136654814.3
Human alphaherpesvirus81913468563119.2
Human immunodeficiency virus454447744714.1
Human orthopneumovirus2177214564111.2
Influenza A virus239233675020.1
Measles morbillivirus1811344773713.8
Orthohantavirus38510128484519.7
Orthohepevirus A28717611164823.1
Primate erythroparvovirus 12766121574120.1
Primate T-lymphotropic virus 11971811664717.8
Severe acute respiratory syndrome coronavirus100153346845017.6
Zika virus2089111783017.7

Feature representation

Previous methods of mapping a variable-length amino acid sequence into a fixed-length feature vector contain amino acid composition, dipeptide composition (DPC), AAP antigenicity scale and k-mer representation, etc. Once the variable-length sequences are converted into fixed-length feature vectors, machine learning algorithms can be applied to various prediction tasks. Recently, continuous representation has been explored in bioinformatics applications. ProtVec [35] has been demonstrated as an efficient embedding method for biological sequences for a variety of biomedical problems [36] [37] [38]. Inspired by the continuous vector representation of words in the natural language processing, this method regarded 3-mers of amino acids as words and collected 546 790 protein sequences from the Swiss-Prot database as the training set. Skip-gram NN was applied to the dataset that attempted to maximize the probability of observed sequences and 100-dimensional numerical vectors were calculated to represent 3-mer peptides. In this work, we leveraged ProtVec to encode the sequences. Given a sequence of length N, we split it into N-2 consecutive 3-mers and each 3-mer was represented by a 100-dimensional vector mapped from ProtVec. Therefore, we have (N-2)*100-dimensional vectors to represent input sequences.

To address the issue of different lengths of input viral peptides, we introduced QR decomposition. Assume a real matrix |$X \in \mathbb{R}^{m\times n}$| that can be written as
(1)
where |$Q \in \mathbb{R}^{m\times m} $| is an orthogonal matrix (⁠|$QQ^T = Q^TQ = I_m$|⁠), |$R^{\prime} \in \mathbb{R}^{n\times n}$| is an upper-diagonal matrix (⁠|$r_{ij} = 0$| for all |$i> j$|⁠) and |$O \in \mathbb{R}^{(m-n) \times n}$| is a zero matrix. We apply Gram–Schmidt process to perform QR decomposition. Specifically, for matrix |$X = [x_1, x_2,..., x_n]$|⁠, where |${x}_i$| are column vectors, we have
(2)
(3)
where |$i \in \{2,3,\cdots , n\} $| and |$\|\cdot \|$| is the |$L_2$| norm. The matrix |$A = [a_1, a_2,..., a_n]$| can be formulated as |$A = QR$|⁠, where |$Q = [e_1, e_2,..., e_n]$| and |$R$| is an upper triangular matrix. The following will prove such process gives the matrix by showing
(4)
where |$\mathrm{i} \in \{1,2, \ldots , \mathrm{n}\}$| and some |$r_{k i}$|⁠.

First notice that when |$\mathrm{i}=1$| obviously it is true by choosing |$r_{11}=\left \|a_{1}\right \|$|⁠. When |$\mathrm{i}=2$|⁠, |$\mathrm{e}_{2}$| is orthogonal to |$e_{1}$| since |$e_{2}^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|} a_{2}^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|}\left [x_{2}-\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}\right ]^{T} \cdot e_{1}=\frac{1}{\left \|a_{2}\right \|}\left (x_{2}^{T} \cdot e_{1}\right )(1-$||$\left .e_{1}^{T} e_{1}\right )=0$|⁠. Moreover, we have |$\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}+\left \|a_{2}\right \| e_{2}=\left (x_{2}^{T} \cdot e_{1}\right ) e_{1}+a_{2}=x_{2}$|⁠, then (4) holds where |$r_{12}=x_{2}^{T} \cdot e_{1}$| and |$r_{22}=\left \|a_{2}\right \|$|⁠. For |$\mathrm{i} \geq 3$|⁠, (4) can be obtained with similar process by setting |$r_{i j}=x_{j}^{T} \cdot e_{i}$| for |$\mathrm{i} \leq \mathrm{j}-1$| and |$r_{j j}=\left \|a_{j}\right \|$|⁠. Therefore, QR decomposition always exists even if the matrix does not have full rank.

In our work, we applied the QR decomposition to all the embedded sequences, which will be represented in |$100\times (N-2)$|-dimensional vectors. Each sequence could be denoted as the product of |$Q \in \mathbb{R}^{100\times 100 }$| and |$R \in \mathbb{R}^{100\times (N-2)}$|⁠. Considering |$R$| is a upper triangular matrix with plenty of zero values, and also |$R^{\prime}$| occupied a small proportion in |$R$|⁠, we only used the matrix |$Q$| to represent each input sequence when building the model. As a result, all the peptides can be transformed into fixed length matrices after embedding.

In addition, to make full use of the sequence information, we added physiochemical properties of amino acids as additional features through Composition–Transition–Distribution method [39]. This method mapped each genomic sequence into a fixed-length vector based on amino acid properties. We selected seven representative physicochemical properties [40] (net charge, hydrophobicity, polarizability, normalized van der Waals, volume polarity, solvent accessibility and secondary structure) and converted the sequences into numerical vectors using AAindex [41], a database of numerical indices representing various physicochemical and biochemical properties of amino acids. We divided all amino acids into three different groups based on the physicochemical properties [42] (Supplementary Materials S1). We derived 147 features for each peptide sequence using CTD descriptors formulated below.
(5)
(6)
(7)
where composition represents the percent frequency of amino acids of a particular property divided by the total number of amino acids. |$N_p$| is the total number of amino acids of the peptide |$p$| and |$C_{Gi}$| is the frequency of amino acid property of group i in the sequence. Transition characterizes the percentage frequency with which amino acid from a group is followed by another group denoted as |$T_{GiGj}$|⁠. It means the property in group i is followed by group j or the other way around such that i, j = 1,2,3 and |$G_{i} \neq G_{j}$|⁠. The third descriptor measures the distribution of each attribute in the sequence and |$D_{i}$| denotes the percentage in the positions of amino acid properties in group i. The distribution is based on the first 25%, 50%, 75% and 100% of the amino acids for each attribute.

Classification

The feature transformation process enabled us to convert variable-length peptides into numerical matrices with identical dimensions that can be handled with machine learning models for classification tasks. Here, we leveraged several methods, including traditional machine learning algorithms and DNN architectures to predict viral epitope among all the encoded peptides. The traditional machine learning algorithms included SVM, K-nearest neighbor (KNN), naïve bayes (NB), RF, NN and XGBoost. The SVM classifier learns a hyperplane to differentiate the positive and negative samples using different kernel functions to maximize the geometric margin. KNN is a nonparametric classification method where the function is approximated locally and all computation is deferred until function evaluation. The NB classifier is based on the Bayes’ Theorem, adapted for use across binary and multiclass classification problems. RFs use multiple decision trees on various subsamples of the dataset to improve the predictive performance. The classic NN classifier implements a multilayer perceptron algorithm that trains using Backpropagation. XGBoost is an optimized distributed gradient boosting method that provides parallel tree boosting to solve classification problems in an efficient and accurate way. In comparison, we also employed DNN models that were previously used and proved to be effective by encoding various physicochemical and structural information for predictive tasks on protein sequences [43–45]. The DNN architecture consists of several variants of convolutional and recurrent NNs, including AlexNet [46], VGG [47], SqueezeNet [48], long short-term memory (LSTM) [49] and gated recurrent unit (GRU) [50].

Experimental setup

We implemented all the approaches with Scikit-learn [51] and PyTorch [52]. The exported data were randomly divided into training and independent testing set in a ratio of 9:1. We applied 5-fold cross validation on the training set where each subset containing an equal number of peptides. Four of the subsets were used to train the model and the remaining subset was for validation. The procedure was repeated five times and the final prediction was the average of five validation results over the cross-validation. The parameters for both traditional machine learning methods and DNN models can be found in Supplementary Materials S2. We used several metrics to evaluate our models including accuracy, precision, recall and F1. The definition of these metrics were described as follows, where TP, TN, FP and FN are true positive, true negative, false positive and false negative, respectively, which measures if the binary targets are correctly predicted or not.
(8)
(9)
(10)
(11)

To further demonstrate the superiority of our proposed framework, we compared our model against several existing tools for linear B-cell epitope prediction on the independent testing set. These tools and methods were in their latest version including Bepipred2.0 [31], LBtope [53], iBCE-EL [54], EpitopeVec [33], Parker Hydrophilicity prediction [55], Chou and Fasman beta turn prediction [56], Emini surface accessibility prediction [57], Kolaskar and Tongaonkar antigenicity Prediction [58] and Karplus and Schulz Flexibility Prediction [59]. In addition, we added two more approaches with different feature representations, namely, AAP antigenicity scale and DPC, for epitope prediction. AAP used a new scale that suggested certain AAPs are favored in epitope regions [27]. Here, we only compare this method with sequence length of 15 denoted as AAP15. DPC is a vector specifying the abundance of dipeptides that can reflect the correlation of adjacent amino acids in protein sequence [60]. We followed the original settings to implement these methods and the detailed parameters and can be found in Supplementary Materials S3. To compare the performance of the compared models, we adopted the area under Receiver operating characteristic (AUROC) and Precision-Recall (AUPR) curves, which is a more efficient way to display the relationship between sensitivity and specificity for assessing models and a good practice in imbalanced dataset.

Results and discussion

The length distribution of epitopes and non-epitopes

The length of collected B-cell epitopes and non-epitopes varies over a broad range across viruses. Thus, it is natural to know the distribution of the sequences in different lengths. Figure 2 displayed the length distribution of unique linear B-cell epitopes and non-epitopes of filtered dataset. We found that most of the peptides range from 6 to 20 in length, where the sequences of length 20 occupied almost half. Both epitope and non-epitope peptides in length 20 had the largest samples compared with peptides in other lengths. The number of peptides over 30 in length only occupied 1% of all the samples. We will still keep these samples for training as they meet the inclusion criteria. Nevertheless, our proposed model is capable of handling different lengths of peptides without extending or trimming them.

The length distribution of unique linear B-cell epitopes and non-epitopes in the dataset.
Figure 2

The length distribution of unique linear B-cell epitopes and non-epitopes in the dataset.

Performance comparison on combination of features and classifiers

We then investigated the performance using different machine learning models with the ProtVec and CTD features and their combination. We trained and tested the models on the training set with 5-fold cross-validation and the performance was averaged over the hold-out folds (Table 2). In more detail, we could find that traditional machine learning methods (KNN, SVM, NB, RF, NN and XGBoost) showed better performance than deep learning models (VGG, AlexNet, SqueezeNet, RNN, LSTM, GRU). One of the possible explanations is that deep learning models require much more data to function properly due to the complex multilayer structures than traditional machine learning models. In addition, the loss of information of input sequences due to QR decomposition might hinder the DNN obtain a better result. Among all the models, XGBoost outperformed other classifiers for predicting epitopes in terms of accuracy (0.753), F-score (0.740), AUROC (0.754) and AUPR (0.825) using the combination of ProtVec and CTD features, while NB showed the best precision (0.835) and GRU obtained highest recall value (0.864). In addition, the combination of ProtVec and CTD features achieved better results than using the ProtVec and CTD features independently of all the models except SVM, where only ProtVec features suggested better performance. These observations indicate that incorporating ProtVec and CTD features would improve the model performance for epitope prediction of human-adapted viruses.

Table 2

Performance comparison of combination of features in epitope prediction using different machine learning models

ModelFeatureAccuracyPrecisionRecallF-scoreAUROCAUPR
Traditional classifierKNNProtVec0.6110.5830.8090.6780.6080.692
CTD0.5880.5910.6020.5970.5880.687
ProtVec+CTD0.6330.6400.6300.6350.6330.719
SVMProtVec0.6860.7080.6480.6770.6870.778
CTD0.6190.6380.5720.6030.6190.675
ProtVec+CTD0.6200.6310.5770.6020.6210.645
NBProtVec0.5780.8230.2120.3380.5820.620
CTD0.5730.5850.5360.5600.5730.574
ProtVec+CTD0.5790.8350.2110.3370.5840.622
RFProtVec0.6830.7450.5700.6460.6850.770
CTD0.6930.6910.6900.6900.7510.788
ProtVec+CTD0.7070.7650.6080.6780.7080.777
NNProtVec0.6700.6840.6700.6750.6680.748
CTD0.5460.8040.1390.2370.5520.609
ProtVec+CTD0.6710.6740.6780.6760.6710.738
XGBoostProtVec0.7210.7520.6690.7080.7210.810
CTD0.6900.6980.6850.6920.6900.780
ProtVec+CTD0.7530.7910.6950.7400.7540.825
Deep learning modelVGGProtVec0.5830.6170.4650.5300.5210.539
CTD0.5520.5860.3970.4730.4690.490
ProtVec+CTD0.5970.6280.5030.5580.5290.525
AlexNetProtVec0.5700.6080.4870.4630.5110.532
CTD0.5770.5920.5300.5590.4850.577
ProtVec+CTD0.5900.6080.5340.5690.5200.529
SqueezeNetProtVec0.5750.5700.6500.6070.5470.571
CTD0.5890.6130.5080.5560.4980.661
ProtVec+CTD0.6670.6790.6480.6630.5590.646
RNNProtVec0.5750.7170.2660.3880.4250.504
CTD0.5850.6300.4370.5160.4620.517
ProtVec+CTD0.6020.7140.3570.4760.4790.484
LSTMProtVec0.5420.5880.3240.4170.4870.527
CTD0.4930.5000.0210.0410.5030.513
ProtVec+CTD0.5680.5650.6440.6020.5220.535
GRUProtVec0.5500.6590.2340.3460.4790.488
CTD0.4990.5030.8640.6360.5060.533
ProtVec+CTD0.5710.6450.3390.4450.4680.551
ModelFeatureAccuracyPrecisionRecallF-scoreAUROCAUPR
Traditional classifierKNNProtVec0.6110.5830.8090.6780.6080.692
CTD0.5880.5910.6020.5970.5880.687
ProtVec+CTD0.6330.6400.6300.6350.6330.719
SVMProtVec0.6860.7080.6480.6770.6870.778
CTD0.6190.6380.5720.6030.6190.675
ProtVec+CTD0.6200.6310.5770.6020.6210.645
NBProtVec0.5780.8230.2120.3380.5820.620
CTD0.5730.5850.5360.5600.5730.574
ProtVec+CTD0.5790.8350.2110.3370.5840.622
RFProtVec0.6830.7450.5700.6460.6850.770
CTD0.6930.6910.6900.6900.7510.788
ProtVec+CTD0.7070.7650.6080.6780.7080.777
NNProtVec0.6700.6840.6700.6750.6680.748
CTD0.5460.8040.1390.2370.5520.609
ProtVec+CTD0.6710.6740.6780.6760.6710.738
XGBoostProtVec0.7210.7520.6690.7080.7210.810
CTD0.6900.6980.6850.6920.6900.780
ProtVec+CTD0.7530.7910.6950.7400.7540.825
Deep learning modelVGGProtVec0.5830.6170.4650.5300.5210.539
CTD0.5520.5860.3970.4730.4690.490
ProtVec+CTD0.5970.6280.5030.5580.5290.525
AlexNetProtVec0.5700.6080.4870.4630.5110.532
CTD0.5770.5920.5300.5590.4850.577
ProtVec+CTD0.5900.6080.5340.5690.5200.529
SqueezeNetProtVec0.5750.5700.6500.6070.5470.571
CTD0.5890.6130.5080.5560.4980.661
ProtVec+CTD0.6670.6790.6480.6630.5590.646
RNNProtVec0.5750.7170.2660.3880.4250.504
CTD0.5850.6300.4370.5160.4620.517
ProtVec+CTD0.6020.7140.3570.4760.4790.484
LSTMProtVec0.5420.5880.3240.4170.4870.527
CTD0.4930.5000.0210.0410.5030.513
ProtVec+CTD0.5680.5650.6440.6020.5220.535
GRUProtVec0.5500.6590.2340.3460.4790.488
CTD0.4990.5030.8640.6360.5060.533
ProtVec+CTD0.5710.6450.3390.4450.4680.551
Table 2

Performance comparison of combination of features in epitope prediction using different machine learning models

ModelFeatureAccuracyPrecisionRecallF-scoreAUROCAUPR
Traditional classifierKNNProtVec0.6110.5830.8090.6780.6080.692
CTD0.5880.5910.6020.5970.5880.687
ProtVec+CTD0.6330.6400.6300.6350.6330.719
SVMProtVec0.6860.7080.6480.6770.6870.778
CTD0.6190.6380.5720.6030.6190.675
ProtVec+CTD0.6200.6310.5770.6020.6210.645
NBProtVec0.5780.8230.2120.3380.5820.620
CTD0.5730.5850.5360.5600.5730.574
ProtVec+CTD0.5790.8350.2110.3370.5840.622
RFProtVec0.6830.7450.5700.6460.6850.770
CTD0.6930.6910.6900.6900.7510.788
ProtVec+CTD0.7070.7650.6080.6780.7080.777
NNProtVec0.6700.6840.6700.6750.6680.748
CTD0.5460.8040.1390.2370.5520.609
ProtVec+CTD0.6710.6740.6780.6760.6710.738
XGBoostProtVec0.7210.7520.6690.7080.7210.810
CTD0.6900.6980.6850.6920.6900.780
ProtVec+CTD0.7530.7910.6950.7400.7540.825
Deep learning modelVGGProtVec0.5830.6170.4650.5300.5210.539
CTD0.5520.5860.3970.4730.4690.490
ProtVec+CTD0.5970.6280.5030.5580.5290.525
AlexNetProtVec0.5700.6080.4870.4630.5110.532
CTD0.5770.5920.5300.5590.4850.577
ProtVec+CTD0.5900.6080.5340.5690.5200.529
SqueezeNetProtVec0.5750.5700.6500.6070.5470.571
CTD0.5890.6130.5080.5560.4980.661
ProtVec+CTD0.6670.6790.6480.6630.5590.646
RNNProtVec0.5750.7170.2660.3880.4250.504
CTD0.5850.6300.4370.5160.4620.517
ProtVec+CTD0.6020.7140.3570.4760.4790.484
LSTMProtVec0.5420.5880.3240.4170.4870.527
CTD0.4930.5000.0210.0410.5030.513
ProtVec+CTD0.5680.5650.6440.6020.5220.535
GRUProtVec0.5500.6590.2340.3460.4790.488
CTD0.4990.5030.8640.6360.5060.533
ProtVec+CTD0.5710.6450.3390.4450.4680.551
ModelFeatureAccuracyPrecisionRecallF-scoreAUROCAUPR
Traditional classifierKNNProtVec0.6110.5830.8090.6780.6080.692
CTD0.5880.5910.6020.5970.5880.687
ProtVec+CTD0.6330.6400.6300.6350.6330.719
SVMProtVec0.6860.7080.6480.6770.6870.778
CTD0.6190.6380.5720.6030.6190.675
ProtVec+CTD0.6200.6310.5770.6020.6210.645
NBProtVec0.5780.8230.2120.3380.5820.620
CTD0.5730.5850.5360.5600.5730.574
ProtVec+CTD0.5790.8350.2110.3370.5840.622
RFProtVec0.6830.7450.5700.6460.6850.770
CTD0.6930.6910.6900.6900.7510.788
ProtVec+CTD0.7070.7650.6080.6780.7080.777
NNProtVec0.6700.6840.6700.6750.6680.748
CTD0.5460.8040.1390.2370.5520.609
ProtVec+CTD0.6710.6740.6780.6760.6710.738
XGBoostProtVec0.7210.7520.6690.7080.7210.810
CTD0.6900.6980.6850.6920.6900.780
ProtVec+CTD0.7530.7910.6950.7400.7540.825
Deep learning modelVGGProtVec0.5830.6170.4650.5300.5210.539
CTD0.5520.5860.3970.4730.4690.490
ProtVec+CTD0.5970.6280.5030.5580.5290.525
AlexNetProtVec0.5700.6080.4870.4630.5110.532
CTD0.5770.5920.5300.5590.4850.577
ProtVec+CTD0.5900.6080.5340.5690.5200.529
SqueezeNetProtVec0.5750.5700.6500.6070.5470.571
CTD0.5890.6130.5080.5560.4980.661
ProtVec+CTD0.6670.6790.6480.6630.5590.646
RNNProtVec0.5750.7170.2660.3880.4250.504
CTD0.5850.6300.4370.5160.4620.517
ProtVec+CTD0.6020.7140.3570.4760.4790.484
LSTMProtVec0.5420.5880.3240.4170.4870.527
CTD0.4930.5000.0210.0410.5030.513
ProtVec+CTD0.5680.5650.6440.6020.5220.535
GRUProtVec0.5500.6590.2340.3460.4790.488
CTD0.4990.5030.8640.6360.5060.533
ProtVec+CTD0.5710.6450.3390.4450.4680.551

Comparison with existing methods

We further compared our proposed framework with several existing state-of-the-art methods for epitope prediction on independent testing data. We used XGBoost for comparison as it demonstrated superior performance than other classifiers. Figure 3 showed the ROC and PR curves of epitope prediction performance. We could observe that our proposed models leveraging ProtVec and CTD features presented leading outcomes in both ROC and PR AUCs. The combination of ProtVec and CTD features had the best results among all the methods, with 0.827 and 0.831 of AUROC and AUPR, respectively. The substantial performance difference between some of the compared models (e.g. BepiPred2 and LBtope) and the proposed models suggests the lack of generalization ability of the previous methods. This is probably due to the fundamental difference of the transformed features from the data that the existing methods are not efficient in distinguishing the epitope and non-epitope sequences. Our proposed model significantly improved the predictive performance by incorporating features of ProtVec representation and CTD-based physicochemical properties of amino acids. Meanwhile, our framework can predict epitopes with variable length, while many of the compared methods mainly focus on fix-length sequences. Therefore, our framework is superior regarding the predictive performance and generalizability for human-adapted viral epitopes identification and has the potential when applied to other types of viruses.

Performance of the proposed framework and compared methods at predicting epitopes of human-adapted viruses on independent testing set evaluated by Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves.
Figure 3

Performance of the proposed framework and compared methods at predicting epitopes of human-adapted viruses on independent testing set evaluated by Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves.

The identification of viral epitope species

Following the viral epitope prediction, we retrained the framework to identify the viral category of the predicted epitopes. Figure 4 presented the performance of identifying viral species for predicted epitopes. We observed that the classifiers based only on ProtVec features would gain better predictive precision with the value of 0.857 by NNs. Meanwhile, RF and KNNs also achieved comparable performance based on ProtVec features. Interestingly, this is different from predicting viral epitopes, where the combination of ProtVec and CTD features displays the best results for most classifiers. The primary reason is that, according to the results, CTD features are not good indicators to identify viral species. Using CTD features alone to predict virus species suggested an unsatisfied result with only 0.57 in precision on average. Nevertheless, our proposed framework not only demonstrated its ability to predict human-adapted variable-length viral epitopes but also can identify the viral species of the epitope.

The precision on identifying viral species of input epitope and non-epitopes with combinations of ProtVec and CTD features using multiple classifiers.
Figure 4

The precision on identifying viral species of input epitope and non-epitopes with combinations of ProtVec and CTD features using multiple classifiers.

Model evaluation on different length groups for epitope prediction

To further estimate the efficacy of the proposed framework for identifying epitopes at specific peptide lengths, we conducted experiments to predict the epitopes at different length groups of human-adapted viruses. The peptides are divided into several groups based on different lengths due to insufficient samples, i.e. ‘|$\leq $|11’, ‘12–15’, ‘16–19’ and ‘|$\geq $|20’ that consist of 223, 282, 147 and 342 samples, respectively, for testing. Figure 5 showed the predictive performance in terms of accuracy, precision, recall and F1-score. The model was using XGBoost as the classifier based on combined ProtVec and CTD features with the same setting as Table 2. We could discern that our model achieved comparable performance in different length groups. The accuracy ranged from 0.656 to 0.803, where the groups of ‘|$\leq $|11’, ‘16-19’ and ‘|$\geq $|20’ obtained better performance than ‘12–15’ group. When we looked at the other metrics, it indicated similar results regarding the precision and F-score. However, the group of ‘|$\leq $|11’ evidently outperformed the other three groups in recall, with the value of 0.853, suggesting our model is more capable of identifying epitopes with shorter lengths. Although the model displayed an overall superior performance on the group of ‘|$\leq $|11’, the predictive outcomes on the other three groups were also very compelling. Therefore, the evaluation on testing data of different length groups further demonstrated the generalizability of our proposed framework for epitope prediction on single fixed-length sequences.

Comparison of different length groups in the epitope prediction in terms of accuracy, precision, recall and F1 score validated with testing set.
Figure 5

Comparison of different length groups in the epitope prediction in terms of accuracy, precision, recall and F1 score validated with testing set.

Conclusion

One challenge to predict linear B-cell epitopes is that these epitopes vary in length, ranging from 4 to 50 amino acids in this work. This paper overcomes this problem by describing a general computational framework for predicting variable-length linear B-cell epitopes, focusing on human-adapted viruses. We introduce ProtVec and QR decomposition that allows us to convert peptides in different lengths into the same dimension of feature vectors. The addition of physicochemical properties of amino acids based on CTD descriptors further improves the prediction model. Experimental results indicate that our proposed framework not only outperforms the other existing computational methods but also can identify the viral species of the epitope with high precision. The superior predictive power of the proposed framework enables more precise identification of epitopes, facilitating the peptide-based vaccine design and the development of medical treatment efficiently and cost-effectively. In future work, we will incorporate more features such as the information of tertiary structures of peptides into the prediction model. In addition, it is interesting to explore how this framework could be specifically involved and help the vaccine design process in the real world.

Key Points
  • A machine learning framework is proposed to predict variable-length epitopes of human-adapted viruses that could facilitate the process of vaccine design.

  • Experimental results indicate that XGBoost model scores the best with combined ProtVec and CTD features.

  • Our proposed model remarkably outperforms the existing methods on the testing dataset and can identify the viral species of the predicted epitope.

Author contributions statement

R.Y., X.Z. and C.K.K. designed the experiments; R.Y. and X.Z. conducted the experiments and analyzed the results. R.Y. wrote the manuscript and M.Z. and P.W. revised the manuscript. All the authors reviewed the manuscript.

Acknowledgments

This project is supported by AcRF Tier 2 grant MOE2014-T2-2-023, Ministry of Education, Singapore and A*STAR-NTU-SUTD AI Partnership grant, RGANS1905.

Availability and Implementation

The source codes, data and supplementary materials are publicly available at https://github.com/Rayin-saber/Epitope-prediction-of-human-adapted-viruses

Author Biographies

Rui Yin is a research fellow at the Department of Biomedical Informatics, Harvard Medical School. His research interests focus on data mining and machine learning to make sense of big heterogeneous data for real-world application in biomedical fields.

Xianghe Zhu is currently an MSc student in Statistical Science at the University of Oxford. His research interests include statistical and probabilistic network analysis, machine learning and bioinformatics.

Min Zeng is an Assistant Professor in the School of Computer Science and Engineering, Central South University, Changsha, Hunan, P. R. China. His research interests include machine learning and deep learning techniques for bioinformatics and computational biology.

Pengfei Wu is a pediatrician at the Department of Genetics and Endocrinology, National Children’s Medical Center for South Central Region, Guangzhou, China. His research interest is medical genomics and bioinformatics.

Min Li is a professor and vice dean at the School of Computer Science and Engineering, Central South University, Changsha, Hunan, P.R. China. Her research interests include computational biology, systems biology and bioinformatics.

Chee Keong Kwoh is an associate professor at the School of Computer Science and Engineering, Nanyang Technological University, Singapore. His research interests include data mining, soft computing and graph-based inference; applications areas include bioinformatics and biomedical engineering.

References

1.

Reth
M
.
Matching cellular dimensions with molecular sizes
.
Nat Immunol
2013
;
14
(
8
):
765
7
.

2.

Baumgarth
N
,
Tung
JW
,
Herzenberg
LA
. Inherent specificities in natural antibodies: a key to immune defense against pathogen invasion. In:
Springer seminars in immunopathology
, Vol.
26
. No. 4.
Springer
,
2005
.

3.

Murphy
K
,
Travers
P
,
Walport
M
, et al.
Immunobiology
.
NY
:
Garland Science New York
,
2012
.

4.

Kringelum
JV
,
Lundegaard
C
,
Lund
O
, et al.
Reliable b cell epitope predictions: impacts of method development and improved benchmarking
.
PLoS Comput Biol
2012
;
8
(
12
):e1002829.

5.

Ekiert
DC
,
Bhabha
G
,
Elsliger
M-A
, et al.
Antibody recognition of a highly conserved influenza virus epitope
.
Science
2009
;
324
(
5924
):
246
51
.

6.

Yin
R
,
Yu
Z
,
Zhou
X
, et al.
Time series computational prediction of vaccines for influenza a h3n2 with recurrent neural networks
.
J Bioinform Comput Biol
2020
;
18
(
01
):
2040002
.

7.

Ahmad
TA
,
Eweida
AE
,
Sheweita
SA
.
B-cell epitope mapping for the design of vaccines and effective diagnostics
.
Trials in Vaccinology
2016
;
5
:
71
83
.

8.

Kametani
Y
,
Miyamoto
A
,
Tsuda
B
, et al.
B cell epitope-based vaccination therapy
.
Antibodies
2015
;
4
(
3
):
225
39
.

9.

Gershoni
JM
,
Roitburd-Berman
A
,
Siman-Tov
DD
, et al.
Epitope mapping
.
BioDrugs
2007
;
21
(
3
):
145
56
.

10.

Huang
J
,
Beibei
R
,
Dai
P
.
Bioinformatics resources and tools for phage display
.
Molecules
2011
;
16
(
1
):
694
709
.

11.

Shirai
H
,
Ikeda
K
,
Yamashita
K
, et al.
High-resolution modeling of antibody structures by a combination of bioinformatics, expert knowledge, and molecular simulations
.
Proteins: Structure, Function, and Bioinformatics
2014
;
82
(
8
):
1624
35
.

12.

Yasser
EL-M
,
Honavar
V
.
Recent advances in b-cell epitope prediction methods
.
Immunome research
2010
;
6
(
2
):
1
9
.

13.

Segel
LA
,
Perelson
AS
. Computations in shape space: a new approach to immune network theory. In:
Theoretical immunology
.
CRC Press
,
2018
,
321
43
.

14.

Lun
H
,
Chan
KCC
.
Extracting coevolutionary features from protein sequences for predicting protein-protein interactions
.
IEEE/ACM Trans Comput Biol Bioinform
2016
;
14
(
1
):
155
66
.

15.

Lun
H
,
Pengwei
H
,
Luo
X
, et al.
Incorporating the coevolving information of substrates in predicting hiv-1 protease cleavage sites
.
IEEE/ACM Trans Comput Biol Bioinform
2019
;
17
(
6
):
2017
28
.

16.

Kolaskar
AS
,
Kulkarni-Kale
U
.
Prediction of three-dimensional structure and mapping of conformational epitopes of envelope glycoprotein of japanese encephalitis virus
.
Virology
1999
;
261
(
1
):
31
42
.

17.

Yin
R
,
Zhou
X
,
Zheng
J
, et al.
Computational identification of physicochemical signatures for host tropism of influenza A virus
.
J Bioinform Comput Biol
2018
;
16
(06):1840023.

18.

Blythe
MJ
,
Flower
DR
.
Benchmarking b cell epitope prediction: underperformance of existing methods
.
Protein Sci
2005
;
14
(
1
):
246
8
.

19.

Zhou
X
,
Yin
R
,
Kwoh
C-K
, et al.
A context-free encoding scheme of protein sequences for predicting antigenicity of diverse influenza a viruses
.
BMC Genomics
2018
;
19
(
10
):
145
54
.

20.

Moreau
V
,
Fleury
C
,
Piquer
D
, et al.
Pepop: computational design of immunogenic peptides
.
Bmc Bioinformatics
2008
;
9
(
1
):
1
15
.

21.

Ansari
HR
,
Raghava
GPS
.
Identification of conformational b-cell epitopes in an antigen from its primary sequence
.
Immunome research
2010
;
6
(
1
):
1
9
.

22.

Zhang
W
,
Niu
Y
,
Xiong
Y
, et al.
Computational prediction of conformational b-cell epitopes from antigen primary structures by ensemble learning
.
PloS one
2012
;
7
(8):e43575.

23.

Andersen
PH
,
Nielsen
M
,
Lund
OLE
.
Prediction of residues in discontinuous b-cell epitopes using protein 3d structures
.
Protein Sci
2006
;
15
(
11
):
2558
67
.

24.

Flower
DR
.
Immunoinformatics: Predicting immunogenicity in silico
. Vol. 409.
Springer Science & Business Media
,
2007
.

25.

Potocnakova
L
,
Bhide
M
,
Pulzova
LB
.
An introduction to b-cell epitope mapping and in silico epitope prediction
.
J Immunol Res
2016
;
2016
:6760830.

26.

Saha
S
,
Raghava
GPS
.
Prediction of continuous b-cell epitopes in an antigen using recurrent neural network
.
Proteins: Structure, Function, and Bioinformatics
2006
;
65
(
1
):
40
8
.

27.

Chen
J
,
Liu
H
,
Yang
J
, et al.
Prediction of linear b-cell epitopes using amino acid pair antigenicity scale
.
Amino Acids
2007
;
33
(
3
):
423
8
.

28.

El-Manzalawy
Y
,
Dobbs
D
,
Honavar
V
. Predicting flexible length linear b-cell epitopes. In:
Computational Systems Bioinformatics
, Vol.
7
.
World Scientific
,
2008
,
121
32
.

29.

Lian
Y
,
Ge
M
,
Pan
X-M
.
Epmlr: sequence-based linear b-cell epitope prediction method using multiple linear regression
.
BMC bioinformatics
2014
;
15
(
1
):
1
6
.

30.

Larsen
JEP
,
Lund
O
,
Nielsen
M
.
Improved method for predicting linear b-cell epitopes
.
Immunome research
2006
;
2
(
1
):
1
7
.

31.

Jespersen
MC
,
Peters
B
,
Nielsen
M
, et al.
Bepipred-2.0: improving sequence-based b-cell epitope prediction using conformational epitopes
.
Nucleic Acids Res
2017
;
45
(
W1
):
W24
9
.

32.

Collatz
M
,
Mock
F
,
Barth
E
, et al.
Epidope: A deep neural network for linear b-cell epitope prediction
.
Bioinformatics
2021
;
37
(
4
):
448
55
.

33.

Bahai
A
,
Asgari
E
,
Mofrad
MRK
, et al.
Epitopevec: Linear epitope prediction using deep protein sequence embeddings
.
Bioinform
2021
;
37
(23):4517–4525.

34.

Vita
R
,
Mahajan
S
,
Overton
JA
, et al. (eds).
The immune epitope database (iedb): 2018 update
.
Nucleic Acids Res
2019
;
47
(
D1
):
D339
43
.

35.

Asgari
E
,
Mofrad
MRK
.
Continuous distributed representation of biological sequences for deep proteomics and genomics
.
PloS one
2015
;
10
(
11
):e0141287.

36.

Yin
R
,
Luusua
E
,
Dabrowski
J
, et al.
Tempel: time-series mutation prediction of influenza a viruses via attention-based recurrent neural networks
.
Bioinformatics
2020
;
36
(
9
):
2697
704
.

37.

Aoki
G
,
Sakakibara
Y
.
Convolutional neural networks for classification of alignments of non-coding rna sequences
.
Bioinformatics
2018
;
34
(
13
):
i237
44
.

38.

Yin
R
,
Thwin
NN
,
Zhuang
P
, et al.
IAV-CNN: a 2D convolutional neural network model to predict antigenic variants of influenza a virus
.
IEEE/ACM Trans Comput Biol Bioinform
2021
.

39.

Dubchak
I
,
Muchnik
I
,
Holbrook
SR
, et al.
Prediction of protein folding class using global description of amino acid sequence
.
Proc Natl Acad Sci
1995
;
92
(
19
):
8700
4
.

40.

Dubchak
I
,
Muchnik
I
,
Mayor
C
, et al.
Recognition of a protein fold in the context of the SCOP classification
.
Proteins: structure, function, and bioinformatics
1999
;
35
(
4
):
401
7
.

41.

Kawashima
S
,
Kanehisa
M
.
Aaindex: amino acid index database
.
Nucleic Acids Res
2000
;
28
(
1
):
374
4
.

42.

Tomii
K
,
Kanehisa
M
.
Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins
.
Protein Engineering, Design and Selection
1996
;
9
(
1
):
27
36
.

43.

Zhou
X
,
Yin
R
,
Zheng
J
, et al.
An encoding scheme capturing generic priors and properties of amino acids improves protein classification
.
IEEE Access
2018
;
7
:
7348
56
.

44.

Heinzinger
M
,
Ahmed Elnaggar
Y
,
Wang
CD
, et al.
Modeling aspects of the language of life through transfer-learning protein sequences
.
BMC bioinformatics
2019
;
20
(
1
):
1
17
.

45.

Yin
R
,
Luo
Z
,
Zhuang
P
, et al.
Virprenet: a weighted ensemble convolutional neural network for the virulence prediction of influenza a virus using all eight segments
.
Bioinformatics
2021
;
37
(
6
):
737
43
.

46.

Krizhevsky
A
,
Sutskever
I
,
Hinton
GE
. Imagenet classification with deep convolutional neural networks. In:
Advances in neural information processing systems
,
2012
,
1097
105
.

47.

Simonyan
K
,
Zisserman
A
.
Very deep convolutional networks for large-scale image recognition
CoRR abs/1409.1556 (2015): n. pag.
2015
.

48.

Iandola
FN
,
Han
S
,
Moskewicz
MW
, et al.
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size
.
arXiv preprint arXiv:1602.07360
.
2016
.

49.

Hochreiter
S
,
Schmidhuber
J
.
Long short-term memory
.
Neural Comput
1997
;
9
(
8
):
1735
80
.

50.

Dey
R
,
Salem
FM
. Gate-variants of gated recurrent unit (gru) neural networks. In:
2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS)
.
IEEE
,
2017
,
1597
600
.

51.

Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
, et al.
Scikit-learn: Machine learning in python
.
Journal of machine learning research
2011
;
12
(
Oct
):
2825
30
.

52.

Paszke
A
,
Gross
S
,
Chintala
S
, et al.
Automatic differentiation in pytorch
.
2017
.

53.

Singh
H
,
Ansari
HR
,
Raghava
GPS
.
Improved method for linear b-cell epitope prediction using antigen’s primary sequence
.
PloS one
2013
;
8
(
5
):e62216.

54.

Manavalan
B
,
Govindaraj
RG
,
Shin
TH
, et al.
ibce-el: a new ensemble learning framework for improved linear b-cell epitope prediction
.
Front Immunol
2018
;
9
:
1695
.

55.

Parker
JMR
,
Guo
D
,
Hodges
RS
.
New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and x-ray-derived accessible sites
.
Biochemistry
1986
;
25
(
19
):
5425
32
.

56.

Pellequer
J-L
,
Westhof
E
,
Van Regenmortel
MHV
.
Correlation between the location of antigenic sites and the prediction of turns in proteins
.
Immunol Lett
1993
;
36
(
1
):
83
99
.

57.

Emini
EA
,
Hughes
J
,
Perlow
DS
, et al.
Induction of hepatitis a virus-neutralizing antibody by a virus-specific synthetic peptide
.
J Virol
1985
;
55
(
3
):
836
9
.

58.

Kolaskar
AS
,
Tongaonkar
PC
.
A semi-empirical method for prediction of antigenic determinants on protein antigens
.
FEBS Lett
1990
;
276
(
1-2
,
172
):–
174
.

59.

Karplus
PA
,
Schulz
GE
.
Prediction of chain flexibility in proteins
.
Naturwissenschaften
1985
;
72
(
4
):
212
3
.

60.

Chin-Sheng
Y
,
Lin
C-J
,
Hwang
J-K
.
Predicting subcellular localization of proteins for gram-negative bacteria by support vector machines based on n-peptide ompositions
.
Protein Sci
2004
;
13
(
5
):
1402
6
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data