Abstract

Peptide binding to major histocompatibility complex (MHC) proteins plays a critical role in T-cell recognition and the specificity of the immune response. Experimental validation such peptides is extremely resource-intensive. As a result, accurate computational prediction of binding peptides is highly important, particularly in the context of cancer immunotherapy applications, such as the identification of neoantigens. In recent years, there is a significant need to continually improve the existing prediction methods to meet the demands of this field. We developed ConvNeXt-MHC, a method for predicting MHC-I-peptide binding affinity. It introduces a degenerate encoding approach to enhance well-established panspecific methods and integrates transfer learning and semi-supervised learning methods into the cutting-edge deep learning framework ConvNeXt. Comprehensive benchmark results demonstrate that ConvNeXt-MHC outperforms state-of-the-art methods in terms of accuracy. We expect that ConvNeXt-MHC will help us foster new discoveries in the field of immunoinformatics in the distant future. We constructed a user-friendly website at http://www.combio-lezhang.online/predict/, where users can access our data and application.

INTRODUCTION

The binding of short peptide fragments to major histocompatibility complex (MHC) molecules is a crucial and discriminating step in initiating an adaptive immune response [1]. Accurate identification of peptides that can bind to specific MHC molecules is of paramount importance for understanding the underlying mechanisms and discovering immunogenic epitopes, such as neoantigens [2]. Moreover, promiscuous MHC epitopes that can bind to multiple MHC alleles are critical candidates for the development of vaccines. There are two distinct classes of MHC molecules: class I (MHC-I) and class II (MHC-II) [3]. To detect endogenous peptides, MHC-I is present in all cells except red blood cells, while MHC-II is found mainly on B cells, dendritic cells, macrophages and certain antigen-presenting cells, where it binds only to exogenous peptides [4].

Experimental validation of the peptides binding to specific MHC molecules is not only extremely broad but also involves both time- and cost-consuming biological experiments [5]. Therefore, precise identification of such peptides using computational methods is a central challenge in immunoinformatics [6]. In traditional allele-specific methods, binding data for only one allele can be used to train a model, and the model can be applied only to predict peptide binding to the allele. Methods, e.g. NetMHC [7], SMM for MHC-I [8], SMM-align [9], NN-align [10] and RTA for MHC-II [11], are all outstanding allele-specific methods with decent accuracy. To address the problem of predicting peptides binding to MHC molecules with only a small number of experimentally obtained binders, a series of computational approaches, called panspecific methods, have been developed and have received intense interest [12]. Pan-specific methods use pseudo sequences instead of MHC alleles as inputs and attempt to predict binders. The pan-specificity predictors define the so-called pseudo sequence, which is the residue of the MHC molecule within a distance of 4.0 Å between any pair of atoms from the MHC and the bound peptide taken from the known peptide–MHC crystal structures [6]. The pan-specificity predictor uses pseudo sequences as a feature of the machine learning process to classify a peptide as a binder or non-binder [13]. Constructing the pan-specificity predictor generally involves the following four major steps: (i) collecting training datasets that experimentally verify the binding ability between the MHC and peptide [14], (ii) extracting pseudo sequences encoding from the peptide and MHC sequences [6], (iii) choosing the best-performing machine learning algorithm and training the corresponding machine learning model [7] and (iv) optimizing and evaluating the model and its performance.

To date, multiple pseudo sequences have been reported. NetMHCPan [15] is a well-established pan-specific binding affinity prediction algorithm that reduces the MHC sequence to a pseudo sequence 34 in length that includes the following residues: 7, 9, 24, 45, 59, 62, 63, 66, 67, 69, 70, 73, 74, 76, 77, 80, 81, 84, 95, 97, 99, 114, 116, 118, 143, 147, 150, 152, 156, 158, 159, 163, 167 and 171. MHCflurry2.0.1 [16] uses the same 34 peptide-contacting residues as well as 3 additional residues (91, 102 and 199). Moreover, MixMHCpred [17] employs 32 peptide-contacting residues as the pseudo sequence. These pseudo sequences have shown advantages over other allele-specific methods. However, the pseudo sequences and peptides are linked together, without distinguishing the residues of MHC from those of peptides. What is more, the information of pair-wise interactions between the MHC and peptide are ignored [18]. These residue interactions are essential for physical binding, suggesting that it could be beneficial to explicitly consider them in the prediction model [19, 20].

The development of computational methods that accurately and efficiently provide prediction for peptide–MHC binding affinity can greatly contribute to immunotherapies and vaccine development. Despite the existence of various methods, none of them fully satisfies the requirements for biomedical practice. Particularly, although state-of-the-art methods extract various information from MHC and peptide sequences and structures, most of these methods rely on simple feature fusion models to capture MHC–peptide relationships. To address this challenge, we aimed to develop a comprehensive MHC-I pseudo sequence by incorporating features of pair-wise residue interactions between MHC-I and peptides as well as more advanced deep learning techniques to increase predictive accuracy. Three innovations are proposed for this study.

First, a new pseudo-sequence coding method, degenerate encoding, was developed by encoding the pair-wise residues interacted between pseudo sequences and peptides together in two-dimensional arrays. Hence, the pair-wise interactions are explicitly embedded in the model.

Second, in recent studies, deep learning has exhibited powerful capabilities for extracting latent patterns within biological sequences [21–33]. Our work takes advantage of the recently developed deep learning model ConvNeXt [34] for prediction. ConvNeXt is an improved convolutional neural network (CNN) model that can handle a wide range of vision tasks and has achieved state-of-the-art performance in many benchmark datasets. Our work also incorporated transfer learning [35–37] and semi-supervised learning to enhance its power.

Third, an online prediction platform was developed for predicting and visualizing the binding of MHC-I molecules to peptides via related data sharing, which can help users make more intuitive and convenient predictions, improving antigen screening efficiency [38].

In conclusion, we developed a new method, namely, ConvNeXt-MHC, to enhance MHC-I-peptide affinity prediction. ConvNeXt-MHC can predict the presentation of antigens (ConvNeXt-MHC_AP) and binding affinity (ConvNeXt-MHC_BA). According to our comprehensive benchmark tests, ConvNeXt-MHC outperforms state-of-the-art methods in terms of accuracy, F1-score, MCC and binding affinity correlation. Notably, ConvNeXt-MHC exhibits remarkably enhanced ability to predict 8-mer peptides. Further analysis illustrates the degenerate encoding approach is the major contributor of the improvement compared to the traditional approaches. To facilitate the use of ConvNeXt-MHC, we constructed a user-friendly online server which is freely available for researchers world-wide. The details of the method are described in the following sections.

MATERIALS AND METHODS

Datasets

The dataset comprises two types of data for a given MHC allele: mass spectrometry (MS)–identified peptide [39] and peptide–MHC affinity (AF) data. AF data measure the ‘half maximum inhibitory concentration’ (IC50), where a lower value indicates stronger affinity [40]. The data can be transformed into a continuous target scale (log50k) in the range of [0,1] according to Supplementary Table 4. We obtained MS data from recently published reports and AF data from the Immune Epitope Database (IEDB) [41]. Overall, the collected data were divided into five sets as described below.

  • (i) MS training set for ConvNeXt-MHC_AP: 1 048 575 records with 206 515 positive data points and 842 060 negative data points. These sequences were obtained from the SysteMHC Atlas project [42] and from TransPHLA-AOMP [43], as well as from studies by Sarkizova et al. [44], Abelin et al. [45] and Karosiene et al. [46].

  • (ii) MS test set for ConvNeXt-MHC_AP: 6342 records with 3133 positive data points and 3209 negative data points were obtained from the works of Pearson et al. [47] and Bulik-Sullivan et al. [46].

  • (iii) AF training set for ConvNeXt-MHC_BA: 210 509 records obtained from the IEDB database [41]. To obtain low-affinity peptides, we used NullSeq [48] to generate 38 633 peptides randomly, which were labeled with the predicted affinities estimated by MHCflurry2.0 and NetMHCpan4.0 (predicted IC50 > 1000 nM).

  • (iv) AF test set for ConvNeXt-MHC_BA: 887 binding affinity data points obtained from the testing dataset of NetMHCpan4.1 [15].

  • (v) Additional AF test set for ConvNeXt-MHC_BA: 212 records obtained from weekly data between Oct. 2021 and June 2023 in the IEDB database [41]

In order to ensure the integrity of our analysis, we took steps to address any potential overlaps between the MS testing dataset and the MS training dataset. We identified and removed any shared data points from the testing dataset, and we also excluded non-Homo sapiens data from both sets. This process was carried out with the same level of diligence for the AF testing and training datasets, ensuring a consistent approach across all datasets used in our study.

Pseudo sequence: We collected a total of 552 MHC-I-peptide structure files from the PDB database [49] and eliminated those with undefined amino acids and non-9-mer peptides. 354 pMHC-I structures were ultimately used to determine the pseudo sequence of MHC-I. The details are described in Supplementary Method 2 , Supplementary Method 6, Supplementary Figure 1 and Supplementary Figure 2.

Degenerate coding

The traditional pseudo sequences are designed to capture the sequence features of MHC alleles. Each residue of the sequence was subsequently transformed into a 21-dimensional vector using the BLOSUM62 substitution matrix or one-hot codes. For example, MHCflurry uses BLOSUM62, and NetMHCpan 4.0 uses BLOSUM50 to encode pseudo sequences [50, 51]. Such models lack information on residue contacts between peptides and MHCs, but it is key for peptide–MHC binding.

To incorporate contact information between MHC-I molecules and 9-mer peptides in their three-dimensional structure, we designed two-dimensional image-like arrays (ILAs) to represent MHC-I pseudo sequences and 9-mer peptide sequences. This ILA did not explicitly include the pseudo sequence but interlocked at the position of the peptide (Figure 2). Thus, we call this degenerate coding. The ILA has a width of 9, representing the length of the 9-mer peptide; a height of 20, representing the 20 amino acid types in the pseudo sequence; and a depth of 21, representing the contact marks and 20 amino acids of the peptide. As shown in Figure 2A, we used Supplementary Method 3.1 to generate a pseudo sequence (20*9*1) for each MHC, where 1 or 0 represents that the 9-mer peptide is in contact with this type of residual amino acid or not (Supplementary Method 3.1 and Supplementary Figure 3).

Development of the deep learning model

ConvNeXt-MHC was designed by the following classic ConvNeXt model [34], which is illustrated in Supplementary Figure 4. The specific operation of the model is shown in Equations (1) and (2) [52].

(1)
(2)

We use |$\odot$| to denote the element-wise product, where |$i,j,k$| and |$l$| are the corresponding indices. Among them, |$X=\left[{X}_{\left[:,:,1\right]};\dots; {X}_{\left[:,:,C\right]}\right]$|⁠, where matrix |$X$| represents the input matrix. |$C$| represents the number of channels in the third dimension of the input matrix |$X$| and |$c\in C$|⁠. |${V}_{\mathrm{c}}$| represents the convolution kernel corresponding to the |$c$| channel, while |$K$| and |$L$| represent the convolution kernel sizes, |$k\in K$|⁠, |$l\in L$| and |$i$| and |$j$| represent the positions of the first two dimensions of |$X$|⁠; |$DepthConv$| represents depth convolution calculation; |$WiseConv$| represents point-by-point convolution calculation. Please see the nomenclature in Supplementary Table 1.

Weight initialization for the attention mechanism

As shown in Figure 3A, we employed an attention mechanism to determine the impact of the amino acid type of the residue on the 9-mer peptide by adding neurons to each layer of the ILA. Unlike in previous studies [24, 53], we initialize the weights of the attention layer via transfer learning and prior knowledge [Equation (3)] and use the attention block to obtain the attention for each layer matrix [Equation (4)]. Adding attention to the matrix results in a hybrid attention matrix X’ [Equation (5)], which can increase the accuracy and generalizability of the predictive model (Supplementary Method 4).

(3)
(4)
(5)

|${frequency}_{Amio\_ acid}\left[h\right]$| represents the frequency of 20 amino acids, with |$h$| being 1 of the 20 amino acids. The initialization parameter corresponding to the number of layers of amino acid |$h$| is |$init\_{weight}_h$|⁠. Equation (3) helps us obtain the network initialization value |$init\_{weight}_h$| and initializes the corresponding neurons |$Conv1{D}_h$|⁠. Among them, |$X=\left[{X}_{\left[1,:,:\right]},\dots, {X}_{\left[h,:,:\right]}\right]$|⁠, |$I$| is the identity matrix, |${a}_h$| represents the attention value and |${X}^{\prime }$| is an input matrix mixed with attention for subsequent network processing.

Training data augmentation via semi-supervised learning

Utilizing the semi-supervised method to generate AF data with similar characteristics to MS data can improve prediction accuracy [54]. As shown in Figure 3B, we adopt data augmentation based on semi-supervised learning to convert the MS data into AF data to pretrain our deep learning model (Supplementary Method 5).

The transformation rule between the IC50 of AF data and the discrete values of antigen presentation is shown in Equation (6) [55].

(6)

Implementation of the overall scheme

The overall scheme was implemented by following steps. Firstly, we extracted the pseudo sequence corresponding to the input MHC-I allele as described in Supplementary Method 3.1. Secondly, we generated the ILAs for both the 9-mer peptide and the MHC-I pseudo sequence, following the procedures outlined in Supplementary Methods 3.2 and 3.3. Thirdly, attention blocks and ConvNeXt networks were constructed, as illustrated in Supplementary Figure 4. The initialization of the attention matrix parameters is detailed in Supplementary Method 4. Next, we employed semi-supervised learning techniques, as explained in Supplementary Method 5. Lastly, we established the prediction models for ConvNeXt-MHC_AP and ConvNeXt-MHC_BA based on the aforementioned steps.

RESULTS

ConvNeXt-MHC employs degenerate coding to characterize the binding modes between peptides and MHC-I molecules and utilizes the semi-supervised learning method of the ConvNeXt model to increase its predictive power (Figure 1). The tool can predict the presentation of antigens (ConvNeXt-MHC_AP) and the binding affinity (ConvNeXt-MHC_BA). Our comprehensive benchmark tests are described in the following sections.

The architecture of the ConvNeXt-MHC. (A) The process of degenerate coding. (B) ConvNeXt-MHC model.
Figure 1

The architecture of the ConvNeXt-MHC. (A) The process of degenerate coding. (B) ConvNeXt-MHC model.

The data structure of degenerate coding. The ILA has 21 channels with the first channel representing the contact information between the 9-mer peptide and the amino acid types of the MHC. The remaining 20 channels used a one-hot encoding method to represent the specific amino acid type of the 9-mer peptide. (A) Pesudo sequence: it is generated by Supplementary Method 3.1. Among them, 1 represents the presence of the corresponding amino acid residue in the polypeptide position, and 0 represents its absence. (B) Peptide residues: it shows that the contact amino acid type of the 9-mer peptide was 0.90; otherwise, 0.05. (C) The degenerate coding of peptide and pseudo sequence.
Figure 2

The data structure of degenerate coding. The ILA has 21 channels with the first channel representing the contact information between the 9-mer peptide and the amino acid types of the MHC. The remaining 20 channels used a one-hot encoding method to represent the specific amino acid type of the 9-mer peptide. (A) Pesudo sequence: it is generated by Supplementary Method 3.1. Among them, 1 represents the presence of the corresponding amino acid residue in the polypeptide position, and 0 represents its absence. (B) Peptide residues: it shows that the contact amino acid type of the 9-mer peptide was 0.90; otherwise, 0.05. (C) The degenerate coding of peptide and pseudo sequence.

The workflow of the ConvNeXt-MHC model. (A) Attention mechanism and its module for initialization. (B) Data augmentation via semi-supervised learning.
Figure 3

The workflow of the ConvNeXt-MHC model. (A) Attention mechanism and its module for initialization. (B) Data augmentation via semi-supervised learning.

The performance of ConvNeXt-MHC_AP on the MS test set

To evaluate the performance of ConvNeXt-MHC in the case of the antigen presentation predictive mode ConvNeXt-MHC_AP, we conducted benchmark tests using the MS test set (2.1 datasets). We compared our method with state-of-the-art methods, including PickPocket1.1 [56], ANN4.0 [7], NetMHCpan4.1 [57], MHCflurry2.0 [16], DeepHLApan [25] and BigMHC [58] using the same benchmark data. The evaluation metrics included accuracy, F1-score and the Matthews correlation coefficient (MCC), which are well-established criteria for assessing prediction model performance, especially for yes or no questions (detailed descriptions are provided in Supplementary Table 4). The testing data were categorized into four groups based on peptide length. Overall, ConvNeXt-MHC_AP outperformed the other methods, achieving the highest accuracies, F1-scores and MCCs across all four categories (Figure 4). While the differences between these methods were marginal for 9-mer peptides, ConvNeXt-MHC_AP demonstrated significant advantages over the other methods, particularly showing more than 30% improvement in 8-mer peptides (Figure 4A). The success of these 8-mer peptides could be attributed to the fact that 8-mer peptides share highly similar binding modes with 9-mer peptides, although the number of available training data for 8-mer peptides is much smaller than that for 9-mer peptides (for which the sample numbers are 7397 versus 142 601). Thus, the encoded binding information compensates for the deficiency of 8-mer peptides in the training data. These results highlight the effectiveness of our degenerate encoding and ConvNeXt model in predicting non-9-mer peptides, which has been a challenging task in this field. For a more detailed explanation and comprehensive statistical tests, please see Supplementary Table 5 and the process is outlined in Supplementary Figure 7.

Benchmarking results for ConvNeXt-MHC_AP and other state-of-the-art methods. (A) Results for 8-mer peptides. (B) Results for 9-mer peptides. (C) Results for 10-mer peptides. (D) Results for 11-mer peptides.
Figure 4

Benchmarking results for ConvNeXt-MHC_AP and other state-of-the-art methods. (A) Results for 8-mer peptides. (B) Results for 9-mer peptides. (C) Results for 10-mer peptides. (D) Results for 11-mer peptides.

The performance of ConvNeXt-MHC_BA on AF test sets

Predicting the binding affinity of peptides is more challenging than predicting antigen presentation mode using the MS dataset, as it requires quantifying the strength of the binding. To assess the performance of ConvNeXt-MHC in this context (ConvNeXt-MHC_BA), we conducted benchmark tests using the AF test set (2.1 datasets). We also compared our method with state-of-the-art methods that support binding affinity prediction, namely, PickPocket, ANN4.0, NetMHCpan4.1, MHCflurry2.0.1 and CapsNet-MHC [19]. The results are presented in scatter plots (Figure 5), depicting the correlation between the experimentally determined true values and the predicted values of the five methods. ConvNeXt-MHC_BA exhibited a strong correlation with the true and predicted values. The Pearson correlation coefficients for PickPocket, MHCflurry2.0.1, ANN4.0, NetMHCpan4.1, CapsNet-MHC and ConvNeXt-MHC_BA were 0.594, 0.455, 0.625, 0.675, 0.483 and 0.755, respectively. Furthermore, the squared error of ConvNeXt-MHC_BA was significantly less than that of the other methods (Supplementary Table 6). To further evaluate the performance, we employed an additional AF test set, which was collected from the recently released weekly IEDB data. The results showed that the squared error of ConveNeXt-MHC_BA was significantly less than that of the other comparisons (Supplementary Figure 8).

The performance of, PickPocket (A), MHCflurry2.0.1 (B), ANN4.0 (C), NetMHCpan4.1 (D), CapsNet-MHC (E) and ConvNeXt-MHC_BA (F) on the peptide–MHC interaction test set. The Pearson correlation coefficients are shown in the top left corner.
Figure 5

The performance of, PickPocket (A), MHCflurry2.0.1 (B), ANN4.0 (C), NetMHCpan4.1 (D), CapsNet-MHC (E) and ConvNeXt-MHC_BA (F) on the peptide–MHC interaction test set. The Pearson correlation coefficients are shown in the top left corner.

Once again, ConvNeXt-MHC_BA demonstrated remarkable advantages over the other methods in the benchmark tests.

Degenerate coding outperforms well-established coding methods

This success can be largely attributed to the effectiveness of degenerate coding and the ConvNeXt model. To reveal the contribution of degenerate coding, we implemented a simple CNN prediction model that can perform neck-to-neck comparisons between degenerate coding and other well-established coding approaches, including the pseudo sequence coding described in NetMHCpan4.1, MHCflurry2.0.1 (The same coding as CapsNet-MHC), DeepHLApan and ANN. These models are referred to as DE-CNN, NetMHCpan-CNN, MHCflurry-CNN, DeepHLApan-CNN and ANN-CNN, respectively. The specific positions of the three pseudo sequences are outlined in Supplementary Table 2.

To ensure a fair and meaningful comparison, we trained and tested all the models, namely, DE-CNN, NetMHCpan-CNN, MHCflurry-CNN, DeepHLApan-CNN and ANN-CNN, by identical CNN structures and a rigorous 5-fold cross-validation technique [39, 59]. Supplementary Table 3 and Supplementary Figure 6 provides comprehensive information regarding the training and testing process.

Through extensive experiments conducted using 5-fold cross-validation on the MS training set, we obtained average accuracies of 0.953, 0.902, 0.928, 0.923 and 0.882 for DE-CNN, NetMHCpan-CNN, MHCflurry-CNN, DeepHLApan-CNN and ANN-CNN, respectively (Table 1). These results illustrate the statistically significant improvements of DE-CNN over the other traditional approaches. Furthermore, Supplementary Figure 5 presents detailed data on accuracy, F1-score and MCC, providing further evidence of the advantages conferred by degenerate encoding.

Table 1

The Shapiro test and t-test for encoding methods of DE-CNN, NetMHCPan, MHCflurry, ANN and DeepHLApan after 5-fold cross-validation

Encoding methodAverage accuracyStdShapiroT-test (P-value)
MHCflurry-CNN0.9280.0040.0500.001
NetMHCpan-CNN0.9020.0040.5860.000
DE-CNN0.9530.0080.7101.000
DeepHLApan-CNN0.9230.0020.4570.001
ANN-CNN0.8820.0080.4310.000
Encoding methodAverage accuracyStdShapiroT-test (P-value)
MHCflurry-CNN0.9280.0040.0500.001
NetMHCpan-CNN0.9020.0040.5860.000
DE-CNN0.9530.0080.7101.000
DeepHLApan-CNN0.9230.0020.4570.001
ANN-CNN0.8820.0080.4310.000
Table 1

The Shapiro test and t-test for encoding methods of DE-CNN, NetMHCPan, MHCflurry, ANN and DeepHLApan after 5-fold cross-validation

Encoding methodAverage accuracyStdShapiroT-test (P-value)
MHCflurry-CNN0.9280.0040.0500.001
NetMHCpan-CNN0.9020.0040.5860.000
DE-CNN0.9530.0080.7101.000
DeepHLApan-CNN0.9230.0020.4570.001
ANN-CNN0.8820.0080.4310.000
Encoding methodAverage accuracyStdShapiroT-test (P-value)
MHCflurry-CNN0.9280.0040.0500.001
NetMHCpan-CNN0.9020.0040.5860.000
DE-CNN0.9530.0080.7101.000
DeepHLApan-CNN0.9230.0020.4570.001
ANN-CNN0.8820.0080.4310.000

In order to gain a deeper understanding of the enhancements brought about by ConvNeXt compared to CNN, we conducted a performance comparison between ConvNeXt-MHC and DE-CNN. These models only differ in terms of the machine learning algorithm utilized. Supplementary Tables 7 and 8 present the results, showing that ConvNeXt-MHC outperforms DE-CNN by 1.1% in accuracy, 1.04% in F1-score and 1.04% in MCC. These findings demonstrate the contribution of ConvNeXt to our methodology.

Collectively, these experimental findings solidify the superiority of our degenerate coding method over traditional approaches.

ConvNeXt-MHC online platform

To facilitate the use of ConvNeXt-MHC, we developed a user-friendly online predictive platform at http://www.combio-lezhang.online/predict/. This platform provides a range of functionalities (Figure 6). This allows users to conveniently submit their data and promptly present a comprehensive list of predictive BA and AP values using the power of ConvNeXt-MHC. For a more detailed analysis, an insightful motif analysis was performed [60] for different peptide sequences that exhibit allele-binding properties. This analysis provides users with a deeper understanding of the underlying patterns and characteristics of these peptides, paving the way for further investigation and discovery. Furthermore, it caters to users’ need for statistical analysis and empowers them with the ability to download the MHC-I data that were collected for our study. By seamlessly integrating these functionalities, ConvNeXt-MHC online predictive platform serves as a convenient tool for researchers and practitioners alike, enabling them to make informed decisions and expedite their research in the field of MHC-I binding prediction.

ConvNeXt-MHC online platform. (A) The data entry, (B) The results of BA and AP prediction, (C) The sequence logo of peptides and (D) The statistics of the data used in this model.
Figure 6

ConvNeXt-MHC online platform. (A) The data entry, (B) The results of BA and AP prediction, (C) The sequence logo of peptides and (D) The statistics of the data used in this model.

DISCUSSION

The MHC-I plays a vital role in presenting signals to the immune system for detecting and eliminating infected or cancerous cells. The development of computational methods that accurately and efficiently provide comprehensive explanations for predicting peptide–MHC binding is of paramount importance in advancing immunotherapies and vaccine development. However, despite the existence of several methods, none of them fully meet the demanding requirements. In this work, we propose a novel semi-supervised model for peptide presentation by MHC class I molecules to improve MHC-I peptide affinity prediction. Our model incorporates degenerate coding, explicitly capturing the interaction between MHC and peptides. Furthermore, we leverage the ConvNeXt model to enhance our method. Extensive experiments on MHC-I-peptide binding prediction demonstrated that ConvNeXt-MHC outperforms state-of-the-art methods in terms of accuracy, F1-scores and MCCs on our benchmark datasets. Notably, ConvNeXt-MHC exhibited a markedly enhanced ability to predict non-9-mer peptides.

Although ConvNeXt-MHC showed advanced performances, there are still some shortcomings that we aim to address in our subsequent work. Firstly, our protocol relies on a machine-learning model, which inherently depends on the size and quality of the training data. In our tests, we did observe a noticeable decrease in prediction accuracy for a small number of MHC alleles that had infrequent occurrences in the training data. Secondly, the degenerate coding approach relies on a static residue list for pair-wise interactions between MHC-I and peptides, without considering any alterations that may occur due to variations in residue types. As a result, the static residue list may introduce errors in certain cases when the sizes and physicochemical properties vary, as the real pair-wise interactions can change. Thirdly, our method’s running speed is comparatively slower compared to popular methods like NetMHCpan4.1 and MHCflurry2.0. This is primarily due to the architecture of our model, which affects the overall computational efficiency.

To facilitate the utilization of ConvNeXt-MHC, we developed a user-friendly web server. This web server offers a comprehensive One-Stop service, which includes predicting binding affinities, peptide residue preferences, MHC allele distributions and other valuable insights. By leveraging ConvNeXt-MHC through this web server, researchers will be empowered to analyze large-scale epitopes, fostering new discoveries in the dynamic field of immunoinformatics.

Key Points
  • To encode the pair-wise interactions between the residues of peptides and MHCs, a new pseudo sequence coding method was developed.

  • ConvNeXt-MHC integrates transfer learning and semi-supervised learning methods into a cutting-edge deep learning framework ConvNeXt.

  • Comprehensive benchmark results demonstrate that ConvNeXt-MHC outperforms state-of-the-art methods in terms of accuracy.

  • ConvNeXt-MHC is implemented in a user-friendly web server that is freely available at http://www.combio-lezhang.online/predict/.

ACKNOWLEDGEMENTS

The authors wish to thank Prof. Zhi-Xiong Xiao of Sichuan University and Prof. Yang Zhang of the National University of Singapore for the invaluable discussion.

FUNDING

This work was supported by grants from National Natural Science Foundation of China [62372316 and 81973243], National Science and Technology Major Project (Nos. 2021YFF1201200 and 2018ZX10201002, China), China Postdoctoral Science Foundation (2020 M673221, China), Fundamental Research Funds for the Central Universities (2020SCU12056, China), Sichuan Science and Technology Program (2022YFS0048) and Chongqing Technology Innovation and Application Development Project (CSTB2022TIAD-KPX0067).

DATA AVAILABILITY

The source codes are available at http://www.labshare.cn/ConvNeXt-MHC/. See Supplementary Materials and Methods for more details of materials and methods.

Author Biographies

Le Zhang is currently a professor in the College of Computer Science, Sichuan University, China. More information can be found at the website of his group: http://120.78.69.27/zhangle/index_English.html#.

Wenkai Song is currently a master student at College of Computer Science, Sichuan University, China. His research interest focuses developing methods for AI-based methods for computational biology.

Tinghao Zhu graduated from the College of Computer Science, Sichuan University, China. She is currently a researcher at the Nuclear Power Institute of China.

Yang Liu is currently a PhD candidate at College of Life Sciences, Sichuan University, China. He mainly focuses on developing methods for sequence analysis and computer-aided drug design.

Wei Chen is currently a professor in Chengdu University of Traditional Chinese Medicine, China. More information can be found at the website of his group: https://chenweilab.cn/.

Yang Cao received his PhD from Institute of Biophysics, Chinese Academy of Sciences. He is currently an associate professor at College of Life Sciences, Sichuan University. His major research interests include developing new methods for computer-aided drug and protein design.

References

1.

Liu
J
,
Gao
GF
.
Major histocompatibility complex: interaction with peptides. Encyclopedia of
.
Life Sci
2011
. https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470015902.a0000922.pub2.

2.

Lundegaard
C
,
Lund
O
,
Buus
S
,
Nielsen
M
.
Major histocompatibility complex class I binding predictions as a tool in epitope discovery
.
Immunology
2010
;
130
:
309
18
.

3.

Neefjes
J
,
Jongsma
ML
,
Paul
P
,
Bakke
O
.
Towards a systems understanding of MHC class I and MHC class II antigen presentation
.
Nat Rev Immunol
2011
;
11
:
823
36
.

4.

Wieczorek
M
,
Abualrous
ET
,
Sticht
J
, et al.
Major histocompatibility complex (MHC) class I and MHC class II proteins: conformational plasticity in antigen presentation
.
Front Immunol
2017
;
8
:
292
.

5.

Assarsson
E
,
Sidney
J
,
Oseroff
C
, et al.
A quantitative analysis of the variables affecting the repertoire of T cell specificities recognized after vaccinia virus infection
.
J Immunol
2007
;
178
:
7890
901
.

6.

Zhang
L
,
Udaka
K
,
Mamitsuka
H
, et al.
Toward more accurate pan-specific MHC-peptide binding prediction: a review of current methods and tools
.
Brief Bioinform
2012
;
13
:
350
64
.

7.

Andreatta
M
,
Nielsen
M
.
Gapped sequence alignment using artificial neural networks: application to the MHC class I system
.
Bioinformatics
2016
;
32
:
511
7
.

8.

Peters
B
,
Tong
W
,
Sidney
J
, et al.
Examining the independent binding assumption for binding of peptide epitopes to MHC-I molecules
.
Bioinformatics
2003
;
19
:
1765
72
.

9.

Nielsen
M
,
Lundegaard
C
,
Lund
O
.
Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method
.
BMC Bioinformatics
2007
;
8
:
238
.

10.

Nielsen
M
,
Lund
O
.
NN-align. An artificial neural network-based alignment algorithm for MHC class II peptide binding prediction
.
BMC Bioinformatics
2009
;
10
:
296
.

11.

Bordner
AJ
,
Mittelmann
HD
.
Prediction of the binding affinities of peptides to class II MHC using a regularized thermodynamic model
.
BMC Bioinformatics
2010
;
11
:
41
.

12.

Bhasin
M
,
Raghava
GP
.
Prediction of CTL epitopes using QM, SVM and ANN techniques
.
Vaccine
2004
;
22
:
3195
204
.

13.

Nielsen
M
,
Lundegaard
C
,
Blicher
T
, et al.
NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence
.
PloS One
2007
;
2
:
e796
.

14.

Sidney
J
,
Peters
B
,
Frahm
N
, et al.
HLA class I supertypes: a revised and updated classification
.
BMC Immunol
2008
;
9
:
1
.

15.

Hoof
I
,
Peters
B
,
Sidney
J
, et al.
NetMHCpan, a method for MHC class I binding prediction beyond humans
.
Immunogenetics
2009
;
61
:
1
13
.

16.

O'Donnell
TJ
,
Rubinsteyn
A
,
Laserson
U
.
MHCflurry 2.0: improved pan-allele prediction of MHC class I-presented peptides by incorporating antigen processing
.
Cell Syst
2020
;
11
:
42
48.e7
.

17.

Boehm
KM
,
Bhinder
B
,
Raja
VJ
, et al.
Predicting peptide presentation by major histocompatibility complex class I: an improved machine learning approach to the immunopeptidome
.
BMC Bioinformatics
2019
;
20
:
7
.

18.

You
R
,
Qu
W
,
Mamitsuka
H
,
Zhu
S
.
DeepMHCII: a novel binding core-aware deep interaction model for accurate MHC-II peptide binding affinity prediction
.
Bioinformatics
2022
;
38
:
i220
8
.

19.

Kalemati
M
,
Darvishi
S
,
Koohi
S
.
CapsNet-MHC predicts peptide-MHC class I binding based on capsule neural networks
.
Commun Biol
2023
;
6
:
492
.

20.

Qu
W
,
You
R
,
Mamitsuka
H
,
Zhu
S
.
DeepMHCI: an anchor position-aware deep interaction model for accurate MHC-I peptide binding affinity prediction
.
Bioinformatics
2023
;
39
:
btad551
.

21.

Jiang
L
,
Yu
H
,
Li
J
, et al.
Predicting MHC class I binder: existing approaches and a novel recurrent neural network solution
.
Brief Bioinform
2021
;
22
:
bbab216
.

22.

Chen
Z
,
Min
MR
,
Ning
X
.
Ranking-based convolutional neural network models for peptide-MHC class I binding prediction
.
Front Mol Biosci
2021
;
8
:634836, https://pubmed.ncbi.nlm.nih.gov/34079815/.

23.

Cheng
J
,
Bendjama
K
,
Rittner
K
, et al.
BERTMHC: improved MHC-peptide class II interaction prediction with transformer and multiple instance learning
.
Bioinformatics
2021
;
37
:
4172
9
.

24.

Venkatesh
G
,
Grover
A
,
Srinivasaraghavan
G
,
Rao
S
.
MHCAttnNet: predicting MHC-peptide bindings for MHC alleles classes I and II using an attention-based deep neural model
.
Bioinformatics
2020
;
36
:
i399
406
.

25.

Wu
J
,
Wang
W
,
Zhang
J
, et al.
DeepHLApan: a deep learning approach for Neoantigen prediction considering both HLA-peptide binding and immunogenicity
.
Front Immunol
2019
;
10
:
2559
.

26.

Mei
S
,
Li
F
,
Leier
A
, et al.
A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction
.
Brief Bioinform
2020
;
21
:
1119
35
.

27.

Tsiakmaki
M
,
Kostopoulos
G
,
Kotsiantis
S
, et al.
Transfer learning from deep neural networks for predicting student performance
.
Applied Sciences
2020
;
10
:
2145
.

28.

Zeng
H
,
Gifford
DK
.
DeepLigand: accurate prediction of MHC class I ligands using peptide embedding
.
Bioinformatics
2019
;
35
:
i278
83
.

29.

You
Y
,
Zhang
L
,
Tao
P
, et al.
Spatiotemporal transformer neural network for time-series forecasting
.
Entropy (Basel)
2022
;
24
:
1651
.

30.

Song
H
,
Chen
L
,
Cui
Y
, et al.
Denoising of MR and CT images using cascaded multi-supervision convolutional neural networks with progressive training
.
Neurocomputing
2022
;
469
:
354
65
.

31.

Lai
X
,
Zhou
J
,
Wessely
A
, et al.
A disease network-based deep learning approach for characterizing melanoma
.
Int J Cancer
2022
;
150
:
1029
44
.

32.

Gao
J
,
Liu
P
,
Liu
G-D
, et al.
Robust needle localization and enhancement algorithm for ultrasound by deep learning and beam steering methods
.
Journal of Computer Science and Technology
2021
;
36
:
334
46
.

33.

Ye
Z
,
Li
S
,
Mi
X
, et al.
STMHCpan, an accurate star-transformer-based extensible framework for predicting MHC I allele binding peptides
.
Brief Bioinform
2023
;
24
:
bbad164
.

34.

Liu
Z
,
Mao
H
,
Wu
C-Y
et al. A ConvNet for the 2020s.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, New Orleans, LA, USA, 2022, p. 11966–11976.

35.

Gao
J
,
Lao
Q
,
Kang
Q
et al. Unsupervised Cross-disease Domain Adaptation by Lesion Scale Matching.
Medical Image Computing and Computer Assisted Intervention – MICCAI 2022
,
Lecture Notes in Computer Science, vol 13437. Springer, Cham.
,
2022
. p.
660
70
.

36.

Xia
Y
,
Yang
C
,
Hu
N
, et al.
Exploring the key genes and signaling transduction pathways related to the survival time of glioblastoma multiforme patients by a novel survival analysis model
.
BMC Genomics
2017
;
18
:
950
.

37.

Gao
J
,
Lao
Q
,
Liu
P
, et al.
Anatomically guided cross-domain repair and screening for ultrasound Fetal biometry
.
IEEE J Biomed Health Inform
2023
;
27
:4914–4925.

38.

Zhao
X
,
Ma
S
,
Wang
B
, et al.
PGG.MHC: toward understanding the diversity of major histocompatibility complexes in human populations
.
Nucleic Acids Res
2023
;
51
:
D1102
8
.

39.

Zhang
L
,
Liu
G
,
Kong
M
, et al.
Revealing dynamic regulations and the related key proteins of myeloma-initiating cells by integrating experimental data into a systems biological model
.
Bioinformatics
2021
;
37
:
1554
61
.

40.

Burlingham
BT
,
Widlanski
TS
.
An intuitive look at the relationship of Ki and IC50: a more general use for the Dixon plot
.
J Chem Educ
2003
;
80
:
214
.

41.

Vita
R
,
Mahajan
S
,
Overton
JA
, et al.
The immune epitope database (IEDB): 2018 update
.
Nucleic Acids Res
2019
;
47
:
D339
43
.

42.

Shao
W
,
Pedrioli
PGA
,
Wolski
W
, et al.
The SysteMHC atlas project
.
Nucleic Acids Res
2018
;
46
:
D1237
47
.

43.

Chu
Y
,
Zhang
Y
,
Wang
Q
, et al.
A transformer-based model to predict peptide–HLA class I binding and optimize mutated peptides for vaccine design
.
Nature Machine Intelligence
2022
;
4
:
300
11
.

44.

Sarkizova
S
,
Klaeger
S
,
Le
PM
, et al.
A large peptidome dataset improves HLA class I epitope prediction across most of the human population
.
Nat Biotechnol
2020
;
38
:
199
209
.

45.

Abelin
JG
,
Harjanto
D
,
Malloy
M
, et al.
Defining HLA-II ligand processing and binding rules with mass spectrometry enhances cancer epitope prediction
.
Immunity
2021
;
54
:
388
.

46.

Bulik-Sullivan
B
,
Busby
J
,
Palmer
CD
, et al.
Deep learning using tumor HLA peptide mass spectrometry datasets improves neoantigen identification
.
Nat Biotechnol
2018
;
37
:
55
63
.

47.

Pearson
H
,
Daouda
T
,
Granados
DP
, et al.
MHC class I-associated peptides derive from selective regions of the human genome
.
J Clin Invest
2016
;
126
:
4690
701
.

48.

Liu
SS
,
Hockenberry
AJ
,
Lancichinetti
A
, et al.
NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents
.
PLoS Comput Biol
2016
;
12
:
e1005184
.

49.

Berman
HM
,
Westbrook
J
,
Feng
Z
, et al.
The protein data Bank
.
Nucleic Acids Res
2000
;
28
:
235
42
.

50.

O'Donnell
TJ
,
Rubinsteyn
A
,
Bonsack
M
, et al.
MHCflurry: open-source class I MHC binding affinity prediction
.
Cell Syst
2018
;
7
:
129
132.e4
.

51.

Jurtz
V
,
Paul
S
,
Andreatta
M
, et al.
NetMHCpan-4.0: improved peptide-MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data
.
J Immunol
2017
;
199
:
3360
8
.

52.

Kaiser
L
,
Gomez
AN
,
Chollet
F
.
Depthwise separable convolutions for neural machine translation
arXiv preprint
.
2017
. https://arxiv.org/abs/1706.03059.

53.

Hu
Y
,
Wang
Z
,
Hu
H
, et al.
ACME: pan-specific peptide-MHC class I binding prediction through attention-based deep neural networks
.
Bioinformatics
2019
;
35
:
4946
54
.

54.

Bravi
B
,
Tubiana
J
,
Cocco
S
, et al.
RBM-MHC: a semi-supervised machine-learning method for sample-specific prediction of antigen presentation by HLA-I alleles
.
Cell Syst
2021
;
12
:
195
202.e9
.

55.

Pei
B
,
Hsu
YH
.
IConMHC: a deep learning convolutional neural network model to predict peptide and MHC-I binding affinity
.
Immunogenetics
2020
;
72
:
295
304
.

56.

Zhang
H
,
Lund
O
,
Nielsen
M
.
The PickPocket method for predicting binding specificities for receptors based on receptor pocket similarities: application to MHC-peptide binding
.
Bioinformatics
2009
;
25
:
1293
9
.

57.

Reynisson
B
,
Alvarez
B
,
Paul
S
, et al.
NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data
.
Nucleic Acids Res
2020
;
48
:
W449
54
.

58.

Albert
BA
,
Yang
Y
,
Shao
XM
, et al.
Deep neural networks predict class I major histocompatibility complex epitope presentation and transfer learn neoepitope immunogenicity
.
Nat Mach Intell
2023
;
5
:
861
72
.

59.

Wiens
TS
,
Dale
BC
,
Boyce
MS
, et al.
Three way k-fold cross-validation of resource selection functions
.
Ecol Model
2008
;
212
:
244
55
.

60.

Zhang
L
,
Xiao
M
,
Zhou
J
, et al.
Lineage-associated underrepresented permutations (LAUPs) of mammalian genomic sequences based on a jellyfish-based LAUPs analysis application (JBLA)
.
Bioinformatics
2018
;
34
:
3624
30
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data