Abstract

The accurate prediction of peptide-major histocompatibility complex (MHC) class I binding probabilities is a critical endeavor in immunoinformatics, with broad implications for vaccine development and immunotherapies. While recent deep neural network based approaches have showcased promise in peptide-MHC (pMHC) prediction, they have two shortcomings: (i) they rely on hand-crafted pseudo-sequence extraction, (ii) they do not generalize well to different datasets, which limits the practicality of these approaches. While existing methods rely on a 34 amino acid pseudo-sequence, our findings uncover the involvement of 147 positions in direct interactions between MHC and peptide. We further show that neural architectures can learn the intricacies of pMHC binding using even full sequences. To this end, we present PerceiverpMHC that is able to learn accurate representations on full-sequences by leveraging efficient transformer based architectures. Additionally, we propose IMGT/RobustpMHC that harnesses the potential of unlabeled data in improving the robustness of pMHC binding predictions through a self-supervised learning strategy. We extensively evaluate RobustpMHC on eight different datasets and showcase an overall improvement of over 6% in binding prediction accuracy compared to state-of-the-art approaches. We compile CrystalIMGT, a crystallography-verified dataset presenting a challenge to existing approaches due to significantly different pMHC distributions. Finally, to mitigate this distribution gap, we further develop a transfer learning pipeline.

The critical molecular interactions between peptides and the Major histocompatibility complex (MHC) serve as the foundation for immune recognition which is the body’s ability to distinguish self from non-self and is pivotal in initiating immune responses against pathogens and malignant cells [1]. Accurate prediction of peptide-MHC (pMHC) binding is crucial in understanding the complexities of immune recognition and holds significant implications for the design of immunotherapies, vaccines, and personalized medicine [2, 3]. The MHC class I and class II proteins play a pivotal role in the adaptive branch of the immune system. T cells [4] play a vital role in the immune system by identifying specific peptides presented on cell surfaces through MHC molecules. This enables them to target and respond to threats, contributing to our body’s defense against infections and diseases. The MHC I molecules present peptides to CD8+ T cells, whereas the MHC II molecules present peptides to CD4+ T cells through the endogenous or direct and the exogenous or cross-presentation pathways [5, 6]. Specifically, the Human Leukocyte Antigen (HLA) system [7], a subset of MHC in humans, governs the presentation of peptides to T cells, thus initiating adaptive immune responses. HLA molecules are highly polymorphic, contributing to individualized immune responses and influencing susceptibility to various diseases [8]. Understanding HLA diversity and its implications is of paramount importance in fields ranging from immunology and transplantation medicine to vaccine development and disease association studies. This dynamic interplay between HLA molecules and immune responses underscores the central importance of HLA in human health.

Predicting pMHC binding probability is a challenging computational task due to the vast combinatorial space of possible peptide sequences and the subtle nuances that govern their interactions with MHC molecules. Traditional approaches, such as scoring matrices and structural modeling, have made valuable contributions to this field. However, they often struggle to capture the intricate, high-dimensional relationships and the sequence patterns that are vital for accurate predictions. In recent years, machine learning, especially deep learning techniques, have emerged as powerful tools for addressing this challenge. These methods leverage vast datasets of experimentally measured pMHC binding probabilities to learn complex patterns and relationships. Nonetheless, existing models still face limitations in effectively encoding and representing the structural and sequential features of both peptide and MHC sequences. This research aims to mitigate the limitations of existing approaches by learning representations that efficiently capture the pMHC binding process and generalize well across different datasets.

In conventional approaches, predicting peptide-HLA binding probabilities has predominantly relied on two methods: scoring matrices [9] and structure-based modeling [10]. Scoring matrices assign scores to pairs of amino acids based on their co-occurrence frequencies in binding or non-binding interactions. Position-Specific Scoring Matrices [11] refine this approach by considering the position of amino acids within a peptide. While these methods are computationally efficient, they may struggle with capturing subtle, high-dimensional relationships [12]. On the other hand structure-based modeling [13], leverages experimental three-dimensional structures of peptide-HLA complexes to analyze the physical interactions. This approach offers high accuracy, when structural data are available, but is limited by the availability of experimental structures [14, 15]. Both methods have played valuable roles in this field, but they do not fully capture the complexity of binding process [16] and consequently they do not generalize well to diverse peptide-HLA combinations.

Diverging from conventional approaches that often focus only on peptide sequences, NetMHCPan proposed a pseudo-sequence encoding method to characterize the interactions between both HLA and peptides, in which an HLA sequence is reduced to a pseudo amino acid sequence of length 34 [17]. According to Nielsen et al. [17], the central peptide residues have the potential to interact with the different subset of residues in the binding groove due to multiple possible conformation. All such residues were included in the pseudo-sequence. However, our investigation into MHC data from IMGT [18] and PDB [19] databases has unveiled significant gaps in this approach. We compiled information from IMGT/3Dstructure-DB structure database, encompassing approximately 1453 peptide-HLA binding data entries, which incorporated polymorphic residues from A, B, and C alleles. We analyzed structures within 4.0 Angstroms, revealing that 147 positions directly interact between HLA and peptides, as shown in Fig. 1, in contrast to the limited set emphasized by pseudo-sequences. Notably, highly influential positions like 5, 123, and 146 were disregarded in [17]. Moreover, while some less common positions, such as 150, made it into the pseudo-sequences, many other critical ones were excluded as shown in Fig. 2. This suggests that the current reliance on pseudo-sequences may have overlooked the crucial aspects of MHC-peptide interactions, potentially hindering the predictive accuracy and therapeutic development. This pseudo-sequence representation-encoding model has been adopted by the rest of the other machine learning approaches [20–23] except DeepSequencePan [24] and its follow-up work DeepAttentionPan [25]. DeepAttentionPan employs full HLA sequences, where in each HLA sample is represented as a 280–287 length residue sequence (8–15 from the peptide and 272 from the HLA). However, their performance is significantly worse compared to pseudo-sequences based approaches like NetMHCPan. Whereas, in NetMHCPan and similar approaches, a sample is represented as a residue sequence of length 42–49 (8–15 from the peptide and 34 from the HLA).

Representation of interacting positions. Comparison between the existing approach using a 34 pseudo-sequence representation and experimentally verified interactions between MHC-peptide representation involving 147 interacting positions. (a) NetMHCpan [17] proposed 34 length pseudo-sequence for binding prediction as shown in the figure. The columns give the HLA residue numbering according to the IMGT nomenclature, while the rows demonstrate the interactions with the nine positions of the peptide. Squares in grey outline the peptide positions estimated to have contact with the corresponding HLA residue. The aggregated 3D representation illustrates the interactions between 147 positions of MHC-I and the peptide. The structure is displayed in green color, with blue color highlighting various interacting positions, while the peptide within the groove is depicted in red.
Figure 1

Representation of interacting positions. Comparison between the existing approach using a 34 pseudo-sequence representation and experimentally verified interactions between MHC-peptide representation involving 147 interacting positions. (a) NetMHCpan [17] proposed 34 length pseudo-sequence for binding prediction as shown in the figure. The columns give the HLA residue numbering according to the IMGT nomenclature, while the rows demonstrate the interactions with the nine positions of the peptide. Squares in grey outline the peptide positions estimated to have contact with the corresponding HLA residue. The aggregated 3D representation illustrates the interactions between 147 positions of MHC-I and the peptide. The structure is displayed in green color, with blue color highlighting various interacting positions, while the peptide within the groove is depicted in red.

Detailed representation of 147 MHC-I positions interacting with a 16-amino acid peptide This figure illustrates the specific residues within the MHC-I sequence that are experimentally verified to interact with a 16-amino acid long peptide in the binding groove. The HLA residue numbering follows the IMGT numbering system. Each row corresponds to one of the sixteen peptide positions, while each column represents an interacting HLA residue. The color gradient, from dark blue to light green, indicates the strength of interaction: dark blue for strong, light green for weak, and white for no interaction. These colors fill the squares to show how often the peptide positions interact with the corresponding HLA residues.
Figure 2

Detailed representation of 147 MHC-I positions interacting with a 16-amino acid peptide This figure illustrates the specific residues within the MHC-I sequence that are experimentally verified to interact with a 16-amino acid long peptide in the binding groove. The HLA residue numbering follows the IMGT numbering system. Each row corresponds to one of the sixteen peptide positions, while each column represents an interacting HLA residue. The color gradient, from dark blue to light green, indicates the strength of interaction: dark blue for strong, light green for weak, and white for no interaction. These colors fill the squares to show how often the peptide positions interact with the corresponding HLA residues.

In this research, we aim to learn efficient pMHC representations from full sequences. We achieve this by harnessing the power of recent deep learning techniques, particularly efficient transformer models like Perceiver IO model [26], to predict pMHC interactions [27]. To this end, we present PerceiverpMHC that leverages latent cross-attention design from Perceiver IO and therefore can scale up efficiently to longer sequences without requiring domain-specific architectural adaptations. This adaptability makes the PerceiverpMHC an ideal candidate for modeling the pMHC binding phenomenon, as it can simultaneously consider the sequence information of both peptides and MHC molecules while capturing complex inter-dependencies. In our study, instead of focusing solely on peptide sequences, we take a broader approach by considering the full MHC sequences.

By encompassing the entirety of the MHC molecule, including both the peptide-binding groove and the surrounding regions, our model gains a nuanced understanding of the structural context in which peptides bind. This holistic approach not only accounts for polymorphism within the MHC gene but also captures the dynamic interplay between residues that may influence peptide presentation on the surface of MHC molecule. Therefore, our model is poised to provide more accurate and contextually relevant predictions of pMHC binding probabilities. While recent advances in deep neural networks have shown promise in pMHC binding predictions, their performance tends to drop when confronted with datasets having slightly different distribution of peptide lengths which limits the applicability of these approaches. To address this issue, we enhance the robustness of PerceiverpMHC by incorporating the power of self-supervised learning, which we refer to as RobustpMHC. In contrast to traditional supervised learning relying solely on labeled data, self-supervised learning leverages both labeled and unlabeled samples to refine the model’s understanding of the underlying relationships. In the context of pMHC binding, training RobustpMHC using self-supervision proves invaluable, as it allows us to harness a broader spectrum of data, augmenting experimentally validated interactions with synthetically generated data. Data augmentation not only enhances the robustness of our predictions but also enables the model to generalize more effectively to novel peptides and MHC sequences.

Full sequence evaluation. Comparison of PerceiverpMHC trained on full sequence with recent approaches based on 34 pseudo-sequence, as well as with full sequence method displayed in separate panels for (a) Anthem Independent set, (b) Anthem External set, (c) Neoantigen data, and (d) HPV data.
Figure 3

Full sequence evaluation. Comparison of PerceiverpMHC trained on full sequence with recent approaches based on 34 pseudo-sequence, as well as with full sequence method displayed in separate panels for (a) Anthem Independent set, (b) Anthem External set, (c) Neoantigen data, and (d) HPV data.

Overall Performance: Comparison of RobustpMHC with recent approaches, showcasing their performance averages across multiple datasets including Anthem, Independent set, Anthem External set, Neoantigen data, and HPV data. Note: RobustpMHC achieves an overall 6% improvement over the state-of-the-art.
Figure 4

Overall Performance: Comparison of RobustpMHC with recent approaches, showcasing their performance averages across multiple datasets including Anthem, Independent set, Anthem External set, Neoantigen data, and HPV data. Note: RobustpMHC achieves an overall 6% improvement over the state-of-the-art.

Robustness evaluation. Comparison of RobustpMHC with recent approaches on (a) Anthem Independent set, (b) Anthem External set, (c) Neoantigen data, (d) HPV data, (e) COVID data, and (f) Binary set from IEDB database.
Figure 5

Robustness evaluation. Comparison of RobustpMHC with recent approaches on (a) Anthem Independent set, (b) Anthem External set, (c) Neoantigen data, (d) HPV data, (e) COVID data, and (f) Binary set from IEDB database.

Peptide length distribution and transfer learning performance. (a-e) Distribution of peptide length for different datasets. It can be seen that when datasets have different distribution like (d) HPV and (e) CrystalIMGT, the performance of existing methods significantly decreases. (f,g,h) Evaluation of transfer learning on (f) CrystalIMGT and (g,h) Neoepitope dataset.
Figure 6

Peptide length distribution and transfer learning performance. (a-e) Distribution of peptide length for different datasets. It can be seen that when datasets have different distribution like (d) HPV and (e) CrystalIMGT, the performance of existing methods significantly decreases. (f,g,h) Evaluation of transfer learning on (f) CrystalIMGT and (g,h) Neoepitope dataset.

Contributions: This research contributes with

  1. We propose PerceiverpMHC architecture, leveraging attention mechanisms for predicting pMHC bindings from full MHC sequences (Section 2.2) eliminating the need for hand-designed pseudo-sequences as shown in Fig. 3.

  2. To improve the generalization of pMHC binding predictions, we further propose RobustpMHC (Section 2.3) that utilizes self-supervised learning enhancing the prediction performance across six datasets with varying peptide lengths (Fig. 4).

  3. We introduce a new dataset CrystalIMGT dataset (Section 2.1), a set of mass spectrometry observed pMHC pairs, which presents a challenge to the existing approaches due to significantly different data distribution. The CrystalIMGT dataset along with the benchmark of existing approaches will be released to the research community for evaluating generalization.

  4. We present a transfer learning technique (Section 1.3) that substantially improves performance on the CrystalIMGT dataset (Fig. 6f), along with the transferability of features for predicting immunogenicity data for neoantigens, compared with nine state-of-the-art methods (Fig. 6g,h).

In conclusion, we present a robust pMHC binding prediction strategy for full sequences and an exhaustive evaluation benchmark, which we believe will advance the field of pMHC binding prediction. All the codes and datasets are available through the link submitted via Nature administration. Once this paper is accepted, we will also make our implementation, trained models, and datasets publicly available to the research community for facilitating reproducibility. Additionally, we have deployed a user-friendly web interface is available at https://www.imgt.org/RobustpMHC for predicting pMHC binding.

Results

We designed our experiments to investigate the following research questions:

  • RQ1. Do we need to depend on hand-crafted pseudo-sequence extraction, or can neural networks autonomously discern crucial features required for pMHC binding prediction?

  • RQ2. How robustly do the binding prediction models generalize across diverse datasets and how does RobustpMHC compare with them?

  • RQ3. How does PerceiverpMHC compare with other Transformer architectures, and how does RobustpMHC’s learning strategy compare against other self-supervised learning approaches mentioned in machine learning literature such as data augmentation and contrastive learning?

  • RQ4. To what extent do the introduction of mutations contribute to enhancing the robustness of RobustpMHC? How many mutations are sufficient, or does a large number of mutations degrade the performance?

To answer (RQ1), we compare PerceiverpMHC with recent pseudo-sequence approaches and DeepAttentionPan (the single full sequence base approach) on 6 datasets in Section 1.1. In relation to (RQ2), we demonstrate across eight different datasets that RobustpMHC significantly outperforms the state-of-the-art overall, as illustrated in Fig. 4. Even at individual level, RobustpMHC either outperforms the state-of-the-art approach or performs comparably for each dataset, as outlined in Section 1.2. Furthermore, in Section 1.3 we show that the learned representations of RobustpMHC can be transferred to datasets that are even significantly different than the training dataset. Finally, we present ablation studies in Section 1.4 to answer the research questions (RQ3, RQ4).

Full sequence evaluation

We initially evaluate whether neural networks can inherently learn which amino acids are crucial for binding given the full MHC sequence or if we need to design hand-crated pseudo-sequences (RQ1). To this end, we evaluate the performance of PerceiverpMHC on four publicly available benchmarks: independent and external set from Anthem [28] dataset, Neoantigen [21], and human papilloma virus (HPV) [29] datasets and compare with the state-of-the-art pseudo-sequences based approaches like TransPHLA [21], capsNet [22], NetMHCpan 4.1 [20], and BigMHC [23].

The independent set from Anthem dataset [28] contains 112 types of HLA alleles, whereas the external set contains five HLA alleles. The Neoantigen data [21] comprises non-small-cell lung cancer, melanoma, ovarian cancer, and pancreatic cancer from recent works including 221 experimentally verified pHLA binders. The HPV dataset [21] comprises 278 experimentally verified pHLA binders derived from HPV16 proteins E6 and E7. These binders consist of 8–11-mer peptides. Following TransPHLA [21] we report four metrics Area Under Curve, Accuracy, Matthews correlation coefficient, and F1 scores on independent set and external set. For the datasets such as Neoantigen and HPV that have only experimentally verified binders, we report the true positives and the false negatives during prediction. Furthermore, we compare our PerceiverpMHC approach with deep-AttentionPan [25], which is also a full sequence based deep pHLA binding prediction approach.

It can be seen from Fig. 3a-c that even on full sequences PerceiverpMHC is at par with the state-of-the-art pseudo-sequence based approaches on three evaluation datasets. While PerceiverpMHC is slightly inferior on the HPV dataset (refer Fig. 3d) as compared to TransPHLA and CapsNet, it is superior to other approaches including NetMHCPan4.1 and BigMHC-EL. This showcases that deep neural networks can implicitly learn to attend the important amino acids without external human supervision, i.e. hand-crafted pseudo-sequences. Furthermore, in comparison to the previous full sequence method DeepAttentionPan [25], PerceiverpMHC achieves far better prediction performance, which clearly demonstrates the superiority of transformer based architectures. It is worth mentioning that TransPHLA has shown superior performance over 14 popular approaches therefore for brevity we do not show comparison with them [24, 25, 28, 30–38].

Robustness evaluation

We next evaluate how well our self-supervised robust model generalizes across different datasets for peptide-HLA-I binding prediction (RQ2). To this end, we evaluate RobustpMHC on six datasets, independent and external splits form Anthem datasets, HPV, and Neoantigen dataset. We further compare them on Covid dataset and IEDB binary evaluation set. The current coronavirus disease 2019 (COVID-19) outbreak is a worldwide emergency, as its rapid spread and high mortality rate has caused severe disruptions. Understanding the immune patterns of Covid-19 is very crucial for mechanisms of SARS-CoV-2-induced immune changes, their effect on disease outcomes, and their implications for potential COVID-19 treatments [39]. The COVID dataset consists of 29 experimentally verified positive and 89 negative samples from different sources [40, 41].

The IEDB dataset [42] contains three different measurements IC50, T1/2, and binary. Here we consider the binary set, which consists of 3845 experimentally verified EL(Mass spectrometry data include naturally presented MHC ligands, referred to as eluted ligands) [43] binding samples and 3656 negative samples. Similar to Anthem, IEDB has been one of the prominent publicly available benchmarks, which contains slightly different distribution as compared to Anthem therefore it provides a diverse set for evaluating the generalization of the approaches.

RobustpMHC demonstrates superior overall performance compared to state-of-the-art over various datasets, indicating its effectiveness in pMHC binding prediction as shown in Fig. 4. We further present an in-depth analysis of performance on individual datasets in Fig. 5. It can be seen from Fig. 5 that for the independent set (Fig. 5a), which has very similar peptide length distribution as the training data, the performance of RobustpHMC is slightly lower as compared to CapsNet. In contrast, on the external dataset (Fig. 5b) RobustpHMC shows superior performance. Even on Neoantigen data RobustpHMC shows slightly better performance as shown in Fig. 5c. On HPV dataset (Fig. 5d) RobustpMHC is slightly behind TransPHLA but significantly better than CapsNet and NetMHCpan-4.1. On Covid dataset (Fig. 5e) RobustpMHC is again superior, marginally outperforming NetMHCpan-4.1, and significantly outperforming TranspHLA and CapsNet. While on IEDB dataset (Fig. 5f), RobustpHMC performs marginally better than CapsNet and TranspHLA and notably better than NetMHCpan-4.1. In conclusion, RobustpMHC obtains state-of-the-art results on four out of six datasets. Furthermore, RobustpHMC demonstrates consistent performance across all six datasets, which showcases the robustness of our approach. It is worth noting that RobstpMHC and PerceiverpMHC work on full sequences, while other baselines in Fig. 5 require pseudo-sequences. Finally, RobstpMHC consistently outperforms PerceiverpMHC in all six datasets. Therefore, we can conclude that mutation robust training aids in generalization and can significantly improve the practicality of pMHC prediction approaches.

Transfer learning evaluation

While the learning based approaches have shown promising results for pHLA binding prediction, their performance steeply deteriorates for prediction on data with significantly different distribution. In Fig. 6a-e, we plot the distribution of peptide length across different datasets. It can be seen that distribution of Anthem independent set is very similar to the train set, thus the performance of end-to-end approaches are better as can be seen in Fig. 5a. On the other hand, for HPV dataset where the distribution of peptide length is different as can be observed by comparing Fig. 6d with Fig. 6a, the performance of all approaches significantly deteriorates as depicted in Fig. 5d. Hence, for an extensive evaluation of out-of-domain generalization, we select two datasets: (i) Neopeptide dataset presented by a very recent method BigMHC [23]. This dataset comprises 198 positive instances and 739 negative instances, and it has served as a benchmark for transfer learning. (ii) We compiled a dataset collected from IMGT/3Dstructure-DB [40], which we refer to as CrystalIMGT. It is an experimentally verified dataset comprising 14,997 crystallographic experimental samples. The peptide length distributions of these datasets are shown in Fig. 6e,f respectively. It can be seen from Fig. 6f-h that existing approaches suffer from significant performance degradation on these datasets and especially CrystalIMGT.

To tackle such large distribution shift, we present a transfer learning paradigm for RobustpMHC. The CrystalIMGT we select 10% of the data randomly and consider them as an independent training set for transfer learning, while for the Neopeptide dataset, we follow the BigMHC and use the independent training and testing sets provided by them for fair evaluation. For transfer learning we keep our feature extraction module, i.e. Perceiver IO module (refer Fig. 7) frozen and retrain the projection module with a smaller learning rate (⁠|$1e^{-4}$|⁠). We refer to the resulting model as RobustMHC-TL. The evaluation results from Fig. 6f-h show that this transfer learning paradigm can significantly improve the performance pHLA binding prediction even on novel datasets which are significantly different from the training dataset.

Pipeline of the proposed RobustpMHC approach. Both the MHC and peptides are mutated with small probability and concatenated for the input sequence. We then train two identical networks containing Perceiver IO blocks for feature extraction, projection and prediction blocks for predicting the binding probability from the Perceiver IO features. We refer to them as teacher ($\phi (\cdot )$) and student network ($\phi \:^{\prime}(\cdot )$). The teacher network is receives the original sequence ($x$), while the student gets mutated sequences ($x\:^{\prime}$). The teacher also has access to true labels ($y$) and can be trained directly by minimising the prediction loss $\mathcal{L}_{p}$. We train the student by matching the latent features of the student network [$o\:^{\prime},z\:^{\prime}$] with their counterparts of the teacher minimising the consistency loss $\mathcal{L}_{c}$. The central idea is to make the student robust to small mutations.
Figure 7

Pipeline of the proposed RobustpMHC approach. Both the MHC and peptides are mutated with small probability and concatenated for the input sequence. We then train two identical networks containing Perceiver IO blocks for feature extraction, projection and prediction blocks for predicting the binding probability from the Perceiver IO features. We refer to them as teacher (⁠|$\phi (\cdot )$|⁠) and student network (⁠|$\phi \:^{\prime}(\cdot )$|⁠). The teacher network is receives the original sequence (⁠|$x$|⁠), while the student gets mutated sequences (⁠|$x\:^{\prime}$|⁠). The teacher also has access to true labels (⁠|$y$|⁠) and can be trained directly by minimising the prediction loss |$\mathcal{L}_{p}$|⁠. We train the student by matching the latent features of the student network [|$o\:^{\prime},z\:^{\prime}$|] with their counterparts of the teacher minimising the consistency loss |$\mathcal{L}_{c}$|⁠. The central idea is to make the student robust to small mutations.

Ablation studies

Comparison with data augmentation and contrastive learning: We next evaluate the contribution of robust training, for we compare RobustpMHC without robust training, i.e. PerceiverpMHC. Furthermore, in what way can it be argued whether Equation (1) is the most optimal approach to robust training (RQ3)? To address this question, we compare our robust training pipeline with two popular self-supervised learning approaches: (i) data augmentation [44] and contrastive learning [45]. In data augmentation the existing dataset is augmented with a synthetic data by transforming the original sample. Here we consider random mutation as a transformation and retrain PerceiverpMHC on the combined dataset, we refer to this model as PerceiverpMHC-Re. (ii) In contrastive learning (specifically noise contrastive learning) for each mini-batch of the training data each sample is passed through a transformation. Then infoNCE loss [45] is used to learn representations robust to a given transformation. Here again considering random mutations as transformation, we train a variant of our model that minimizes infoNCE loss instead of the proposed loss 1. We refer to this variant as PerceiverpMHC-CE. From supplementary Fig. S3, it can be seen that our approach (RobustpMHC) consistently outperforms PerceiverpMHC, PerceiverpMHCRe and PerceiverpMHC-CE, which validates the contribution of our loss function in obtaining better generalization across diverse datasets.

Other transformer architectures for full sequence pMHC binding prediction: To validate the selection of Perceiver IO as our base module for encoding pMHC sequence we train the vanilla transformer [46] on full MHC sequence and compare with PerceiverpMHC and another popular efficient transformer architecture Reformer [47] which we call ReformerpMHC. We can observe from the supplementary Fig. S2 that PerceiverpMHC performs better than TransformerpMHC (slightly inferior) also being over 4 times faster in both training and evaluation on a single RTX-3090 GPU while the ReformerpMHC performs significantly inferior as can be seen in supplementary Fig. S1. It is worth mentioning that for drug discovery, a wide range of peptides are evaluated. Moreover before evaluation the peptide sequences are divided into sub-sequences of different lengths of 8-14mers and then each sub-sequence is evaluated against MHC for identifying binding candidates, which makes evaluation speed crucial for practical reasons. This motivates our choice for selecting PerceiverpMHC as our base model.

Effect of number of mutations for robust training: Our training pipeline aims to improve the pMHC representation against small amount of mutations in the MHC and peptide sequences. We hypothesize that small amount of mutations, on average, does make the neural network robust, and small mutations can be seen as noise in the data. However, for large mutations, the properties of the allele and peptide changes, resulting in different binding probabilities. To evaluate this hypothesis (i.e. answer RQ4), we train RobuspMHC on small amount of mutations, |$m=\mathcal{u}(0,0.15)$|⁠, and compare it with one trained on large amount of mutations, |$m=\mathcal{u}(0,0.35)$|⁠. It can be observed from supplementary Fig. S4 that small mutation improves the efficacy of the approach while large mutations degrades the performance.

Effect of different peptide length: While several existing approaches are able to work with different peptide lengths their performance noticeably varies as the peptide length changes. This is due to the bias of the training data, since the networks expect similar distributions. Here we explicitly study the effect of peptide length on performance of RobustMHC for Anthem external dataset. It can be seen from supplementary Fig. S5, that RobstMHC’s performance drops for peptides with lengths greater than 9, since maximum amount of training data belongs to peptides with lengths 8,9. It is a common issue of leaning based approaches and similar observations have been reported by TranspHLA and CapsNet.

Method

Datasets

Anthem dataset: The Anthem dataset [28] comprises of three types of dataset the training set for model training and model selection, the independent test set and the external test set for model evaluation and methods comparison. The data sources for the training and independent test set are the same (i) four public HLA binders databases ([42, 48–50]), (ii) allotype-specific HLA ligands identified by mass spectrometry in previously published studies as described in TransPHLA [21]. The training dataset consists of 112 types of HLA alleles, including 3 59 166 EL and 17 95 830 negative instances. Peptides of negative data are sequence segments that are randomly chosen from the source proteins of IEDB HLA immunopeptidomes. The independent and external set consists of 112 types of HLA alleles with 85 876 positive and 85 562 instances. Because the source of the independent test set and the training set are the same, the data distributions for the training set and independent test set are very similar, TranPHLA [21] used an additional external set consisting of 5 types of HLA alleles with 51 984 positive and 51 881 negative instances was used to perform a fairer evaluation.

Neoantigen dataset: The Neoantigen dataset presented by the Chu et al. [21] consists of 250 Neoantigen samples was compiled from experimentally verified non-small-cell lung cancer, melanoma, ovarian cancer, and pancreatic cancer pHLA binders.

HPV dataset: The HPV dataset presented by Bonsack et al. [29] studies one of the most common sexually transmitted disease. It comprises of 278 experimentally verified pHLA binders from HPV16 proteins E6 and E7, consisting of 8–11-mer peptides. Following [29], we also consider ‘binder’ according to IC50|$<100{\mu }$|M for the HPV vaccine data.

COVID dataset: The coronavirus disease 2019 (COVID-19) belongs to family SARS-CoV-2-induced pneumonia, named by World Health Organization as COVID-19, has been declared a pandemic on the 11th of March 2020 since its first appearance in Wuhan, China, in December 2019 [51]. COVID-19, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, became a worldwide pandemic affecting millions of people [52]. Several efforts were made to identify HLA-related susceptibility to SARS-CoV-1 also belong to this family [53] to understand the variation in HLA affect both susceptibility and severity of COVID-19 infection could help identify risk and may support future vaccination strategies. Experimental Dataset collected from IMGT/3Dstructure-DB structure database[40] contains crystallographic experimental data consists of 42 experimentally verified samples.

CrystalIMGT dataset: We utilized the IMGT/3Dstructure-DB database [18] to provide a unique resource of expertise with detailed specific annotations on structural data of IG, TR, MHC, and RPI, from human and other vertebrate species, extracted from the Protein Data Bank PDB [19]. The IMGT/3Dstructure-DB is publicly available structure database [40] comprise experimentally verified 3D data contains 15395 pHLA binders. IMGT/3Dstructure-DB integrates data from sequence and structural sources and provides identification of IMGT genes and alleles expressed in the IG, TR and MHC with known 3D structures.

Neoepitope dataset: For evaluating the transfer learning capability of our approach, we follow the BigMHC [23]. We use the training (positive = 1407; negative = 4778), and validation (positive = 173; negative = 515) split for transfer learning our projection block keeping the weights of Perceiver IO block frozen. The transfer learning evaluation dataset consisted of a total of 198 immunogenic and 739 non-immunogenic neoepitopes after removing the intersection with all other pMHC data, were compiled from different sources such as NEPdb [54], Neopepsee [55], TESLA [56], and a data collected from 16 cancer patients using the MANAFEST Albert et al. [23]. They only kept peptides of length at least 8 and at most 11 and peptides with dummy amino acid ‘X’ were removed.

MHC-I database: IPD-IMGT/HLA da-tabase [57] is the official repository for the WHO Nomenclature Committee for Factors of the HLA System that receives submissions from laboratories in over 46 countries and active website uses from users in over 150 countries. A total of 36 036 MHC-I sequences across 38 species from the IPD-IMGT/HLA Database version 3.53.0 use for self-supervised learning.

PercieverpMHC network architecture

PercieverpMHC leverages a cross-attention mechanism [46] in latent space to extract features from both peptide and MHC sequences. To ensure computational efficiency, especially with longer sequences, we employ efficient transformer variants, specifically Perceiver IO [26]. The peptide and full HLA sequences are concatenated together and padded to form a sequence of length 400. This joint sequence is transformed into a the feature embedding using a fixed sized dictionary similar to [21], which is then added to positional embedding. This is later passed to the Perciever IO [26] block that extracts latent features from these embeddings (as explained below). These extracted features are the passed though Projection and Prediction blocks to obtain the binding probabilities. The entire network is trained in supervised fashion using ground truth labels from the datasets by minimising cross-entropy loss. The Perciever IO, Projection and Prediction blocks are described below.

Perceiver IO block: In this work, we aim to work with entire sequences and showcase that using attention mechanisms deep neural networks can help to learn the representation of full sequence efficiently, as can be seen from Fig. 3. To this end, we make our architecture based on efficient transformer architectures such as Perceiver IO [26]. Since transformers scale very poorly in both compute and memory they require significantly higher training time for processing full sequences. Perceiver IO uses cross-attention to map large inputs sequences to a smaller number of latent features. Then processing is performed entirely on these latent tokens, and finally they are decoded to an output space as shown in Fig. 8(a). Therefore, Perceiver IO has no quadratic dependence on the input or output size: encoder and decoder attention modules depend linearly on the input and output size (respectively), while the latent attention is independent of both input and output [26].

Building blocks of the proposed RobustpMHC approach. (a) Perceiver IO block: it projects the input sequence with large number of tokens to a smaller number of latent tokens using cross-attention and all processing is done in the latent space. (b) Projection block: the projection block in a 3-layer neural network that projects the Perceiver features to classifier space.
Figure 8

Building blocks of the proposed RobustpMHC approach. (a) Perceiver IO block: it projects the input sequence with large number of tokens to a smaller number of latent tokens using cross-attention and all processing is done in the latent space. (b) Projection block: the projection block in a 3-layer neural network that projects the Perceiver features to classifier space.

Projection and Prediction blocks: For obtaining binding scores from the Perceiver IO features, we flatten the latent features from the decoder of the Perceiver IO. These features are then propagated through a 3-layer shallow neural network with dense layers and rectified linear unit (ReLU) activation similar to TransPHLA [21] as shown in Fig. 8(b). For prediction we just take the Softmax of the output score of the projection block to obtain the binding probability. Similar to previous approaches [21, 24], we threshold the binding probability at |$0.5$| for converting it into class labels.

RobustpMHC network architecture

Similar to PercieverpMHC, RobustpMHC utilises employs Perceiver IO [26] to extract features from both peptide and MHC sequences. Additionally, to train our model to capture features robust to minor mutations, we introduce small random mutations into the peptide and MHC sequences. We then train a student model to be resilient to these perturbations in a self-supervised manner. To predict binding probabilities from these learned representations, we incorporate a non-linear projection block followed by a classifier. The overall architecture of RobustpMHC is depicted in Fig. 7 and the mutation block is described below. Other components such as Perceiver IO, Projection, and Prediction blocks are similar to PercieverpMHC Section 2.2.

Mutation block: The mutation block is inspired from mask language modelling and contrastive self-supervised learning used in natural language processing [58] where randomly an input token is masked, such designs have also shown significant improvement in the robustness in other domains such as computer vision [59]. In this work instead, we mutate a token representing embedding of an amino acid. For each batch we sample a mutation probability between |$m = \mathcal{u}(0,0.15)$| uniformly for peptide and |$m = \mathcal{u}(0,0.05)$| uniformly for MHC. Then for each amino acid in the sequence, we independently sample a random variable (⁠|$p_{i}$|⁠) from Bernoulli distribution. We mutate an amino acid (⁠|$i$|⁠) if |$p_{i}<m$|⁠. Note that, if we only keep mutation probability (⁠|$m$|⁠) fixed, then this could lead to a distribution shift, since during test time, peptides might not be mutated. Therefore, we also randomly sample the mutation probability (⁠|$m$|⁠) from a uniform distribution.

graphic

Self-supervision for robust feature learning

In ablation studies Section 1.4, we show that simply training the network with mutations or using contrastive loss results in inferior performance. Therefore, we present a self-supervision strategy inspired from knowledge distillation to learn robust features with the help of mutations. Our strategy comprises of training two identical networks together. The first network which we refer to as the teacher (⁠|$\phi (x)$|⁠) has access to unmasked inputs (⁠|$x$|⁠) and true class labels (⁠|$y$|⁠). While the second network, which we call student (⁠|$\phi \:^{\prime}(x) $|⁠) can only observe the mutated input |$x\:^{\prime}$| and has to regress to the latent features |$[o,z]$| of the teacher network as shown in Fig. 7(a). Therefore, our objective function (⁠|$\mathcal{L} = \mathcal{L}_{p}+\mathcal{L}_{c}$|⁠) consists of two losses, the prediction loss (⁠|$\mathcal{L}_{p}$|⁠), and the consistency loss (⁠|$\mathcal{L}_{c}$|⁠),

(1)

Where, CE is the category-cross entropy loss and |$sg(\cdot )$| is the stop gradient operator that refrains the gradient flow to the teacher during backpropagation so that only student is trained on the gradients of |$\mathcal{L}_{c}$|⁠. During the evaluation phase we only consider the predictions from the student networks, i.e. |$\hat{y}\:^{\prime} = \phi \:^{\prime}(x)$|⁠. Note that, while evaluating we do not enforce mutation to input sequence. This self-supervision is summarized in algorithm 1.

Transfer learning

In our study, transfer learning was applied by freezing the Perceiver IO module to retain the general features learned from the original dataset. For transfer learning, we keep the pretrained the student model unchanged and fine-tune the projection model at a lower learning rate (1e-4). This approach allowed us to adapt the model to new data distributions while maintaining the robustness of the original feature extraction.

Implementation details

We train RobustpMHC on single GeForce RTX 3080 GPU which has 24 GB memory. The model was compiled on PyTorch 1.10.0 and CUDA 11.7 kernel. We train RobustpMHC for 100 epochs and use early stopping retaining the one with best validation performance. Following TransPHLA [21] for fair comparison with previous approaches, we also train five models on the training data where the negative samples of the training set are divided into 5 parts for five-fold cross-validation, however we report the mean performance unlike TranspHLA where best performing model’s performance is reported. We use the batch size of 1024 for training since this is the largest size we could fit on single GeForce RTX 3080 GPU. We employ Adam optimiser for training with learning rate 0.001, we further reduce the learning rate to its one-tenth, when the validation performance is not improved in 4 epochs.

Compared methods

We exhaustively evaluate our approach by conducting two types of case studies, on one hand we compare our approach on six different datasets for validating robustness and on the other hand, we evaluate transfer learning capability of our approach on two datasets. During these studies we compared with the state-of-the-art methods such as TranspHLA, CapsNet, the IEDB recommended method (NetMHCpan-4.1), over fifteen IEDB baseline methods and transfer learning based method (BigMHC) published very recently, RobustMHC achieves most consistent performance across all experiments. We here explain few selected recent approaches and highlight their contrast with RobustMHC. It is worth noting that apart from our approach only DeepAttentionPan can take full sequences, other approaches require pseudo-sequences.

NetMHCpan4.1 [20] is a popular tool for simultaneously predicting BA (Binding Affinity) data is often derived from assays that measure the strength of interaction between peptides and MHC molecules [60] and EL. This method consists of an ensemble of 100 single-layer networks, each of which consumes a peptide 9-mer binding core and a sub-sequence of the MHC molecule. The 9-mer core is extracted by the model, whereas the MHC representation, termed a ‘pseudo-sequence,’ is a predetermined 34-mer core extracted from the MHC molecule sequence. The 34-mer residues were selected based on the estimated proximity to bound peptides so that only residues within 4Å were included.

MHCflurry-2.0 [61] is an ensemble of neural networks that predicts BA and EL for MHC-I. BA prediction is the output of a neural network ensemble, where each member is a two- or three-layer feed-forward neural network. Then, an antigen processing convolutional network is trained on a subset of the BA predictions, along with the regions flanking the N-terminus and C-terminus, to capture antigen processing information that is missed by the BA predictor. EL prediction is the result of logistically regressing BA and antigen processing outputs.

TransPHLA [21] is a transformer-based model that adopts NetMHCpan pseudo-sequences for MHC encoding. This model encodes peptides and MHC pseudo-sequences using the original transformer encoding procedure before inferring the encodings via ReLU-activated fully connected layers. TransPHLA was trained on the Anthem data for binary binding prediction.

MHCnuggets [62] comprises many allele-specific LSTM networks to handle arbitrary-length peptides for MHC-I and MHC-II. Transfer learning was used across the alleles to address data scarcity. MHCnuggets trained on qualitative BA, quantitative BA, and EL data. MHCnuggets trained on up to two orders of magnitude fewer data than the other methods.

HLAthena [63] uses three single-layer neural networks trained on mass spectrometry data to predict presentation on short peptides with length in the range [8, 11]. Each of the three networks trained on a separate peptide encoding: one-hot, BLOSUM62 [64], and PMBEC40. In addition, the networks consumed peptide-level characteristics, and also amino acid physiochemical properties. The outputs of these networks were used to train logistic regression models that also accounted for proteasomal cleavage, gene expression, and presentation bias. HLAthena also saw performance gains when considering RNA-seq as a proxy for peptide abundance.

CapsNet [22] proposes a capsule neural networks architecture for learning the representation of Anthem’s datasets. CapsNet-MHC architecture, is built upon four major units: (a) data encoder, (b) feature extractor, (c) binding dependencies extractor, and (d) binding predictor. They used CNNs with the attention mechanisms for feature extraction from the encoded peptide and MHC sequences, using Blosum62 [64] input encoding method.

BigMHC [23] present transfer learning method on the prediction of immunogenic neoepitopes. Initially, the model is pre-trained on publicly available data of MHC class I peptide presentation. Subsequently, the model undergoes a transfer learning process, where it leverages this pre-existing knowledge to fine-tune its predictions for immunogenic neoepitopes.

DeepAttentionpan [25] propose an improved pan-specific model, based on convolutional neural networks and attention mechanisms for more flexible, stable, and interpretable MHC-I binding prediction.

NetMHCpan, HLAthena, and MHCflurry rely on ensemble of shallow neural networks for binding prediction. The ensemble help in reducing variance however shallow neural networks are not able capture the fine nuances in pMHC sequence, which are crucial for binding prediction. MHCnuggets on the other hand considers pMHC pseudo-sequences and employs an LSTM to extract the latent features however, LSTM features face issues in capturing longer sequence dependencies. Therefore, TransPHLA employs a single layer Transformer and CapsNet utilises a Capsulenet architecture for extracting better features. While these approaches rely hand-crafted pseudo-sequences, we show that deep attention networks are able to capture similar features from full sequences. DeepAtentionPan incorporated attention layers between convolution layers for extracting features from full sequences, however their performance is significantly inferior to recent pseudo-sequence methods. We use efficient transformer models based on latent cross-attention to efficiently extract better features, we further propose self-supervision to make the models generalize better across diverse datasets. BigMHC presents a transfer learning paradigm to predict binding on immunology datasets, we further show that the features learned by RobustpMHC also transfers well to immunology dataset (we showcase better performance as compared to BigMHC on one of the two metrics). Moreover, on other datasets BigMHC’s performance is far inferior as compared to ours. Hence, our approach can tackle full sequences and is able to show consistently better results across diverse datasets.

Discussion

Conclusion: In this research, we aim to shift the reliance of previous approaches from the 34 amino acid pseudo-sequence to a broader understanding. We have discovered that MHC-peptide interactions involve 147 practical positions, far surpassing the limited scope of the conventional 34 amino acid approach. RobustpMHC achieves overall 6% improvement over the state-of-the-art. This emphasizes the need for a more comprehensive perspective to enhance our understanding and prediction accuracy in this critical area of study. We have demonstrated the efficacy of IMGT/RobustpMHC model in predicting pMHC class I binding probabilities particularly when considering full MHC sequences. IMGT/RobustpMHC belongs to the category of generalized pan-specific models that are not restricted by MHC alleles or peptide length. Our model employs cross-attention mechanisms within deep neural networks, therefore exhibits capability to learn comprehensive representations of MHC sequences, underscoring the potential of efficient transformer architectures like Perceiver IO in computational immunology. Furthermore, we also incorporated self-supervised learning by training the network with mutations that empowered our model to capture subtle inter-dependencies between peptide and MHC sequences, enabling it to generalize effectively and resulting in significant improvements in binding probability predictions. Notably, our model maintains robust performance even in the presence of mutations, which is vital for real-world applications. Collectively, the combination of these techniques propels our Perceiver IO-based model to the forefront of pMHC binding prediction, offering a promising tool for immunotherapy and vaccine design. We believe that our model highlights the importance of harnessing advanced deep learning techniques in tackling complex biological problems.

Limitations: While IMGT/RobustpMHC consistently demonstrates improved performance compared to the state-of-the-art approaches across a wide range of datasets, it still relies on additional transfer learning data for CrystalIMGT and Neopeptide datasets. Notably, the performance of all approaches remains sub-optimal on these specific datasets. Consequently, we are convinced that the field of pMHC prediction could benefit from extensive pre-training and domain adaptation techniques to bridge this performance gap, and IMGT/RobustpMHC represents a crucial step in this direction. Finally, similar to existing approaches, our approach can benefit from a theoretical framework that could could provide guarantees on binding prediction.

Furthermore, a common issue we encounter across all datasets is the generation of negative samples for training these approaches, which is typically done randomly without experimental verification. Consequently, these negative samples may contain false negatives. We acknowledge that this is one of the shortcomings of existing approaches, including our own, and we intend to address this issue in our future research endeavors.

Key Points
  • Previous approaches utilized 34 amino acid pseudo-sequences, whereas our findings identified 147 positions involved in MHC-peptide interactions (Fig. 1 and Fig. 2). We overcame these limitations by employing transformer architectures to learn from the entire sequence.

  • RobustpMHC was evaluated on eight datasets, including two novel datasets, CrystalIMGT and COVID, and demonstrated superior performance across diverse data.

  • Unlike existing deep learning methods like BERT, which struggle with out-of-distribution generalization, RobustpMHC consistently performs well in all cases, outperforming 18 methods with a 6% improvement in binding prediction accuracy (Fig. 4).

  • Data augmentation and contrastive learning methods did not enhance performance due to distribution shifts in pMHC data. Nevertheless, RobustpMHC consistently outperformed other transformer models (supplementary Figs S2 and S3).

  • In contrast to prior methods focused on 9-amino acid peptides, RobustpMHC generalizes across peptides of varying lengths and uses mutation-invariant learning to improve robustness (Fig. 6, supplementary Figs S4 and S5).

Acknowledgments

We express our gratitude to the entire IMGT® team for their ongoing dedication and unwavering enthusiasm.

Funding

IMGT® is currently supported by the Centre National de la Recherche Scientifique (CNRS) and the University of Montpellier. IMGT® is member of the French Infrastructure ‘Institut Français de Bioinformatique’, IFB as well as member of BioCampus, MAbImprove and IBiSA. AK was co-funded by ‘Direction de la Recherche, du Transfert Technologique et de l’Enseignement Supérieur’ of Occitanie Region under the N|$\deg $| 20007399 / ALDOCT-001023 contract as well as IMGT® proper resources. This work was granted access to the High Performance Computing (HPC) resources of Meso@LR and of ‘Centre Informatique National de l‘Enseignement Supérieur’ (CINES). We acknowledge the support of Immun4Cure IHU ‘Institute for innovative immunotherapies in autoimmune diseases’ (France 2030 / ANR-23-IHUA-0009) and also by the Key Collaborative Research Program of the ‘Alliance of International Science Organizations’ (ANSO-CR-KP-2022-09).

Data availability

The data is available in data section at (https://www.imgt.org/RobustpMHC/) on IMGT site. It contains the training data, independent test data, external test data, anthem dataset, neoantigen dataset, HPV dataset, COVID dataset, crystalIMGT dataset, and neoepitope dataset.

Author contributions statement

Anjana Kushwaha conceived and conducted the experiments under the supervision of Patrice Duroux, Véronique Giudicelli, Konstantin Todorov, and Sofia Kossida. All authors reviewed and approved the final manuscript for publication.

Webserver availability

The webserver is available at https://www.imgt.org/RobustpMHC/

Code availability

The implementation code and trained models is available at (https://www.imgt.org/RobustpMHC/) in download section on IMGT site. PyTorch 1.10.0 and CUDA 11.7 was used to calculate performance metrics. Pandas v.2.1.1 and Numpy v.1.22.3 were used for data processing. Matplotlib v.3.5.1 were used to generate figures.

References

1.

Yewdell
JW
,
Bennink
JR
.
Immunodominance in major histocompatibility complex class I–restricted T lymphocyte responses
.
Annu Rev Immunol
1999
;
17
:
51
88
. .

2.

Zahedipour
F
,
Jamialahmadi
K
,
Zamani
P
. et al. .
Improving the efficacy of peptide vaccines in cancer immunotherapy
.
Int Immunopharmacol
2023
;
123
:
110721
. .

3.

van de Veerdonk
FL
,
Giamarellos-Bourboulis
E
,
Pickkers
P
. et al. .
A guide to immunotherapy for Covid-19
.
Nat Med
2022
;
28
:
39
50
. .

4.

Davis
MM
,
Bjorkman
PJ
.
T-cell antigen receptor genes and t-cell recognition
.
Nature
1988
;
334
:
395
402
. .

5.

Adiko
AC
,
Babdor
J
,
Gutiérrez-Martínez
E
. et al. .
Intracellular transport routes for MHC I and their relevance for antigen cross-presentation
.
Front Immunol
2015
;
6
:
335
. .

6.

Braciale
TJ
,
Morrison
LA
,
Sweetser
MT
. et al. .
Antigen presentation pathways to class I and class II MHC-restricted T lymphocytes
.
Immunol Rev
1987
;
98
:
95
114
. .

7.

Janeway CA Jr, Travers P, Walport M. et al. . The complement system and innate immunity. In: Janeway CA Jr, Travers P, Walport M, Shlomchik MJ, eds.

Immunobiology: The Immune System in Health and Disease
. 5th ed. New York: Garland Science; 2001.

8.

Ehrenmann F, Lefranc M-P. IMGT/3Dstructure-DB: Querying the IMGT Database for 3D Structures in Immunology and Immunoinformatics (IG or Antibodies, TR, MH, RPI, and FPIA). In:

Cold Spring Harbor Protocols 2011, pdb–prot5637
. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY; 2011. .

9.

Nielsen
M
,
Lundegaard
C
,
Lund
O
.
Prediction of MHC class II binding affinity using smm-align, a novel stabilization matrix alignment method
.
BMC Bioinform
2007
;
8
:
1
12
.

10.

Patronov
A
,
Dimitrov
I
,
Flower
DR
. et al. .
Peptide binding prediction for the human class II MHC allele HLA-DP2: a molecular docking approach
.
BMC Struct Biol
2011
;
11
:
1
10
.

11.

Liu
G
,
Li
D
,
Li
Z
. et al. .
PSSMHCcpan: a novel PSSM-based software for predicting class I peptide-hla binding affinity
.
Giga Sci
2017
;
6
:
gix017
.

12.

Paul
S
,
Kolla
RV
,
Sidney
J
. et al. .
Evaluating the immunogenicity of protein drugs by applying in vitro MHC binding data and the immune epitope database and analysis resource
.
Clin Dev Immunol
2013
;
2013
:
1
7
. .

13.

Riley
TP
,
Keller
GL
,
Smith
AR
. et al. .
Structure based prediction of neoantigen immunogenicity
.
Front Immunol
2019
;
10
:
2047
. .

14.

Liao
WW
,
Arthur
JW
.
Predicting peptide binding to major histocompatibility complex molecules
.
Autoimmun Rev
2011
;
10
:
469
73
. .

15.

Jojic
N
,
Reyes-Gomez
M
,
Heckerman
D
. et al. .
Learning MHC i–peptide binding
.
Bioinformatics
2006
;
22
:
e227
35
. .

16.

Nielsen
M
,
Lundegaard
C
,
Worning
P
. et al. .
Reliable prediction of t-cell epitopes using neural networks with novel sequence representations
.
Protein Sci
2003
;
12
:
1007
17
. .

17.

Nielsen
M
,
Lundegaard
C
,
Blicher
T
. et al. .
Netmhcpan, a method for quantitative predictions of peptide binding to any hla-a and-b locus protein of known sequence
.
PloS One
2007
;
2
:
e796
. .

18.

Ehrenmann
F
,
Lefranc
M-P
.
IMGT/3Dstructure-DB: Querying the IMGT Database for 3D Structures in Immunology and Immunoinformatics (IG or Antibodies, TR, MH, RPI, and FPIA), Cold Spring Harbor Protocols 2011, pdb–prot5637
2011
.

19.

Berman
HM
,
Westbrook
J
,
Feng
Z
. et al. .
The protein data bank
.
Nucleic Acids Res
2000
;
28
:
235
42
. .

20.

Reynisson
B
,
Alvarez
B
,
Paul
S
. et al. .
NetMHCpan-4.1 and netMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data
.
Nucleic Acids Res
2020
;
48
:
W449
54
. .

21.

Chu
Y
,
Zhang
Y
,
Wang
Q
. et al. .
A transformer-based model to predict peptide–hla class I binding and optimize mutated peptides for vaccine design
.
Nat Mach Intell
2022
;
4
:
300
11
. .

22.

Kalemati
M
,
Darvishi
S
,
Koohi
S
.
Capsnet-MHC predicts peptide-MHC class I binding based on capsule neural networks
.
Commun Biol
2023
;
6
:
492
. .

23.

Albert
BA
,
Yang
Y
,
Shao
XM
. et al. .
Deep neural networks predict class I major histocompatibility complex epitope presentation and transfer learn neoepitope immunogenicity
.
Nat Mach Intell
2023
;
5
:
1
12
.

24.

Liu
Z
,
Cui
Y
,
Xiong
Z
. et al. .
DeepSeqPan, a novel deep convolutional neural network model for pan-specific class I hla-peptide binding affinity prediction
.
Sci Rep
2019
;
9
:
1
10
.

25.

Jin
J
,
Liu
Z
,
Nasiri
A
. et al. .
Deep learning pan-specific model for interpretable MHC-I peptide binding prediction with improved attention mechanism, proteins
.
Struct Funct Bioinform
2021
;
89
:
866
83
. .

26.

Jaegle
A
,
Borgeaud
S
,
Alayrac
J-B
. et al. .
Perceiver IO: a general architecture for structured inputs & outputs
arXiv preprint arXiv:2107.14795
.
2021
.

27.

Wieczorek
M
,
Abualrous
ET
,
Sticht
J
. et al. .
Major histocompatibility complex (MHC) class I and MHC class II proteins: Conformational plasticity in antigen presentation
.
Front Immunol
2017
;
8
:
292
.

28.

Mei
S
,
Li
F
,
Xiang
D
. et al. .
Anthem: a user customised tool for fast and accurate prediction of binding between peptides and hla class I molecules
.
Brief Bioinform
2021
;
22
:
bbaa415
. .

29.

Bonsack
M
,
Hoppe
S
,
Winter
J
. et al. .
Performance evaluation of MHC class-I binding prediction tools based on an experimentally validated MHC–peptide binding data set
.
Cancer Immunol Res
2019
;
7
:
719
36
. .

30.

Lata S, Bhasin M, Raghava GP. Application of machine learning techniques in predicting MHC binders. In: Bhasin M, Raghava GP, eds.

Immunoinformatics: Predicting Immunogenicity In Silico
. New York: Springer; 2007. p. 201–15.

31.

Yang
X
,
Zhao
L
,
Wei
F
. et al. .
DeepNetBim: deep learning model for predicting hla-epitope interactions based on network analysis by harnessing binding and immunogenicity information
.
BMC Bioinform
2021
;
22
:
1
16
.

32.

Zhang
H
,
Lund
O
,
Nielsen
M
.
The pickpocket method for predicting binding specificities for receptors based on receptor pocket similarities: application to MHC-peptide binding
.
Bioinformatics
2009
;
25
:
1293
9
. .

33.

Sidney
J
,
Assarsson
E
,
Moore
C
. et al. .
Quantitative peptide binding motifs for 19 human and mouse MHC class I molecules derived using positional scanning combinatorial peptide libraries
.
Immunome Res
2008
;
4
:
1
14
.

34.

Karosiene
E
,
Lundegaard
C
,
Lund
O
. et al. .
NetMHCcons: a consensus method for the major histocompatibility complex class I predictions
.
Immunogenetics
2012
;
64
:
177
86
. .

35.

Kim
Y
,
Sidney
J
,
Pinilla
C
. et al. .
Derivation of an amino acid similarity matrix for peptide: MHC binding and its application as a Bayesian prior
.
BMC Bioinform
2009
;
10
:
1
11
.

36.

Rasmussen
M
,
Fenoy
E
,
Harndahl
M
. et al. .
Pan-specific prediction of peptide–MHC class I complex stability, a correlate of t cell immunogenicity
.
J Immunol
2016
;
197
:
1517
24
. .

37.

Peters
B
,
Sette
A
.
Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method
.
BMC Bioinform
2005
;
6
:
1
9
. .

38.

Hu
Y
,
Wang
Z
,
Hu
H
. et al. .
ACME: pan-specific peptide–MHC class I binding prediction through attention-based deep neural networks
.
Bioinformatics
2019
;
35
:
4946
54
. .

39.

de Sousa
E
,
Ligeiro
D
,
Lérias
JR
. et al. .
Mortality in Covid-19 disease patients: Correlating the association of major histocompatibility complex (MHC) with Severe Acute Respiratory Syndrome 2 (SARS-CoV-2) variants
.
Int J Infect Dis
2020
;
98
:
454
9
. .

40.

Manso
T
,
Folch
G
,
Giudicelli
V
. et al. .
IMGT® databases, related tools and web resources through three main axes of research and development
.
Nucleic Acids Res
2022
;
50
:
D1262
72
. .

41.

Rose
PW
,
Prlić
A
,
Altunkaya
A
. et al. .
The RCSB Protein Data Bank: integrative view of protein, gene and 3D structural information
.
Nucleic Acids Res
2016
;
44
:D425–D432.

42.

Dhanda
SK
,
Mahajan
S
,
Paul
S
. et al. .
IEDB-AR: immune epitope database–analysis resource in 2019
.
Nucleic Acids Res
2019
;
47
:
W502
6
. .

43.

Bassani-Sternberg
M
,
Bräunlein
E
,
Klar
R
. et al. .
Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry
.
Nat Commun
2016
;
7
:
13404
. .

44.

Van Dyk
DA
,
Meng
X-L
.
The art of data augmentation
.
J Comput Graph Stat
2001
;
10
:
1
50
. .

45.

Oord
A v d
,
Li
Y
,
Vinyals
O
.
Representation learning with contrastive predictive coding
arXiv preprint arXiv:1807.03748
.
2018
.

46.

Vaswani
A
,
Shazeer
N
,
Parmar
N
. et al. .
Attention is all you need
.
Adv Neural Inf Process Syst
2017
;
30
:5998–6008.

47.

Kitaev
N
,
Kaiser
Ł
,
Levskaya
A
.
Reformer: the efficient transformer
arXiv preprint arXiv:2001.04451
.
2020
.

48.

Reche
PA
,
Zhang
H
,
Glutting
J-P
. et al. .
EpiMHC: a curated database of MHC-binding peptides for customized computational vaccinology
.
Bioinformatics
2005
;
21
:
2140
1
. .

49.

Lata
S
,
Bhasin
M
,
Raghava
GP
.
MHCBN 4.0: a database of MHC/tap binding peptides and t-cell epitopes
.
BMC Res Notes
2009
;
2
:
1
6
.

50.

Rammensee
H-G
,
Bachmann
J
,
Emmerich
NPN
. et al. .
SYFPEITHI: database for MHC ligands and peptide motifs
.
Immunogenetics
1999
;
50
:
213
9
. .

51.

Duggal
P
,
Thio
CL
,
Wojcik
GL
. et al. .
Genome-wide association study of spontaneous resolution of hepatitis C virus infection: data from multiple cohorts
.
Ann Intern Med
2013
;
158
:
235
45
. .

52.

Sanchez-Mazas
A
.
HLA studies in the context of coronavirus outbreaks
.
Swiss Med Wkly
2020
;
150
:
w20248
8
. .

53.

Ben Shachar
S
,
Barda
N
,
Manor
S
. et al. .
MHC haplotyping of SARS-CoV-2 patients: HLA subtypes are not associated with the presence and severity of Covid-19 in the Israeli population
.
J Clin Immunol
2021
;
41
:
1154
61
. .

54.

Xia
J
,
Bai
P
,
Fan
W
. et al. .
NepDB: a database of t-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy
.
Front Immunol
2021
;
12
:
644637
. .

55.

Kim
S
,
Kim
HS
,
Kim
E
. et al. .
Neopepsee: accurate genome-level prediction of neoantigens by harnessing sequence and amino acid immunogenicity information
.
Ann Oncol
2018
;
29
:
1030
6
. .

56.

Wells
DK
,
van Buuren
MM
,
Dang
KK
. et al. .
Key parameters of tumor epitope immunogenicity revealed through a consortium approach improve neoantigen prediction
.
Cell
2020
;
183
:
818
834.e13
. .

57.

Barker
DJ
,
Maccari
G
,
Georgiou
X
. et al. .
The IPD-IMGT/HLA database
.
Nucleic Acids Res
2023
;
51
:
D1053
60
. .

58.

Kenton
JDM-WC
,
Toutanova
LK
.
BERT: pre-training of deep bidirectional transformers for language understanding
. In:
Proceedings of naacL-HLT
, Vol.
1
. Stroudsburg, PA: Association for Computational Linguistics; 2019. p. 2.

59.

He
K
,
Chen
X
,
Xie
S
. et al. .
Masked autoencoders are scalable vision learners
. In:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
. New York: IEEE; 2022. p. 16000–9.

60.

Bassani-Sternberg
M
,
Bräunlein
E
,
Klar
R
. et al. .
Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry
.
Nat Commun
2016
;
7
:
13404
. .

61.

O’Donnell
TJ
,
Rubinsteyn
A
,
Laserson
U
.
MHCflurry 2.0: Improved pan-allele prediction of MHC class I-presented peptides by incorporating antigen processing
.
Cell systems
2020
;
11
:
42
48.e7
. .

62.

Shao
XM
,
Bhattacharya
R
,
Huang
J
. et al. .
High-throughput prediction of MHC class I and II neoantigens with MHCnuggets
.
Cancer Immunol Res
2020
;
8
:
396
408
. .

63.

Sarkizova
S
,
Klaeger
S
,
Le
PM
. et al. .
A large peptidome dataset improves hla class I epitope prediction across most of the human population
.
Nat Biotechnol
2020
;
38
:
199
209
. .

64.

Altschul
SF
.
Amino acid substitution matrices from an information theoretic perspective
.
J Mol Biol
1991
;
219
:
555
65
. .

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Supplementary data