GTE: a graph learning framework for prediction of T-cell receptors and epitopes binding specificity

Abstract

The interaction between T-cell receptors (TCRs) and peptides (epitopes) presented by major histocompatibility complex molecules (MHC) is fundamental to the immune response. Accurate prediction of TCR–epitope interactions is crucial for advancing the understanding of various diseases and their prevention and treatment. Existing methods primarily rely on sequence-based approaches, overlooking the inherent topology structure of TCR–epitope interaction networks. In this study, we present |$GTE$|⁠, a novel heterogeneous Graph neural network model based on inductive learning to capture the topological structure between TCRs and Epitopes. Furthermore, we address the challenge of constructing negative samples within the graph by proposing a dynamic edge update strategy, enhancing model learning with the nonbinding TCR–epitope pairs. Additionally, to overcome data imbalance, we adapt the Deep AUC Maximization strategy to the graph domain. Extensive experiments are conducted on four public datasets to demonstrate the superiority of exploring underlying topological structures in predicting TCR–epitope interactions, illustrating the benefits of delving into complex molecular networks. The implementation code and data are available at https://github.com/uta-smile/GTE.

T cell receptor, epitope specificity, immunoinformatics, heterogeneous graph neural networks, inductive learning, deep AUC maximization

Introduction

The interaction between T-cell receptors (TCRs) and peptides (epitopes) presented by major histocompatibility complex molecules (MHC) forms the foundation of T-cell immunity [1, 2]. The interaction between TCRs and epitopes aids in the activation of T cells, enabling them to effectively combat pathogens within the body, such as bacteria, viruses, or tumor cells [3, 4]. Therefore, the precise recognition of TCRs with epitopes significantly contributes to the understanding and technological advancements in preventing and treating various diseases. However, due to the inherent complexity of this recognition mechanism, experimental detection and validation of TCR–epitope interactions are often time-consuming and costly [5]. To address these challenges, numerous computational methods have been developed to simulate TCR–epitope interactions [6–11].

TCR and epitope sequences are inherently short, making it challenging to collect a sufficient amount of Multiple Sequence Alignment (MSA) data. Consequently, it becomes difficult to effectively harness MSA-based protein models [12–14], especially when directly utilizing single sequence-based protein language models [15]. Mainstream methods typically involve sequence information for TCRs and epitopes to construct sequence-level classification models through transfer learning [8, 16–18]. Specifically, TCR and epitope sequence data are gathered from databases such as VDJdb [19], IEDB [20], and McPAS-TCR [21]. These sequences are then input into pre-trained models to obtain high-quality embeddings for TCRs and epitopes, which are subsequently used in deep learning models to predict whether TCRs and epitopes will bind.

While previous methods have obtained success, they still face three challenges in TCR–epitope prediction. (i) The interaction between TCRs and epitopes is not an isolated event but part of a complex and extensive interaction network. Networks formed by protein interactions are crucial as they depict the cellular framework and signal pathways while also influencing signal transmission within cells [22, 23]. Therefore, it draws our attention to the need for effectively utilizing the topological structure between TCRs and epitopes. However, how to efficiently exploit such topological structures to predict TCR–epitope interactions remains unexplored. The major challenge is that unlike protein–protein interactions (PPIs), where all nodes are treated as the same in a simple homogeneous graph [24], TCRs and epitopes shall be considered as two distinct types of nodes due to their biological nature. Therefore, constructing a topological structure graph for TCRs and epitopes is even more complex. (ii) TCR and epitope interaction datasets typically contain both positive and negative samples. Effectively utilizing negative sample data helps the model better understand and differentiate between positive and negative samples [25]. Current approaches often require the preparation of fixed negative sample data before training [8, 16, 17, 26], which may exceed physical memory limitations or result in high time complexity during the training process. Therefore, how to design a strategy for learning negative samples effectively is another crucial challenge. (iii) TCR–epitope prediction often encounters a severe data imbalance issue since the number of positive instances is normally much lower than the number of negatives. This imbalance could cause the model to lean toward predicting the majority class, presenting another challenge that needs to be addressed.

In order to tackle the above challenges, we propose GTE, a graph learning framework for the prediction of TCR–epitope binding specificity. To the best of our knowledge, GTE is the first attempt to capture the topological structure of the TCR–epitope interaction graph for enhancing the prediction performance. Specifically, we introduce a heterogeneous graph neural network (GNN) model based on inductive learning to form TCR–epitope binding prediction tasks as edge classification problems. Through graph learning, our model offers a more comprehensive insight into the natural environment’s topological structure of TCRs and epitopes, effectively revealing previously unseen interactions. Furthermore, we propose a novel dynamic graph learning method that allows the model to encounter a more diverse set of negative samples during the training process, without being constrained by physical memory limitations. Nevertheless, the efficient negative sampling method could potentially exacerbate data imbalance. We then employ the AUC Maximization [27] to effectively address the challenge of data imbalance.

We have conducted experiments on four public datasets under various settings to evaluate the performance of TCR–epitope interaction prediction tasks, and achieved significant improvement over the current state-of-the-art (SOTA) methods consistently. The overall contributions of our proposed GTE are summarized as follows:

Our proposed GTE is the first method to explore the topological structure to enhance TCR–epitope binding prediction tasks by introducing an inductive heterogeneous GNN.
An advanced dynamic graph learning mechanism is developed to adaptively sample negative data during training, allowing efficient learning of negative samples without being constrained by physical memory limitations.
AUC maximization is adapted in the graph domain to ensure the model maintains a more balanced and robust predictive capability, thereby overcoming the data imbalance problem.

Related work

Predicting the binding of any given TCR–epitope pair benefits many advances in healthcare, such as aiding diagnostics, vaccine development, and cancer therapies [1, 28]. Researchers have been developing models for predicting peptide immunogenicity for decades. However, compared with the theoretical diversity of TCRs, the available data are still limited, posing challenges for the development of computational prediction methods [7, 28–30].

Prior research has indicated that the CDR3 |$\beta $| sequences likely play a more significant role in determining TCR binding specificity to epitopes [31, 32]. Therefore, most sequence-based methods primarily rely on the CDR3 |$\beta $| sequence. For instance, TEIM [17] combines sequence and residue contact prediction, utilizing a two-phase training approach. TEINet [16] employs transfer learning, transforming TCR and epitope sequences into numerical vectors using separate pre-trained encoders, and uses a fully connected neural network to predict their binding specificity. pMTnet [8, 33] is a significant model for TCR–pMHC binding prediction, utilizing TCR, epitope, and MHC sequence data, setting it apart from the previous two models.

However, current methods primarily focus on learning sample similarities, often neglecting the intrinsic molecular network formed by natural interactions and overlooking the utilization of graph topological structure. In contrast, our proposed model explores a novel direction to uncover the underlying topological relationship within TCRs and epitopes, yielding superior outcomes. Furthermore, we employ a dynamic graph learning mechanism and AUC maximization strategy to efficiently sample negative data and address data imbalance issues.

Methods

The overall framework of GTE is illustrated in Fig. 1, which demonstrates that TCR–epitope binding prediction tasks are transformed into edge classification problems. Specifically, we take TCR–epitope pair data as the input to build a heterogeneous graph. TCRs and epitopes are considered nodes with different types, and the edge is established for each TCR–epitope pair. Next, we employ a GNN model on the constructed graph to predict the edge class. During the training process, we propose a novel dynamic edge update approach that adaptively samples more negative pairs, thereby allowing the model to learn from a more diverse set. Additionally, an innovative AUC loss function is designed to handle the potential data imbalance issue during the training. Detailed descriptions of each core component are presented in the following subsections.

$GTE framework: the training and testing datasets are labeled with two classes: positive (binding) or negative samples; the blue nodes represent TCRs (CDR3.$\beta $), and the orange nodes represent epitopes; the solid edges are established when TCRs and epitopes are binding, and the dashed edges represent negative samples; during the training process, we employ a GNN with two layers of GraphSAGE and a dynamic graph, along with AUC maximization; ultimately, we concatenate the node features of TCRs and epitopes, and use a Multi-Layer Perceptron to generate binding scores, which represent the probability of edges; the details of the inference phase can be found in Appendix F.$

Figure 1

GTE framework: the training and testing datasets are labeled with two classes: positive (binding) or negative samples; the blue nodes represent TCRs (CDR3.|$\beta $|⁠), and the orange nodes represent epitopes; the solid edges are established when TCRs and epitopes are binding, and the dashed edges represent negative samples; during the training process, we employ a GNN with two layers of GraphSAGE and a dynamic graph, along with AUC maximization; ultimately, we concatenate the node features of TCRs and epitopes, and use a Multi-Layer Perceptron to generate binding scores, which represent the probability of edges; the details of the inference phase can be found in Appendix F.

Open in new tab Download slide

Constructing graph-based topology for TCR-Epitope

The network of direct physical interactions between proteins plays a pivotal role in facilitating information transfer among proteins [22–24, 34]. However, prior methods have predominantly concentrated on learning sample embeddings and distances, neglecting the inherent wealth of topology information.

In this study, we introduce the graph-based topological structures into TCR–epitope binding prediction. As shown in Fig. 1, we first create a heterogeneous graph, which can be represented as |$G = (V, E, X)$|⁠, where |$V$| is the set of nodes, including TCR nodes and epitope nodes. |$E$| is the set of edges representing relationships between TCR and epitope nodes, the number of edges is equal to the total number of positive and negative TCR–epitope pairs. We use only one type of edge to represent both positive and negative samples when building the graph. The goal is to classify the established edges, where label 0 represents negative and label 1 represents positive (binding). |$X$| is the feature matrix of nodes, where each node |$v_{i}$| corresponds to a feature vector |$x_{i}$| obtained by inputting the respective sequence into the pre-trained model TCRpeg [35], i.e. |$x_{i} = \mathrm{TCRpeg}(v_{i})$|⁠. In the constructed graph, an edge |$e_{ij}$| represents the connection between node |$v_{i}$| and node |$v_{j}$|⁠. The label |$y_{ij}$| on |$e_{ij}$| is either 1 or 0, which indicates whether there is a connection established. Therefore, we can represent the edge set |$E$| and label set |$Y$| as follows:

$$ \begin{align*} & \begin{gathered} E=\left\{\left(v_i, v_j\right) \mid v_i \in V, v_j \in V\right\}, Y=\left\{y_{i j} \mid e_{i j} \in E\right\}. \end{gathered} \end{align*} $$

The process of signal propagation entails utilizing SAGEConv [36] and a fully connected network to compute the probability output for nodes. For each node |$v_{i}$|⁠, the signal propagation is computed using the SAGEConv layer:

$$ \begin{align*} & h_i^{(k+1)}=\operatorname{SAGEConv}{} ^{(k)}\left(h_i^{(k)},\left\{h_j^{(k)} \mid \forall j \in \operatorname{neighbors}\left(v_i\right)\right\}\right), \end{align*} $$

where |$\mathrm{SAGEConv}^{(k)}$| represents the |$k$|-th layer of SAGEConv operation, and |$h_{i}^{(k)}$| is the representation of node |$v_{i}$| at the |$k$|-th layer, and |$\mathrm{neighbors}(v_{i})$| represents the set of neighboring nodes of node |$v_{i}$|⁠. The features of TCR node |$v_{i}$| and epitope node |$v_{j}$| are concatenated as follows: |$h_{\mathrm{concat}}=\left [h_{i}, h_{j}\right ]$|⁠. These concatenated features are then passed through a fully connected network represented by |$\mathrm{FC}$|⁠, followed by a sigmoid activation function to yield the prediction probability |$p_{i j}$|⁠:

$$ \begin{align*} &p_{i j}=\operatorname{sigmoid}\left(\mathrm{FC}\left(h_{\textrm{concat}}\right)\right).\end{align*} $$

For the loss function, we employ the BCE loss, which is commonly used for binary classification problems:

$$ \begin{align*} & \text{ {$L_{BCE}$}}=-\frac{1}{N} \sum_{i=1}^N\left[y_{i j} \log \left(p_{i j}\right)+\left(1-y_{i j}\right) \log \left(1-p_{i j}\right)\right], \end{align*} $$

where |$N$| is the number of edges, |$y_{ij}$| is the true label (0 or 1) of the |$i$|-th edge, and |$p_{ij}$| is the predicted probability of the |$i$|-th edge being a positive (binding) event. Thus, the binding probability between TCR and epitope can be obtained while using the BCE loss function for training.

Our method constructs a heterogeneous TCR–epitope graph to leverage the topological structure, which captures correlations between sample pairs and maps the relationships between different TCRs and epitopes. At the same time, our model employs inductive learning to improve its ability to predict interactions between unknown TCRs and epitopes.

Dynamic edge updating strategy

During the graph construction process, how to establish edges for negative samples within the graph presents another challenge. We propose to use dynamic graph learning to provide the model with as many negative samples as possible at a minimal cost. At the beginning of training, negative sample edges are introduced into the initial positive-only training graph. Then, negative samples are re-sampled based on the prediction results of each iteration during the training stage. For instance, incorrectly predicted negative samples are retained, and from those correctly predicted, a proportion (⁠|$M\%$|⁠) is selected for re-sampling.

For the |$M\%$| of samples that require re-sampling, the same strategy is applied as used for the initial negative sample collection. For each epitope, a random selection of |$N$| (defaulting to 10) TCRs is made to create new negative samples. It is also important to ensure that these randomly generated new negative samples do not conflict with positive samples. If |$N$| negative samples cannot be found within this |$M\%$|⁠, then as many of the remaining negative samples as possible are utilized. To ensure that the knowledge from negative samples is thoroughly learned after each dynamic negative sampling, the model is evaluated following every operation. If it surpasses the previous best result, then dynamic negative sampling is performed again in the next epoch. As shown in Fig. 1 dynamic graph module, updates with light-colored edges represent dynamic negative samples. Under the same conditions, adopting a dynamic edge updating strategy allows the model to encounter a more diverse set of negative samples at minimal cost, thus improving the efficiency of model training.

Deep AUC maximization

In the field of TCR-epitope binding prediction, the quantity of positive samples is often limited. To expand the model to learn from a larger dataset, it is common practice to use a greater number of negative samples [26]. This leads to the issue of data imbalance, which can be exacerbated by the practice of dynamic edge updating strategies. Area under the ROC Curve (AUC) is conventionally used as the evaluation metric in our tasks since it evaluates the model’s ability to rank positive samples higher than negative samples. Given this situation, a loss function designed specifically for optimizing AUC is not only better suited to address the challenges of imbalanced data but also significantly enhances model performance [27].

Specifically, consider the input space |$X \in \mathbb{R}^{\mathrm{d}}$|⁠, and the output space |$Y\!\!=\!\!\{-1,+1\}$|⁠. The training data follow the unknown distribution |$\rho $|⁠, independently and identically distributed in |$Z=X \times Y$|⁠. The ROC curve is the plot of the True Positive Rate against the False Positive Rate. For any scoring function |$f\!\!:\!\! X \rightarrow R$|⁠, its AUC equals the probability that a positive sample ranks higher than a negative one. It is defined as follows: |$\operatorname{AUC}(f)=\operatorname{Pr}\left (f(x) \geq f\left (x^{\prime }\right ) \mid y=+1, y^{\prime }=-1\right )$|⁠, where |$f(x)$| and |$f(x^{\prime })$| are the decision function scores. |$x$| denotes the positive samples and |$x^{\prime }$| denotes the negative samples. |$y=+1$| and |$y^{\prime }=-1$| represent its corresponding labels. The goal is to find the optimal decision function |$f$| that satisfies

$$ \begin{align*} & \begin{aligned} \arg \max _f \operatorname{AUC}\big(f\big) & =\arg \min _f \operatorname{Pr}\left(f(x)<f\left(x^{\prime}\right) \mid y=1, y^{\prime}=-1\right) \\ & =\arg \min _f \mathbb{E}\left[\mathbb{I}_{\left[f\left(x^{\prime}\right)-f(x)>0\right]} \mid y=1, y^{\prime}=-1\right]. \end{aligned} \end{align*} $$

The function |$\mathbb{I}(\cdot )$| takes the value 1 if the argument is true and 0 otherwise. Since |$\mathbb{I}(\cdot )$| is not continuous, it is often replaced with a convex function. In this work, we use |$ \ell _{2} \operatorname{loss}\left (1-\left (f(x)-f\left (x^{\prime }\right )\right )\right )^{2} $| as the convex function, where |$f(x)=w^{T} x$|⁠, |$w$| denotes the parameters of the deep neural network to be learned. Overall, the maximization of AUC can be formalized as

$$ \begin{align*} & \begin{aligned} \operatorname{argmin}_{\|W\|\leq R} E\left[\left(1-w^T\left(x-x^{\prime}\right)\right)^2 \mid y=1, y^{\prime}=-1\right]. \end{aligned} \end{align*} $$

This can be reformulated as an equivalent stochastic saddle point problem [37] with any |$w \in \mathbb{R}^{\mathrm{d}}, a, b, \alpha \in \mathbb{R}$| and |$z=(x, y) \in Z$|⁠, where |$a$| and |$b$| can be interpreted as the mean prediction score on positive data and negative data. |$\alpha $| is a hyperparameter that controls the trade-off. Let |$p$| denote |$\operatorname{Pr}(y=1)$|⁠, which is the the probability of a sample being in the positive class. Finally, we define a function |$F: \mathbb{R}^{\mathrm{d}} \times \mathbb{R}^{3} \times Z \rightarrow \mathbb{R}$|⁠, AUC loss (⁠|$L_{AUC}$|⁠) as follows:

$$ \begin{align*} & \begin{aligned} F(w, a, b, \alpha ; z) & =(1-p)\left(w^{\mathrm{T}} x-a\right)^2 I_{[y=1]} \\ & \quad +p\left(w^{\mathrm{T}} x-b\right)^2 I_{[y=-1]} \\ & \quad +2(1+\alpha)\left(p w^{\mathrm{T}} x I_{[y=-1]}-(1-p) w^{\mathrm{T}} x I_{[y=1]}\right) \\ & \quad -p(1-p) \alpha^2. \end{aligned} \end{align*} $$

The final loss function is defined as follows, where |$\beta $| is a hyperparameter used to control the weights.

$$ \begin{align*} & \text{ {$L$} }= \beta * \text{{$L_{AUC}$}} + (1 - \beta) * \text{ {$L_{BCE}$} }. \end{align*} $$

Overall, our proposed GTE redefines TCR–epitope binding prediction as an edge prediction task by exploring the topological structure. Moreover, an advanced dynamic graph learning mechanism is proposed to sample negative data adaptively, and AUC maximization is employed to achieve a more balanced and robust prediction in response to data imbalance issues.

Pretrained encoder

Due to the limited data on TCRs and epitopes, accurately representing TCRs or epitopes is challenging. Transfer learning has been widely used to generate embeddings for TCRs or epitopes [8, 16, 17] since it is capable of reducing the dependency on vast amounts of target domain data for constructing target learners [38–41]. In our study, we employ two types of pre-trained models to generate embeddings for TCRs and epitopes. One is based on general protein sequences as training data, while the other relies on specific TCR sequences or epitope sequences as training data. Specifically, the ESM2 [15] advanced protein pre-trained model based on the deep learning Transformer architecture aims to learn the evolutionary relationships and structural-functional features of proteins by analyzing large-scale protein sequence data, with a feature size of 1280. On the other hand, the TCRpeg [16, 35] trained on |$10^{6}$| TCR sequences or 362 456 unique epitope sequences is a probabilistic model using a deep autoregressive model architecture with gated recurrent unit layers, capturing key features of sequence inputs through unsupervised learning, with a feature dimension of 768.

Using protein language models or TCR-epitope models to generate embeddings demonstrates the flexibility of the GTE method. It highlights that the graph-based topological structure is not limited to a single type of embedding, proving its broad applicability. In subsequent experiments, unless specifically noted, TCRpeg will be used by default.

Experiments

In this section, we briefly introduce the dataset and splitting method. Details of the experimental settings are provided in Appendix A. Extensive experiments demonstrate that methods based on graph topological structure outperform all SOTA models in the field.

Experimental setup

Training and testing set splitting

In TCR-epitope studies, it is common to use five-fold cross-validation to evaluate the model’s performance[8, 16, 17]. In our study, we refer to this approach of randomly splitting data as “RandomTCR.” However, we notice that RandomTCR carries the risk of potential data leakage. Specifically, random splitting may result in overlapping TCRs between the training and validation sets. This overlap can lead to unreliable results, as the model may have already been exposed to TCRs from the validation set during training.

To mitigate this issue that exists in current studies, we first introduce the “StrictTCR” splitting approach to ensure a rigorous experimental setting. With this approach, data are divided into five folds based on unique TCRs, guaranteeing the absence of duplicate TCRs in both the training and validation sets. Furthermore, when negative samples are generated within each fold, the possibility of having identical negatives in the training and validation sets is eliminated. Therefore, StrictTCR is more scientifically rigorous as it can assess the model’s ability to handle previously unseen TCR data.

Datasets

We follow the experimental settings of SOTA TCR–epitope prediction methods [8, 16, 17], and conduct experiments on four publicly available datasets to demonstrate the effectiveness of our method.

VDJdb Dataset [19]: The VDJdb dataset consists of a curated set of 89 847 pairs of CDR3 sequences, which include both the |$\alpha $| and |$\beta $| chains of TCRs, covering three different species: humans, monkeys, and mice. We perform a series of filtering steps, retaining only entries for MHC class I and keeping only the |$\beta $| chain. Additionally, we constrain the length of CDR3 to be between 5 and 30 amino acids and the epitope length to be between 7 and 15 amino acids, and we include only human data. After all these filtering steps, the dataset is reduced to 38 533 unique CDR3|$\beta $|-epitope pairs, with 36 033 TCRs assigned to 1002 epitopes.
McPAS-TCR Dataset [21]: It is a manually curated catalog of TCR sequences, comprising 40 033 pairs of sequences, covering two species: mice and humans. We apply a similar filtering process to McPAS-TCR as we performed for VDJdb. In the end, we obtain 5102 unique CDR3|$\beta $|-epit1ope pairs, with 4975 TCRs assigned to 244 different epitopes.
TEINet Dataset [16]: TEINet is a dataset provided by the baseline. It includes 44 682 pairs of TCRs and epitopes, with 41 610 TCRs associated with 180 epitopes. TEINet provides a five-fold partition using the RandomTCR method; thus, we directly use their pre-processed data as the RandomTCR dataset. We re-sample the original data provided by TEINet to form the StrictTCR setting.
pMTnet Dataset [8]: pMTnet relies on quality metrics provided by certain databases and uses them to filter records [7, 19, 21, 31, 42–46], retaining only high-confidence pairs. For example, in VDJdb and TCRGP [7] data, they include only records with vdj.scores greater than zero. This enhances the confidence in specific TCR–epitope interactions. This process ultimately yields 31 203 pairs of TCRs and epitopes, with 28 604 TCRs associated with 426 epitopes.

Negative data

In the field of machine learning, especially in classification problems, negative samples are essential. They help the algorithm learn what features not to associate with a positive class, hence enabling more accurate and generalized model performance. This distinction is crucial for effectively training a model to differentiate between various classes [47, 48]. However, the publicly available TCR–epitope interaction datasets often only include positive samples, which presents a challenge for training and evaluating models. Therefore, we adopt a commonly used method for generating negative samples [8, 16, 17, 26]. Within each set, negative examples are generated by mismatching the positive data, i.e. combining an epitope sequence with a TCR different from its cognate target. To more comprehensively evaluate our model, we generate 1x, 5x, and 10x negative samples for each dataset, with 10x being the default setting.

Experimental results

Performance of GTE

We compare the performance of three baseline methods, TEINet [16], TEIM [17], and pMTnet [8], employing StrictTCR data splitting approaches. These methods are evaluated using two key metrics, AUC and Area under the Precision-Recall Curve (AUPR), with results averaged from five-fold cross-validation. Furthermore, to ensure a fair comparison with the baselines, the results for “RandomTCR,” as well as the five-fold and standard deviation outcomes, are presented in Appendix B.

Figure 2 shows the results for four datasets using the StrictTCR split method. Our results are significantly better than the baselines. It is noteworthy that the AUC score of the baseline model TEINet is 0.607 under StrictTCR, slightly lower than their reported score of 0.610|$\pm $|0.002 under RandomTCR. This difference may be due to StrictTCR’s elimination of potential data redundancy, which could make training more challenging and thus reduce TEINet’s performance. However, our GTE method achieves an AUC of 0.863 on the TEINet dataset. This firmly attests to the advantage of utilizing the topological structures of the graph when predicting TCR–epitope interactions. Moreover, the baseline TEINet also reports an AUC of 0.603|$\pm $|0.004 using the pMTnet method on their dataset, and similarly, we obtain an AUC of 0.616 for pMTnet under the TEINet dataset. The decrease in pMTnet’s performance compared with pMTnet-reported results is due to the omission of MHC information. The reason for excluding MHC information is to maintain consistency and ensure a fair comparison, as it is not included in any of the other experiments. Since the baseline methods treat each TCR–epitope pair as an independent input, and fail to consider the correlations among different types of TCR and epitope, their performances slightly decrease under the StrictTCR data splitting approach. Unlike baseline methods, our method constructs a heterogeneous graph of TCR–epitope interactions, which not only exploits the correlations between sample pairs but also explores the relationships between different TCRs and epitopes. Inductive representation learning has been shown to efficiently produce high-quality embeddings for previously unobserved data using node feature information [36], and enhances our model’s ability to explore interactions between previously unseen TCRs and epitopes. Therefore, our method is also competitive under the StrictTCR splitting approach, further illustrating the benefits of graph-based topological methods.

Figure 2

Comparison results under StrictTCR setting; experiments are conducted on four datasets using three baseline models; the first row demonstrates the visualizations of AUC scores and the second row illustrates the AUPR scores.

Open in new tab Download slide

Furthermore, the application of our GTE method to the pMTnet dataset yields an impressive AUC of 0.910. This score not only greatly surpasses the baseline but also stands out as the highest score among all four datasets. This result might be attributed not only to our successful leverage of the topological structure of the graph but also potentially due to the pMTnet dataset combining data from various sources, carefully curated to retain only high-confidence data [7, 19, 21, 31, 42–46], while the other datasets may contain some lower confidence data. However, even in the other three datasets with potentially noisy data, our results still surpassed all baselines. This suggests that the baseline methods might face challenges when adapting to diverse sources of data, and they exhibit particular sensitivity when dealing with low-confidence data. In contrast, our method is capable of capturing the variations between different datasets, and it demonstrates robust performance even when faced with low-confidence data. This confirms the superiority of our graph-based topological structure method.

Moreover, we conduct a thorough analysis of the model complexity, as illustrated in Table 1. The results demonstrate the efficiency of graph-based topological methods in maintaining high prediction accuracy while conserving computational resources.

Table 1

Open in new tab

Model complexity comparison

Model	Number of parameters
pMTnet [8]	128 223
TEIM [17]	1214 723
TEINet [16]	19 420 973
GTE (ours)	295 681

Table 1

Open in new tab

Model complexity comparison

Model	Number of parameters
pMTnet [8]	128 223
TEIM [17]	1214 723
TEINet [16]	19 420 973
GTE (ours)	295 681

Single sequence protein model evaluation

Inspired by recent advancements in single-sequence protein language models [15], we explore the suitability of large-scale protein language models to generate the embeddings for TCRs and epitopes. We select ESM2 [15] as a comparative baseline. ESM2 is a SOTA protein language model that is extensively trained and optimized, demonstrating outstanding performance in various protein-related tasks, including structural prediction and functional annotation. In our study, we utilize two different variants of ESM2, with 650M and 3B parameters, respectively. On the other hand, we employ TCRpeg [35] model provided by TEINet [16]. TCRpeg is an autoencoder model that utilizes recurrent neural networks with GRU layers to capture the features of TCRs and epitopes. TCRpeg is trained on a dataset comprising |$10^{6}$| TCR sequences and 362 456 unique epitopes, resulting in pre-trained models for epitopes and TCRs.

The performance of these three different embedding representations is evaluated in our GTE method and a comparative analysis is conducted on four datasets. To ensure a fair comparison, we keep the model parameters and data consistent, only changing the sequence embeddings. The results and conclusions of our study are as follows. Table 2 displays the results for the same four datasets under the StrictTCR split. Each result represents the average of five-fold cross-validation. Appendix C displays the RandomTCR split results.

Table 2

Open in new tab

Comparison experiments are conducted to evaluate different input embeddings using the proposed GTE model on four datasets with the StrictTCR setting; both AUC and AUPR scores are reported; each result derives from five-fold cross-validation and includes standard deviations; the best scores are marked in bold

Embeddings	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
ESM2(650M)	0.824 \|$\pm $\| 0.008	0.459 \|$\pm $\| 0.017	0.899 \|$\pm $\| 0.003	0.632 \|$\pm $\| 0.008	0.829 \|$\pm $\| 0.005	0.466 \|$\pm $\| 0.009	0.742 \|$\pm $\| 0.023	0.305 \|$\pm $\| 0.032
ESM2(3B)	0.814 \|$\pm $\| 0.015	0.441 \|$\pm $\| 0.029	0.897 \|$\pm $\| 0.003	0.625 \|$\pm $\| 0.007	0.827 \|$\pm $\| 0.005	0.462 \|$\pm $\| 0.009	0.746 \|$\pm $\| 0.019	0.308 \|$\pm $\| 0.030
TCRpeg	0.839\|$\pm $\| 0.005	0.486\|$\pm $\| 0.012	0.911\|$\pm $\| 0.003	0.655\|$\pm $\| 0.007	0.847\|$\pm $\| 0.005	0.500\|$\pm $\| 0.010	0.813\|$\pm $\| 0.014	0.390\|$\pm $\| 0.031

Embeddings	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
ESM2(650M)	0.824 \|$\pm $\| 0.008	0.459 \|$\pm $\| 0.017	0.899 \|$\pm $\| 0.003	0.632 \|$\pm $\| 0.008	0.829 \|$\pm $\| 0.005	0.466 \|$\pm $\| 0.009	0.742 \|$\pm $\| 0.023	0.305 \|$\pm $\| 0.032
ESM2(3B)	0.814 \|$\pm $\| 0.015	0.441 \|$\pm $\| 0.029	0.897 \|$\pm $\| 0.003	0.625 \|$\pm $\| 0.007	0.827 \|$\pm $\| 0.005	0.462 \|$\pm $\| 0.009	0.746 \|$\pm $\| 0.019	0.308 \|$\pm $\| 0.030
TCRpeg	0.839\|$\pm $\| 0.005	0.486\|$\pm $\| 0.012	0.911\|$\pm $\| 0.003	0.655\|$\pm $\| 0.007	0.847\|$\pm $\| 0.005	0.500\|$\pm $\| 0.010	0.813\|$\pm $\| 0.014	0.390\|$\pm $\| 0.031

Table 2

Open in new tab

Embeddings	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
ESM2(650M)	0.824 \|$\pm $\| 0.008	0.459 \|$\pm $\| 0.017	0.899 \|$\pm $\| 0.003	0.632 \|$\pm $\| 0.008	0.829 \|$\pm $\| 0.005	0.466 \|$\pm $\| 0.009	0.742 \|$\pm $\| 0.023	0.305 \|$\pm $\| 0.032
ESM2(3B)	0.814 \|$\pm $\| 0.015	0.441 \|$\pm $\| 0.029	0.897 \|$\pm $\| 0.003	0.625 \|$\pm $\| 0.007	0.827 \|$\pm $\| 0.005	0.462 \|$\pm $\| 0.009	0.746 \|$\pm $\| 0.019	0.308 \|$\pm $\| 0.030
TCRpeg	0.839\|$\pm $\| 0.005	0.486\|$\pm $\| 0.012	0.911\|$\pm $\| 0.003	0.655\|$\pm $\| 0.007	0.847\|$\pm $\| 0.005	0.500\|$\pm $\| 0.010	0.813\|$\pm $\| 0.014	0.390\|$\pm $\| 0.031

Embeddings	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
ESM2(650M)	0.824 \|$\pm $\| 0.008	0.459 \|$\pm $\| 0.017	0.899 \|$\pm $\| 0.003	0.632 \|$\pm $\| 0.008	0.829 \|$\pm $\| 0.005	0.466 \|$\pm $\| 0.009	0.742 \|$\pm $\| 0.023	0.305 \|$\pm $\| 0.032
ESM2(3B)	0.814 \|$\pm $\| 0.015	0.441 \|$\pm $\| 0.029	0.897 \|$\pm $\| 0.003	0.625 \|$\pm $\| 0.007	0.827 \|$\pm $\| 0.005	0.462 \|$\pm $\| 0.009	0.746 \|$\pm $\| 0.019	0.308 \|$\pm $\| 0.030
TCRpeg	0.839\|$\pm $\| 0.005	0.486\|$\pm $\| 0.012	0.911\|$\pm $\| 0.003	0.655\|$\pm $\| 0.007	0.847\|$\pm $\| 0.005	0.500\|$\pm $\| 0.010	0.813\|$\pm $\| 0.014	0.390\|$\pm $\| 0.031

It can be observed that using pre-trained models based on TCR and epitope sequences yields slightly better results than those generated by large-scale protein language models. One possible reason could be the relatively shorter nature of TCR and epitope sequences, which could lead to a loss in performance when using embeddings generated by large protein language models. However, this also prompts us to consider whether using large-scale epitope language models or large-scale TCR-epitope models could yield even better results. Furthermore, the results also underscore the effectiveness of our graph-based topology and its strong generalization. Whether based on protein language models or TCR-epitope language models, our approach consistently outperforms the current SOTA results.

Figure 3

Illustration of prediction results for positive samples from the McPAS dataset using the FoldDock predictor; the x-axis represents the number of MSAs, and the y-axis represents the log-transformed sample counts; the red color signifies correct predictions, and the blue color indicates incorrect predictions.

Open in new tab Download slide

Figure 4

Ablation experiments are conducted on four datasets; the y-axis represents AUC scores, and the x-axis shows two partitioning methods; in the figures, the crosses (X) represent the average AUC scores obtained from five-fold cross-validation, while the lines indicate the median; Panel A displays results for the McPAS Dataset, Panel B for the TEINet Dataset, Panel C for the pMTnet Dataset, and Panel D for the VDJdb Dataset.

Open in new tab Download slide

MSA-based protein model evaluation

Recently, significant progress has been made with large-scale protein models based on MSA in the field of protein prediction [13, 14, 49]. FoldDock [12] aims to enhance PPI predictions using AlphaFold2 [14]. Inspired by FoldDock, we assessed the feasibility and limitations of applying AlphaFold2 to simulate human TCR–epitope interactions. Specifically, we input all the positive samples of TCR and epitope sequences to FoldDock predictor. The process continues with MSA searches and evaluations, culminating in the quantification of interface contacts. Due to the extensive time required for MSA extraction and the demanding GPU resource requirements of the complex AlphaFold2 model structure, we only report the performance on the smallest McPAS dataset in Fig. 3. We input all the positive samples (binding) from McPAS into the baseline FoldDock and calculate the accuracy based on the predicted interface contacts. Since we consider the short lengths of the TCR and epitope sequences, defining a prediction as correct based on just one interface contact represents a relaxation of the baseline, yielding an accuracy of 0.4292. Under more stringent constraints, such as defining predictions with more than two interface contacts as correct, the result is 0.3701. When predictions with more than three interface contacts are considered correct, the result is 0.3267 (flip classes for 0.6733). In contrast, our method achieves an accuracy of 0.7740. More details can be found in Appendix D, with the MSA count and prediction results for all positive samples reported. Such results indicate a challenge when applying MSA-based protein language models directly to TCR–epitope predictions. The limitation may be due to the naturally shorter lengths of TCRs and epitopes, making it difficult to find a substantial number of MSAs. Figure 3 presents a statistical analysis of the impact of the number of MSAs on the results. To enhance clarity, we take the logarithm of the sample counts, with the y-axis representing the log-transformed sample counts and the x-axis showing the number of MSAs. The red color indicates correct predictions, while the blue color signifies incorrect predictions. A trend is evident, with a slight improvement in accuracy as the number of MSAs increases (larger red fraction). This supports our assertion that the direct application of MSA-based protein language models to TCR–epitope interactions is challenging due to the limited availability of MSAs for TCRs and epitopes. On the other hand, it reaffirms the superiority of our graph-based topological approach.

Ablation study

In this section, we present the results of our ablation experiments using box plot, as depicted in Fig. 4. We extend our experimental findings for the four datasets: TEINet, pMTnet, VDJdb, and McPAS. Here, “|$GTE$|” denotes the use of dynamic graph training along with the inclusion of AUC loss. “|$w/o Both$|” signifies the absence of AUC loss and dynamic graph strategy.

The results indicate that each component of the proposed method consistently enhances predictive performance. Even in the absence of AUC loss and dynamic graph strategies, our model still outperforms all baseline methods on the AUC scores, also with a smaller standard deviation compared with other conditions. This suggests that the graph-based topological approach is inherently robust and efficient. Incorporating dynamic graph learning and AUC maximization leads to a significant improvement in model performance, which proves that dynamic graph learning promotes the acquisition of a more diverse set of negative samples, and AUC maximization effectively addresses the challenge of data imbalance.

Different negative sample settings

To verify the robustness of our model under different negative sample settings, Tables 3 and 4 report the results under the positive-to-negative sample ratios of 1:1 and 1:5, respectively. Specifically, we keep the parameters and structure of the model unchanged, only replacing the training and testing datasets. The results clearly demonstrate that our model consistently achieves the best results, regardless of the negative sample setting. With the positive-to-negative sample ratios of 1:1 to 1:5, the AUPR for all methods decreases as the number of negative samples increases, which may be due to an increase in false positives. Furthermore, the AUC for all methods increases as the number of negative samples increases, which might be attributed to the larger training set enabling better learning about negative samples, resulting in lower scores for these samples during prediction and thus promoting further training of the model. This also supports our point that richer information can be learned from a large number of negative samples, which can effectively improve the model.

Table 3

Open in new tab

Comparison of model performance with a 1:1 positive-to-negative sample ratio; results are obtained from five-fold cross-validation; AUC and AUPR scores, along with their standard deviations, are reported; the best scores are highlighted in bold

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.608 \|$\pm $\| 0.008	0.615 \|$\pm $\| 0.008	0.619 \|$\pm $\| 0.005	0.612 \|$\pm $\| 0.006	0.588 \|$\pm $\| 0.005	0.597 \|$\pm $\| 0.004	0.687 \|$\pm $\| 0.021	0.697 \|$\pm $\| 0.021
TEIM	0.606 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.004	0.618 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.002	0.590 \|$\pm $\| 0.003	0.602 \|$\pm $\| 0.005	0.641 \|$\pm $\| 0.024	0.659 \|$\pm $\| 0.026
pMTnet	0.578 \|$\pm $\| 0.004	0.577 \|$\pm $\| 0.005	0.579 \|$\pm $\| 0.027	0.583 \|$\pm $\| 0.030	0.571 \|$\pm $\| 0.003	0.574 \|$\pm $\| 0.003	0.562 \|$\pm $\| 0.007	0.562 \|$\pm $\| 0.008
GTE(Ours)	0.734\|$\pm $\| 0.003	0.781\|$\pm $\| 0.003	0.813\|$\pm $\| 0.002	0.839\|$\pm $\| 0.003	0.775\|$\pm $\| 0.004	0.811\|$\pm $\| 0.004	0.743\|$\pm $\| 0.012	0.788\|$\pm $\| 0.017

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.608 \|$\pm $\| 0.008	0.615 \|$\pm $\| 0.008	0.619 \|$\pm $\| 0.005	0.612 \|$\pm $\| 0.006	0.588 \|$\pm $\| 0.005	0.597 \|$\pm $\| 0.004	0.687 \|$\pm $\| 0.021	0.697 \|$\pm $\| 0.021
TEIM	0.606 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.004	0.618 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.002	0.590 \|$\pm $\| 0.003	0.602 \|$\pm $\| 0.005	0.641 \|$\pm $\| 0.024	0.659 \|$\pm $\| 0.026
pMTnet	0.578 \|$\pm $\| 0.004	0.577 \|$\pm $\| 0.005	0.579 \|$\pm $\| 0.027	0.583 \|$\pm $\| 0.030	0.571 \|$\pm $\| 0.003	0.574 \|$\pm $\| 0.003	0.562 \|$\pm $\| 0.007	0.562 \|$\pm $\| 0.008
GTE(Ours)	0.734\|$\pm $\| 0.003	0.781\|$\pm $\| 0.003	0.813\|$\pm $\| 0.002	0.839\|$\pm $\| 0.003	0.775\|$\pm $\| 0.004	0.811\|$\pm $\| 0.004	0.743\|$\pm $\| 0.012	0.788\|$\pm $\| 0.017

Table 3

Open in new tab

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.608 \|$\pm $\| 0.008	0.615 \|$\pm $\| 0.008	0.619 \|$\pm $\| 0.005	0.612 \|$\pm $\| 0.006	0.588 \|$\pm $\| 0.005	0.597 \|$\pm $\| 0.004	0.687 \|$\pm $\| 0.021	0.697 \|$\pm $\| 0.021
TEIM	0.606 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.004	0.618 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.002	0.590 \|$\pm $\| 0.003	0.602 \|$\pm $\| 0.005	0.641 \|$\pm $\| 0.024	0.659 \|$\pm $\| 0.026
pMTnet	0.578 \|$\pm $\| 0.004	0.577 \|$\pm $\| 0.005	0.579 \|$\pm $\| 0.027	0.583 \|$\pm $\| 0.030	0.571 \|$\pm $\| 0.003	0.574 \|$\pm $\| 0.003	0.562 \|$\pm $\| 0.007	0.562 \|$\pm $\| 0.008
GTE(Ours)	0.734\|$\pm $\| 0.003	0.781\|$\pm $\| 0.003	0.813\|$\pm $\| 0.002	0.839\|$\pm $\| 0.003	0.775\|$\pm $\| 0.004	0.811\|$\pm $\| 0.004	0.743\|$\pm $\| 0.012	0.788\|$\pm $\| 0.017

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.608 \|$\pm $\| 0.008	0.615 \|$\pm $\| 0.008	0.619 \|$\pm $\| 0.005	0.612 \|$\pm $\| 0.006	0.588 \|$\pm $\| 0.005	0.597 \|$\pm $\| 0.004	0.687 \|$\pm $\| 0.021	0.697 \|$\pm $\| 0.021
TEIM	0.606 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.004	0.618 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.002	0.590 \|$\pm $\| 0.003	0.602 \|$\pm $\| 0.005	0.641 \|$\pm $\| 0.024	0.659 \|$\pm $\| 0.026
pMTnet	0.578 \|$\pm $\| 0.004	0.577 \|$\pm $\| 0.005	0.579 \|$\pm $\| 0.027	0.583 \|$\pm $\| 0.030	0.571 \|$\pm $\| 0.003	0.574 \|$\pm $\| 0.003	0.562 \|$\pm $\| 0.007	0.562 \|$\pm $\| 0.008
GTE(Ours)	0.734\|$\pm $\| 0.003	0.781\|$\pm $\| 0.003	0.813\|$\pm $\| 0.002	0.839\|$\pm $\| 0.003	0.775\|$\pm $\| 0.004	0.811\|$\pm $\| 0.004	0.743\|$\pm $\| 0.012	0.788\|$\pm $\| 0.017

Table 4

Open in new tab

Comparison of model performance with a 1:5 positive-to-negative sample ratio; results are obtained from five-fold cross-validation; AUC and AUPR scores, along with their standard deviations, are reported; the best scores are highlighted in bold

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.614 \|$\pm $\| 0.006	0.267 \|$\pm $\| 0.016	0.621 \|$\pm $\| 0.002	0.243 \|$\pm $\| 0.014	0.588 \|$\pm $\| 0.004	0.251 \|$\pm $\| 0.007	0.691 \|$\pm $\| 0.017	0.372 \|$\pm $\| 0.026
TEIM	0.616 \|$\pm $\| 0.004	0.285 \|$\pm $\| 0.008	0.622 \|$\pm $\| 0.003	0.263 \|$\pm $\| 0.003	0.599 \|$\pm $\| 0.005	0.273 \|$\pm $\| 0.005	0.647 \|$\pm $\| 0.016	0.354 \|$\pm $\| 0.022
pMTnet	0.608 \|$\pm $\| 0.006	0.262 \|$\pm $\| 0.005	0.580 \|$\pm $\| 0.006	0.218 \|$\pm $\| 0.005	0.592 \|$\pm $\| 0.002	0.251 \|$\pm $\| 0.004	0.623 \|$\pm $\| 0.009	0.281 \|$\pm $\| 0.015
GTE(Ours)	0.760\|$\pm $\| 0.008	0.419\|$\pm $\| 0.009	0.855\|$\pm $\| 0.005	0.598\|$\pm $\| 0.008	0.786\|$\pm $\| 0.002	0.447\|$\pm $\| 0.005	0.753\|$\pm $\| 0.014	0.392\|$\pm $\| 0.019

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.614 \|$\pm $\| 0.006	0.267 \|$\pm $\| 0.016	0.621 \|$\pm $\| 0.002	0.243 \|$\pm $\| 0.014	0.588 \|$\pm $\| 0.004	0.251 \|$\pm $\| 0.007	0.691 \|$\pm $\| 0.017	0.372 \|$\pm $\| 0.026
TEIM	0.616 \|$\pm $\| 0.004	0.285 \|$\pm $\| 0.008	0.622 \|$\pm $\| 0.003	0.263 \|$\pm $\| 0.003	0.599 \|$\pm $\| 0.005	0.273 \|$\pm $\| 0.005	0.647 \|$\pm $\| 0.016	0.354 \|$\pm $\| 0.022
pMTnet	0.608 \|$\pm $\| 0.006	0.262 \|$\pm $\| 0.005	0.580 \|$\pm $\| 0.006	0.218 \|$\pm $\| 0.005	0.592 \|$\pm $\| 0.002	0.251 \|$\pm $\| 0.004	0.623 \|$\pm $\| 0.009	0.281 \|$\pm $\| 0.015
GTE(Ours)	0.760\|$\pm $\| 0.008	0.419\|$\pm $\| 0.009	0.855\|$\pm $\| 0.005	0.598\|$\pm $\| 0.008	0.786\|$\pm $\| 0.002	0.447\|$\pm $\| 0.005	0.753\|$\pm $\| 0.014	0.392\|$\pm $\| 0.019

Table 4

Open in new tab

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.614 \|$\pm $\| 0.006	0.267 \|$\pm $\| 0.016	0.621 \|$\pm $\| 0.002	0.243 \|$\pm $\| 0.014	0.588 \|$\pm $\| 0.004	0.251 \|$\pm $\| 0.007	0.691 \|$\pm $\| 0.017	0.372 \|$\pm $\| 0.026
TEIM	0.616 \|$\pm $\| 0.004	0.285 \|$\pm $\| 0.008	0.622 \|$\pm $\| 0.003	0.263 \|$\pm $\| 0.003	0.599 \|$\pm $\| 0.005	0.273 \|$\pm $\| 0.005	0.647 \|$\pm $\| 0.016	0.354 \|$\pm $\| 0.022
pMTnet	0.608 \|$\pm $\| 0.006	0.262 \|$\pm $\| 0.005	0.580 \|$\pm $\| 0.006	0.218 \|$\pm $\| 0.005	0.592 \|$\pm $\| 0.002	0.251 \|$\pm $\| 0.004	0.623 \|$\pm $\| 0.009	0.281 \|$\pm $\| 0.015
GTE(Ours)	0.760\|$\pm $\| 0.008	0.419\|$\pm $\| 0.009	0.855\|$\pm $\| 0.005	0.598\|$\pm $\| 0.008	0.786\|$\pm $\| 0.002	0.447\|$\pm $\| 0.005	0.753\|$\pm $\| 0.014	0.392\|$\pm $\| 0.019

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.614 \|$\pm $\| 0.006	0.267 \|$\pm $\| 0.016	0.621 \|$\pm $\| 0.002	0.243 \|$\pm $\| 0.014	0.588 \|$\pm $\| 0.004	0.251 \|$\pm $\| 0.007	0.691 \|$\pm $\| 0.017	0.372 \|$\pm $\| 0.026
TEIM	0.616 \|$\pm $\| 0.004	0.285 \|$\pm $\| 0.008	0.622 \|$\pm $\| 0.003	0.263 \|$\pm $\| 0.003	0.599 \|$\pm $\| 0.005	0.273 \|$\pm $\| 0.005	0.647 \|$\pm $\| 0.016	0.354 \|$\pm $\| 0.022
pMTnet	0.608 \|$\pm $\| 0.006	0.262 \|$\pm $\| 0.005	0.580 \|$\pm $\| 0.006	0.218 \|$\pm $\| 0.005	0.592 \|$\pm $\| 0.002	0.251 \|$\pm $\| 0.004	0.623 \|$\pm $\| 0.009	0.281 \|$\pm $\| 0.015
GTE(Ours)	0.760\|$\pm $\| 0.008	0.419\|$\pm $\| 0.009	0.855\|$\pm $\| 0.005	0.598\|$\pm $\| 0.008	0.786\|$\pm $\| 0.002	0.447\|$\pm $\| 0.005	0.753\|$\pm $\| 0.014	0.392\|$\pm $\| 0.019

However, as the ratio changes from 1:5 to 1:10 (see Fig. 2), the baseline methods show varying performance in AUC across four datasets. This may indicate that baseline methods are unable to effectively handle severe data imbalances. In contrast, the AUC scores of our method continue to increase, which convincingly demonstrates that maximizing AUC can effectively address data imbalance. Moreover, employing a dynamic edge update strategy allows the model to learn more comprehensive information from more negative samples, thus enhancing model performance.

$Parameter sensitivity analysis on the McPAS dataset; $\beta $ represents the weight for AUC loss, while $1-\beta $ is the weight for BCE loss; $M$ controls the sampling probability for dynamic edge updates; the color varies based on the AUC scores; the darker color indicates higher AUC value.$

Figure 5

Parameter sensitivity analysis on the McPAS dataset; |$\beta $| represents the weight for AUC loss, while |$1-\beta $| is the weight for BCE loss; |$M$| controls the sampling probability for dynamic edge updates; the color varies based on the AUC scores; the darker color indicates higher AUC value.

Open in new tab Download slide

Parameter sensitivity analysis

Figure 6

Comparison of model performance on the PDB independent test set; the crosses (X) represent average scores from five-fold cross-validation, and the lines indicate the median; the y-axis shows scores, while the x-axis displays different models; the results indicate that the AUC of the GTE model is higher than other baselines, further highlighting the superiority of our method compared with the baseline approaches.

Open in new tab Download slide

Figure 7

Confidence density on four datasets positive data; Panel A displays results for the TEINet Dataset, Panel B for the VDJdb Dataset, Panel C for the pMTnet Dataset, and Panel D for the McPAS Dataset.

Open in new tab Download slide

Evaluation on independent dataset

To further compare the performance of each model, we use the independent test set provided by the baseline TEINet [16], referred to as PDB. This PDB test set collects 105 TCR–epitope complex structures from the public RCSB Protein Database (PDB) [50]. After filtering, it contains 63 positive samples. The PDB independent test set is a roughly balanced dataset, with each epitope binding to one or two TCRs. For a fair comparison, the TEINet dataset is used as the training set. Specifically, we generate a 1:1 ratio of negative samples following the StrictTCR partitioning. The same approach applies to the PDB independent test set. During training, a five-fold cross-validation method is employed, where one fold is used for validation and model selection. The final report presents the results from five iterations of the independent test set. As shown in Fig. 6, the results indicate that the AUC of the GTE model is higher than the baselines, demonstrating the powerful capability of utilizing graph topological structures.

Confidence density

In a bid to bolster the reliability of our findings, we proceeded to conduct a confidence analysis of the predictions across the four datasets. We employ Kernel Density Estimation (KDE) [51], a technique for estimation of probability density function. This method smooths the distribution of confidence scores without assuming any specific distribution shape. This flexibility arising from its nonparametric nature makes KDE a very popular approach for data drawn from a complicated distribution. The prediction results are model output logits, distributed between 0 and 1. Figure 7 displays the confidence density distributions of positive samples within these datasets. A distribution closer to 1 signifies higher confidence. However, TEINet and TEIM tend to predict most positive sample data as below 0.2, misclassifying them as negative samples. Our method exhibits higher confidence for positive sample predictions compared with all baseline methods, indicating that our model can easily distinguish positive samples and achieve high-confidence predictions, highlighting the superiority of our graph-based topological structure.

An interesting observation is that pMTnet predicts almost all data to be between 0.4 and 0.6. This behavior results from the “Differential loss function” used in pMTnet [8], aiming to ensure that positive samples score higher than negative ones, but does not concentrate on bringing positive samples close to 1 and negative samples close to 0. Conversely, other methods apply the Cross Entropy loss [52], and therefore, their predictions work toward making positive samples close to 1 and negative samples close to 0.

Limitations

TCR–epitope prediction holds great potential for advancing immunotherapy development [25]. Public databases accumulate increasingly more data on TCR–epitope interactions. Benefiting from these data, researchers develop various prediction frameworks based on deep learning algorithms. GTE is a novel approach that uses graph topological structures to capture the potential structural information between TCRs and epitopes, which provides a new perspective for TCR epitope prediction. However, one potential limitation is that GTE follows other SOTA methods to use the CDR3 region of the TCR’s beta chain, while previous work has shown that data containing only CDR3.|$\beta $| information are of lower quality compared with data that include paired CDRs [6, 53]. Incorporating the full TCR information presents a challenge, which will be the focus of our ongoing work. We plan to design a more comprehensive heterogeneous graph that includes information from both chains of the TCR to achieve accurate TCR specificity prediction.

Conclusion

We present GTE, an innovative framework that leverages graph topological structures to improve the prediction of TCR–epitope binding. Our method tackles the challenges of negative sample utilization and data imbalance through dynamic graph learning and AUC maximization strategies. Extensive experiments on four datasets demonstrate the promise of graph-based topological methods in predicting TCR–epitope interactions and offer valuable perspectives on the complex molecular networks underlying immune responses.

Key Points

We introduce GTE, which is the first method leveraging topological structure to enhance TCR-epitope binding predictions through an inductive heterogeneous GNN.
We develop an advanced dynamic graph learning mechanism for adaptive negative data sampling during training, allowing efficient learning without memory limitations.
We adapt AUC maximization in the graph domain to ensure our model maintains a balanced and robust predictive capability, overcoming the challenge of data imbalance.

Funding

This work was partially supported by US National Science Foundation IIS-2412195, CCF-2400785, the Cancer Prevention and Research Institute of Texas (CPRIT) award (RP230363), the National Institutes of Health (NIH) [R01CA258584/TW], and Cancer Prevention Research Institute of Texas [RP230363/TW].

Data and code availability

GTE is implemented in Python using Pytorch. All the data and code are available at https://github.com/uta-smile/GTE.

References

Dens

Bittremieux

Affaticati

. et al. .

Interpretable deep learning to uncover the molecular binding patterns determining tcr–epitope interaction predictions

ImmunoInformatics

2023

;

100027

. https://doi.org/10.1016/j.immuno.2023.100027

Google Scholar

Crossref

WorldCat

Korompoki

Filippidis

Nielsen

. et al. .

Long-term antithrombotic treatment in intracranial hemorrhage survivors with atrial fibrillation

Neurology

2017

;

687

–

. https://doi.org/10.1212/WNL.0000000000004235

Ross

Fletcher

Linette

. et al. .

The her-2/neu gene and protein in breast cancer 2003: biomarker and target of therapy

Oncologist

2003

;

307

–

. https://doi.org/10.1634/theoncologist.8-4-307

Schumacher

Schreiber

Neoantigens in cancer immunotherapy

Science

2015

;

348

–

. https://doi.org/10.1126/science.aaa4971

Joglekar

T cell antigen discovery

Nat Methods

2021

;

873

–

. https://doi.org/10.1038/s41592-020-0867-z

Dash

Fiore-Gartland

Hertz

. et al. .

Quantifiable predictive features define epitope-specific t cell receptor repertoires

Nature

2017

;

547

–

. https://doi.org/10.1038/nature22383

Jokinen

Huuhtanen

Mustjoki

. et al. .

Predicting recognition between t cell receptors and epitopes with tcrgp

PLoS Comput Biol

2021

;

:e1008814. https://doi.org/10.1371/journal.pcbi.1008814

Google Scholar

OpenURL Placeholder Text

WorldCat

Tianshi

Zhang

Zhu

. et al. .

Deep learning-based prediction of the t cell receptor–antigen binding specificity

Nature Machine Intell

2021

;

864

–

. https://doi.org/10.1038/s42256-021-00383-2

Google Scholar

OpenURL Placeholder Text

WorldCat

John-William Sidhom

Larman

Pardoll

. et al. .

Deeptcr is a deep learning framework for revealing sequence concepts within t-cell repertoires

Nat Commun

2021

;

. https://doi.org/10.1038/s41467-021-21879-w

Google Scholar

OpenURL Placeholder Text

WorldCat

10.

Springer

Besser

Tickotsky-Moskovitz

. et al. .

Prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs

Front Immunol

2020

;

1803

. https://doi.org/10.3389/fimmu.2020.01803

11.

Tong

Wang

Zheng

. et al. .

Sete: sequence-based ensemble learning approach for tcr epitope binding prediction

Comput Biol Chem

2020

;

:107281. https://doi.org/10.1016/j.compbiolchem.2020.107281

Google Scholar

OpenURL Placeholder Text

WorldCat

12.

Bryant

Pozzati

Elofsson

Improved prediction of protein-protein interactions using alphafold2

Nat Commun

2022

;

1265

. https://doi.org/10.1038/s41467-022-28865-w

13.

Evans

O’Neill

Pritzel

. et al. .

Protein complex prediction with alphafold-multimer

biorxiv

2021

–

Google Scholar

OpenURL Placeholder Text

WorldCat

14.

Jumper

Evans

Pritzel

. et al. .

Highly accurate protein structure prediction with alphafold

Nature

2021

;

596

583

–

. https://doi.org/10.1038/s41586-021-03819-2

15.

Lin

Akin

Rao

. et al. .

Language models of protein sequences at the scale of evolution enable accurate structure prediction

bioRxiv

2022

Google Scholar

OpenURL Placeholder Text

WorldCat

16.

Jiang

Huo

Teinet: a deep learning framework for prediction of tcr–epitope binding specificity

Brief Bioinform

2023

;

bbad086

17.

Peng

Lei

Feng

. et al. .

Characterizing the interaction conformation between t-cell receptors and epitopes with deep learning

Nature Machine Intelligence

2023

;

395

–

407

. https://doi.org/10.1038/s42256-023-00634-4

Google Scholar

Crossref

WorldCat

18.

Kevin

Yost

Daniel

. et al. .

Tcr-bert: learning the grammar of t-cell receptors for flexible antigen-xbinding analyses

Biorxiv

2021

–

Google Scholar

OpenURL Placeholder Text

WorldCat

19.

Shugay

Bagaev

Zvyagin

. et al. .

Vdjdb: a curated database of t-cell receptor sequences with known antigen specificity

Nucleic Acids Res

2018

;

D419

–

. https://doi.org/10.1093/nar/gkx760

20.

Vita

Mahajan

Overton

. et al. .

The immune epitope database (iedb): 2018 update

Nucleic Acids Res

2019

;

D339

–

. https://doi.org/10.1093/nar/gky1006

21.

Tickotsky

Sagiv

Prilusky

. et al. .

Mcpas-tcr: a manually curated catalogue of pathology-associated t cell receptor sequences

Bioinformatics

2017

;

2924

–

. https://doi.org/10.1093/bioinformatics/btx286

22.

Hakes

Pinney

Robertson

. et al. .

Protein-protein interaction networks and biology–what’s the connection?

Nat Biotechnol

2008

;

–

. https://doi.org/10.1038/nbt0108-69

23.

Maslov

Sneppen

Specificity and stability in topology of protein networks

Science

2002

;

296

910

–

. https://doi.org/10.1126/science.1065103

24.

Gao

Jiang

Zhang

. et al. .

Hierarchical graph learning for protein–protein interaction

Nat Commun

2023

;

1093

. https://doi.org/10.1038/s41467-023-36736-1

25.

Montemurro

Schuster

Povlsen

. et al. .

Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcr|$\alpha $| and |$\beta $| sequence data

Communications biology

2021

;

. https://doi.org/10.1038/s42003-021-02610-3

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Jurtz

Jessen

Bentzen

. et al. .

Nettcr: sequence-based prediction of tcr binding to peptide-mhc complexes using convolutional neural networks

BioRxiv

2018

433706

Google Scholar

OpenURL Placeholder Text

WorldCat

27.

Yuan

Yan

Sonka

Yang

Large-scale robust deep auc maximization: a new surrogate loss and empirical studies on medical image classification

. In

Proceedings of the IEEE/CVF International Conference on Computer Vision

, pages

3040

–

2021

28.

Gielis

Moris

Bittremieux

. et al. .

Detection of enriched t cell epitope specificity in full t cell receptor sequence repertoires

Front Immunol

2019

;

2820

. https://doi.org/10.3389/fimmu.2019.02820

29.

Saethang

Hirose

Ingorn Kimkong

. et al. .

Paaqd: predicting immunogenicity of mhc class i binding peptides using amino acid pairwise contact potentials and quantum topological molecular similarity descriptors

J Immunol Methods

2013

;

387

293

–

302

. https://doi.org/10.1016/j.jim.2012.09.016

30.

Tung

C-W

Ziehm

Kämper

. et al. .

Popisk: T-cell reactivity prediction using support vector machines and string kernels

BMC bioinformatics

2011

;

–

31.

Glanville

Huang

Nau

. et al. .

Identifying specificity groups in the t cell receptor repertoire

Nature

2017

;

547

–

. https://doi.org/10.1038/nature22976

32.

Krogsgaard

Davis

How t cells’ see’antigen

Nat Immunol

2005

;

239

–

. https://doi.org/10.1038/ni1173

33.

Zhang

Xiong

Wang

. et al. .

Mapping the functional landscape of t cell receptor repertoires by single-t cell transcriptomics

Nat Methods

2021

;

–

. https://doi.org/10.1038/s41592-020-01020-3

34.

Yatao Bian

Rong

. et al. .

Cross-dependent graph neural networks for molecular property prediction

Bioinformatics

2022

;

2003

–

. https://doi.org/10.1093/bioinformatics/btac039

35.

Yuepeng Jiang and Shuai Cheng Li

Deep autoregressive generative models capture the intrinsics embedded in t-cell receptor repertoires

Brief Bioinform

2023

;

. https://doi.org/10.1093/bib/bbad038

OpenURL Placeholder Text

WorldCat

36.

Hamilton

Ying

Leskovec

Inductive representation learning on large graphs

Advances in neural information processing systems

2017

;

Google Scholar

OpenURL Placeholder Text

WorldCat

37.

Nemirovski

Juditsky

Lan

. et al. .

Robust stochastic approximation approach to stochastic programming

SIAM Journal on optimization

2009

;

1574

–

609

. https://doi.org/10.1137/070704277

Google Scholar

Crossref

WorldCat

38.

Zhuang

Duan

. et al. .

A comprehensive survey on transfer learning

Proc IEEE

2020

;

109

–

. https://doi.org/10.1109/JPROC.2020.3004555

Google Scholar

Crossref

WorldCat

39.

Yan

. et al. .

Investigation of customized medical decision algorithms utilizing graph neural networks

arXiv preprint arXiv:2405.17460

2024

40.

Wang

Guo

Wang

. et al. .

Smiles-bert: large scale unsupervised pre-training for molecular property prediction

. In

Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics

, pages

429

–

436

2019

41.

Guo

Bian

. et al. .

Modna: motif-oriented pre-training for dna language model

. In

Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

, pages

–

2022

42.

Chen

Yang

. et al. .

Sequence and structural analyses reveal distinct and highly diverse human cd8+ tcr repertoires to immunodominant viral antigens

Cell Rep

2017

;

569

–

. https://doi.org/10.1016/j.celrep.2017.03.072

43.

Huth

Liang

Krebs

. et al. .

Antigen-specific tcr signatures of cytomegalovirus infection

The Journal of Immunology

2019

;

202

979

–

. https://doi.org/10.4049/jimmunol.1801401

44.

Joglekar

Leonard

Jeppson

. et al. .

T cell antigen discovery via signaling and antigen-presenting bifunctional receptors

Nat Methods

2019

;

191

–

. https://doi.org/10.1038/s41592-018-0304-8

45.

Zhang

S-Q

K-Y

Schonnesen

. et al. .

High-throughput determination of the antigen specificities of t cell receptors in single cells

Nat Biotechnol

2018

;

1156

–

. https://doi.org/10.1038/nbt.4282

Google Scholar

Crossref

WorldCat

46.

Zhang

Wang

Liu

. et al. .

Pird: pan immune repertoire database

Bioinformatics

2020

;

897

–

903

. https://doi.org/10.1093/bioinformatics/btz614

47.

Snoek

Classifying tag relevance with relevant positive and negative examples

. In

Proceedings of the 21st ACM international conference on Multimedia

, pages

485

–

488

2013

48.

Jiang

Rong

. et al. .

Robust self-training strategy for various molecular biology prediction tasks

. In

Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

, pages

–

2022

49.

Yuzhi

Guo

Jiaxiang

Hehuan

. et al. .

Bagging msa learning: Enhancing low-quality pssm with deep learning for accurate protein structure property prediction

. In

Research in Computational Molecular Biology: 24th Annual International Conference, RECOMB 2020

,Padua, Italy, May 10–13, 2020,

Proceedings 24

, pages

–

103

Springer

2020

50.

Sussman

Lin

Jiang

. et al. .

Protein data bank (pdb): database of three-dimensional structural information of biological macromolecules

Acta Crystallogr D Biol Crystallogr

1998

;

1078

–

. https://doi.org/10.1107/S0907444998009378

51.

Chen

Y-C

A tutorial on kernel density estimation and recent advances

Biostatistics & Epidemiology

2017

;

161

–

. https://doi.org/10.1080/24709360.2017.1396742

Google Scholar

Crossref

WorldCat

52.

Claude Elwood Shannon

A mathematical theory of communication

Bell Syst Tech J

1948

;

379

–

423

. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Crossref

WorldCat

53.

Lanzarotti

Marcatili

Nielsen

T-cell receptor cognate target prediction based on paired |$\alpha $| and |$\beta $| chain sequence and structural cdr loop similarities

Front Immunol

2019

;

:462645. https://doi.org/10.3389/fimmu.2019.02080

Google Scholar

OpenURL Placeholder Text

WorldCat

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Download all slides

Embeddings	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
ESM2(650M)	0.824 \|$\pm $\| 0.008	0.459 \|$\pm $\| 0.017	0.899 \|$\pm $\| 0.003	0.632 \|$\pm $\| 0.008	0.829 \|$\pm $\| 0.005	0.466 \|$\pm $\| 0.009	0.742 \|$\pm $\| 0.023	0.305 \|$\pm $\| 0.032
ESM2(3B)	0.814 \|$\pm $\| 0.015	0.441 \|$\pm $\| 0.029	0.897 \|$\pm $\| 0.003	0.625 \|$\pm $\| 0.007	0.827 \|$\pm $\| 0.005	0.462 \|$\pm $\| 0.009	0.746 \|$\pm $\| 0.019	0.308 \|$\pm $\| 0.030
TCRpeg	0.839\|$\pm $\| 0.005	0.486\|$\pm $\| 0.012	0.911\|$\pm $\| 0.003	0.655\|$\pm $\| 0.007	0.847\|$\pm $\| 0.005	0.500\|$\pm $\| 0.010	0.813\|$\pm $\| 0.014	0.390\|$\pm $\| 0.031

Embeddings	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
ESM2(650M)	0.824 \|$\pm $\| 0.008	0.459 \|$\pm $\| 0.017	0.899 \|$\pm $\| 0.003	0.632 \|$\pm $\| 0.008	0.829 \|$\pm $\| 0.005	0.466 \|$\pm $\| 0.009	0.742 \|$\pm $\| 0.023	0.305 \|$\pm $\| 0.032
ESM2(3B)	0.814 \|$\pm $\| 0.015	0.441 \|$\pm $\| 0.029	0.897 \|$\pm $\| 0.003	0.625 \|$\pm $\| 0.007	0.827 \|$\pm $\| 0.005	0.462 \|$\pm $\| 0.009	0.746 \|$\pm $\| 0.019	0.308 \|$\pm $\| 0.030
TCRpeg	0.839\|$\pm $\| 0.005	0.486\|$\pm $\| 0.012	0.911\|$\pm $\| 0.003	0.655\|$\pm $\| 0.007	0.847\|$\pm $\| 0.005	0.500\|$\pm $\| 0.010	0.813\|$\pm $\| 0.014	0.390\|$\pm $\| 0.031

Embeddings	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
ESM2(650M)	0.824 \|$\pm $\| 0.008	0.459 \|$\pm $\| 0.017	0.899 \|$\pm $\| 0.003	0.632 \|$\pm $\| 0.008	0.829 \|$\pm $\| 0.005	0.466 \|$\pm $\| 0.009	0.742 \|$\pm $\| 0.023	0.305 \|$\pm $\| 0.032
ESM2(3B)	0.814 \|$\pm $\| 0.015	0.441 \|$\pm $\| 0.029	0.897 \|$\pm $\| 0.003	0.625 \|$\pm $\| 0.007	0.827 \|$\pm $\| 0.005	0.462 \|$\pm $\| 0.009	0.746 \|$\pm $\| 0.019	0.308 \|$\pm $\| 0.030
TCRpeg	0.839\|$\pm $\| 0.005	0.486\|$\pm $\| 0.012	0.911\|$\pm $\| 0.003	0.655\|$\pm $\| 0.007	0.847\|$\pm $\| 0.005	0.500\|$\pm $\| 0.010	0.813\|$\pm $\| 0.014	0.390\|$\pm $\| 0.031

Embeddings	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
ESM2(650M)	0.824 \|$\pm $\| 0.008	0.459 \|$\pm $\| 0.017	0.899 \|$\pm $\| 0.003	0.632 \|$\pm $\| 0.008	0.829 \|$\pm $\| 0.005	0.466 \|$\pm $\| 0.009	0.742 \|$\pm $\| 0.023	0.305 \|$\pm $\| 0.032
ESM2(3B)	0.814 \|$\pm $\| 0.015	0.441 \|$\pm $\| 0.029	0.897 \|$\pm $\| 0.003	0.625 \|$\pm $\| 0.007	0.827 \|$\pm $\| 0.005	0.462 \|$\pm $\| 0.009	0.746 \|$\pm $\| 0.019	0.308 \|$\pm $\| 0.030
TCRpeg	0.839\|$\pm $\| 0.005	0.486\|$\pm $\| 0.012	0.911\|$\pm $\| 0.003	0.655\|$\pm $\| 0.007	0.847\|$\pm $\| 0.005	0.500\|$\pm $\| 0.010	0.813\|$\pm $\| 0.014	0.390\|$\pm $\| 0.031

Models	TEINet Dataset		pMTnet Dataset		VDJdb Dataset		McPAS Dataset
	AUC	AUPR	AUC	AUPR	AUC	AUPR	AUC	AUPR
TEINet	0.608 \|$\pm $\| 0.008	0.615 \|$\pm $\| 0.008	0.619 \|$\pm $\| 0.005	0.612 \|$\pm $\| 0.006	0.588 \|$\pm $\| 0.005	0.597 \|$\pm $\| 0.004	0.687 \|$\pm $\| 0.021	0.697 \|$\pm $\| 0.021
TEIM	0.606 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.004	0.618 \|$\pm $\| 0.004	0.617 \|$\pm $\| 0.002	0.590 \|$\pm $\| 0.003	0.602 \|$\pm $\| 0.005	0.641 \|$\pm $\| 0.024	0.659 \|$\pm $\| 0.026
pMTnet	0.578 \|$\pm $\| 0.004	0.577 \|$\pm $\| 0.005	0.579 \|$\pm $\| 0.027	0.583 \|$\pm $\| 0.030	0.571 \|$\pm $\| 0.003	0.574 \|$\pm $\| 0.003	0.562 \|$\pm $\| 0.007	0.562 \|$\pm $\| 0.008
GTE(Ours)	0.734\|$\pm $\| 0.003	0.781\|$\pm $\| 0.003	0.813\|$\pm $\| 0.002	0.839\|$\pm $\| 0.003	0.775\|$\pm $\| 0.004	0.811\|$\pm $\| 0.004	0.743\|$\pm $\| 0.012	0.788\|$\pm $\| 0.017

Article Contents

GTE: a graph learning framework for prediction of T-cell receptors and epitopes binding specificity

Abstract

Introduction

Related work

Methods

Constructing graph-based topology for TCR-Epitope

Dynamic edge updating strategy

Deep AUC maximization

Pretrained encoder

Experiments

Experimental setup

Training and testing set splitting

Datasets

Negative data

Experimental results

Performance of GTE

Single sequence protein model evaluation

MSA-based protein model evaluation

Ablation study

Different negative sample settings

Parameter sensitivity analysis

Evaluation on independent dataset

Confidence density

Limitations

Conclusion

Funding

Data and code availability

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Month:	Total Views:
July 2024	414
August 2024	195
September 2024	222
October 2024	203
November 2024	161
December 2024	139
January 2025	80
February 2025	188
March 2025	122
April 2025	108
May 2025	12

Article Contents

GTE: a graph learning framework for prediction of T-cell receptors and epitopes binding specificity

Abstract

Introduction

Related work

Methods

Constructing graph-based topology for TCR-Epitope

Dynamic edge updating strategy

Deep AUC maximization

Pretrained encoder

Experiments

Experimental setup

Training and testing set splitting

Datasets

Negative data

Experimental results

Performance of GTE

Single sequence protein model evaluation

MSA-based protein model evaluation

Ablation study

Different negative sample settings

Parameter sensitivity analysis

Evaluation on independent dataset

Confidence density

Limitations

Conclusion

Funding

Data and code availability

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only