Abstract

Motivation

The knowledge of protein stability upon residue variation is an important step for functional protein design and for understanding how protein variants can promote disease onset. Computational methods are important to complement experimental approaches and allow a fast screening of large datasets of variations.

Results

In this work, we present DDGemb, a novel method combining protein language model embeddings and transformer architectures to predict protein ΔΔG upon both single- and multi-point variations. DDGemb has been trained on a high-quality dataset derived from literature and tested on available benchmark datasets of single- and multi-point variations. DDGemb performs at the state of the art in both single- and multi-point variations.

Availability and implementation

DDGemb is available as web server at https://ddgemb.biocomp.unibo.it. Datasets used in this study are available at https://ddgemb.biocomp.unibo.it/datasets.

1 Introduction

Computational methods for predicting the effect of variations on protein thermodynamic stability play a fundamental role in computational protein design (Notin et al. 2024), in functional characterization of protein variants (Vihinen 2021) and their relation to disease onset (Puglisi 2022, Pandey et al. 2023). In the last years, several methods have been presented for the prediction of protein stability change upon variation (ΔΔG).

Tools available can be roughly classified according to the type of information they rely on (protein structure and/or sequence) and on the type of method which carries out the prediction. Structure-based methods rely on the availability of the protein structure as an input. Different structure-based predictive approaches have been presented, including methods based on force fields and energy functions (Schymkowitz et al. 2005, Kellogg et al. 2011, Worth et al. 2011), conventional machine-learning methods (Capriotti et al. 2005, Dehouck et al. 2011, Pires et al. 2014b, Laimer et al. 2015, Savojardo et al. 2016, Montanucci et al. 2019, Chen et al. 2020), deep-learning approaches (Li et al. 2020, Benevenuta et al. 2021), and consensus methods (Pires et al. 2014a, Rodrigues et al. 2018, 2021).

Sequence-based methods only use features that can be extracted from the protein sequence. So far, the vast majority of methods available are based on canonical features such as evolutionary information and physicochemical properties, processed by conventional machine-learning methods (Capriotti et al. 2005, Cheng et al. 2006, Fariselli et al. 2015, Montanucci et al. 2019, Li et al. 2021). ACDC-NN-Seq introduced deep-learning methods (convolutional networks) to process sequence profiles extracted from multiple-sequence alignments (Pancotti et al. 2021). Recently, PROSTATA (Umerenkov et al. 2023) adopted protein language models for encoding the protein wild-type and mutated sequences. The protein language model input is then processed in PROSTATA using a simple neural network with a single hidden layer. The sequence-based THPLM adopts pretrained protein language models and a simple convolutional neural network (Gong et al. 2023). Finally, ThermoMPNN (Dieckhaus et al. 2024) also adopts a pretrained pLM called ProteinMPNN (Dauparas et al. 2022) in combination with a deep network to predict ΔΔG upon single-point variations.

One of the major challenges in the field of protein stability prediction is the ability to predict ΔΔG upon multi-point variations, i.e. how protein stability is affected when variations occur at multiple residue positions. So far, only a few methods support multi-point variations as an input: four structure-based methods [FoldX (Schymkowitz et al. 2005), MAESTRO (Laimer et al. 2015, 2016), DDGun3D (Montanucci et al. 2019) and Dynamut2 (Rodrigues et al. 2021)], and one sequence-based method, DDGunSeq (Montanucci et al. 2019). Overall, the performance of methods for predicting ΔΔG upon multi-point variations is generally lower than that obtained for single-point variations.

In this work, we present a novel method called DDGemb for the prediction of protein ΔΔG upon both single- and multi-point variations. DDGemb exploits the power of ESM2 protein language model (Lin et al. 2023) for protein and variant representation in combination with a deep-learning architecture based on a Transformer encoder (Vaswani et al. 2017) to predict the ΔΔG.

We train DDGemb using full-length protein sequences and single-point variations from the S2648 dataset (Dehouck et al. 2011), previously adopted to train different state-of-the-art approaches (Dehouck et al. 2011, Fariselli et al. 2015, Savojardo et al. 2016).

The performance of DDGemb is evaluated on ΔΔG prediction upon both single- and multi-point variations. For single-point variations, we adopted the S669 dataset recently presented in literature and already adopted for benchmarking a large set of tools (Pancotti et al. 2022). For multi-point variations, we adopted a dataset derived from the PTmul dataset (Montanucci et al. 2019). In both benchmarks, DDGemb reports state-of-the-art performance, overpassing both sequence- and structure-based methods.

2 Materials and methods

2.1 Datasets

2.1.1 The S669 blind test set

For a fair and comprehensive evaluation of DDGemb performance and for comparing with other state-of-the-art approaches, we take advantage of an independent dataset adopted in literature to score a large set of available tools for predicting protein stability change upon variation (Pancotti et al. 2022).

The dataset, named S669, comprises 1338 direct and reverse single-site variations occurring in 95 protein chains. ΔΔG values were retrieved from ThermoMutDB (Xavier et al. 2021) and manually checked by authors. In this paper, we adopt the convention by which negative ΔΔG values indicate destabilizing variations. Interestingly, the dataset has been built to be nonredundant at 25% sequence identity with respect to datasets routinely used for training tools available in literature, including the S2648 (Dehouck et al. 2011) and the VariBench dataset (Nair and Vihinen 2013). This enables a fair comparison with most state-of-the-art tools. Variations included in S669 are provided in relation to PDB chains. In this work, since DDGemb adopts protein language models for input encoding, we mapped all variations on full-length UniProt (https://www.uniprot.org/) sequences using SIFTS (Dana et al. 2019).

2.1.2 Training set: the S2450 dataset

To build our training set, we started from the well-known and widely adopted S2648 dataset (Dehouck et al. 2011), containing 2648 single-point variations on 131 different proteins. Associated experimental ΔΔG values are retrieved from the ProTherm database (Bava et al. 2004) and were manually checked and corrected to avoid inconsistencies. Differently from previous works adopting the same dataset (Dehouck et al. 2011, Fariselli et al. 2015), in which variations are directly mapped on PDB chain sequences, in this work, we adopted full-length protein sequences from UniProt. To this aim, we used SIFTS (Dana et al. 2019) to map PDB chains and variant positions on corresponding UniProt sequences.

Homology reduction of the S669 dataset against S2648 was originally performed in (Pancotti et al. 2022) considering only PDB-covered portions of the sequences. This procedure does not guarantee to detect all sequence similarity on full-length sequences. For this reason, in this work, we compared UniProt sequences in S669 and S2648, removing from the training set those having >25% sequence identity with any sequence in the test set (S669). Overall, 18 sequences were removed from S2648, accounting for 198 single-point variations. This reduced dataset is then referred to as S2450 throughout the entire paper.

The S2450 dataset was adopted here to perform 5-fold cross-validation. To this aim, we implemented the stringent data split procedure described in (Fariselli et al. 2015), by which all variations occurring on the same protein are put in the same cross-validation subset and proteins are divided among subsets taking into consideration pairwise sequence identity (setting a threshold to 25%). In this way, during cross-validation, no redundancy is present between protein sequences included in training and validation sets.

By construction, the S2450 dataset is unbalanced toward destabilizing variations (i.e. negative ΔΔG values). To balance the dataset, to reduce the bias toward destabilizing ΔΔG values, and to improve the model capability of predicting stabilizing variations, we exploited thermodynamic reversibility of variations, by which ΔΔG(AB) = −ΔΔG(BA) (Capriotti et al. 2008). Using the reversibility property, the set of variations can be artificially doubled to include reverse variations, switching the sign of experimental ΔΔG values.

2.1.3 Multiple variations: the reduced PTmul dataset

We also adopted a dataset for testing the DDGemb on the prediction of ΔΔG upon multi-point variations. The dataset, referred to as PTmul, has been introduced in (Montanucci et al. 2019): it comprises 914 multi-point variations on 91 proteins. However, the original PTmul dataset share a high level of sequence similarity when compared to our S2450 training dataset. In order to perform a fair evaluation of the performance, we excluded from PTmul all proteins that are similar to any protein in our S2450 training set. After this reduction step, we retained 82 multi-point variants occurring in 14 proteins. Although the number of variants is significantly reduced if compared to the original dataset, using this homology reduction procedure ensures a fair evaluation of the different methods. The reduced dataset is referred to as PTmul-NR.

2.2 The DDGemb method

An overview of the DDGemb deep-learning model is shown in Fig. 1. The architecture comprises two components: (i) the input encoding and (ii) the ΔΔG prediction model. In the next section we describe in detail both components.

The DDGemb model architecture. For input encoding details refer to Section 2.2.1; for ΔΔG prediction see Sections 2.2.2 and 2.2.3.
Figure 1.

The DDGemb model architecture. For input encoding details refer to Section 2.2.1; for ΔΔG prediction see Sections 2.2.2 and 2.2.3.

2.2.1 Input encoding

For encoding a single residue variation, we start from the wild-type and the variant protein sequences. The latter is derived from the former upon either single-point or multi-point variations. In the first step, the two sequences, both of length L, are encoded using the ESM2 protein language model (pLM) (Rives et al. 2021, Lin et al. 2023). Among the different models available and after input encoding optimization (see Section 3), here we adopted the medium-size 33-layers model with 650M parameters and trained on the UniRef50 database. This model provides residue-level embeddings of dimension 1280 and represents a good trade-off between representation expressivity and computational requirements.

For generating embeddings, we adopted the ESM2 package available at https://github.com/facebookresearch/esm.

The application of the ESM2 pLM provides two L×d matrices, named Ewt and Evt, representing the residue-level embeddings of the wild-type and variant sequences (derived either from a single- or multi-point variation), respectively. A single L×d matrix D encoding the variation is then generated computing the element-wise difference of Ewt and Evt:
(1)

The matrix D is used as input for the downstream ΔΔG prediction architecture.

2.2.2 The Transformer based ΔΔG prediction network

The remaining part of the DDGemb architecture is devised to predict a ΔΔG value starting from the input matrix D encoding the protein variant. The hyperparameters of the final model were optimized in cross-validation, according to the different configurations reported in Table 1 (see Section 3). After optimization, the final selected model is Model4 (Table 1), described in the following.

Table 1.

Five-fold cross-validation results of different ΔΔG prediction architectures.

ModelModel configurationPCC¯RMSE¯MAE¯
Model0m = 32, h = 2, s = 1280.68 ± 0.011.27 ± 0.100.97 ± 0.09
Model1m = 64, h = 2, s = 2560.70 ± 0.011.23 ± 0.110.94 ± 0.09
Model2m = 64, h = 4, s = 2560.70 ± 0.011.24 ± 0.110.96 ± 0.09
Model3m = 128, h = 4, s = 5120.70 ± 0.011.24 ± 0.110.96 ± 0.09
Model4m = 128, h = 8, s = 5120.71 ± 0.011.23 ± 0.100.94 ± 0.08
Model5m = 256, h = 8, s = 10240.71 ± 0.021.23 ± 0.120.95 ± 0.10
ModelModel configurationPCC¯RMSE¯MAE¯
Model0m = 32, h = 2, s = 1280.68 ± 0.011.27 ± 0.100.97 ± 0.09
Model1m = 64, h = 2, s = 2560.70 ± 0.011.23 ± 0.110.94 ± 0.09
Model2m = 64, h = 4, s = 2560.70 ± 0.011.24 ± 0.110.96 ± 0.09
Model3m = 128, h = 4, s = 5120.70 ± 0.011.24 ± 0.110.96 ± 0.09
Model4m = 128, h = 8, s = 5120.71 ± 0.011.23 ± 0.100.94 ± 0.08
Model5m = 256, h = 8, s = 10240.71 ± 0.021.23 ± 0.120.95 ± 0.10
Table 1.

Five-fold cross-validation results of different ΔΔG prediction architectures.

ModelModel configurationPCC¯RMSE¯MAE¯
Model0m = 32, h = 2, s = 1280.68 ± 0.011.27 ± 0.100.97 ± 0.09
Model1m = 64, h = 2, s = 2560.70 ± 0.011.23 ± 0.110.94 ± 0.09
Model2m = 64, h = 4, s = 2560.70 ± 0.011.24 ± 0.110.96 ± 0.09
Model3m = 128, h = 4, s = 5120.70 ± 0.011.24 ± 0.110.96 ± 0.09
Model4m = 128, h = 8, s = 5120.71 ± 0.011.23 ± 0.100.94 ± 0.08
Model5m = 256, h = 8, s = 10240.71 ± 0.021.23 ± 0.120.95 ± 0.10
ModelModel configurationPCC¯RMSE¯MAE¯
Model0m = 32, h = 2, s = 1280.68 ± 0.011.27 ± 0.100.97 ± 0.09
Model1m = 64, h = 2, s = 2560.70 ± 0.011.23 ± 0.110.94 ± 0.09
Model2m = 64, h = 4, s = 2560.70 ± 0.011.24 ± 0.110.96 ± 0.09
Model3m = 128, h = 4, s = 5120.70 ± 0.011.24 ± 0.110.96 ± 0.09
Model4m = 128, h = 8, s = 5120.71 ± 0.011.23 ± 0.100.94 ± 0.08
Model5m = 256, h = 8, s = 10240.71 ± 0.021.23 ± 0.120.95 ± 0.10

The input matrix D is firstly processed by a 1D convolution layer comprising m=128 filters of size w=15, with ReLU activation functions. The 1D-convolution layer provides a way of projecting the higher-dimensional input data into a lower-dimensional space of size m, extracting local contextual information through a series of sliding filters of width w. The output of the 1D-convolution is a matrix C of dimension L×m.

The matrix C is then passed through a Transformer encoder layer (Vaswani et al. 2017), consisting of a cascading architecture including a multi-head attention layer with eight attention heads (h), residual connections, and a position-wise feedforward network (FFN). The Transformer encoder is responsible for computing self-attention across the input sequence, producing in output a representation of the input taking into consideration the relations among the different positions of the input sequence.

The architecture adopted here is directly derived from the original Transformer definition (Vaswani et al. 2017). Formally, given the input sequence C of dimension L×m, each head i of the multi-head attention layer adopts three matrices of learnable weights, called AQi, AKi, and AVi each having dimension m×r, where r=m/h (r is equal to 16 in our case) and h is the number of attention heads (here set to 8). The input matrix C is firstly projected using the AQi, AKi, and AVi as follows:
(2)
(3)
(4)
where · denotes the matrix product operator.
Then, for each head i, an attention output Zi of dimension L×r is computed as follows:
(5)
The different Zi from the different attention heads are then concatenated, multiplied with an output weight matrix AO of dimension m×m, and the result added to the input C matrix by residual connection:
(6)
where [ ] denotes the concatenation operator (by rows) and the Z output matrix has dimension L×m (as the input matrix C).
The final Transformer encoder output F of dimension L×m is computed by independently applying a position-wise Feed-Forward Network (FFN) to each position 1jL of Z, and residual connection addition. In other words, each row fj of F is computed as follows:
(7)
where ReLU is the activation function defined as gx=max0, x,  and W1, b1, W2, b2 are position-independent weight parameters and biases of the FFN, having, respectively, dimensions m×s, 1×s, s×m, and 1×m. Here, we set the dimensionality s of the hidden layer of the position-wise FFN to 512.
The output matrix F of the Transformer encoder is then collapsed to two unidimensional vectors by means of Global Average and Max Pooling layers, denoted as pave and pmax, respectively, acting on the first dimension L of the matrix F:
(8)
(9)
The pooled vectors are then concatenated into a single vector P of size 2m:
(10)
where [ ] denotes the concatenation operator.
P is finally processed by a linear FFN parametrized by a weight vector wO and bias bO, producing in output the predicted ΔΔG value y^:
(11)

2.2.3 Model training and implementation

Given a dataset of N protein single residue variations D=D1,D2, , DN, corresponding target ΔΔG values Y=y1,y2,, yN, and model predictions Y^=y^1, y^2,, y^N training is carried out minimizing the Mean Squared Error function on training data:
(12)

The optimization is carried out with gradient descent and using the Adam optimizer (Kingma and Ba 2017). The training data are split into mini batches of size 128. Training is performed for 500 epochs and stopped when error starts decreasing on a subset of the training set used as validation data (early stopping).

The training procedure and the model itself were implemented using the PyTorch Python package (https://pytorch.org). All experiments were carried out on a single machine equipped with two AMD EPYC 7413 CPUs with 48/96 CPU cores/threads and 768 GB RAM.

2.2.4 DDGemb web server

We release DDGemb as a web server at https://ddgemb.biocomp.unibo.it. The server provides a user-friendly web interface, providing both interactive and batch submission modes.

In the interactive mode, the user can predict ΔΔG for up to 100 variations occurring on a single protein sequence as an input. Results of interactive jobs can be directly visualized on the DDGemb website and downloaded in JSON or TSV formats. Both single- and multi-point variations are supported.

Batch submission mode is dedicated to larger prediction jobs. The user can submit up to 2000 variations occurring on at most 500 proteins per job. Results of batch jobs can be downloaded in JSON and/or TSV formats.

In both cases, user jobs are maintained for a month after completion. The user can retrieve job results using the job assigned upon submission.

The web application is implemented using Django (version 4.0.4), Bootstrap (version 5.3.0), JQuery (version 3.6.0), and neXtProt FeatureViewer (version 1.3.0-beta6) for graphical visualization of predicted variants and their ΔΔG along the sequence.

2.3 Scoring performance

To score the performance of the different approaches, we use the following well-established scoring indexes. In what follows, e and p are experimental and predicted ΔΔG values, respectively, while pdir and pinv are predicted ΔΔG for direct and corresponding reverse variations, respectively.

The Pearson’s correlation coefficient (PCC) between e and p is defined as:
(13)

where e¯ and p¯ are average experimental and predicted ΔΔG values, respectively.

The Root Mean Square Error (RMSE) between e and p is defined as:
(14)
The Mean Absolute Error (MAE) between e and p is defined as:
(15)

To score anti-symmetry properties of the different tools we adopted two additional measures defined in literature (Pucci et al. 2018).

The Pearson’s correlation between pdir and pinv, referred to as rd-r, defined as:
(16)
The anti-symmetry bias δ defined as:
(17)

3 Results

3.1 Cross-validation results on the S2450 dataset

In a first experiment, we performed 5-fold cross-validation on the S2450 dataset. To this aim, we adopted the most stringent data split procedure proposed in Fariselli et al. (2015), which consists in retaining all variations occurring in the same protein within the same cross-validation subset and in confining proteins with >25% sequence identity in the same subset. Sequence comparison was performed using full-length UniProt sequences.

Considering both average PCC, RMSE, and MAE values and the corresponding standard deviations, the highest performance is obtained using the architecture Model4, including 128 1D-convolutional filters, 8 Transformer encoder attention heads and 512 hidden units in the Transformed encoder FFN output (Table 1). This configuration has been then chosen as the final model.

3.2 Prediction of single-point variations on the S669 dataset

We compared DDGemb with several state-of-the-art methods introduced in the past years using the common benchmark dataset S669.

Results for 21 different methods were taken from (Pancotti et al. 2022), except DDGemb, presented in this work, PROSTATA (Umerenkov et al. 2023), THPLM (Gong et al. 2023), and ThermoMPNN (Dieckhaus et al. 2024) whose results were extracted from the respective papers. Scored methods include nine sequence-based predictors, namely INPS (Fariselli et al. 2015), ACDC-NN-Seq (Pancotti et al. 2022), DDGun (Montanucci et al. 2019), I-Mutant3-Seq (Capriotti et al. 2005), SAAFEC-SEQ (Li et al. 2021), MUPro (Cheng et al. 2006), PROSTATA (Umerenkov et al. 2023), THPLM (Gong et al. 2023) and ThermoMPNN (Dieckhaus et al. 2024), and fifteen structure-based methods, ACDC-NN (Benevenuta et al. 2021), PremPS (Chen et al. 2020), DDGun3D (Montanucci et al. 2019), INPS-3D (Savojardo et al. 2016), ThermoNet (Li et al. 2020), MAESTRO (Laimer et al. 2015, 2016), Dynamut (Rodrigues et al. 2018), PoPMuSiC (Dehouck et al. 2011), DUET (Pires et al. 2014a), SDM (Worth et al. 2011), mCSM (Pires et al. 2014b), Dynamut2 (Rodrigues et al. 2021), I-Mutant3-3D (Capriotti et al. 2005), Rosetta (Kellogg et al. 2011), and FoldX (Schymkowitz et al. 2005). Results are listed in Table 2.

Table 2.

Comparative benchmark of different sequence- and structure-based methods on the S669 independent test set of single-point variations.

MethodInputTotal
Direct
Reverse
Symmetry
PCCRMSEMAEPCCRMSEMAEPCCRMSEMAErd-r⟨δ⟩
DDGembSEQ0.681.400.990.531.400.990.521.400.99−0.970.01
PROSTATASEQ0.651.451.000.491.451.000.491.450.99−0.99−0.01
ACDC-NN3D0.611.51.050.461.491.050.451.51.06−0.980.02
INPS-SeqSEQ0.611.521.10.431.521.090.431.531.11.000.00
PremPS3D0.621.491.070.411.51.080.421.491.05−0.850.09
ACDC-NN-SeqSEQ0.591.531.080.421.531.080.421.531.081.000.00
DDGun3D3D0.571.611.130.431.61.110.411.621.14−0.970.05
INPS3D3D0.551.641.190.431.51.070.331.771.31−0.50.38
THPLMSEQ0.531.630.391.600.351.66−0.96−0.01
ThermoNet3D0.511.641.20.391.621.170.381.661.23−0.850.05
DDGunSEQ0.571.741.250.411.721.250.381.751.25−0.960.05
MAESTRO3D0.441.81.30.51.441.060.22.11.66−0.220.57
ThermoMPNNSEQ0.431.52
Dynamut3D0.51.651.210.411.61.190.341.691.24−0.580.06
PoPMuSiC3D0.461.821.370.411.511.090.242.091.64−0.320.69
DUET3D0.411.861.390.411.521.10.232.141.68−0.120.67
I-Mutant3.0-SeqSEQ0.371.911.470.341.541.150.222.221.79−0.480.76
SDM3D0.321.931.450.411.671.260.132.161.64−0.40.4
mCSM3D0.371.961.490.361.541.130.222.31.86−0.050.85
Dynamut23D0.361.91.420.341.581.150.172.161.690.030.64
I-Mutant3.03D0.321.961.490.361.521.120.152.321.87−0.060.81
Rosetta3D0.472.692.050.392.72.080.42.682.02−0.720.61
FoldX3D0.312.391.530.222.31.560.222.481.5−0.20.34
SAAFEC-SEQSEQ0.262.021.540.361.541.13−0.012.41.94−0.030.83
MUproSEQ0.322.031.580.251.611.210.202.381.96−0.320.95
MethodInputTotal
Direct
Reverse
Symmetry
PCCRMSEMAEPCCRMSEMAEPCCRMSEMAErd-r⟨δ⟩
DDGembSEQ0.681.400.990.531.400.990.521.400.99−0.970.01
PROSTATASEQ0.651.451.000.491.451.000.491.450.99−0.99−0.01
ACDC-NN3D0.611.51.050.461.491.050.451.51.06−0.980.02
INPS-SeqSEQ0.611.521.10.431.521.090.431.531.11.000.00
PremPS3D0.621.491.070.411.51.080.421.491.05−0.850.09
ACDC-NN-SeqSEQ0.591.531.080.421.531.080.421.531.081.000.00
DDGun3D3D0.571.611.130.431.61.110.411.621.14−0.970.05
INPS3D3D0.551.641.190.431.51.070.331.771.31−0.50.38
THPLMSEQ0.531.630.391.600.351.66−0.96−0.01
ThermoNet3D0.511.641.20.391.621.170.381.661.23−0.850.05
DDGunSEQ0.571.741.250.411.721.250.381.751.25−0.960.05
MAESTRO3D0.441.81.30.51.441.060.22.11.66−0.220.57
ThermoMPNNSEQ0.431.52
Dynamut3D0.51.651.210.411.61.190.341.691.24−0.580.06
PoPMuSiC3D0.461.821.370.411.511.090.242.091.64−0.320.69
DUET3D0.411.861.390.411.521.10.232.141.68−0.120.67
I-Mutant3.0-SeqSEQ0.371.911.470.341.541.150.222.221.79−0.480.76
SDM3D0.321.931.450.411.671.260.132.161.64−0.40.4
mCSM3D0.371.961.490.361.541.130.222.31.86−0.050.85
Dynamut23D0.361.91.420.341.581.150.172.161.690.030.64
I-Mutant3.03D0.321.961.490.361.521.120.152.321.87−0.060.81
Rosetta3D0.472.692.050.392.72.080.42.682.02−0.720.61
FoldX3D0.312.391.530.222.31.560.222.481.5−0.20.34
SAAFEC-SEQSEQ0.262.021.540.361.541.13−0.012.41.94−0.030.83
MUproSEQ0.322.031.580.251.611.210.202.381.96−0.320.95

Results for all methods except DDGemb, THPLM, ThermoMPNN, and PROSTATA were taken from (Pancotti et al. 2022). For PROSTATA and THPLM direct and reverse PCC, RMSE, and MAE were taken from the reference papers (Gong et al. 2023, Umerenkov et al. 2023). ThermoMPNN results were taken from Dieckhaus et al. (2024) PROSTATA total PCC, Total RMSE, Total MAE, PCCd-r, and δ were computed using the predictions available at the method GitHub repository. Bold values highlight the top-performing method(s) on the respective metric.

Table 2.

Comparative benchmark of different sequence- and structure-based methods on the S669 independent test set of single-point variations.

MethodInputTotal
Direct
Reverse
Symmetry
PCCRMSEMAEPCCRMSEMAEPCCRMSEMAErd-r⟨δ⟩
DDGembSEQ0.681.400.990.531.400.990.521.400.99−0.970.01
PROSTATASEQ0.651.451.000.491.451.000.491.450.99−0.99−0.01
ACDC-NN3D0.611.51.050.461.491.050.451.51.06−0.980.02
INPS-SeqSEQ0.611.521.10.431.521.090.431.531.11.000.00
PremPS3D0.621.491.070.411.51.080.421.491.05−0.850.09
ACDC-NN-SeqSEQ0.591.531.080.421.531.080.421.531.081.000.00
DDGun3D3D0.571.611.130.431.61.110.411.621.14−0.970.05
INPS3D3D0.551.641.190.431.51.070.331.771.31−0.50.38
THPLMSEQ0.531.630.391.600.351.66−0.96−0.01
ThermoNet3D0.511.641.20.391.621.170.381.661.23−0.850.05
DDGunSEQ0.571.741.250.411.721.250.381.751.25−0.960.05
MAESTRO3D0.441.81.30.51.441.060.22.11.66−0.220.57
ThermoMPNNSEQ0.431.52
Dynamut3D0.51.651.210.411.61.190.341.691.24−0.580.06
PoPMuSiC3D0.461.821.370.411.511.090.242.091.64−0.320.69
DUET3D0.411.861.390.411.521.10.232.141.68−0.120.67
I-Mutant3.0-SeqSEQ0.371.911.470.341.541.150.222.221.79−0.480.76
SDM3D0.321.931.450.411.671.260.132.161.64−0.40.4
mCSM3D0.371.961.490.361.541.130.222.31.86−0.050.85
Dynamut23D0.361.91.420.341.581.150.172.161.690.030.64
I-Mutant3.03D0.321.961.490.361.521.120.152.321.87−0.060.81
Rosetta3D0.472.692.050.392.72.080.42.682.02−0.720.61
FoldX3D0.312.391.530.222.31.560.222.481.5−0.20.34
SAAFEC-SEQSEQ0.262.021.540.361.541.13−0.012.41.94−0.030.83
MUproSEQ0.322.031.580.251.611.210.202.381.96−0.320.95
MethodInputTotal
Direct
Reverse
Symmetry
PCCRMSEMAEPCCRMSEMAEPCCRMSEMAErd-r⟨δ⟩
DDGembSEQ0.681.400.990.531.400.990.521.400.99−0.970.01
PROSTATASEQ0.651.451.000.491.451.000.491.450.99−0.99−0.01
ACDC-NN3D0.611.51.050.461.491.050.451.51.06−0.980.02
INPS-SeqSEQ0.611.521.10.431.521.090.431.531.11.000.00
PremPS3D0.621.491.070.411.51.080.421.491.05−0.850.09
ACDC-NN-SeqSEQ0.591.531.080.421.531.080.421.531.081.000.00
DDGun3D3D0.571.611.130.431.61.110.411.621.14−0.970.05
INPS3D3D0.551.641.190.431.51.070.331.771.31−0.50.38
THPLMSEQ0.531.630.391.600.351.66−0.96−0.01
ThermoNet3D0.511.641.20.391.621.170.381.661.23−0.850.05
DDGunSEQ0.571.741.250.411.721.250.381.751.25−0.960.05
MAESTRO3D0.441.81.30.51.441.060.22.11.66−0.220.57
ThermoMPNNSEQ0.431.52
Dynamut3D0.51.651.210.411.61.190.341.691.24−0.580.06
PoPMuSiC3D0.461.821.370.411.511.090.242.091.64−0.320.69
DUET3D0.411.861.390.411.521.10.232.141.68−0.120.67
I-Mutant3.0-SeqSEQ0.371.911.470.341.541.150.222.221.79−0.480.76
SDM3D0.321.931.450.411.671.260.132.161.64−0.40.4
mCSM3D0.371.961.490.361.541.130.222.31.86−0.050.85
Dynamut23D0.361.91.420.341.581.150.172.161.690.030.64
I-Mutant3.03D0.321.961.490.361.521.120.152.321.87−0.060.81
Rosetta3D0.472.692.050.392.72.080.42.682.02−0.720.61
FoldX3D0.312.391.530.222.31.560.222.481.5−0.20.34
SAAFEC-SEQSEQ0.262.021.540.361.541.13−0.012.41.94−0.030.83
MUproSEQ0.322.031.580.251.611.210.202.381.96−0.320.95

Results for all methods except DDGemb, THPLM, ThermoMPNN, and PROSTATA were taken from (Pancotti et al. 2022). For PROSTATA and THPLM direct and reverse PCC, RMSE, and MAE were taken from the reference papers (Gong et al. 2023, Umerenkov et al. 2023). ThermoMPNN results were taken from Dieckhaus et al. (2024) PROSTATA total PCC, Total RMSE, Total MAE, PCCd-r, and δ were computed using the predictions available at the method GitHub repository. Bold values highlight the top-performing method(s) on the respective metric.

For each method we report PCC, RMSE and MAE computed considering (i) all variations (both direct and reverse) in the dataset (columns under “Total”), (ii) only direct variations (columns under “Direct”), and (iii) only reverse variations (columns under “Reverse”). In addition, we computed PCCd-r and <δ>.

On the S669 dataset, DDGemb reports the highest PCC, RMSE, and MAE values (Total, Direct and Reverse). Our DDGemb overall scores as the top-performing tool in this benchmark, significantly overpassing both structure- and sequence-based methods.

3.3 Prediction of multi-point variations

We finally tested DDGemb in the prediction of multi-point variations using the PTmul-NR dataset. This allowed us to directly compare with other methods such as DDGun/DDGun3D (Montanucci et al. 2019), MAESTRO (Laimer et al. 2015), and FoldX (Schymkowitz et al. 2005). Results are listed in Table 3.

Table 3.

Comparative benchmark of different methods on multi-point variations from the PTmul-NR dataset.

MethodPCCRMSEMAE
DDGemb0.592.161.59
FoldX0.365.513.66
MAESTRO0.282.551.88
DDGun0.232.552.10
DDGun3D0.172.572.08
MethodPCCRMSEMAE
DDGemb0.592.161.59
FoldX0.365.513.66
MAESTRO0.282.551.88
DDGun0.232.552.10
DDGun3D0.172.572.08
Table 3.

Comparative benchmark of different methods on multi-point variations from the PTmul-NR dataset.

MethodPCCRMSEMAE
DDGemb0.592.161.59
FoldX0.365.513.66
MAESTRO0.282.551.88
DDGun0.232.552.10
DDGun3D0.172.572.08
MethodPCCRMSEMAE
DDGemb0.592.161.59
FoldX0.365.513.66
MAESTRO0.282.551.88
DDGun0.232.552.10
DDGun3D0.172.572.08

On the PTmul-NR dataset, DDGemb significantly outperforms DDGun, DDGun3D, FoldX, and MAESTRO on the prediction of multi-point variations, achieving the highest correlation coefficient of 0.59, and the lowest RMSE and MAE values of 2.16 and 1.59, respectively.

These results suggest that DDGemb can be effectively used for assessing with high accuracy the impact of multi-point variations on protein stability. Remarkably, our method has been trained using only single-point variants, suggesting the ability of the proposed approach to generalize on multi-point variants as well.

4 Conclusion

In this work, we present DDGemb, a novel method based on protein language models and Transformers to predict protein stability change (ΔΔG) upon single- and multi-point variations. Our method has been trained on a high-quality dataset derived from literature and tested using recently introduced benchmark datasets of thermodynamic data for single- and multi-point variations. In all the benchmarks, DDGemb reports performances that are superior to the state of art, outperforming both sequence- and structure-based methods and achieving an overall PCC of 0.68 on single-point variations. Moreover, on multi-point variations, our method reports a PCC of 0.59, which is significantly higher than the one achieved by the second top-performing approach, FoldX, reporting PCC equal to 0.36. Our study suggests the relevance of a transformer architecture specifically fine-tuned to predict the ΔΔG upon variation, in combination with numerical representations provided by protein language models.

Conflict of interest: None declared.

Funding

The work was supported by the European Union NextGenerationEU through the Italian Ministry of University and Research under the projects “Consolidation of the Italian Infrastructure for Omics Data and Bioinformatics” (ElixirxNext-GenIT)” (Investment PNRRM4C2-I3.1, Project IR_0000010, CUP B53C22001800006), “HEAL ITALIA” (Investment PNRR-M4C2-I1.3, Project PE_00000019, CUP J33C22002920006), “National Centre for HPC, Big Data and Quantum Computing” (Investment PNRR-M4C2-I1.4, Project CN_00000013).

Data availability

The data underlying this article are available in the article and in its online supplementary material.

References

Bava
KA
,
Gromiha
MM
,
Uedaira
H
et al.
ProTherm, version 4.0: thermodynamic database for proteins and mutants
.
Nucleic Acids Res
2004
;
32
:
D120
1
.

Benevenuta
S
,
Pancotti
C
,
Fariselli
P
et al.
An antisymmetric neural network to predict free energy changes in protein variants
.
J Phys D Appl Phys
2021
;
54
:
245403
.

Capriotti
E
,
Fariselli
P
,
Rossi
I
et al.
A three-state prediction of single point mutations on protein stability changes
.
BMC Bioinformatics
2008
;
9
:
S6
.

Capriotti
E
,
Fariselli
P
,
Casadio
R
et al.
I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure
.
Nucleic Acids Res
2005
;
33
:
W306
10
.

Chen
Y
,
Lu
H
,
Zhang
N
et al.
PremPS: predicting the impact of missense mutations on protein stability
.
PLoS Comput Biol
2020
;
16
:
e1008543
.

Cheng
J
,
Randall
A
,
Baldi
P
et al.
Prediction of protein stability changes for single-site mutations using support vector machines
.
Proteins
2006
;
62
:
1125
32
.

Dana
JM
,
Gutmanas
A
,
Tyagi
N
et al.
SIFTS: updated structure integration with function, taxonomy and sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins
.
Nucleic Acids Res
2019
;
47
:
D482
9
.

Dauparas
J
,
Anishchenko
I
,
Bennett
N
et al.
Robust deep learning-based protein sequence design using ProteinMPNN
.
Science
2022
;
378
:
49
56
.

Dehouck
Y
,
Kwasigroch
JM
,
Gilis
D
et al.
PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality
.
BMC Bioinformatics
2011
;
12
:
151
.

Dieckhaus
H
,
Brocidiacono
M
,
Randolph
NZ
et al.
Transfer learning to leverage larger datasets for improved prediction of protein stability changes
.
Proc Natl Acad Sci USA
2024
;
121
:
e2314853121
.

Fariselli
P
,
Martelli
PL
,
Savojardo
C
et al.
INPS: predicting the impact of non-synonymous variations on protein stability from sequence
.
Bioinformatics
2015
;
31
:
2816
21
.

Gong
J
, ,
Wang
J
,
,
Zong
X
et al.
Prediction of protein stability changes upon single-point variant using 3D structure profile
.
Comput Struct Biotechnol J
2023
;
21
:
354
64
.

Kellogg
EH
,
Leaver-Fay
A
,
Baker
D
et al.
Role of conformational sampling in computing mutation-induced changes in protein structure and stability
.
Proteins
2011
;
79
:
830
8
.

Kingma
DP
,
Ba
J.
Adam: a method for stochastic optimization. arXiv, arXiv:1412.6980v9,
2017
, preprint: not peer reviewed.

Laimer
J
,
Hofer
H
,
Fritz
M
et al.
MAESTRO—multi agent stability prediction upon point mutations
.
BMC Bioinformatics
2015
;
16
:
116
.

Laimer
J
,
Hiebl-Flach
J
,
Lengauer
D
et al.
MAESTROweb: a web server for structure-based protein stability prediction
.
Bioinformatics
2016
;
32
:
1414
6
.

Li
B
,
Yang
YT
,
Capra
JA
et al.
Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks
.
PLoS Comput Biol
2020
;
16
:
e1008291
.

Li
G
,
Panday
SK
,
Alexov
E
et al.
SAAFEC-SEQ: a sequence-based method for predicting the effect of single point mutations on protein thermodynamic stability
.
Int J Mol Sci
2021
;
22
:
606
.

Lin
Z
,
Akin
H
,
Rao
R
et al.
Evolutionary-scale prediction of atomic-level protein structure with a language model
.
Science
2023
;
379
:
1123
30
.

Montanucci
L
,
Capriotti
E
,
Frank
Y
et al.
DDGun: an untrained method for the prediction of protein stability changes upon single and multiple point variations
.
BMC Bioinformatics
2019
;
20
:
335
.

Nair
PS
,
Vihinen
M.
VariBench: a benchmark database for variations
.
Hum Mutat
2013
;
34
:
42
9
.

Notin
P
,
Rollins
N
,
Gal
Y
et al.
Machine learning for functional protein design
.
Nat Biotechnol
2024
;
42
:
216
28
.

Pancotti
C
,
Benevenuta
S
,
Repetto
V
et al.
A deep-learning sequence-based method to predict protein stability changes upon genetic variations
.
Genes (Basel)
2021
;
12
:
911
.

Pancotti
C
,
Benevenuta
S
,
Birolo
G
et al.
Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset
.
Brief Bioinform
2022
;
23
:
bbab555
.

Pandey
P
,
Ghimire
S
,
Wu
B
et al.
On the linkage of thermodynamics and pathogenicity
.
Curr Opin Struct Biol
2023
;
80
:
102572
.

Pires
DEV
,
Ascher
DB
,
Blundell
TL
et al.
DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach
.
Nucleic Acids Res
2014a
;
42
:
W314
9
.

Pires
DEV
,
Ascher
DB
,
Blundell
TL
et al.
mCSM: predicting the effects of mutations in proteins using graph-based signatures
.
Bioinformatics
2014b
;
30
:
335
42
.

Pucci
F
,
Bernaerts
KV
,
Kwasigroch
JM
et al.
Quantification of biases in predictions of protein stability changes upon mutations
.
Bioinformatics
2018
;
34
:
3659
65
.

Puglisi
R.
Protein mutations and stability, a link with disease: the case study of frataxin
.
Biomedicines
2022
;
10
:
425
.

Rives
A
,
Meier
J
,
Sercu
T
et al.
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
.
Proc Natl Acad Sci USA
2021
;
118
:
e2016239118
.

Rodrigues
CH
,
Pires
DE
,
Ascher
DB
et al.
DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability
.
Nucleic Acids Res
2018
;
46
:
W350
5
.

Rodrigues
CHM
,
Pires
DEV
,
Ascher
DB
et al.
DynaMut2: assessing changes in stability and flexibility upon single and multiple point missense mutations
.
Protein Sci
2021
;
30
:
60
9
.

Savojardo
C
,
Fariselli
P
,
Martelli
PL
et al.
INPS-MD: a web server to predict stability of protein variants from sequence and structure
.
Bioinformatics
2016
;
32
:
2542
4
.

Schymkowitz
J
,
Borg
J
,
Stricher
F
et al.
The FoldX web server: an online force field
.
Nucleic Acids Res
2005
;
33
:
W382
8
.

Umerenkov
D
,
Nikolaev
F
,
Shashkova
TI
et al.
PROSTATA: a framework for protein stability assessment using transformers
.
Bioinformatics
2023
;
39
:
btad671
.

Vaswani
A, Shazeer N, Parma N
et al. Attention is all you need. arXiv, arXiv:1706.03762v7,
2017
, preprint: not peer reviewed.

Vihinen
M.
Functional effects of protein variants
.
Biochimie
2021
;
180
:
104
20
.

Worth
CL
,
Preissner
R
,
Blundell
TL
et al.
SDM—a server for predicting effects of mutations on protein stability and malfunction
.
Nucleic Acids Res
2011
;
39
:
W215
22
.

Xavier
JS
, ,
Nguyen
T-B
,
,
Karmarkar
M
et al.
ThermoMutDB: A thermodynamic database for missense mutations
.
Nucleic Acids Research
2021
;
49
:
D475
9
.

Author notes

Castrense Savojardo and Matteo Manfredi Equal contribution.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Jianlin Cheng
Jianlin Cheng
Associate Editor
Search for other works by this author on: