-
PDF
- Split View
-
Views
-
Cite
Cite
Matteo Tiberti, Thilde Terkelsen, Kristine Degn, Ludovica Beltrame, Tycho Canter Cremers, Isabelle da Piedade, Miriam Di Marco, Emiliano Maiani, Elena Papaleo, MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles, Briefings in Bioinformatics, Volume 23, Issue 3, May 2022, bbac074, https://doi.org/10.1093/bib/bbac074
- Share Icon Share
Abstract
Mutations, which result in amino acid substitutions, influence the stability of proteins and their binding to biomolecules. A molecular understanding of the effects of protein mutations is both of biotechnological and medical relevance. Empirical free energy functions that quickly estimate the free energy change upon mutation (ΔΔG) can be exploited for systematic screenings of proteins and protein complexes. In silico saturation mutagenesis can guide the design of new experiments or rationalize the consequences of known mutations. Often software such as FoldX, while fast and reliable, lack the necessary automation features to apply them in a high-throughput manner. We introduce MutateX, a software to automate the prediction of ΔΔGs associated with the systematic mutation of each residue within a protein, or protein complex to all other possible residue types, using the FoldX energy function. MutateX also supports ΔΔG calculations over protein ensembles, upon post-translational modifications and in multimeric assemblies. At the heart of MutateX lies an automated pipeline engine that handles input preparation, parallelization and outputs publication-ready figures. We illustrate the MutateX protocol applied to different case studies. The results of the high-throughput scan provided by our tools can help in different applications, such as the analysis of disease-associated mutations, to complement experimental deep mutational scans, or assist the design of variants for industrial applications. MutateX is a collection of Python tools that relies on open-source libraries. It is available free of charge under the GNU General Public License from https://github.com/ELELAB/mutatex.
Introduction
Advances in proteomics are now providing a massive amount of data on protein–protein interaction or post-translationally modified proteins [1–5] that benefit from structural studies to be rationalized [6, 7]. On the other hand, genomic initiatives allow the identification of missense mutations [8–10] in the coding region of genes that need to be understood at a structural and functional level [11–13]. Indeed, single amino acid substitutions (i.e. mutations) or post-translational modifications (PTMs) in proteins may alter structural stability and intermolecular interactions, impacting protein activity, function and cellular signaling. An accurate and systematic prediction of changes in stability and binding upon mutations or PTMs is thus crucial to understand protein variants at the molecular level [14–16]. Moreover, the possibility to predict mutational effects also strongly impacts protein engineering for biotechnological and industrial applications [17, 18]. It has also been shown that disease-causing mutations are characterized by changes in structural stability or binding affinity [19–21]. Most prediction tools are designed to work on a single conformation at a time. However, proteins are not static entities, and they can attain multiple states in solutions that are tightly related to their function [22, 23]. Some proteins are conformationally heterogeneous and undergo a myriad of structural changes, including sparsely populated states that are stabilized upon interaction with a biological partner [24, 25]. It is thus important to account for the conformational ensembles of proteins to achieve a proper understanding of the impact of mutations and PTMs. Protein ensembles can be obtained by experimental techniques such as NMR [26, 27], simulations [28–30] or an integration of the two [31–33].
Several different methods have been developed to estimate changes in folding and binding free energies upon amino-acidic substitutions from a structure, many of which are extensively summarized in a recent review article [34]. Synthetic benchmarks of these methods have been performed over time, along with comparisons with experimental data. Structure-based methods were generally considered to be moderately accurate in terms of classification between stabilizing, neutral or destabilizing mutations [35, 36] and in terms of correlation between experimental and computed ΔΔG values [37, 38]. FoldX [39] ranked among the best-performing methods [40] and has been applied with success to many biological questions. Experimental validation of FoldX predictions, along with a comparison to existing experimental data, also supports the notion that this tool is valuable for mutation classification, design and prediction [34, 40–45].
The FoldX method predicts changes in free energy of folding upon mutation as the difference between the estimated free energy of folding of the mutant and the reference wild-type (WT) variants. This is done by using an empirical free energy function, which includes different energetic terms for main-chain and side-chain contributions. Some of these terms are weighted by a coefficient, which was determined using a fitting procedure over a database of free energy differences from single-point mutants, meaning that FoldX free energies are not necessarily physical. However, the calculated differences are comparable to those identified experimentally. The energy function is implemented in the software suite, FoldX, which is free for academic users. The suggested FoldX mutation protocol includes a repair step, in which an energy optimization is performed that removes interatomic clashes or optimizes the initial conformation of side chains. Next, a mutation step is performed (BuildModel) in which the model of the mutant variant and corresponding WT models are generated. Finally, the respective energies are calculated and subtracted to obtain the final difference in free energy upon mutation. FoldX can also estimate the free energy of interaction between pairs of molecules in the structure through the AnalyseComplex command. FoldX supports the twenty canonical amino acids, a few post-translationally modified ones and other chemical moieties such as nucleic acids and metal ions.
The accuracy of FoldX has been independently investigated with SDs reported in the range of 1.0–1.78 kcal mol−1 between the FoldX ΔΔGs and experimental data, and an average accuracy of 0.69 [40]. Recent benchmarking and thorough analysis of FoldX and other prediction methods highlighted possible sources of bias for the predictions, most likely due to the limited or unbalanced dataset used to parametrize such methods. One of the most discussed is the bias toward identifying destabilizing mutations [46–48]. Nonetheless, among the methods tested for systematic bias, FoldX was in the top five least biased methods [48]. The bias can be alleviated through a backbone relaxation step [47]. Other factors such as the difference in volume between WT and mutant residue types can be an additional source of bias [48]. A recent investigation comparing several prediction methods, including FoldX, pointed out that mutation type (i.e. the type of the specific WT and mutant residue considered) is also a significant source of bias. In addition, the same study pointed out that such methods perform better in predicting neutral or destabilizing mutations than for stabilizing or strongly destabilizing mutations [49].
Moreover, FoldX is sensitive to the input structure. Therefore, as a solution, the calculations could be done on multiple conformations of the same protein [49]. In this regard, a modified protocol in which ensemble averages of free energies predictions calculated on conformers from molecular dynamics (MD) simulations has been proposed [44], with the observation that, while predictions performed on the single MD conformation had relatively poor correlation with experimental ΔΔG, using ensemble average predictions restored the correlation and reduced the spread of the ΔΔG values. This indicates that ensemble averaging in FoldX constitutes a viable line of research that could improve the results obtainable by the software.
The energy functions mentioned above, and FoldX in the first place, are, in general, promising approaches since they, in principle, allow for a fast but yet semiquantitative estimate of the changes in stability and at the interaction interface for other biological partners of all possible mutations in a protein structure. Despite their shortcomings, they still capture a correct trend in their prediction and are useful to obtain a large-scale picture of a protein’s mutational landscape. Nevertheless, elegant computational solutions to design flexible and comprehensive pipelines to carry out in silico deep mutational scans in a high-throughput manner are still missing, along with the extension of these functions to a systematic study of protein ensembles taking advantage of parallelization on modern multicore hardware.
Results
Design and implementation
Overview on MutateX
MutateX includes a wrapper in Python for the FoldX suite accompanied by plotting and post-processing Python scripts. The main program implements an in silico saturation mutagenesis protocol for proteins and protein complexes that uses the popular FoldX program to predict the effect of mutations. It allows calculating the free energy change associated with the systematic substitution of each protein residue to any of the natural amino acids or post-translationally modified residues, such as phospho-residues. MutateX automates and streamlines the mutation process, making running a full mutational scan straightforward, even for several conformers of the same protein. From a single three-dimensional (3D) protein structure or an ensemble of structures, MutateX provides the researcher with textual and graphical representations of the mutational data to explore and make sense of the large amount of information derived from such an exhaustive scan. This overview is possible through visualization tools, which produce publication-ready figures.
At the heart of the MutateX software lies a pipeline engine that handles input preparation, parallel runs of available programs able to estimate the free energy, coupled with data gathering and visualization tools. Currently, the supported software for calculating free energies is the FoldX Suite to determine the energetic effect of point mutations, along with the binding energies associated with the formation of multimeric protein complexes. MutateX is compatible with FoldX Suite versions 4 and 5 [50].
The MutateX protocol
In its most straightforward applications, the MutateX main wrapper requires as an input a single file with the coordinates of a protein 3D structure in the protein data bank (PDB) format. Moreover, the user should provide the FoldX running input files, for which templated are included by default in the MutateX distribution. The tool also provides templates for the other calculations are used by FoldX, such as Repair, BuildModel and (optionally) AnalyseComplex, as detailed in the user guide. Advanced users are invited to create their own templates or modify the existing ones as they see fit.
The MutateX software implements the protocol summarized in Figure 1. First, the main MutateX program checks that the input PDB files can be parsed correctly (i.e. they are in the proper format) and saves one file per model in the PDB format. Next, MutateX prepares and performs a Repair FoldX run on each model. For each repaired model, MutateX generates all the directories and files required to perform the mutational scan, including structure files, run files for FoldX and lists of mutations. Before attempting mutation, the software checks that all the input protein models have the same primary sequence. All substitutions on a given site are treated as a single independent FoldX run, making it possible to run several instances of the FoldX software simultaneously, which is necessary to take advantage of modern multicore processors and effectively reduce the execution time of the whole pipeline. MutateX takes care of scheduling the FoldX runs, keeping track of their status, and checking the output files. During the preparatory step, the software automatically recognizes if two or more identical protein chains are present in the input structure files and prepare the mutation runs accordingly, by mutating all the chains having the same sequence at the same time. This is the only instance in which multiple mutations are performed, meaning in all other cases only one mutation at the time is considered. It also supports setting the number of runs per mutation, so that cases in which more local structural variability is expected can be taken into account. Once all the mutation runs have been performed, MutateX reads the output of the FoldX runs and summarizes the estimated average free energy differences, the associated SDs and minimum and maximum values in separate text files for each residue. The MutateX processing tools can be used on these files to obtain representations of the effects of the mutations, such as probability density plots of ΔΔG values calculated with Kernel density estimation (ddg2density), heatmaps (ddg2heatmap), histograms relative to single mutation sites (ddg2histo), logo plots (ddg2logo) and per-site distributions of ΔΔGs in different representations (ddg2distribution). The user can specify custom labels by filling in a template generated using the pdb2labels tool for plotting. The ddg2dg tool applies systematic linear corrections to the calculated ΔΔG values. The ddg2excel tool converts the dataset to Microsoft Excel-compatible files. PDB files with relevant information about the mutagenesis results can be stored in the B-factor field (ddg2pdb). These files can be further visualized and processed using protein structure viewers, such as PyMOL (https://pymol.org). The same data can also be written as the diagonal of a residue matrix file, compatible with the xPyder plugin [51]. The ddg2summary tools generate ad hoc reports on specific mutations. Most of these tools support extensive plotting options, the most important of which is the range of ΔΔG values to be considered. In fact, according to our experience, the ΔΔGs predicted by FoldX can go up a few tens kcal mol−1. Therefore, if the entire range of predicted ΔΔGs is considered, identifying smaller but still meaningful differences becomes difficult. We suggest limiting the range of ΔΔG values between −3.0 and 5.0 kcal mol−1, as this is the range in which most mutations were experimentally found (see Results for details). ΔΔG values below or above this are likely to be overestimated by the method.

Scheme of the workflow implemented in MutateX, as described in the Methods section.
Self-mutations, i.e. those mutations that replace a residue with itself, can also be performed independently with respect to the rest of the protocol and are especially useful as a sanity check of the method for every single mutation site, in each structure. In fact, the free energy difference calculated when mutating a residue to itself acts as a lower bound of the prediction error, as we would generally expect ΔΔG close to 0 for such a mutation.
MutateX allows customizing some aspects of the calculations. For instance, it is possible to provide the list of target residue types that should be considered in the scan, and to include multiple models from one or more PDB files. Indeed, MutateX supports the saturation scan and the free energy calculations over structural ensembles, an approach that in principle might mitigate the inaccuracies associated with the limited conformational variability that FoldX allows for during modeling of the mutant variants. Furthermore, the software can introduce simultaneously each mutation on more than one chain in the case homo-dimers or homo-multimers are present in the PDB file. Finally, MutateX is designed to handle, calculate and plot the changes of free energies of binding between complexes upon mutation, both considering homo-multimers and hetero-multimers.
Performance
Most of the run time for the MutateX scan is taken by the runs of the ΔΔG estimation engine, as the operations of preparation and post-processing performed by MutateX are computationally inexpensive. Therefore, the performance of MutateX is tied to that of the underlying mutation engine - FoldX in this case. MutateX performs a complete mutational scan of a 300-residues protein to all the 20 natural variants in less than a day using four cores for parallel execution, which can routinely be done on a desktop class machine (Figure 2). As the FoldX runs performed by MutateX are completely independent of one another, the execution time scales roughly linearly with the number of cores as well as with the number of input structures per run. For instance, considering the p53 MD ensembles below (199 residues), the execution time using 10 processor cores was 51.14, 103.40 and 203.6 h for 25, 50 and 100 frames, respectively. With 32 processor cores, 100 frames can also be processed in approximately 72.5 h.

Scaling behavior of MutateX with increasing number of residues. These run times have been obtained using MutateX in combination with FoldX Suite version 4.
Results
Selection of a cutoff for truncation of ΔΔGs values for plotting purposes
The predicted FoldX ΔΔGs could reach up a few tens of kcal mol−1 in some cases, making visualizing a complete mutational scan challenging due to scale effects, for instance, when using heatmaps. Using large scales would mask smaller but meaningful differences, giving more importance to very large energy differences that are likely to be overestimations. Therefore, we, sought to identify a range of free energy differences in which small differences are considered meaningful, while smaller or larger values than that can be flattened out as largely destabilizing or stabilizing. To this extent, we have used the ProTherm database [52] to check how experimentally derived ΔΔG values are distributed, after filtering out those values for which an energy unit was not specified in the database. Figure 3 shows that most mutations fall within the −3.0 to 5.0 kcal mol−1 range, which can be used as a cutoff for truncation of ΔΔG values when visualizing MutateX mutation results, which correspond roughly to the second and 98th percentile, respectively. We recommend using these values as range limits for visualization and plotting purposes.
![Distribution of experimental ΔΔG values, upon sign change, as found in the ProTherm database [52] for which a unit was specified.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bib/23/3/10.1093_bib_bbac074/1/m_bbac074f3.jpeg?Expires=1749174564&Signature=stBz9VBYWa6PnbyMEeUt-TnxYzbdtaAD83dgIzYo1R0Xpf7B3NTmK~jAhEULL4mTaM1UpVz1X8vAUGwsYmPShxfiq-XlyDET4G7pTLBQ2Q7dCBhy5YNBt9NfcGB0W7ielAzvHF5KdGQXx~RXUqe5foBXWjP9mly1~aHhXnuhn-9-bH591tJkDtXf~lUcMNLDfvqZ6I5UPHsvyQuh9WEhgUffAJPVfR8hJA0OggcR1L4K4oYbAJBpTwGmRFuhe551X~qDwZdgL6anJRHc-mOilh8bEfJlZHFSB3oSfo-xxNv1~wvJaKGZVAcwEc2Vj2PTmy1CBuIWzA7E127qT~vgcA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Distribution of experimental ΔΔG values, upon sign change, as found in the ProTherm database [52] for which a unit was specified.
Saturation mutation scan of the cancer-related protein phosphatase and tensin homolog (PTEN): relationship between protein stability and disease
PTEN is a tumor suppressor with homology to tyrosine phosphatases and the cytoskeletal protein tensin [53]. The PTEN monomer consists of two major functional domains, i.e. the phosphatase domain (protein tyrosine phosphatase (PTP), residues 14–185) and the C2 domain (residues 190–285). PTEN is involved in regulating cellular growth and suppresses cell migration and invasion [54]. Mutations in PTEN have been identified in various human cancers [55]. Most reported missense mutations of PTEN are located within the exon encoding the PTP domain.
PTEN has been used in a study in which other methods have been used to predict changes in free energy upon mutations and then compared to multiplexed experiments [56]. The authors found that the loss of PTEN stability was a driving factor for disease-causing mutations, with 60% of the pathogenic variants causing the loss of function because the protein is destabilized and degraded. PTEN is thus a useful case study to illustrate the possible application of the scan for changes in stability with the MutateX protocol. Indeed, we performed a saturation mutagenesis scan of PTEN to identify the most damaging mutations and relate them with cancer data publicly available.
We employed the most complete structure of PTEN available at the time of analyses (PDB ID 1D5R [57]). Notably, the PDB entry lacks atomic coordinates for residues 286–309 in a disordered region—however, given the local sampling carried out by FoldX, the missing residues are likely to affect only residues in the proximity of the loop. We did not model the missing region since this calculation serves only for illustrative purposes.

Saturation mutagenesis scan of the tumor suppressor PTEN by MutateX. The heatmap (A) and box plot (B) of the mutational scan corresponding to the residues 64–113 is shown as an example. It includes the folding ΔΔGs estimated for the mutation of every mutation site (x-axis) to every natural amino acid (y-axis) using FoldX. The values of free energy difference in A have been limited to the range −3 to 5 kcal mol−1, meaning that values above 5 or below −3 kcal mol−1 are shown as the extremes of the range. In panel D, the average ΔΔG for the mutation of each residue to all the 20 natural residue types is shown; then the ten sites with the largest average ΔΔG are labeled. The same mutations are shown on the structure of PTEN in panel C, produced with PyMOL.
The results from the scan were visualized exploiting different post-processing tools available within MutateX. We first analyzed the results using heatmaps (produced by ddg2heatmap) and boxplots (ddg2distribution), which provide a comprehensive overview of the free energy changes upon mutations along the protein sequence. In particular, the ddg2distribution tool has been designed to produce different representations of the obtained ΔΔGs on a per-residue distribution basis. The heatmap and box plots are shown in Figure 4A and B with changes of ΔΔGs in the region 64–113 of PTEN, corresponding to the middle part of the phosphatase domain. The results may be classified into the following scenarios:
(i) the ΔΔG is not influenced by the substitution of the WT amino acid. An overall lack of change in ΔΔGs at a certain site indicates that the WT residue is not essential for protein stability. This is often the case when residues are located at the protein surface, or in the flexible region (such as D77) or if they lack interactions with neighboring residues.
(ii) The ΔΔG increases upon mutations of the WT residue for a certain subset of other residues. This subset is likely to have something in common, e.g. residues are hydrophobic, hydrophilic, charged, small/large or have an aromatic side chain. Substituting a small uncharged residue with another similar one is unlikely to alter protein stability, whereas substitutions to large aromatic side chains could have a marked effect (such as in the case of N69).
(iii) The ΔΔG increases regardless of what the WT amino acid is mutated to, indicating that the native residue may play an essential/specific role in preserving the protein structure (such as the example of P95 and P96).
(iv) The ΔΔG decreases when the WT residue is mutated to one or a subset of other residues—e.g. the substitutions result in a more stable protein variant. An observed decrease in ΔΔG is expected to be significantly more modest in size and observed less often as protein structure/stability is evolutionarily optimized (for example the substitution of residue Q with Y or F at position 87).
Residues with largest average ΔΔG annotated with information from cBioPortal
Residue position . | Wild-type amino acid . | Annotated mutant amino acid . | Number of missense annotations . | Cancer type . | Location within PTEN structure . |
---|---|---|---|---|---|
83 | C | NA | NA | NA | Phosphatase domain |
95 | P | S, L | 5 | Glioblastoma multiforme, uterine endometrioid carcinoma, colorectal adenocarcinoma and mixed germ cell tumor | Phosphatase domain |
96 | P | A, L, S, T | 4 | Uterine endometrioid carcinoma, chromophobe renal cell carcinoma, glioblastoma multiforme and uterine endometrioid carcinoma | Phosphatase domain |
127 | G | E, R, V | 15 | Lung adenocarcinoma, glioblastoma multiforme, hepatocellular carcinoma, cutaneous melanoma, uterine endometrioid carcinoma, colorectal adenocarcinoma, ampullary carcinoma, teratoma, uterine clear cell carcinoma and uterine serous carcinoma | Phosphatase domain |
132 | G | A, C, D, S, V | 23 | Solitary fibrous tumor, uterine carcinosarcoma, glioblastoma multiforme, cutaneous melanoma, colorectal adenocarcinoma, lung adenocarcinoma, astrocytoma, uterine endometrioid carcinoma, esophageal squamous cell carcinoma, medulloblastoma, prostate adenocarcinoma and rectal adenocarcino | Phosphatase domain |
151 | A | G, T, D, V | 4 | Breast invasive lobular carcinoma, colorectal adenocarcinoma, uterine carcinosarcoma and glioblastoma multiforme | Phosphatase domain |
165 | G | E, R | 15 | Small cell lung cancer, renal non-clear cell carcinoma, medulloblastoma, cutaneous melanoma, glioblastoma multiforme, rectal adenocarcinoma and uterine endometrioid carcinoma | Phosphatase domain |
204 | P | L, R, S | 7 | Melanoma, glioblastoma multiforme, uterine endometrioid carcinoma, uterine carcinosarcoma, colorectal adenocarcinoma and cutaneous melanoma | C2 domain |
213 | P | R | 1 | Prostate adenocarcinoma | C2 domain |
328 | A | NA | NA | NA | C2 domain |
Residue position . | Wild-type amino acid . | Annotated mutant amino acid . | Number of missense annotations . | Cancer type . | Location within PTEN structure . |
---|---|---|---|---|---|
83 | C | NA | NA | NA | Phosphatase domain |
95 | P | S, L | 5 | Glioblastoma multiforme, uterine endometrioid carcinoma, colorectal adenocarcinoma and mixed germ cell tumor | Phosphatase domain |
96 | P | A, L, S, T | 4 | Uterine endometrioid carcinoma, chromophobe renal cell carcinoma, glioblastoma multiforme and uterine endometrioid carcinoma | Phosphatase domain |
127 | G | E, R, V | 15 | Lung adenocarcinoma, glioblastoma multiforme, hepatocellular carcinoma, cutaneous melanoma, uterine endometrioid carcinoma, colorectal adenocarcinoma, ampullary carcinoma, teratoma, uterine clear cell carcinoma and uterine serous carcinoma | Phosphatase domain |
132 | G | A, C, D, S, V | 23 | Solitary fibrous tumor, uterine carcinosarcoma, glioblastoma multiforme, cutaneous melanoma, colorectal adenocarcinoma, lung adenocarcinoma, astrocytoma, uterine endometrioid carcinoma, esophageal squamous cell carcinoma, medulloblastoma, prostate adenocarcinoma and rectal adenocarcino | Phosphatase domain |
151 | A | G, T, D, V | 4 | Breast invasive lobular carcinoma, colorectal adenocarcinoma, uterine carcinosarcoma and glioblastoma multiforme | Phosphatase domain |
165 | G | E, R | 15 | Small cell lung cancer, renal non-clear cell carcinoma, medulloblastoma, cutaneous melanoma, glioblastoma multiforme, rectal adenocarcinoma and uterine endometrioid carcinoma | Phosphatase domain |
204 | P | L, R, S | 7 | Melanoma, glioblastoma multiforme, uterine endometrioid carcinoma, uterine carcinosarcoma, colorectal adenocarcinoma and cutaneous melanoma | C2 domain |
213 | P | R | 1 | Prostate adenocarcinoma | C2 domain |
328 | A | NA | NA | NA | C2 domain |
Residues with largest average ΔΔG annotated with information from cBioPortal
Residue position . | Wild-type amino acid . | Annotated mutant amino acid . | Number of missense annotations . | Cancer type . | Location within PTEN structure . |
---|---|---|---|---|---|
83 | C | NA | NA | NA | Phosphatase domain |
95 | P | S, L | 5 | Glioblastoma multiforme, uterine endometrioid carcinoma, colorectal adenocarcinoma and mixed germ cell tumor | Phosphatase domain |
96 | P | A, L, S, T | 4 | Uterine endometrioid carcinoma, chromophobe renal cell carcinoma, glioblastoma multiforme and uterine endometrioid carcinoma | Phosphatase domain |
127 | G | E, R, V | 15 | Lung adenocarcinoma, glioblastoma multiforme, hepatocellular carcinoma, cutaneous melanoma, uterine endometrioid carcinoma, colorectal adenocarcinoma, ampullary carcinoma, teratoma, uterine clear cell carcinoma and uterine serous carcinoma | Phosphatase domain |
132 | G | A, C, D, S, V | 23 | Solitary fibrous tumor, uterine carcinosarcoma, glioblastoma multiforme, cutaneous melanoma, colorectal adenocarcinoma, lung adenocarcinoma, astrocytoma, uterine endometrioid carcinoma, esophageal squamous cell carcinoma, medulloblastoma, prostate adenocarcinoma and rectal adenocarcino | Phosphatase domain |
151 | A | G, T, D, V | 4 | Breast invasive lobular carcinoma, colorectal adenocarcinoma, uterine carcinosarcoma and glioblastoma multiforme | Phosphatase domain |
165 | G | E, R | 15 | Small cell lung cancer, renal non-clear cell carcinoma, medulloblastoma, cutaneous melanoma, glioblastoma multiforme, rectal adenocarcinoma and uterine endometrioid carcinoma | Phosphatase domain |
204 | P | L, R, S | 7 | Melanoma, glioblastoma multiforme, uterine endometrioid carcinoma, uterine carcinosarcoma, colorectal adenocarcinoma and cutaneous melanoma | C2 domain |
213 | P | R | 1 | Prostate adenocarcinoma | C2 domain |
328 | A | NA | NA | NA | C2 domain |
Residue position . | Wild-type amino acid . | Annotated mutant amino acid . | Number of missense annotations . | Cancer type . | Location within PTEN structure . |
---|---|---|---|---|---|
83 | C | NA | NA | NA | Phosphatase domain |
95 | P | S, L | 5 | Glioblastoma multiforme, uterine endometrioid carcinoma, colorectal adenocarcinoma and mixed germ cell tumor | Phosphatase domain |
96 | P | A, L, S, T | 4 | Uterine endometrioid carcinoma, chromophobe renal cell carcinoma, glioblastoma multiforme and uterine endometrioid carcinoma | Phosphatase domain |
127 | G | E, R, V | 15 | Lung adenocarcinoma, glioblastoma multiforme, hepatocellular carcinoma, cutaneous melanoma, uterine endometrioid carcinoma, colorectal adenocarcinoma, ampullary carcinoma, teratoma, uterine clear cell carcinoma and uterine serous carcinoma | Phosphatase domain |
132 | G | A, C, D, S, V | 23 | Solitary fibrous tumor, uterine carcinosarcoma, glioblastoma multiforme, cutaneous melanoma, colorectal adenocarcinoma, lung adenocarcinoma, astrocytoma, uterine endometrioid carcinoma, esophageal squamous cell carcinoma, medulloblastoma, prostate adenocarcinoma and rectal adenocarcino | Phosphatase domain |
151 | A | G, T, D, V | 4 | Breast invasive lobular carcinoma, colorectal adenocarcinoma, uterine carcinosarcoma and glioblastoma multiforme | Phosphatase domain |
165 | G | E, R | 15 | Small cell lung cancer, renal non-clear cell carcinoma, medulloblastoma, cutaneous melanoma, glioblastoma multiforme, rectal adenocarcinoma and uterine endometrioid carcinoma | Phosphatase domain |
204 | P | L, R, S | 7 | Melanoma, glioblastoma multiforme, uterine endometrioid carcinoma, uterine carcinosarcoma, colorectal adenocarcinoma and cutaneous melanoma | C2 domain |
213 | P | R | 1 | Prostate adenocarcinoma | C2 domain |
328 | A | NA | NA | NA | C2 domain |
To determine which of the 351 PTEN amino acids were most important for protein stability, we have used ddg2distribution to plot the per-site average ΔΔG over all the residues of PTEN. We have further annotated with residue names the ten positions with the largest average ΔΔG (Figure 4C and D). On average, positions 83, 95, 96, 127, 132, 151, 165, 204, 213 and 328 featured the more pronounced effects on PTEN stability upon mutation. The top 10 residues are highlighted with dots/sticks on the 3D structure of the protein reported in Figure 4C. The mutation predicted to be most sensitive to substitution is the glycine residue at position 127, which is in a loop right before the third α-helix in PTEN. Glycine residues located in disordered regions such as loops provide a certain degree of flexibility, which is important for the proper functional breathing motions of the protein [29]. It should also be noted that, since FoldX does not relax the protein backbone upon mutation, replacing a residue without side chain with larger ones can result in steric clashes that artificially increase the obtained ΔΔG values.
PTEN functions as a tumor suppressor in many cancer types. Thus, we extracted 1116 missense cancer-associated mutations of PTEN using the cBioPortal database [9] (August 2019)—a comprehensive collection of cancer mutations identified by genomic initiatives of high-throughput profiling of samples from cancer patients. Intriguingly, eight out of ten of the top ten candidates identified in the MutateX scan were annotated in CBioPortal and are reported in Table 1. Furthermore, G127, G132 and G165 were among the most frequent mutation sites of PTEN in cancer patients.

Saturation mutagenesis for the p53 DBD tetramer. (panels A and C) Heatmaps of interaction ΔΔG for p53 generated using MutateX. Only the residues that take part in the interaction between the different chains are shown. On the left hand side (panels A and B), the difference interaction energies are between chains A and B are shown, while the right hand side panels (panels C and D) are for chains A and D. Heatmaps for both cases (panels A and C) have been generated using the ddg2heatmap tool, white a violin plot (panel B) and a stem plot (panel D) have been generated using ddg2distribution. Each point in the stem plot shows the average difference in interaction energy of that mutation site on the 20 residue types it has been mutated to, with the SD shown as a vertical bar.
Saturation mutagenesis to identify hotspot residues at the binding interface for tetramerization of p53
The tumor suppressor p53 is known as the guardian of the human genome. It is a stress response transcription factor involved in different cellular functions such as cell cycle arrest, senescence and apoptosis [58]. Mutations in the p53 gene occur at high frequency in many cancers [59–61]. p53 exerts its function as a transcription factor by binding to DNA through its central folded DNA-binding domain (DBD, residues 91–289), which is also able to bind other proteins and acts as a signal integration hub [58, 62, 63]. The activity of p53 depends on its conformation. P53 is active in a tetrameric (dimer of dimers) conformation required for its ability to bind with high affinity to DNA and interact with various proteins [58]. Formation of the tetramer happens through a specialized tetramerization (TET) domain; nonetheless, the DBD of the four p53 units in the tetramer still interact with one another [64]. Many of the cancer-related p53 mutations are in the DBD and have destabilizing effects, but there are also mutations that do not fall within this class. To properly understand the effects of mutations on p53 and their roles in cancer, it is thus essential to identify hotspot mutations that could induce structural changes at the binding interface for TET of p53.
Here, we used MutateX to study the effects of mutations on the binding interfaces between DBDs in the p53 tetramer. We performed a saturation mutagenesis scan of p53 tetramer, mutating each residue to all the 20 natural amino acids, using the X-ray structure of the p53 DBD as a self-assembled tetramer (PDB ID 3KMD) [65] and calculated the differences in binding energies upon mutation. As the structure contains four protein chains with identical sequences, MutateX automatically recognized it as a tetramer and performed the same mutation of each chain for every run. The software then calculated differences in binding energy considering every possible combination of protein chains (A–B, B–C, C–D, A–D, A–C and B–D). The p53 dimer of dimer has a planar structure shaped like a parallelogram, with no contacts between monomers A–C and B–D. Interface A–B connects two p53 monomers into a dimer (dimer interface), while interface A–D connects two dimers into the tetramer (dimer–dimer interface). We here considered only differences in free energy of binding between pairs A–B and A–D, as the other two interfaces are identical to these. We first plotted a heatmap of binding ΔΔGs using ddg2heatmap, which provides a comprehensive visual representation of the free energy changes, limiting the plotted values between −3.0 and 5 kcal mol−1 (Figure 5A and C). We also plotted the per-site ΔΔG distributions using ddg2distribution, obtaining either a violin or a stem plot (Figure 5B and D). The sites with mutations causing marked changes in binding free energy between chains A and B are in the regions C176-R181 and S241-C245 (Figure 5A). These correspond to the dimerization interface of the two p53 chains A and B, which include part of L2 and L3, close to where the zinc-binding site is located. Residues that were previously identified as important for the interaction interface, such as P177, H178, E180, M243 and G244 and surrounding residues, featured the most destabilizing effects in terms of binding ΔΔG when mutated. This is the case of mutations of residue C176 to bulky aromatic residue (F, W and Y) or charged ones (K, R and H) or of mutation of P177 to S,T, G and R. Replacing key residues on the other side of the binding interface is predicted to be damaging in general, given the high ΔΔG values for mutations at M243 and G244.
Next, we analyzed the interaction surface between chains A and D, which corresponds to the dimer–dimer interface. The interface is extended and includes two patches of residues, formed by loops connecting the β-stands, the DBD N-terminal tail and loop L2. One patch of this interface (patch I) is described as formed by L93, S94, S95, S166, Q167, T170 and F212 of monomer A and L201, and T140, E198, G199, M200, R202 and H233 of monomer D. Of these residues, L93, Q167, G199, L201 and H233 included the most destabilizing mutations according to our scan. G199 is deeply buried within patch 1 and faces T170 directly; its mutation to bulky amino acids (such as F, W, Y, H, but also D and E) is found to be highly destabilizing. It should be noted that FoldX does not rearrange the protein backbone after side-chain replacement and remodeling, meaning the mutation might be predicted as more destabilizing than it would be if the whole structure could be fully relaxed. Patch 2 is smaller and composed of residue Q100 and K101 on monomer A and E224 and V225 on chain D. Of these potential mutation sites, E224 is predicted to be the most sensitive, with most mutations predicted to be destabilizing.
Saturation mutagenesis of protein-peptide complexes: identifying binding specificities of SH2 domains
Protein PTMs play a critical role in cellular transduction. Phosphorylation is the most well-characterized PTM and consists of transferring the terminal phosphate group from adenosine triphosphate (ATP) to the hydroxyl group of serine, threonine or tyrosine residues. Protein kinases and phosphatases control the levels of protein phosphorylation, regulating several processes such as proliferation, DNA repair, programmed cell death, differentiation and immune responses [66]. Indeed, the physiological phosphorylation pattern is frequently altered in pathological conditions such as cancer [67]. Phosphorylation of short linear motifs creates docking sites recognized by binding domains, such as Src homology 2 (SH2) and others [68]. SH2 domains specifically bind pTyr-containing linear motifs and represent the largest class of pTyr-recognition domains [69]. SH2 domains have been studied with biochemical assays and high-throughput approaches, clarifying the determinants of binding specificity for different members of the SH2 family [69–72].
We used MutateX to evaluate how phosphorylations can modulate the interaction between SH2 domains and peptides, aiming at predicting the change of free energy of binding upon mutation of phosphorylated and non-phosphorylated peptides. This allowed us to understand which mutations influence the binding and to correlate our results with the experimentally known binding specificities of phosphopeptides to SH2 domains.
We performed a MutateX saturation mutagenesis scan of different SH2 containing proteins complexed with phospho-tyrosine peptides: CDPK-related protein kinase (CRK) protooncogene bound to a CRK-derived phosphopeptide (PDB ID 1JU5 [73]), GRB2 SH2 domain bound to a phosphorylated peptide (PDB ID 1JYR [74]) and SHP2 phosphatase complexed with a with RLNpYAQLWHR peptide (PDB ID 3TL0 [75]). We selected these complexes because of their known heterogeneous binding specificities [69, 71] and because 3D experimental structures were available.
We compared our results with the target consensus described in Tinti et al. [71], in which high-density peptide chip technology has been used to define the binding specificity of 70 different SH2 domains to pTyr peptides. Results from this article indicate that: (i) CRK SH2 domain has a strong preference for Proline in position +3 (from phospho-tyrosine), (ii) GRB2 SH2 domain has a strong preference for Asparagine in position +2 and (iii) N-terminal SH2 domain of SHP2 has a preference for hydrophobic residues at positions −2, +1 and +3.
Following the MutateX run, we analyzed the results by using the post-processing scripts from the MutateX suite of tools. At first, we used ddg2distribution to analyze the site averages of all the calculated ΔΔGs focusing our attention on the phosphopeptides. This tool was used to identify the residue positions that if mutated were impacting more on the binding free energy states of the complex. (Figure 6, center). Next, we used the ddg2heatmap tool to generate heatmaps of binding interface ΔΔGs, including all the 20 naturally occurring amino acids (Figure 6, left). This tool provided an immediate graphical representation of the residues, which were either more or less perturbing the interaction between the SH2 domains and the phosphopeptides, and gave more details on which substitutions are more likely to affect the interaction. We also used the ddg2histo tool to plot in graph bars the binding interface ΔΔGs for each residue in the phosphopeptides from positions −2 to +3 (Figure 6, right).

Changes in binding free energy in complexes of SH2-peptide. For each of the SH2 domain complexes, we have plotted the calculated ΔΔG for the mutation of key residues of the phosphopeptide. These are shown as heatmaps, generated by ddg2heatmap, box plots, generated by ddg2distribution. The ΔΔG of single key mutation sites are then shown more in detail as histograms, generated using the ddg2histo tool.
As it is clearly shown in Figure 6, in all the SH2/phosphopeptide complexes, mutations of the pTyr residues are associated with high ΔΔGs, indicating a strong impact on the binding affinity. The substitution tyrosine/pTyr shows elevated ΔΔGs, in agreement with experimental data according to which the tyrosine phosphorylation is a prerequisite for the binding. Interestingly, the ΔΔGs were higher upon mutations in the residue positions that were shown to be mostly conserved in the different SH2 domain target consensus.
More in detail, the analysis showed that (i) CRK SH2 domain displays a strong preference for Proline residue in position +3, while all the other mutations affect the interaction, except for branched-chain amino acids and Met, which present lower ΔΔGs; (ii) GRB2 SH2 domain shows a strong preference in position +2 for Asn, with aromatic and charged amino acids being the most disrupting substitutions (iii) SHP2 N terminal SH2 shows a moderate preference of hydrophobic residues in position −2 and +1 and a more pronounced preference in position +3 for Leu and Met.

Performance of MutateX predictions when ran on p53 DNA-binding domain molecular dynamics ensembles. The plots illustrate the correlation between available experimental ΔΔG values and MutateX predictions. We performed MutateX scans on the original X-ray experimental structure, the starting structure of an MD simulation and MD simulation ensembles of 25, 50 and 100 frames extracted from the same MD trajectory. For the ensemble runs, we show the average ΔΔG values calculated over independent FoldX runs performed on each ensemble structure. The horizontal lines show the SD of experimental measurements when multiple values were found for any given mutation, while vertical lines account for the SD of the predicted values on the ensemble.
Our predictions were consistent with experimental data which defined the SH2 domains consensus. Summarizing, this example clearly indicates that MutateX is a useful tool for the prediction of binding specificities and protein–peptide interaction analysis.
Comparison of MutateX-predicted changes in free energy for the p53 DBD in experimental structures and ensembles from molecular dynamics
As discussed in the introduction, FoldX estimates changes of free energy upon mutation by side-chain replacement, keeping the protein main-chain rigidly fixed. This, joined with the observation that the algorithm is quite sensitive to the input structure, led to the suggestion that performing mutational scans on ensembles of conformations, either from experiments or simulations, might be useful to improve the predictions. Indeed, MutateX supports performing mutational scans on multiple structures as a single job. It also provides the average ΔΔG values independently predicted in each structural scan. Here, we have used the p53 DBD as a case study and performed saturation mutagenesis scans on conformational ensembles from an MD simulation as well as the starting structure of such a simulation and the corresponding crystal structure (PDB ID 2XWR [76]). The ensembles were extracted as PDB files from the same 500 ns MD trajectory of p53 DBD [77], equally spaced as to obtain ensembles of 100, 50 and 25 frames, meaning the 100-frames ensemble is a perfect superset of the other two. Each PDB file was then used as input for MutateX directly. We then selected a set of mutations of the p53 DBD whose experimentally determined ΔΔG values were available in ThermoMutDB [78] and computed the Pearson correlation coefficient r between this dataset and the corresponding ensemble-averaged ΔΔG values from our mutational scans.
As shown in Figure 7, we obtain a good correlation (r = 0.71) between experimental and predicted free energy changes when using the experimental structure. Using just a single structure from MD we obtain a significantly worse correlation (r = 0.47), while using our MD ensembles results in a small improvement in the correlation (r = 0.75–0.77), with 50 frames being the best choice for this case. It should be noted that the difference in correlation is mostly due to a single outlier (R175H, ΔΔG = 9.9 kcal mol−1 for the X-ray structure scan); removing it from the dataset makes r very similar for all the datasets (r = 0.75–0.76) except for the single-MD frame one (r = 0.58). These results suggest that using MD ensembles of multiple structures was able to improve our prediction and make the analysis more robust.

ΔΔG of interaction calculated by MutateX saturation scans of the Spike protein belonging to the WT and Omicron variants. Panels on the left and right refer to the WT and Omicron structures, respectively. We report values for residues of the Spike protein within 6 Å from ACE2 in both structures. Positive ΔΔG values indicate a detrimental effect for binding. Boxplots (top) show the distribution of the predicted ΔΔG values for each site, while heatmaps (bottom) illustrate the predicted ΔΔG of interaction in kcal mol−1 for the spike residues for each mutation.
Identification of variants in the Coronavirus Disease-19 (COVID-19) spike-ACE2 complex
Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) is the pathogenic agent of the current COVID-19 pandemic. The viral spike protein mediates cell entry by first contacting molecules on the cell surface. Specifically, spike is characterized by a receptor binding domain (RBD) capable of interacting with the human receptor Angiotensin-Converting Enzyme 2 (ACE2). The recently reported Omicron (B.1.1.529) variant, which emerged in November 2021 in South Africa, has quickly become a variant of concern because of its high infectivity and immune evasion capability. Omicron spike protein is characterized by 37 mutations with respect to the original Wuhan-Hu-1 strain (WT variant here on), with 15 of them located in the RBD [79]. Omicron spike has an overall increased apparent affinity for ACE2 with respect to the WT spike and a comparable affinity with respect to the Delta variant [79]. Among the 15 mutations of the Omicron variant, previous studies including a high-throughput surface plasmon resonance (SPR) assay [80] of the WT variant has shown that some of the Omicron variant mutations mildly decrease (S371L, S373P, S375F, K417N, G446S, E484A, Q493R, G496S, Q498R and Y505H) or increase (G339D, N440K, S477N, T478K and N501Y) the binding with the RBD. A recently published cryoelectron microscopy 3D structure of the Omicron spike complexed with ACE2 has shed light on the structural details of the structural differences imparted by the mutations [79].
Here we used MutateX on the spike RBD in complex with ACE2 receptor as a case study aiming at identifying hotspot residues involved in the spike-receptor binding as well as evaluating the effect of the Omicron variant mutations. We performed both a MutateX saturation scan and a MutateX self-scan on the most complete (at the best of our knowledge) WT spike RBD in complex with ACE2 (PDB ID 6LZG [81]) and on the only currently available cryo-EM structure of the Omicron variant RBD interacting with the receptor (PDB ID 7T9L [79]). As we were only interested in the residues of the spike–ACE2 interaction surface, we performed partial mutational scans by specifying a list of mutation sites of interest as supported by the software. For this purpose, we selected in both cases all the residues of the spike RBD chain and the ACE2 residues that were at a distance lower than 12 Å from spike. We then considered for further analysis those residues of spike that were in close contact with the ACE2 (i.e. within 6 Å in both the WT and Omicron structures). We also performed a self-scan, mutating all residues to their own WT variants as quality control, finding that all the residues folding ΔΔGs were low (|ΔΔG| < 1.2 kcal mol−1) except for E406 and C432 in the Omicron scan.
Subsequently, we visualized the results of the saturation scan with different post-processing tools provided by MutateX. To gain a general overview of the effect of the spike mutations on the ΔΔG of interaction, we focused on the spike residues placed at the interface with ACE2 and we plotted heatmaps (ddg2heatmap) restricting the plotted values to a range of −3.0 to 5 kcal mol−1 and boxplots (ddg2distribution) for both the WT and the Omicron variant (Figure 8).
Our MutateX scan showed no effect on mutation on either WT or Omicron variant for a few sites of interest (339, 371, 373 and 375; see dataset on GitHub). These sites were indeed not in contact with the ACE2 protein in our structures. Overall, our scan identified several sites that were highly sensitive to mutations in terms of interaction free energy, most of them identified both in the WT and the Omicron scans (G447, Y449, L455, F456, A475, F486, N487, Y489, G496, N501, G502 and Y505). Positions 496, 501 and 505 coincide with mutation sites identified in the Omicron variant.
We have then used the ddg2summary tool to extract specifically the predicted effect on the binding of the Omicron variant mutations. Most of them have a small effect on the ΔΔG of interaction, with only a few values with |ΔΔG| > 1.0 kcal mol−1. This is consistent with SPR data that also showed small effects for most of the mutations [80]. The scan of our WT structure highlighted the slight destabilizing effect of mutations K417N (ΔΔG = 0.67 kcal mol−1) and Y505H (ΔΔG = 1.77 kcal mol−1) in agreement with experimental data. However, it unexpectedly predicted mutation G496S to be stabilizing (ΔΔG = −1.04) [80]. This is probably due to the bias inherent in replacing a small residue with a larger one. Similarly, N501Y was predicted to be highly destabilizing (ΔΔG = 5.49 kcal mol−1) while it is experimentally known to increase binding [80]. The latter result is not entirely surprising given the method’s bias toward destabilizing mutations. In addition, we found that the inverse Y501N mutation from the Omicron variant scan was also predicted to be destabilizing, in agreement with the experimental data [80]. Moreover, the stabilizing effect of this mutation was proposed to be due to pi-stacking interactions which we do not expect to be fully represented in the FoldX force field. Furthermore, it should be noted that mutations can have cooperative effects. For example, Q498R increases binding affinity to ACE2 when in combination with N501Y [82], which cannot be adequately captured by our scan.
Limitations, future directions and availability
MutateX is a collection of Python-based scripts and requires Python 2.7 and 3 or higher to be available on the system together with open-source Python libraries, whose installation is straightforward on macOS and Linux distributions. MutateX comes with a standard installation script that takes care of installing all the necessary requirements as well. It is available at https://github.com/ELELAB/mutatex. The package also includes templates for input preparation, in-depth documentation and test cases. It should be noted however that to perform the mutagenesis calculation, the most recent version of the FoldX Suite needs to be available in the system. The FoldX Suite is free of charge for academic use, and it is currently available athttp://foldxsuite.crg.eu. While MutateX currently supports the FoldX Suite only, we plan on adding support for other software packages or web servers in the future, along with providing customized support to other energy functions available in the literature and of potential interest for the user community.
The potential of MutateX is attested by recent publications where a preliminary suite of tools designed by us was employed to perform saturation mutational scans for several systems of biological interest, such as enzymes of industrial and pharmacological interest [45, 83], the Parkinson-related DJ1 protein [84], transcription factors as MZF1 [85], proteins involved in mismatch repair [41, 42] and other disease-related targets [41]. The aforementioned studies, along with the examples reported in this manuscript, provide insight into the mutational landscape of known cancer- or disease-related proteins. This was in fact an important driving force behind the development of MutateX. The wide-scale and high-throughput prediction of the effects on stability and binding of disease-related targets can help predict the effect of cancer-related missense variants found in mutational data or cancer mutation initiatives (such as catalogue of somatic mutations in cancer (COSMIC) [86]). The scans that MutateX provides can contribute to identifying mutational hotspots in cancer targets. Even in this context, however, the limitations of the tool should be taken into consideration. FoldX relies on local side-chain rearrangement, and we thus don’t expect predictions to reliably represent long-range allosteric effects, which are important in the context of different diseases [87, 88]. For similar reasons, we don’t expect predictions to fully incorporate epistatic effects due to distant mutations when estimating the combined effect of multiple mutations (e.g. concerning ‘latent driver’ mutations [89]). Moreover, predictions of changes in stability only give a partial view of the effect of the amino acid changes disregarding important functional consequences. The use of other computational tools, either using structural data or not, can complement these predictions and obtain a broader picture of the effect of cancer mutations and of the mutational landscape overall [90, 91]. For instance, tools designed to investigate protein structure networks [30, 92, 93] or to quantify allosteric modulation [94] can complement MutateX data and partially address its shortcomings.
MutateX also has a relevant technological interest as it can be used in the context of protein engineering. It is a promising tool for designing enzymes with different thermal stabilities, as FoldX has been previously used in different protein engineering studies assessing protein (de)stabilization upon mutagenesis [41–44, 95–97]. However, given the aforementioned bias of FoldX toward destabilizing mutations, we advise carefully considering its results and combining them with experiments or other more accurate predictions of free energy changes on interesting candidates. Lastly, the FoldX Suite allows runs on structures containing DNA and RNA molecules as well as other molecules such ATP and GTP, and adding their support will be one of the further developments of MutateX.
MutateX provides a powerful complementary tool in experimental biochemistry, biotechnology and molecular/cellular biology, along with the molecular understanding and helping classification of disease-causing mutations.
Automatization of in silico structure-based mutational scans for high-throughput applications
Possibility to estimate changes in both stability and binding (i.e. function) upon mutations
Possibility to apply the method to conformational ensembles of proteins and protein complexes
Different tools for visualization and analyses of the results
Author contributions
Conceived and designed the project: M.T., E.P. perform calculations and data visualization: T.T. (PTEN), M.D.M. (PTEN), I.P. (p53), E.M. (SH2), K.D. (MD ensembles) and L.B. (spike). Analyzed the data: T.T. (PTEN), M.T. (all), E.P. (all), I.P. (p53), E.M. (SH2, spike), K.D. (p53 ensembles) and L.B. (spike). Contributed analysis tools, programming: T.T., M.T., T.C. and I.P. Wrote the paper: M.T. and E.P. with inputs from all the coauthors.
Acknowledgements
Space limitation precludes the citations of many excellent works by colleagues in the fields discussed. We acknowledge these efforts here.
The authors are also extremely grateful to all the current and former Computational Biology Laboratory (now Cancer Structural Biology) members at the Danish Cancer Society Research Center for extensive testing and usage of the software, which allowed us to optimize it. We also would like to thank Kresten Lindorff-Larsen, Amelie Stein and Rasmus Hartmann-Petersen for the interest and support that their groups have demonstrated for our project, and for having used the tool for their proteins of interest. We are also grateful for the interest demonstrated by the FoldX developers in Luis Serrano’s group.
Data availability
To use for the Data Availability Section Software and data used in this publication is available at our GitHub repository https://github.com/ELELAB/mutatex.
Funding
The result of this research has been achieved using the DECI-PRACE15th and 16th HPC Grants on Archer. The research has been also carried out thanks to the access to the Danish HPC Infrastructure Computerome2. The research has been supported by grants from LEO Foundation (LF17006, LF17024), Carlsberg Distinguished Fellowship (CF18-0314), Danmarks Grundforskningsfond (DNRF125), NovoNordisk Fonden in Bioscience and Basic Biomedicine (0065262) and Biotechnology-based Synthesis and Production (17OC0027588) and Hartmann Foundation (A33877). This work is part of Interregional Childhood Oncology Precision Medicine Exploration, a cross-Oresund collaboration between University Hospital Copenhagen, Rigshospitalet, Lund University, Region Skåne and Technical University Denmark, supported by the European Regional Development Fund.
Author Biographies
Matteo Tiberti is a staff scientist at the Cancer Structural Biology group (Danish Cancer Society Research Center, DCRC, Copenhagen, Denmark) and his research focuses on software development for structural bioinformatics and annotations of cancer mutations.
Thilde Terkelsen worked as PhD student at the Cancer Structural Biology group at DCRC and she is now Data Scientist at the Center for Health and Data Science at the University of Copenhagen. At DCRC, she worked on the implementation of MutateX functions for downstream analyses and the analyses of -omics data.
Kristine Degn is a PhD student in the Cancer Systems Biology group (Department of Health and Technology, Technical University of Denmark, DTU, Lyngby, Denmark) and her research focuses on structure-based methods to characterize and classify variants found in cancer samples with impact on the protein products.
Ludovica Beltrame is an Erasmus exchange student at the Cancer Structural Biology group at DCRC. Her research is focusing on free energy calculations and molecular simulations to study the effect of mutations on protein structures.
Tycho Canter Cremers carried out his Master Studies in the Cancer Structural Biology group at DCRC, focusing on implementing new functions for MutateX and he is currently working as Bioinformatician at the Center of Medical Genetics, Antwerp.
Isabelle da Piedade worked as a Post-Doctoral Researcher at the Cancer Structural Biology Group at DCRC with a focus on analyses of genomics and transcriptomics data. She is currently working at Senior Bioinformatician at the Danish National Genome Center (Copenhagen, Denmark).
Miriam Di Marco worked as a pre-graduate fellow and master's student at the Cancer Structural Biology group at DCRC. She applied free energy calculations to characterize disease-related proteins and the impact of mutations on protein stability.
Emiliano Maiani worked as a Post-Doctoral Researcher and Senior Scientist at the Cancer Structural Biology group of DCRC. His main research focus in the group has been in the area of experimental structural and cellular studies of short linear motifs. He is currently an Adjunct Professor at UniCamillus–Saint Camillus International University of Health Sciences, Rome, Italy. His current research focuses on DNA damage response and autophagy.
Elena Papaleo is Associate Professor and leader of the Cancer Systems Biology group at DTU and group leader of the Cancer Structural Biology group at DCRC. Her research focuses on applying -omics bioinformatics and structural methods, accompanied by experimental validation with in vitro and cellular assays. These methodologies are used to unravel the effects of disease-related mutations and post-translational modifications on protein structures and to characterize protein-protein interactions mediated by disordered regions.