Abstract

We report the structure-based pathogenicity relationship identifier (SPRI), a novel computational tool for accurate evaluation of pathological effects of missense single mutations and prediction of higher-order spatially organized units of mutational clusters. SPRI can effectively extract properties determining pathogenicity encoded in protein structures, and can identify deleterious missense mutations of germ line origin associated with Mendelian diseases, as well as mutations of somatic origin associated with cancer drivers. It compares favorably to other methods in predicting deleterious mutations. Furthermore, SPRI can discover spatially organized pathogenic higher-order spatial clusters (patHOS) of deleterious mutations, including those of low recurrence, and can be used for discovery of candidate cancer driver genes and driver mutations. We further demonstrate that SPRI can take advantage of AlphaFold2 predicted structures and can be deployed for saturation mutation analysis of the whole human proteome.

INTRODUCTION

Whole-genome sequencing and whole-exome sequencing provide powerful means to assess genetic diversity among individuals [1, 2]. Among hundreds of variants in the coding regions [3–5], missense mutations have the highest frequency of occurrence and can potentially affect molecular functions upon residue substitutions [5–8]. However, majority of the variants are incidental and have limited effects on protein functions. In addition, missense mutations usually have low recurrence [9–12], and it is difficult to identify pathological mutations over a noisy background. It is also challenging to recognize the collective effects of multiple mutations of low recurrences. It is therefore important to determine whether a missense mutation leads to deleterious phenotypic changes, and whether multiple low recurrence mutations act cooperatively to affect protein function as a higher-order spatial unit.

As functions follow forms, protein structures provide rich information on how missense mutations may affect proteins functions. With rapid progress in structural biology [13–16] and breakthroughs of AlphaFold2 and RoseTTAFold in structure prediction [17, 18], over 200 million proteins now have their structures available [17, 19], allowing analysis of the whole human proteome.

However, most current methods to evaluate potential pathological effects of missense mutations are based on sequence alignment and conservation analysis [20–28], and do not take advantage of structural information [20, 21, 23, 25–28]. Some require annotated knowledge of functional domains, which is not uniformly available [20] (Table 1), others exhibit uneven performance on different datasets, indicating lack of transferability [21, 29]. In addition, it is difficult to identify individual mutational events that will lead to diseases: over 70% of mutations collected from cancer samples are predicted as deleterious by Polyphen-2 [26], while only a few mutations are believed to drive tumorigenesis and most cancer mutations are passenger mutations [9–12]. A recently developed transformer-based model, MutFormer, is able to learn novel sequence features, including long-range and short-range dependencies between amino acid mutations [22], and outperforms many other sequence-based methods.

Table 1

Performance of predictions on the UnifyPDBFull 5-fold test datasets

SPRIEVmutationFATHMMLISTPMUTPolyphen-2PROVEANRhapsody
Completeness1.00.7400.9870.9870.9900.9870.9870.930
Recall0.9190.9080.8680.7550.7320.8750.8420.826
Specificity0.8260.7670.3850.7440.7030.8130.8160.926
Precision0.8890.8610.6770.8140.7880.8750.8720.947
Accuracy0.8820.8540.6740.750.7210.850.8320.864
F-1 score0.9040.8840.7610.7830.7590.8750.8570.882
MCC score0.7520.6880.2930.4920.4290.6880.6540.732
SPRIEVmutationFATHMMLISTPMUTPolyphen-2PROVEANRhapsody
Completeness1.00.7400.9870.9870.9900.9870.9870.930
Recall0.9190.9080.8680.7550.7320.8750.8420.826
Specificity0.8260.7670.3850.7440.7030.8130.8160.926
Precision0.8890.8610.6770.8140.7880.8750.8720.947
Accuracy0.8820.8540.6740.750.7210.850.8320.864
F-1 score0.9040.8840.7610.7830.7590.8750.8570.882
MCC score0.7520.6880.2930.4920.4290.6880.6540.732

Bold indicates the best metric.

Table 1

Performance of predictions on the UnifyPDBFull 5-fold test datasets

SPRIEVmutationFATHMMLISTPMUTPolyphen-2PROVEANRhapsody
Completeness1.00.7400.9870.9870.9900.9870.9870.930
Recall0.9190.9080.8680.7550.7320.8750.8420.826
Specificity0.8260.7670.3850.7440.7030.8130.8160.926
Precision0.8890.8610.6770.8140.7880.8750.8720.947
Accuracy0.8820.8540.6740.750.7210.850.8320.864
F-1 score0.9040.8840.7610.7830.7590.8750.8570.882
MCC score0.7520.6880.2930.4920.4290.6880.6540.732
SPRIEVmutationFATHMMLISTPMUTPolyphen-2PROVEANRhapsody
Completeness1.00.7400.9870.9870.9900.9870.9870.930
Recall0.9190.9080.8680.7550.7320.8750.8420.826
Specificity0.8260.7670.3850.7440.7030.8130.8160.926
Precision0.8890.8610.6770.8140.7880.8750.8720.947
Accuracy0.8820.8540.6740.750.7210.850.8320.864
F-1 score0.9040.8840.7610.7830.7590.8750.8570.882
MCC score0.7520.6880.2930.4920.4290.6880.6540.732

Bold indicates the best metric.

Recent methods incorporating protein structural information have improved predictions of deleterious missense mutations [30–33]. Methods have also been developed that provide mapping of the spatial positions of mutation sites, facilitating identification of clusters of mutation sites close to known functional regions [34–39]. However, how to extract information encoded in protein structure to improve assessment of pathological effects of missense mutations remains problematic. It is challenging to determine the relevant factors affecting mutation effects from the vast amount of information contained in protein structures. It is also difficult to discover de novo a higher-order spatial unit of multiple deleterious mutations of low recurrence in a principled fashion without any a priori information.

Here we present a new method called SPRI (structure-based pathogenicity relationship identifier), which provides quantitative assessment of pathological effects of missense mutations, identifies deleterious mutations, and uncovers higher-order spatial units of deleterious mutational clusters. Our approach is based on the fact that biochemical functions require specific spatial arrangement of multiple residues and atoms to maintain the structural scaffold and to present a binding or functional surface with the required physico-chemical textures [40]. Mutations altering this microenvironment may affect the execution of biochemical functions. With computation of protein surface pockets [38, 41–44], novel profiles of interaction at different range, and solvent accessible surface area using alpha shapes [41, 45, 46], we integrate the structural information with evolutionary signals of the mutation sites derived from multiple sequence alignment. With further incorporation of changes in biophysical properties upon mutation, a random forest predictor is trained to provide quantitative estimation of pathological effects [47]. The geometric analysis of protein structure and pathogenicity estimation further allow us to discover rigorously defined higher-order spatial units with enriched accumulation of pathological mutations.

SPRI is among the best in predicting pathogenic missense mutations of Mendelian diseases. Using benchmark data with reduced inherent flaws of uneven representations of neutral and deleterious mutations, our method outperforms several state-of-the-art methods using a number of quality metrics (AUROC, AUPRC, MCC and F-1 score ) [20, 21, 23, 25–27, 31].

SPRI can also predict cancer driver mutations of somatic origin with high sensitivity. As our method can capture essential properties required for proper biochemical functions, its effectiveness in identifying disease-causing mutations is transferable. We show that SPRI can effectively evaluate pathogenicity of mutations of somatic origin and can identify cancer driver mutations. These predictions are made without re-training using different datasets or additional parameter adjustments. This is in contrast to current methods, which make inconsistent predictions with higher rates of false negative predictions [21, 31].

SPRI can discover structurally organized pathogenic higher-order spatial (patHOS) units of deleterious mutations, including those of low recurrence. In a large scale analysis of 239,164 cancer mutations on 2,712 human proteins, we discovered 567 patHOS units in 189 genes, each harboring a significant number (⁠|$\ge $|5) of deleterious mutations, incidictingtheir high susceptibility to cancer. Among these, 83.7 % of the mutations have |$<3$| recurrence count per each mutation. Furthermore, we uncovered 161 genes not listed in Cancer Gene Census (CGC) containing patHOS clusters, 29 of which are strong candidates of cancer driver genes as they harbor two individual patHOSs. In addition, 22 out of these 29 genes have literature evidence on their roles and relevance in cancer.

SPRI can also be used for in silico saturation mutagenesis and can predict pathological effects of systematic mutations at each site of a protein. Finally, SPRI can be deployed for global studies of mutation effects of the whole human proteome. Using a set of proteins with AlphaFold2-predicted structures, we show that SPRI can make reliable predictions on the pathological effects of missense mutations at sites whose structures are predicted at high or moderate confidence level.

Overall, we expect that SPRI can be applied broadly to assess pathological effects for mutations regardless of disease types. It can also be used to discover higher-order mutational clusters, and to provide in silico saturation mutagenesis analysis. It can be applied proteome-wide scale studies using known and predicted protein structures.

MATERIAL AND METHODS

Benchmark dataset with reduced flaws

For reliable prediction of pathological effects and deleterious mutations, a dataset that accurately reflects the natural landscape of mutations is critical. Existing benchmark datasets such as HumVar, HumDiv, ExoVar, VariBench and PredictSNP [23, 26, 48–50] have issues, as deleterious mutations and neutral mutations are often from different proteins, which may have different degree of deleteriousness depending on the number of functional domains, size of the functional surfaces, and other factors (see details in Supplemental Information (SI), Section 2). Furthermore, many proteins were not fully investigated; thus, variants lacking annotation of pathological effects may be simply a reflection of lack of knowledge, rather than positive information about these mutations being neutral. As only a small proportion of proteins possess both deleterious and neutral variants (see SI) in these datasets, inconsistency in predictions when applied to different data may result, suggesting that improved training data are needed for robustness and transferability.

We construct two data sets based on the HumDiv, HumVar and PredictSNP datasets as benchmarks [26, 50]. To ensure the same level of knowledge of both types of mutations, we select only proteins with annotations of both deleterious and neutral mutations, and have high-quality structures. The UnifyPDBFull dataset contains 4231 deleterious variants and 2791 neutral variants, which are derived from 252 single-chain proteins with full structural coverage. The UnifyPDBAcceptable dataset contains 5999 deleterious variants and 3485 neutral variants, which are derived from 377 proteins with 444 polypeptide chains with full and partial structural coverage, including the UnifyPDBFull set as a proper subset. Our datasets are smaller but have fewer flaws, providing more accurate representations of the landscape of deleterious and neutral mutations in proteins.

Structural, biophysical and evolutionary properties

Protein surface pocket and geometric location of mutated residues

Surface pockets on proteins provide the local microenvironment for ligand binding and biochemical reactions [40, 51], while interior residues may contribute to folding stability [52]. We therefore classify each residue into the categories of surface pockets residues, interior buried residues and other residues on other surface region [42, 46]. We also calculate the solvent accessible surface area of each residue. Pocket construction, surface and buried core residues, as well as solvent accessible surface areas are computed using the alpha shape method and are accessible from the Computer Atlas of Surface Topography of Proteins (CASTp) server [38, 53].

Contact profiles of short range atomic interactions

A functional amino acid residue plays its biochemical roles in conjunction with its spatial neighbors as part of a favorable microenvironment for reactions. While residue distances have been used to describe the local environment [33, 54], complex patterns in atomic interactions and side-chain packing arrangements are difficult to capture by distance-based profiles.

To capture immediate atomic interactions at the mutation site, we construct its atomic contact profiles. We obtain the contact network from the weighted alpha shape, which corresponds to the Voronoi diagram of the protein structure with non-empty nearest neighbor atomic contacts. As only those with physical contacts are found in alpha shapes [55, 56], this allows us to focus on exact atomic contacts without the distraction of noise inherent in the contacts defined by distances.

Specifically, we first computed the weighted Delanuay triangulation of the atoms of the protein structure. We then obtain its alpha shape by generating a filtration of the Delaunay simplicial complex [57, 58]. We further select alpha edges connecting different residues and obtain inter-residue atomic contact network, where nodes are atoms and alpha edges connect nearest-neighboring atoms whose volume overlaps (See SI Figure 1 in Section 3).

We consider only four atom types of C, N, O and S, distinguishing atoms provided by the mutated residue and its neighboring residue. Altogether, we have 16 types of element pairs for atomic interactions.

Interaction profiles of residues of intermediate and long range

Intermediate and long range interactions can influence the microenvironment of the mutation site residue via electrostatic interactions and indirect steric effects [59–61]. We construct interaction profiles of intermediate- and long-range at the residue level to account for such effects. Interacting residues are again identified by the atomic connection network derived from alpha shape.

Specifically, we start with residues in direct atom contacts with the mutation site, as recorded in the short-range contact profile. From these first layer contact residues. We use a Breadth-First Search (BFS) algorithm to identify the next layer of residues in atomic interactions with them [62]. These are residues of intermediate-range interactions. We apply BFS again to identify the next layer of partner residues through atomic contacts. These newly identified residues are in long-range interactions with the mutation site. We construct the intermediate- and long-range interaction profiles, each a 20-dimensional integer vector recording the number and type of residues interacting with the mutation site.

Higher-order spatial cluster

Here we define higher-order spatial (HOS) cluster rigorously based on the computed alpha shape. We take the mutation site residue as the anchor, and incorporate all residues with short-, intermediate-, and long-range interactions from the anchor as its HOS cluster. HOS clusters provide a priori defined spatial units from structures for analysis of higher-order effects of mutations.

Salt bridge interactions

Salt bridges formed between ion pairs can stabilize protein conformation and provide protons for catalytic reactions [63]. We count the total number of salt bridges that an ionizable residue participates in. The criterion is that the side-chain oxygen atoms in an acidic residue and the side-chain nitrogen atoms in a basic residue have a distance |$\le 3.2 \overset{\circ}{A}$|⁠.

Changes in biophysical properties in variants

Different substitutions occurring at the same site may have dramatically different effects. It is therefore important to consider changes in biophysical properties upon a substitution to obtain finer distinction of mutation effects of different substitution patterns at a given mutation site. We record differences in each atomic chemical type upon mutation. If Gly is involved, we record the change of |$C_{\beta }$|⁠. Here |$C_{\beta }$| is the beta-carbon atom appearing as the first atom of the side-chain in an amino acid, which is present in 19 types of amino acid residues, while missing in the Gly residue. We further record changes in charges upon mutations.

Robustness of structural properties

The structure-derived properties are robust. Using five different PDB structures of the same protein uroporphyrinogen decarboxylase, about 85% of structure-derived features are of the same value or have only small deviations, leading to the same prediction results (see SI, Section 9 for more details).

Evolutionary signals

We use Blosum62 matrix to capture general patterns of substitutions [64]. To incorporate protein-specific evolutionary information, we construct multiple sequence alignments (MSA) using the full protein sequence instead of domains [65]. We take homologs with full-length sequence identity |$\geq $| 30% in the reference proteomes of other eukaryotic species, from which we construct the MSA using CLUSTAL OMEGA [65]. Overall, 94.4% protein-specific MSAs analyzed in this study (356 out of the 377 UnifyPDBAcceptable proteins) have adequate alignment depth of |$\geq $| 100, ensuring that reliable evolutionary signals can be detected. We then calculate the entropy value of the mutation site using the aligned MSA [65]. In addition, frequencies of occurrence of the wild-type and the mutated-type residues in the MSA for each mutation site are also recorded. More details can be found in SI, Section 5.

Predicting pathological effects

Pathogenicity of missense mutation

We integrate all information extracted from structure and sequence and generate a probability score measuring the pathological effects of a substitution. We construct a random forest classifier [66] as it is robust against overfitting [47]. The prediction score |$\pi _{\text{del}}$| is calculated from the ratio of the number of trees supporting the hypothesis that the variant is pathological against the total number of trees in the forest [47]. The overall architecture of our method is shown in Figure 1. If missing prediction occurs, we generate a random score based on the uniform distribution of |$[0,1]$| for imputation (see SI, Section 1.5 for details.)

Overview of the SPRI method. (A) SPRI utilizes three types of features based on protein structures, biophysical properties and evolutionary signals, respectively. Structural features include the atomic and residue interaction profiles computed using alpha shapes, which capture the site-specific micro-environment of the mutation site, solvent accessible surface area and geometric shape at the mutation site. Features of biophysical properties include possible salt bridges formed by ionizable residues, changes in charge upon substitutions, as well as changes in side-chain and backbone atoms. To obtain features of evolutionary signals, a knowledge database containing $\sim $24 millions protein sequences from reference proteomes of eukaryotic species in UniProt is built where homologs are identified, so multiple sequence alignment can be constructed to obtain site-specific metrics of entropy, frequencies of wild-type and mutated residues. (B) These features, except geometric location, are vectorized into numerical values. A random forest classifier is then trained to generate probability measure of the likelihood of pathological effect, from which a binary classification is made. (C) SPRI can discover pathogenic higher-order spatial clusters (patHOS) of deleterious mutations. For each observed mutation site, SPRI constructs its higher-order spatial unit (HOS) based on geometric computation, and evaluate the overall pathogenecity of all mutations located in the HOS. After normalization by the size of HOS, those clusters with ($\ge $5) deleterious mutations and pathogenecity score $\ge \theta = 0.05$ are classified as pathogenic HOS (patHOS) and are distinguished from background mutational HOS.
Figure 1

Overview of the SPRI method. (A) SPRI utilizes three types of features based on protein structures, biophysical properties and evolutionary signals, respectively. Structural features include the atomic and residue interaction profiles computed using alpha shapes, which capture the site-specific micro-environment of the mutation site, solvent accessible surface area and geometric shape at the mutation site. Features of biophysical properties include possible salt bridges formed by ionizable residues, changes in charge upon substitutions, as well as changes in side-chain and backbone atoms. To obtain features of evolutionary signals, a knowledge database containing |$\sim $|24 millions protein sequences from reference proteomes of eukaryotic species in UniProt is built where homologs are identified, so multiple sequence alignment can be constructed to obtain site-specific metrics of entropy, frequencies of wild-type and mutated residues. (B) These features, except geometric location, are vectorized into numerical values. A random forest classifier is then trained to generate probability measure of the likelihood of pathological effect, from which a binary classification is made. (C) SPRI can discover pathogenic higher-order spatial clusters (patHOS) of deleterious mutations. For each observed mutation site, SPRI constructs its higher-order spatial unit (HOS) based on geometric computation, and evaluate the overall pathogenecity of all mutations located in the HOS. After normalization by the size of HOS, those clusters with (⁠|$\ge $|5) deleterious mutations and pathogenecity score |$\ge \theta = 0.05$| are classified as pathogenic HOS (patHOS) and are distinguished from background mutational HOS.

Pathogenicity of higher-order spatial cluster

The degree of pathogenicity of a HOS cluster is calculated by summing up the pathology probability scores of its residues, which is then normalized by the cluster size. We take the site of each mutation as the anchor and calculate the associated cluster pathogenicity score. Those above a threshold of |$\theta =0.05$| and harbor |$\ge $|5 deleterious mutations are considered to be pathogenic higher-order spatial cluster (patHOS). These thresholds are determined based on statistical analysis of HOSs on the cancer mutation data. Details can be found in Section 11 of SI.

RESULTS

SPRI can capture functional spatial patterns without prior information

We use human glutathione synthetase (GSS, UniProt: P48637, PDB: 2HGS chain:A) for illustration. This protein has 524 residues. There are four known deleterious variants occurring at three mutation sites. Among these, the variant R283C leads to glutathione synthetase deficiency symptom [67]. There are 10 neutral variants occurring at 10 different mutation sites. Our method provides reliable predictions on these variants without specification of the catalytic residues or any other prior annotations as shown in Figure 2: All four deleterious variants (L188P, Y270C, Y270H, R283C) are correctly predicted, 8 out of 10 neutral variants (K95E, A134T, P202T, H290C, V343M, R418Q, Q435H, E353K) are correctly predicted, and only two neutral variants (S80N, I401T) are incorrectly predicted to be deleterious.

Illustration of different structural regions and mutation sites of the human glutathione synthetase, which are color-coded: the catalytic region (orange), its neighboring environment (light orange), outside region (gray), the correctly predicted deleterious mutations (pink), the correctly predicted neutral mutations (dark blue) and the false predictions of deleterious mutations (light blue). All four deleterious mutation site residues (L188P, Y270C, Y270H, R283C) are located within the neighboring environment of the catalytic region, and are correctly predicted. Eight correctly predicted neutral predictions are located in the outside region, and the two false deleterious predictions are located at the boundary of the catalytic neighboring environment.
Figure 2

Illustration of different structural regions and mutation sites of the human glutathione synthetase, which are color-coded: the catalytic region (orange), its neighboring environment (light orange), outside region (gray), the correctly predicted deleterious mutations (pink), the correctly predicted neutral mutations (dark blue) and the false predictions of deleterious mutations (light blue). All four deleterious mutation site residues (L188P, Y270C, Y270H, R283C) are located within the neighboring environment of the catalytic region, and are correctly predicted. Eight correctly predicted neutral predictions are located in the outside region, and the two false deleterious predictions are located at the boundary of the catalytic neighboring environment.

Post-prediction analysis of the results show that SPRI can extract complex spatial patterns contributing to functions without a priori knowledge. The catalytic region of GSS has nine residues, with E144, N146 and E368 binding to magnesium, R125, S151 stabilizing the tetrahedral intermediate, K305, K364, G369 and R450 stabilizing the phosphate substrate and the leaving groups electrostatically [68]. We take these nine residues and compute the three-layer interaction profiles using alpha shapes. Residues identified form the neighbor environment of the catalytic region. The rest of the residues are regarded as outside residues. We found that all deleterious variants are located within the catalytic region and its neighbor environment, and the eight neutral variants with correct predictions are all outside residues. The two incorrectly predicted neutral variants are at the boundary of the neighbor environment.

SPRI’s ability in extracting spatial patterns important for functions from structures is general. We have further examined a total of 19 enzymes whose biochemistry are well understood and catalytic residues well annotated in the M-CSA enzyme database [69]. Statistical test over these enzymes shows that deleterious variants are more preferably located at the catalytic or its neighboring region, while neutral variants are more likely to occur at the outside region (P|$ = 4.2 \times 10^{-14}$| by Fisher’s exact test). It is important to note that SPRI does not require any prior knowledge on enzyme catalytic residues for predictions.

SPRI is highly effective in predicting pathogenic effects of Mendelian mutations

To evaluate the effectiveness of SPRI in predicting deleterious Mendelian mutation, we use stratified k-fold cross-validations on both the UnifyPDBFull and the UnifyPDBAcceptable datasets [70]. We focus on the former as it contains proteins whose structures cover the full sequences, ensuring accurate representations of structural properties. We compare our results with those from the methods of EVmutation, FATHMM, LIST, PMut, Polyphen-2, and Rhapsody [20, 21, 23, 25–27, 31]. We report completeness, recall, specificity, precision, accuracy, F-1 score and MCC for comprehensiveness. Among these metrics, MCC is an informative and balanced metric to symmetrically evaluate both groups of positives and negatives with adjustment of ratio of subgroups [71]. Here completeness is the ratio of mutations that can be given by a specific method against the whole set.

The 5-fold average performance of our method on UnifyPDBFull stratified test data has an F-1 score and Matthews correlation coefficient (MCC) of 0.904 and of 0.752, respectively, which are the highest among all 8 methods compared (Table 1). Our method also has the highest accuracy (0.882) and highest recall (0.919), indicating it is highly sensitive in identifying deleterious mutations. In addition, our method has an excellent specificity of 0.826 in identifying neutral mutations correctly, which are better than all other methods except Rhapsody (0.926). Our methods outperforms Rhapsody in several important regards: Rhapsody has a lower recall of 0.826, and it can make predictions on fewer variants in UnifyPDBFull (0.930), whereas our SPRI method has a higher recall of 0.919 and makes predictions on all variants. Polyphen-2 and PROVEAN achieve similar specificity as SPRI. EVmutation, and PMUT exhibits specificity on par with each other, but lower than SPRI. FATHMM exhibits significantly lower specificity than other methods. SPRI exhibits 10% lower specificity than Rhapsody. As Rhapsody incorporates protein dynamics and co-evolution information, additional consideration in these aspects may lead to improved specificity of SPRI. These would be interesting topics to explore for future studies.

The ROC and the PR curves of the overall performance of prediction at different thresholds are shown in Figure 3. On the stratified UnifyPDBFull test dataset, our method has an Area Under Receiver Operating Characteristic Curve (AUROC) of 0.946 and an Area Under Precision-Recall Curve (AUPRC) of 0.966, which are the best among all methods. These results show that our method is robust and performs well at different thresholds. ROC and PR curves of each fold on 5-fold cross validation test sets can be viewed in SI, Section 6. Results using cross-validation data of (UnifyPDBAcceptable) are similar to that of UnifyPDBFull, where our method has the best performance when measured by the metrics of F1-score, MCC value, accuracy, and AUCs of ROC and PR curves (Table 2 and SI, Section 7).

Receiver operating characteristic (ROC) curves and Precision–Recall (PR) curves of testing results using SPRI and other methods on the UnifyPDBFull and UnifyPDBAcceptable test datasets. (A) The ROC curves and their AUROCs using the UnifyPDBFull test dataset. SPRI has the highest AUROC of 0.949. (B) The PR curves and their AUPRCs using the UnifyPDBFull test dataset. SPRI has the highest AUPRC of 0.966. (C) The ROC curves and their AUROCs using the UnifyPDBAcceptable test dataset. SPRI has the highest AUROC of 0.943. (D) The PR curves and their AUPRCs using the UnifyPDBAcceptable test dataset. SPRI has the highest AUPRC of 0.960.
Figure 3

Receiver operating characteristic (ROC) curves and Precision–Recall (PR) curves of testing results using SPRI and other methods on the UnifyPDBFull and UnifyPDBAcceptable test datasets. (A) The ROC curves and their AUROCs using the UnifyPDBFull test dataset. SPRI has the highest AUROC of 0.949. (B) The PR curves and their AUPRCs using the UnifyPDBFull test dataset. SPRI has the highest AUPRC of 0.966. (C) The ROC curves and their AUROCs using the UnifyPDBAcceptable test dataset. SPRI has the highest AUROC of 0.943. (D) The PR curves and their AUPRCs using the UnifyPDBAcceptable test dataset. SPRI has the highest AUPRC of 0.960.

Table 2

Performance on the UnifyPDBAcceptable 5-fold test datasets

SPRIEVmutationFATHMMLISTPMUTPolyphen-2PROVEANRhapsody
Completeness1.00.7110.9900.9900.9870.9900.9900.941
Recall0.9130.9020.8460.7530.7470.8810.8240.822
Specificity0.8280.7660.4050.7660.7320.8090.8170.925
Precision0.9020.8780.7070.8460.8290.8880.8840.952
Accuracy0.8820.8540.6830.7580.7420.8550.8210.859
F-1 score0.9070.8900.7710.7970.7860.8840.8530.882
MCC score0.7440.6760.2810.5050.4660.6890.6280.721
SPRIEVmutationFATHMMLISTPMUTPolyphen-2PROVEANRhapsody
Completeness1.00.7110.9900.9900.9870.9900.9900.941
Recall0.9130.9020.8460.7530.7470.8810.8240.822
Specificity0.8280.7660.4050.7660.7320.8090.8170.925
Precision0.9020.8780.7070.8460.8290.8880.8840.952
Accuracy0.8820.8540.6830.7580.7420.8550.8210.859
F-1 score0.9070.8900.7710.7970.7860.8840.8530.882
MCC score0.7440.6760.2810.5050.4660.6890.6280.721

Bold indicates the best metric.

Table 2

Performance on the UnifyPDBAcceptable 5-fold test datasets

SPRIEVmutationFATHMMLISTPMUTPolyphen-2PROVEANRhapsody
Completeness1.00.7110.9900.9900.9870.9900.9900.941
Recall0.9130.9020.8460.7530.7470.8810.8240.822
Specificity0.8280.7660.4050.7660.7320.8090.8170.925
Precision0.9020.8780.7070.8460.8290.8880.8840.952
Accuracy0.8820.8540.6830.7580.7420.8550.8210.859
F-1 score0.9070.8900.7710.7970.7860.8840.8530.882
MCC score0.7440.6760.2810.5050.4660.6890.6280.721
SPRIEVmutationFATHMMLISTPMUTPolyphen-2PROVEANRhapsody
Completeness1.00.7110.9900.9900.9870.9900.9900.941
Recall0.9130.9020.8460.7530.7470.8810.8240.822
Specificity0.8280.7660.4050.7660.7320.8090.8170.925
Precision0.9020.8780.7070.8460.8290.8880.8840.952
Accuracy0.8820.8540.6830.7580.7420.8550.8210.859
F-1 score0.9070.8900.7710.7970.7860.8840.8530.882
MCC score0.7440.6760.2810.5050.4660.6890.6280.721

Bold indicates the best metric.

Overall, SPRI has the best performance by most measurements among the eight methods compared, and is robust regardless of the thresholds, despite the fact that SPRI is derived on a training data set of much smaller size (see SI, Table 3.).

All features collectively contribute to model performance

To understand the relative contributions to effectiveness of predictions from different types of features, we compare the model trained on the full set of features with models trained on different subsets of features. Specifically, we trained additional models with evolutionary only, structural+biophysical only (non-evolutionary), structural only, and biophysical features only.

We compare the average values of AUROCs and AUPRCs of models based on 5-fold cross-validation of the UnifyPDBFull test datasets (SI Table 4 and SI Figures 3–7). Among these, the model trained with structural+biophysical (non-evolutionary) features has the best AUPRC (0.917) and the second-best AUROC (0.879). The model trained on evolutionary features has comparable AUPRC (0.913) and AUROC (0.891).

These results show that structural+biophsycial features contribute equivalently as evolutionary signals to the model performance, even though each individual structural and biophysical feature has lower feature-wise importance according to mean decrease in the Gini coefficient (SI Figure 2). It is notable that none of the subsets of features are capable of achieving comparable performance with the full set of features, suggesting that structural, biophysical, and evolutionary complement each other, and collectively contribute to the overall performance of the full model. Detailed information can be found in the SI, section 4 Analysis of feature importance..

SPRI can predict cancer driver mutations of somatic origin

Somatic mutations provide one of the most important means to trigger abnormal cell growth that may lead to tumorigenesis [72–74]. Identifying somatic mutations that drive cancer development can facilitate development of therapeutics targeting these mutated proteins [75]. However, there are millions of missense mutations accumulated in cancer patients, the majority of which exhibit low recurrence [6]. Current methods based on frequency counting can provide gene-level predictions, but are challenged in identifying local structure regions with higher pathogenicity, or in pinpointing individual cancer driver mutations [9].

While SPRI has been developed and evaluated using mutations of Mendelian disorders, we hypothesize that SPRI captures the essential characteristics of pathological mutations and the predictive model is transferable to predict cancer somatic mutations. To test this, we take the recently available annotations of tier 1 cancer census mutation from the COSMIC v92 as the ground truth [6], as there is strong evidence for their roles in tumorigenesis from experimental, clinical, and in silico analysis [6]. This dataset contains 201 missense mutations with both wild-type residues and substituted amino acids. Among these, 122 mutations can be mapped to 50 mutated residues on 26 proteins with known structures. We build a separate predictor following the same procedure but with the 26 cancer relevant proteins removed from the training process.

Our method has excellent performance in identifying cancer driver mutations shown in Figure 4. It can correctly classify 110 mutations out of the 122 structure-mapped mutations as deleterious. We also tested the methods of PROVEAN, Polyphen-2 and PMut, which have comparable performance, identifying 109, 105, and 103 deleterious mutations, respectively. EVmutation identifies 94 deleterious mutations, with 9 incorrect predictions and 19 missing predictions. Rhapsody identifies 78 out of the 122 deleterious mutations, at a significantly reduced level of effectiveness than Mendelian disorders. FATHMM identifies 65 deleterious mutations, with 54 incorrect predictions and 3 missing predictions. Overall, our method has excellent performance in identifying confirmed cancer driver mutations.

Predictions of cancer driver mutations among the set of cancer census mutations (cmc) tier-1 mutations. For each method, the numbers of correct prediction (pink), failed prediction (light blue), and missing predictions (gray) are listed. Overall, our SPRI method has the highest number of 110 correct predictions out of 122 driver mutations.
Figure 4

Predictions of cancer driver mutations among the set of cancer census mutations (cmc) tier-1 mutations. For each method, the numbers of correct prediction (pink), failed prediction (light blue), and missing predictions (gray) are listed. Overall, our SPRI method has the highest number of 110 correct predictions out of 122 driver mutations.

As currently there is no clearly defined and widely agreed upon neutral data set of passenger mutations in tumorigenesis, we develop a new data set of neutral passenger mutations consisting of 117 mutations on 31 proteins (see SI Section 13). Using this customized negative neutral passenger mutations, SPRI correctly identifies the largest number of neutral mutations (73 out of 117) on passenger mutations (see SI, Section 13. for more details).

SPRI can identify higher-order spatial clusters of low recurrence mutations but strong cancer susceptibility

In addition to identifying deleterious single mutations with binary classification, discovery of pathological HOS or patHOS is a unique feature of our method. SPRI can quantitatively evaluate pathogenicity of a cluster of residues in an objectively defined HOS unit. Here the pathogenicity of HOS is determined by both the number of deleterious mutations in the HOS, and the normalized pathogenicity score of HOS, the latter is for all residues in the HOS cluster and is different from the pathogenicity of a single mutation as in a binary prediction.

By evaluating the pathogenicities of the HOS clusters for each known mutation site taken as the anchor, we can distinguish those with strong disease susceptibility from background clusters. As an illustration, two HOS clusters with different levels of pathogenicities exist on the protein XPD (encoded by ERCC2, Figure 5B). The HOS cluster centering at I668 has 77 residues in total, harboring 8 deleterious mutations (T460A, T480A, T484A, T484R, G607A, G665A, G665R and I668F), all are of low recurrence (1 for 7 mutations and 3 for 1 mutation). With a cluster pathogenicity score of 0.078 > |$\theta $|= 0.05 and |$\ge $|5 deleterious mutations, this cluster is classified as a pathogenic HOS (patHOS) cluster with strong disease susceptibility. Indeed, this patHOS cluster contains the binding sites to ATP and a portion of the functional domain mediating interaction with MMS19. The accumulation of multiple low recurrence mutations with deleterious effects in this cluster likely strong affect the function of this protein.

Discovery of pathogenic higher-order spatial clusters (patHOS) enriched with low-recurrence deleterious mutations. (top) The polar plots illustrate the spatial relationships between co-clustering deleterious mutation sites and the anchor residue in the same higher-order spatial (HOS) cluster. Mutation sites are clockwise ordered according to their linear sequence index. Light-gray rings with different radii reflect relative distances to the anchor, with S, I and L for short-, intermediate- and long-range distance, respectively. The size of the colored disks are proportional to the recurrence counts in the pan-cancer data, except R387 of FBXW7. The recurrence count numbers are listed in the gray disks (bottom). The corresponding protein structures with the mutational patHOS clusters highlighted in light red are also shown. The mutation sites with deleterious effects are shown as red spheres. For comparison, background spatial structural clusters containing only neutral mutations are shown in light blue, with the mutation sites shown as blue spheres. (A) The patHOS cluster anchoring at R387 in the confirmed cancer gene FBXW7 and a background cluster anchored at E353 shown in the struture. (B) The patHOS cluster anchoring at I668 in the confirmed cancer gene ERCC2 and a background cluster anchored at Q187. (C) The patHOS cluster anchoring at R247 in the newly identified cancer gene GALNT2 and a background cluster anchored at R113. (D) The patHOS cluster anchoring at N74 in the newly identified cancer gene SLC14A1 and a background cluster anchored at W171.
Figure 5

Discovery of pathogenic higher-order spatial clusters (patHOS) enriched with low-recurrence deleterious mutations. (top) The polar plots illustrate the spatial relationships between co-clustering deleterious mutation sites and the anchor residue in the same higher-order spatial (HOS) cluster. Mutation sites are clockwise ordered according to their linear sequence index. Light-gray rings with different radii reflect relative distances to the anchor, with S, I and L for short-, intermediate- and long-range distance, respectively. The size of the colored disks are proportional to the recurrence counts in the pan-cancer data, except R387 of FBXW7. The recurrence count numbers are listed in the gray disks (bottom). The corresponding protein structures with the mutational patHOS clusters highlighted in light red are also shown. The mutation sites with deleterious effects are shown as red spheres. For comparison, background spatial structural clusters containing only neutral mutations are shown in light blue, with the mutation sites shown as blue spheres. (A) The patHOS cluster anchoring at R387 in the confirmed cancer gene FBXW7 and a background cluster anchored at E353 shown in the struture. (B) The patHOS cluster anchoring at I668 in the confirmed cancer gene ERCC2 and a background cluster anchored at Q187. (C) The patHOS cluster anchoring at R247 in the newly identified cancer gene GALNT2 and a background cluster anchored at R113. (D) The patHOS cluster anchoring at N74 in the newly identified cancer gene SLC14A1 and a background cluster anchored at W171.

In contrast, the HOS cluster centering at Q187 has 37 residues in total, but harboring only 2 neutral mutations (Q187R and Q187K), with a pathogenicity score of 0.006, and is therefore classified as a background cluster. This cluster of residues is likely of low cancer susceptibility.

SPRI can discover novel candidate cancer driver genes

We applied SPRI to a large scale study to search for candidate cancer driver genes. Using the pan-cancer mutation data of whole genome and whole exome sequencing from COSMIC database (v95), we analyze 239,164 mutations from 200,402 unique mutation site residues on 2712 proteins with full-coverage PDB structures. We exhaustively take each unique mutation site as the anchor, and calculate the pathogenicity of its HOS cluster.

Over 96.4% of the HOS clusters are found to harbor no more than 3 mutations. Using the threshold of pathogenicity score |$\geq \theta = 0.05$| and |$\ge $|5 deleterious mutations, we are able to identify 567 patHOS clusters, which likely have strong cancer susceptibility.

These clusters are concentrated on cancer driver genes: 200 high-susceptibility patHOS clusters are found on 28 of the 105 known cancer driver genes according to Cancer Gene Census (CGC) [6]. Another 367 clusters are found on 161 of the 2607 genes not known to be cancer genes (i.e., those are part of the non-CGC genes). Among these 161 genes, we predict that 29 genes to have high likelihood to be cancer driver genes, as each has at least 2 almost non-overlapping patHOS mutational clusters (overlap ≤ 20%). We recognize that it is challenging to validate the 29 predicted novel cancer driver genes as this requires detailed experimental analysis. To gain some validation, we have unearthed from literature varying levels of evidence supporting the roles of these genes in tumorigenesis. Among the 29 genes, 22 genes have at least two independent research publications indicating their potential roles in cancer, such as tumor suppressor genes (TSG), oncogenes, up/down-regulation genes in tumorigenesis pathways, significantly differential expression in cancer samples, diagnostic markers or clinical biomarkers relevant with patients’ outcome (see SI, Section 12.)

SPRI for in silico saturation mutation analysis

As SPRI can predict deleterious mutations contributing to both Mendelian disorders and complex diseases such as cancer, it can be used to assess general effects of substitutions of arbitrary sites from the wild type to any of the other 19 amino acid types. Figure 6 depicts an example, where we show that the heatmap of pathological effects of all different substitutions at all positions along the full protein sequence of the copper transport protein. The color intensity of each square in the heatmap represents the level of pathological effect of a substitution. As SPRI is scalable, such heatmaps can be computed exome/proteome wide for all proteins with structures.

The heatmap of predicted pathological effects of mutation saturation of copper transport protein ATOX1 (Uniprot: O00244). The horizontal axis is the residue index along the sequence, and the vertical axis represents the substituted amino acid types. Each square is color coded by the effects of mutation, where the color intensity encodes the likelihood of pathological effect. A red substitution represents a mutation predicted to be likely a pathological mutation, whereas a blue substitution likely a neutral mutation.
Figure 6

The heatmap of predicted pathological effects of mutation saturation of copper transport protein ATOX1 (Uniprot: O00244). The horizontal axis is the residue index along the sequence, and the vertical axis represents the substituted amino acid types. Each square is color coded by the effects of mutation, where the color intensity encodes the likelihood of pathological effect. A red substitution represents a mutation predicted to be likely a pathological mutation, whereas a blue substitution likely a neutral mutation.

SPRI can make reliable predictions using AlphaFold2 structures

SPRI is a structure-based method. As about 200 million proteins structures can be predicted by AlphaFold2 [17, 19], we apply SPRI to a set of proteins with AlphaFold2 structures but no experimental structures and evaluate how well pathological effects of missense mutations can be predicted.

We adopt the same dataset construction procedure as UnifyPDBFull, and take the subset of proteins from HumDiv, HumVar and PredictSNP that have AlphaFold2 structures but no PDB structures, and generate a new dataset UnifyAF2. It contains 4980 deleterious and 2600 neutral variants from 422 proteins. We use exactly the same model trained on UnifyPDBFull to predict pathological effects of missense mutations in UnifyAF2, which has no overlap with UnifyPDBFull. We stratify mutation site residues into three categories of high, medium and low-confidence level of their predicted structures using the pLDDT score from AlphaFold2 (see SI).

We report the ROC curves and PR curves of prediction results in Figure 7. For the high-, moderate- and low-confidence mutation site residues, our method achieves AUROCs of 0.908, 0.905 and 0.854, respectively. The corresponding AUPRCs are 0.956, 0.950 and 0.810, respectively. Overall, except residues of low-confidence, SPRI can provide reliable assessment of mutation effects using AlphaFold2 structures. We also compare SPRI with other available methods on the dataset of UnifyAF2. SPRI provides the best MCC score, AUROC, accuracy and completeness. SPRI also achieves the second-best AUPRC, F-1 score and recall on mutations with high-confidence AlphaFold2 structural scores. Note that the EVmutation method with the best AUPRC, F-1 score and precision has only 46.5% coverage, which is far less than all other methods (all with of the 4980+2600 mutations |$\geq 96\%$|⁠). More details can be found in SI, Section 14.

ROC curves and PR curves of prediction results using SPRI on the UnifyAF2 dataset of AlphaFold2 predicted structures. The performance is evaluated separately by the AlphaFold2 confidence level of the mutation site residues. (A) The ROC curves and their AUROCs using the UnifyAF2 dataset. (B) The PR curves and their AUPRCs using the UnifyAF2 dataset. Red, blue and orange curves represent residues with high-, moderate- and low-confidence pLDDT prediction quality, respectively.
Figure 7

ROC curves and PR curves of prediction results using SPRI on the UnifyAF2 dataset of AlphaFold2 predicted structures. The performance is evaluated separately by the AlphaFold2 confidence level of the mutation site residues. (A) The ROC curves and their AUROCs using the UnifyAF2 dataset. (B) The PR curves and their AUPRCs using the UnifyAF2 dataset. Red, blue and orange curves represent residues with high-, moderate- and low-confidence pLDDT prediction quality, respectively.

These results indicate that our method can be applied broadly to AlphaFold2 predicted structures with reliable performance. As the high- and moderate-confidence residues constitute 61.7% of the 14 850 403 residues among 23 391 AlaphFold2 predicted human protein structures, our method can be applied at whole-exome or whole-proteome scale.

DISCUSSION

Biochemical functions require specific spatial arrangement of multiple residues and atoms to collectively provide the structural scaffold and reaction surfaces [40–42]. SPRI can recognize such spatial arrangements through explicit computations of surface pockets, residue and atom-level interaction profiles, and further incorporation of changes in biophysical properties upon mutations. As shown in the case of glutathione synthetase, deleterious variants are likely to occur in the catalytic region and its neighborhood, while neutral variants are likely to be outside these regions. With exact geometric computation, these regions can be precisely defined using surface analysis and atomic-residue interactions profiles.

Furthermore, SPRI takes advantage of information contained in full structures, and infers pathogenicity automatically without requiring a priori knowledge of annotated functional domains or residues. This is different from existing methods where analysis is carried out at the domain level, requiring prior knowledge on clearly defined functional domains, as well as demanding significant alignment depth [20]. As a result, SPRI can be applied to a much larger number of mutation sites that have no domain annotations. In addition, there is no need of ad-hoc re-weighting or other post-processings, unlike the practice of several other methods [23, 26].

SPRI can identify pathogenic higher-order spatially organized units (patHOS) of local regions on protein structures. These units are enriched with strongly deleterious mutations that are well-separated in sequence (as shown in Fig. 5). As the structural scaffold and reaction surfaces of biological functions consist of many residues [56, 76], there are higher-order spatially organized clusters of functionally important residues. Our results show that when these are enriched with deleterious mutations, they can exhibit strong pathogenicity, even though individual mutations may be of low recurrence and these patHOS clusters can elude detection by frequency analysis of sequences. With this approach, we reported the discovery of candidate cancer driver genes and associated patHOS clusters of mutations.

SPRI has an important advantage over other methods in identifying structural clusters of mutations [35, 39]. SPRI makes the important distinction between a pathogenic higher-order spatially organized unit (patHOS) harboring strongly enriched deleterious mutations and a background structural higher-order spatial (HOS) cluster. No such distinctions are made in methods such as Mutation3D and AlphaCluster [35, 39]. SPRI can then predict cancer driver genes based on the detection of patHOS. While both SPRI and mutPanning [9], a frequency-based method, can identify all 28 CGC genes as cancer driver genes, no predictions on 160 out of 161 non-CGC genes with strong patHOS identified by SPRI are predicted to have cancer susceptibility by mutPanning [9]. Another important advantage of SPRI is it can pinpoint individual residues where predicted cancer pathogenecity concentrates. No information at residue level was provided by mutPanning, as its prediction is based on simulated mutation rate of the whole gene. Overall, our results show that SPRI can discover higher-order pathological clusters of low-recurrence mutations and distinguish them from otherwise noisy background. Such a task is not possible with prediction methods that works at the gene-level [9, 77].

Our results also shed some light on an important question, namely, whether pathological missense mutations of different varieties can be effectively predicted using the same set of features, regardless of whether it is of germline or somatic origin, and whether it manifests as Mendelian diseases or complex diseases. Some studies have built pathogenicity models that are disease-specific or dataset-specific. For example, FATHMM has different predictors for inherited disease, cancer, and other specific pathologies [21]. PMut allows user-input to customize training datasets so different predictors for different diseases can be constructed [25]. There are drawbacks with these approaches. FATHMM has very different answers in evaluating pathogenic mutations, depending on whether a predictor for inherited diseases or a predictor for cancer is used. This is likely due to the different level of available annotated information and the necessary adjustment of the model parameter [21]. While Rhapsody works well in predicting Mendelian disease related mutations, its predictions of cancer driver mutations lags behind significantly [31]; therefore, it is unclear if overall biochemical principles behind different disease mutations exist. Results presented here suggest that structural+biophysical and evolutionary features are equally informative for different types of pathological missense mutations. In addition, both protein structure and sequence encode a large amount of the relevant information. Furthermore, they can be effectively extracted by SPRI. We show that accurate predictions can be made on cancer driver mutations, even though SPRI was trained using Mendelian disease mutations, requiring no parameter adjustment or re-training using different data.

In addition to structure-derived features, features of evolutionary signals, and biophysical properties are also found to be important for distinguishing deleterious mutations from neutral mutations (see Supplemental Information). Together, they capture essential biological properties whose perturbations contribute to pathogenicity.

A limitation of SPRI is that it is based on analysis of protein structure of monomeric units. Structures of protein complexes contain valuable information on protein-protein interactions. As experimental and reliable predicted structures of protein-complexes continue to increase [78–80] and technical issues [81] are resolved, we anticipate that incorporating knowledge of protein–protein interaction interfaces will further improve models for predictions of pathogenicity of missense mutations.

CONCLUSION

Overall, SPRI is highly effective in extracting properties encoded in protein structures to provide accurate evaluation of pathological effects of missense mutations and to identify deleterious mutations. SPRI can provide accurate predictions of deleterious missense single mutations of germ line origin associated with Mendelian diseases, and performs favorably when compared to current state-of-the-art methods, as measured by several performance metrics.

The effectiveness of SPRI is transferable and SPRI can be used to identify pathological missense mutations of somatic origin as manifested in complex diseases. This is demonstrated by the success in predicting cancer driver mutations.

In addition, SPRI can discover higher-order spatial units of multiple low recurrence missense mutations, which are strongly indicative of high susceptibility to cancer. Furthermore, SPRI can also predict mutation effects at saturation level, as demonstrated by the example of the copper transport protein.

As SPRI can take advantage of AlphaFold2 predicted structures, it can be deployed for saturation mutation analysis the whole human proteome wide and can be used to construct an atlas of effects of saturation mutations. Such an atlas can provide a global view of protein fitness landscape and mutation effects of the human proteome.

Key Points
  • SPRI is highly effective in extracting properties determining pathogenicity encoded in protein structures. It can accurately evaluate pathological effects of missense mutations and identify deleterious mutations, with more favorable prediction performance compared to other current methods by metrics such as AUROC, AUPRC, MCC and F-1 score.

  • SPRI provides the unique capability of discovering pathogenic higher-order spatially organized units (patHOS) of deleterious mutations of low recurrence, a difficult task currently with no good solutions.

  • SPRI is highly transferable and it can effectively evaluate pathogenicity of mutations in either Mendelian disorders of germline origin or cancer driver mutations of somatic origin.

  • SPRI makes accurate predictions using high- or moderate-confidence AlphaFold2 predicted structures, and can be deployed to the whole human proteome-wide mutation saturation analysis.

FUNDING

The National Institutes of Health (NIH), Grants R35 GM 127084.

DATA AVAILABILITY STATEMENT

The pre-computed results on human proteome with known PDB structures can be found at http://sts.bioe.uic.edu/spri/, these benchmark datasets generated and analyzed during the current study are available is accessible at https://bitbucket.org/lianglabuic/spri/.

AUTHORS’ CONTRIBUTIONS

Conceptualization: BW and JL, with input from YYT. Methodology: BW and JL. Dataset and implementation: BW. Analysis: BW and JL. Additional analysis tools: XL, WT, APR, YYT. Original draft: BW and JL. Review and editing: XL, WT, APR and YYT. Supervision: JL. The authors read and approved the final manuscript.

Author Biographies

Boshen Wang is a PhD candidate at the University of Illinois at Chicago. His research interests are bioinformatics, cancer driver mutations and protein functions.

Xue Lei was at the University of Illinois at Chicago. She is currently at the Sanford Burnham Institute. Her research interests are systems biology and bioinformatics.

Wei Tian was at the University of Illinois at Chicago. He is currently at the Salk Institute. His research interests are bioinformatics, membrane protein design, and computation geometry.

Alan Perez-Rathke is an MD/PhD formerly at the University of Illinois at Chicago and now at Vanderbilt University Medical Center. His research interests are computational biology, Monte Carlo simulation, and artificial intelligence in medicine.

Yan-Yuan Tseng is an associate professor at Wayne State University. His research interests are protein structure, function, classification, and evolution.

Jie Liang is a Richard and Loan Hill Professor of Biomedical Engineering at the University of Illinois at Chicago. His research interests are systems biology, bioinformatics, protein biophysics, and 3D chromatin structure.

REFERENCES

1.

Yang
 
Y
,
Muzny
 
DM
,
Xia
 
F
, et al.
Molecular findings among patients referred for clinical whole-exome sequencing
.
JAMA
 
2014
;
312
(
18
):
1870
9
.

2.

Ng
 
PC
,
Kirkness
 
EF
.
Whole genome sequencing
.
Genetic Variation: Methods Protocols
 
2010
;
215
26
.

3.

Belkadi
 
A
,
Bolze
 
A
,
Itan
 
Y
, et al.
Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants
.
Proc Natl Acad Sci
 
2015
;
112
(
17
):
5473
8
.

4.

Lek
 
M
,
Karczewski
 
KJ
,
Minikel
 
EV
, et al.
Analysis of protein-coding genetic variation in 60,706 humans
.
Nature
 
2016
;
536
(
7616
):
285
91
.

5.

1000 Genomes Project Consortium
, et al.
A map of human genome variation from population scale sequencing
.
Nature
 
2010
;
467
(
7319
):
1061
73
.

6.

Tate
 
JG
,
Bamford
 
S
,
Jubb
 
HC
, et al.
Cosmic: the catalogue of somatic mutations in cancer
.
Nucleic Acids Res
 
2019
;
47
(
D1
):
D941
7
.

7.

Labelle
 
Y
,
Phaneuf
 
D
,
Leclerc
 
B
,
Tanguay
 
RM
.
Characterization of the human fumarylacetoacetate hydrolase gene and identification of a missense mutation abolishing enzymatic activity
.
Hum Mol Genet
 
1993
;
2
(
7
):
941
6
.

8.

Kawaguchi
 
T
,
Kato
 
S
,
Otsuka
 
K
, et al.
The relationship among p53 oligomer formation, structure and transcriptional activity using a comprehensive missense mutation library
.
Oncogene
 
2005
;
24
(
46
):
6976
81
.

9.

Dietlein
 
F
,
Weghorn
 
D
,
Taylor-Weiner
 
A
, et al.
Identification of cancer driver genes based on nucleotide context
.
Nat Genet
 
2020
;
52
(
2
):
208
18
.

10.

Stratton
 
MR
,
Campbell
 
PJ
,
Andrew
 
P
,
Futreal.
The cancer genome
.
Nature
 
2009
;
458
(
7239
):
719
24
.

11.

Vogelstein
 
B
,
Papadopoulos
 
N
,
Velculescu
 
VE
, et al.
Cancer genome landscapes
.
Science
 
2013
;
339
(
6127
):
1546
58
.

12.

Stephens
 
PJ
,
Tarpey
 
PS
,
Davies
 
H
, et al.
The landscape of cancer genes and mutational processes in breast cancer
.
Nature
 
2012
;
486
(
7403
):
400
4
.

13.

Rose
 
Y
,
Duarte
 
JM
,
Lowe
 
R
, et al.
Rcsb protein data bank: architectural advances towards integrated searching and efficient access to macromolecular structure data from the pdb archive
.
J Mol Biol
 
2021
;
433
(
11
):
166704
.

14.

Ilari
 
A
,
Savino
 
C
.
Protein structure determination by x-ray crystallography
.
Bioinformatics: Data Sequence Anal Evol
 
2008
;
63
87
.

15.

Wüthrich
 
K
.
Protein structure determination in solution by nmr spectroscopy
.
J Biol Chem
 
1990
;
265
(
36
):
22059
62
.

16.

Yip
 
KM
,
Fischer
 
N
,
Paknia
 
E
, et al.
Atomic-resolution protein structure determination by cryo-em
.
Nature
 
2020
;
587
(
7832
):
157
61
.

17.

Jumper
 
J
,
Evans
 
R
,
Pritzel
 
A
, et al.
Highly accurate protein structure prediction with alphafold
.
Nature
 
2021
;
596
(
7873
):
583
9
.

18.

Baek
 
M
,
DiMaio
 
F
,
Anishchenko
 
I
, et al.
Accurate prediction of protein structures and interactions using a three-track neural network
.
Science
 
2021
;
373
(
6557
):
871
6
.

19.

Varadi
 
M
,
Anyango
 
S
,
Deshpande
 
M
, et al.
Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
.
Nucleic Acids Res
 
2022
;
50
(
D1
):
D439
44
.

20.

Hopf
 
TA
,
Ingraham
 
JB
,
Poelwijk
 
FJ
, et al.
Mutation effects predicted from sequence co-variation
.
Nat Biotechnol
 
2017
;
35
(
2
):
128
35
.

21.

Shihab
 
HA
,
Gough
 
J
,
Cooper
 
DN
, et al.
Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden markov models
.
Hum Mutat
 
2013
;
34
(
1
):
57
65
.

22.

Jiang
 
T
,
Fang
 
L
,
Wang
 
K
.
Mutformer: a context-dependent transformer-based model to predict deleterious missense mutations from protein sequences in the human genome
 
arXiv preprint arXiv:2110.14746
.
2021
.

23.

Malhis
 
N
,
Jones
 
SJM
,
Gsponer
 
J
.
Improved measures for evolutionary conservation that exploit taxonomy distances
.
Nat Commun
 
2019
;
10
(
1
):
1556
.

24.

Malhis
 
N
,
Jacobson
 
M
,
Jones
 
SJM
,
Gsponer
 
J
.
List-s2: taxonomy based sorting of deleterious missense mutations across species
.
Nucleic Acids Res
 
2020
;
48
(
W1
):
W154
61
.

25.

López-Ferrando
 
V
,
Gazzo
 
A
,
De La Cruz
 
X
, et al.
Pmut: a web-based tool for the annotation of pathological variants on proteins, 2017 update
.
Nucleic Acids Res
 
2017
;
45
(
W1
):
W222
8
.

26.

Adzhubei
 
IA
,
Schmidt
 
S
,
Peshkin
 
L
, et al.
A method and server for predicting damaging missense mutations
.
Nat Methods
 
2010
;
7
(
4
):
248
9
.

27.

Choi
 
Y
,
Chan
 
AP
.
Provean web server: a tool to predict the functional effect of amino acid substitutions and indels
.
Bioinformatics
 
2015
;
31
(
16
):
2745
7
.

28.

Vaser
 
R
,
Adusumalli
 
S
,
Leng
 
SN
, et al.
Sift missense predictions for genomes
.
Nat Protoc
 
2016
;
11
(
1
):
1
9
.

29.

Wang
 
B
,
Tian
 
W
,
Lei
 
X
, et al.
Structure-based method for predicting deleterious missense snps
. In:
2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI)
.
IEEE
,
2019
,
1
4
.

30.

Ponzoni
 
L
,
Bahar
 
I
.
Structural dynamics is a determinant of the functional significance of missense variants
.
Proc Natl Acad Sci
 
2018
;
115
(
16
):
4164
9
.

31.

Ponzoni
 
L
,
Peñaherrera
 
DA
,
Oltvai
 
ZN
,
Bahar
 
I
.
Rhapsody: predicting the pathogenicity of human missense variants
.
Bioinformatics
 
2020
;
36
(
10
):
3084
92
.

32.

Bakan
 
A
,
Meireles
 
LM
,
Bahar
 
I
.
Prody: protein dynamics inferred from theory and experiments
.
Bioinformatics
 
2011
;
27
(
11
):
1575
7
.

33.

Quan
 
L
,
Hongjie
 
W
,
Lyu
 
Q
,
Zhang
 
Y
.
Dampred: recognizing disease-associated nssnps through bayes-guided neural-network model built on low-resolution structure prediction of proteins and protein–protein interactions
.
J Mol Biol
 
2019
;
431
(
13
):
2449
59
.

34.

Sedova
 
M
,
Iyer
 
M
,
Li
 
Z
, et al.
Cancer3d 2.0: interactive analysis of 3d patterns of cancer mutations in cancer subsets
.
Nucleic Acids Res
 
2019
;
47
(
D1
):
D895
9
.

35.

Meyer
 
MJ
,
Lapcevic
 
R
,
Romero
 
AE
, et al.
mutation3d: cancer gene prediction through atomic clustering of coding variants in the structural proteome
.
Hum Mutat
 
2016
;
37
(
5
):
447
56
.

36.

Gao
 
J
,
Chang
 
MT
,
Johnsen
 
HC
, et al.
3d clusters of somatic mutations in cancer reveal numerous rare mutations as functional targets
.
Genome Med
 
2017
;
9
:
1
13
.

37.

Chen
 
S
,
He
 
X
,
Li
 
R
, et al.
Hotspot3d web server: an integrated resource for mutation analysis in protein 3d structures
.
Bioinformatics
 
2020
;
36
(
12
):
3944
6
.

38.

Stitziel
 
NO
,
Andrew Binkowski
 
T
,
Tseng
 
YY
, et al.
Toposnp: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association
.
Nucleic Acids Res
 
2004
;
32
(
suppl_1
):
D520
2
.

39.

Obiajulu
 
J
,
Kuang
 
R
,
Zhong
 
G
, et al.
Alphacluster: Coevolutionary driven residue-residue interaction models enable quantifiable clustering analysis of de novo variants to enhance predictions of pathogenicity
.
2022
.

40.

English
 
AC
,
Groom
 
CR
,
Hubbard
 
RE
.
Experimental and computational mapping of the binding surface of a crystalline protein
.
Protein Eng
 
2001
;
14
(
1
):
47
59
.

41.

Liang
 
J
,
Woodward
 
C
,
Edelsbrunner
 
H
.
Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design
.
Protein Sci
 
1998a
;
7
(
9
):
1884
97
.

42.

Tseng
 
YY
,
Dundas
 
J
,
Liang
 
J
.
Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns
.
J Mol Biol
 
2009
;
387
(
2
):
451
64
.

43.

Tseng
 
YY
,
Liang
 
J
.
Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a bayesian Monte Carlo approach
.
Mol Biol Evol
 
2006
;
23
(
2
):
421
36
.

44.

Dundas
 
J
,
Adamian
 
L
,
Liang
 
J
.
Structural signatures of enzyme binding pockets from order-independent surface alignment: a study of metalloendopeptidase and nad binding proteins
.
J Mol Biol
 
2011a
;
406
(
5
):
713
29
.

45.

Edelsbrunner
 
H
,
Mücke
 
EP
.
Three-dimensional alpha shapes
.
ACM Trans Graphics (TOG)
 
1994
;
13
(
1
):
43
72
.

46.

Liang
 
J
,
Edelsbrunner
 
H
,
Ping
 
F
, et al.
Analytical shape computation of macromolecules: I. molecular area and volume through alpha shape
.
Proteins: Struct Funct Bioinformatics
 
1998b
;
33
(
1
):
1
17
.

47.

Breiman
 
L
.
Random forests
.
Mach Learn
 
2001
;
45
:
5
32
.

48.

Karczewski
 
KJ
,
Weisburd
 
B
,
Thomas
 
B
, et al.
The exac browser: displaying reference data information from over 60 000 exomes
.
Nucleic Acids Res
 
2017
;
45
(
D1
):
D840
5
.

49.

Nair
 
PS
,
Vihinen
 
M
.
V ari b ench: a benchmark database for variations
.
Hum Mutat
 
2013
;
34
(
1
):
42
9
.

50.

Bendl
 
J
,
Stourac
 
J
,
Salanda
 
O
, et al.
Predictsnp: robust and accurate consensus classifier for prediction of disease-related mutations
.
PLoS Comput Biol
 
2014
;
10
(
1
):
e1003440
.

51.

Billas
 
IML
,
Iwema
 
T
,
Garnier
 
J-M
, et al.
Structural adaptability in the ligand-binding pocket of the ecdysone hormone receptor
.
Nature
 
2003
;
426
(
6962
):
91
6
.

52.

Michael Gromiha
 
M
,
Oobatake
 
M
,
Kono
 
H
, et al.
Relationship between amino acid properties and protein stability: buried mutations
.
J Protein Chem
 
1999
;
18
:
565
78
.

53.

Tian
 
W
,
Chen
 
C
,
Lei
 
X
, et al.
Castp 3.0: computed atlas of surface topography of proteins
.
Nucleic Acids Res
 
2018
;
46
(
W1
):
W363
7
.

54.

Pires
 
DEV
,
Ascher
 
DB
,
Blundell
 
TL
.
Mcsm: predicting the effects of mutations in proteins using graph-based signatures
.
Bioinformatics
 
2014
;
30
(
3
):
335
42
.

55.

Li
 
X
,
Changyu
 
H
,
Liang
 
J
.
Simplicial edge representation of protein structures and alpha contact potential with confidence measure
.
Proteins: Struct Funct Bioinformatics
 
2003
;
53
(
4
):
792
805
.

56.

Li
 
X
,
Liang
 
J
.
Geometric cooperativity and anticooperativity of three-body interactions in native proteins
.
Proteins: Struct Funct and Bioinformatics
 
2005
;
60
(
1
):
46
65
.

57.

Liang
 
J
,
Edelsbrunner
 
H
,
Ping
 
F
, et al.
Analytical shape computation of macromolecules: I. molecular area and volume through alpha shape
.
Proteins: Struct Funct Bioinformatics
 
1998c
;
33
(
1
):
1
17
.

58.

Liang
 
J
,
Edelsbrunner
 
H
,
Ping
 
F
, et al.
Analytical shape computation of macromolecules: ii. Inaccessible cavities in proteins
.
Proteins: Struct Funct Bioinformatics
 
1998d
;
33
(
1
):
18
29
.

59.

Nakamura
 
H
.
Roles of electrostatic interaction in proteins
.
Q Rev Biophys
 
1996
;
29
(
1
):
1
90
.

60.

Shoulders
 
MD
,
Hodges
 
JA
,
Raines
 
RT
.
Reciprocity of steric and stereoelectronic effects in the collagen triple helix
.
J Am Chem Soc
 
2006
;
128
(
25
):
8112
3
.

61.

Baldridge
 
A
,
Samanta
 
SR
,
Nithyanandhan Jayaraj
 
V
,
Ramamurthy
, and
Laren
 
M Tolbert.
Steric and electronic effects in capsule-confined green fluorescent protein chromophores
.
J Am Chem Soc
 
2011
;
133
(
4
):
712
5
.

62.

Bundy
 
A
,
Wallen
 
L
.
Breadth-first search
.
Catalogue Artif Intell Tools
 
1984
;
13
3
.

63.

Kumar
 
S
,
Nussinov
 
R
.
Salt bridge stability in monomeric proteins
.
J Mol Biol
 
1999
;
293
(
5
):
1241
55
.

64.

Steven Henikoff and Jorja G Henikoff
.
Amino acid substitution matrices from protein blocks
.
Proc Natl Acad Sci
 
1992
;
89
(
22
):
10915
9
.

65.

Fabian
 
S
,
Higgins
 
G
,
Desmond.
Clustal omega
.
Curr Protoc Bioinformatics
 
2014
;
48
(
1
):
3
13
.

66.

Liaw
 
A
,
Wiener
 
M
, et al.
Classification and regression by randomforest
.
R News
 
2002
;
2
(
3
):
18
22
.

67.

Ingle
 
BL
,
Shrestha
 
B
,
De Jesus
 
MC
, et al.
Genetic mutations in the s-loop of human glutathione synthetase: links between substrate binding, active site structure and allostery. Computational and structural
.
Biotechnol J
 
2019
;
17
:
31
8
.

68.

Dinescu
 
A
,
Cundari
 
TR
,
Bhansali
 
VS
, et al.
Function of conserved residues of human glutathione synthetase: implications for the atp-grasp enzymes
.
J Biol Chem
 
2004
;
279
(
21
):
22412
21
.

69.

Ribeiro
 
AJM
,
Holliday
 
GL
,
Furnham
 
N
, et al.
Mechanism and catalytic site atlas (m-csa): a database of enzyme reaction mechanisms and active sites
.
Nucleic Acids Res
 
2018
;
46
(
D1
):
D618
23
.

70.

Zeng
 
X
,
Martinez
 
TR
.
Distribution-balanced stratified cross-validation for accuracy estimation
.
J Exp Theor Artif Intell
 
2000
;
12
(
1
):
1
12
.

71.

Chicco
 
D
,
Jurman
 
G
.
The advantages of the Matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation
.
BMC Genomics
 
2020
;
21
:
1
13
.

72.

Greenman
 
C
,
Stephens
 
P
,
Smith
 
R
, et al.
Patterns of somatic mutation in human cancer genomes
.
Nature
 
2007
;
446
(
7132
):
153
8
.

73.

Ramroop
 
JR
,
Gerber
 
MM
,
Toland
 
AE
.
Germline variants impact somatic events during tumorigenesis
.
Trends Genet
 
2019
;
35
(
7
):
515
26
.

74.

Wang
 
M
,
Zhao
 
J
,
Zhang
 
L
, et al.
Role of tumor microenvironment in tumorigenesis
.
J Cancer
 
2017
;
8
(
5
):
761
73
.

75.

Hong
 
DS
,
Fakih
 
MG
,
Strickler
 
JH
, et al.
Krasg12c inhibition with sotorasib in advanced solid tumors
.
New Engl J Med
 
2020
;
383
(
13
):
1207
17
.

76.

Dundas
 
J
,
Adamian
 
L
,
Liang
 
J
.
Structural signatures of enzyme binding pockets from order-independent surface alignment: a study of metalloendopeptidase and nad binding proteins
.
J Mol Biol
 
2011b
;
406
(
5
):
713
29
.

77.

Liu
 
C
,
Dai
 
Y
,
Keping
 
Y
,
Zhang
 
Z-K
.
Enhancing cancer driver gene prediction by protein–protein interaction network
.
IEEE/ACM Trans Comput Biol Bioinform
 
2021
;
19
(
4
):
2231
40
.

78.

Kundrotas
 
PJ
,
Kotthoff
 
I
,
Choi
 
SW
, et al.
Dockground tool for development and benchmarking of protein docking procedures
.
Protein Struct Predict
 
2020
;
289
300
.

79.

Evans
 
R
,
O’Neill
 
M
,
Pritzel
 
A
, et al.
Protein complex prediction with alphafold-multimer
.
BioRxiv
 
2021
;
2021
10
.

80.

Pierce
 
BG
,
Wiehe
 
K
,
Hwang
 
H
, et al.
Zdock server: interactive docking prediction of protein–protein complexes and symmetric multimers
.
Bioinformatics
 
2014
;
30
(
12
):
1771
3
.

81.

Sprinzak
 
E
,
Sattath
 
S
,
Margalit
 
H
.
How reliable are experimental protein–protein interaction data?
 
J Mol Biol
 
2003
;
327
(
5
):
919
23
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Supplementary data