VIPpred: a novel model for predicting variant impact on phosphorylation events driving carcinogenesis

Author Notes

Abstract

Disrupted protein phosphorylation due to genetic variation is a widespread phenomenon that triggers oncogenic transformation of healthy cells. However, few relevant phosphorylation disruption events have been verified due to limited biological experimental methods. Because of the lack of reliable benchmark datasets, current bioinformatics methods primarily use sequence-based traits to study variant impact on phosphorylation (VIP). Here, we increased the number of experimentally supported VIP events from less than 30 to 740 by manually curating and reanalyzing multi-omics data from 916 patients provided by the Clinical Proteomic Tumor Analysis Consortium. To predict VIP events in cancer cells, we developed VIPpred, a machine learning method characterized by multidimensional features that exhibits robust performance across different cancer types. Our method provided a pan-cancer landscape of VIP events, which are enriched in cancer-related pathways and cancer driver genes. We found that variant-induced increases in phosphorylation events tend to inhibit the protein degradation of oncogenes and promote tumor suppressor protein degradation. Our work provides new insights into phosphorylation-related cancer biology as well as novel avenues for precision therapy.

cancer driver, multi-omics data, pan-cancer landscape, protein degradation, variant impact on phosphorylation

INTRODUCTION

Cancer is a disease in which cells have acquired the capacity to divide and grow uncontrollably [1–3], usually through genetic changes in specific genes. Identifying genetic variants responsible for driving tumor progression (drivers) and distinguishing them from the enormous number of randomly occurring variants (passengers) is a critical step in understanding tumorigenesis biology and precision oncology [4, 5]. Protein phosphorylation is the most common post-translational modification (PTM) that is involved in chronic proliferation, anti-apoptosis, tissue invasion and metastasis characteristic of tumorigenesis [6]. Previous studies have demonstrated that a number of cancer drivers affect the status of phosphorylation and trigger oncogenic transformation of healthy cells. For example, a systematic analysis of lung tissue found that the mutation P1019L within the epidermal growth factor receptor (EGFR) protein causes abnormal activation of downstream extracellular-regulated kinase, namely by up-regulating EGFR Y1016 phosphorylation levels [7]. Likewise, the mutation R213Q within tumor protein p53 (TP53) disrupts phosphorylation at S215, which ultimately suppresses the degradation of TP53 [8]. Thus, investigation of variant impact on phosphorylation (VIP) events is an effective strategy for annotating novel cancer drivers.

Low-throughput experiments such as co-immunoprecipitations and phosphorylation site-specific antibody assays have only confirmed a few VIP events [8, 9], as these studies can only target limited sites with corresponding antibodies. Due to a lack of experimentally confirmed VIP events, databases and prediction models concerning VIP events mainly rely on sequence-based features. One is the sequential distance of non-synonymous single nucleotide variants (nsSNVs) and phosphosites [10–14]. While characterizing the modification status alterations of phosphosites occurring at the same location of nsSNVs can be accurate, these predictors may produce false positives when facing nsSNVs occurring near phosphosites. Another feature of VIP prediction is the kinase-substrate binding patterns that are disrupted or introduced by nsSNVs [15, 16]. Nevertheless, this feature cannot fully reflect the overall mechanism of VIP events in cancerous cells because (1) most kinase-substrate binding patterns remain elusive and (2) such feature is only one of the effective traits in VIP events prediction. Therefore, constructing an accurate and experimentally proven VIP event dataset is of significant importance. Introducing features other than sequence-based information and developing precise bioinformatics models for investigating pan-cancer landscape and mechanisms of VIP events remain a major concern. In this study, we designed a new pipeline where we reprocessed phosphoproteomics and proteomics data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) to generate a VIP benchmark dataset. Based on the VIP benchmark dataset, we also developed a high-accuracy prediction model named VIPpred, which integrated four distinct categories of multidimensional features compared to existing sequence-based models as shown in Table 1. We provided a pan-cancer landscape of VIP events and found that these events affect the degradation of cancer driver proteins, which brings novel insights into carcinogenesis mechanisms.

Table 1

Open in new tab

Comparison with existing bioinformatics methods for VIP prediction

Model	Date	Data sources	VIP in training data	Classification features	Classification method	Input for prediction	Output
MIMP [16]	2015	PhosphoSitePlus, PhosphoELM, HPRD	/	Kinase sequence specificity	Gaussian mixture models	Phosphorylation–mutation sites pairs on specific proteins	VIP event prediction results and probabilities
PhosphoPICK-SNP [15]	2017	Phospho.ELM, HPRD, BioGRID, cell cycle data obtained from the experiments by Olsen et al. [17]	/	PPI, cell cycle, kinase sequence specificity	Bayesian network model	Phosphorylation–mutation sites pairs on specific proteins	VIP event prediction results and probabilities
PTMsnp [10]	2020	dbPTM, iPTMnet, TCGA, UniProt	/	The differences in mutation rates across various positions	Bayesian hierarchical model	Genetic mutation	PTM-related mutations
PhosSNP [11]	2015	Exp. and STRING PPIs, dbSNP, GPS 2.1 human kinase predictors [18]	/	PPI, kinase sequence specificity	BLOSUM62 similarity matrix	dbSNP mutations and experimentally known phosphorylation sites	All PhosSNP events
DeltaScansite [17]	2019	Disease-associated mutations from TCGA, UniProt and dbSNP with PTM sites from PhosphoSitePlus	/	/	Scansite and MIMP [16, 19]	Disease-associated mutations and PTM sites	All VIP events
VIPpred	2023	CPTAC, TCGA, UniProt	Experimentally supported VIP events from mass spectrometry data	Multidimensional features grouped into four categories	XGBoost	nsSNV–phosphosite pairs	Gain/Loss/No impact labels

Model	Date	Data sources	VIP in training data	Classification features	Classification method	Input for prediction	Output
MIMP [16]	2015	PhosphoSitePlus, PhosphoELM, HPRD	/	Kinase sequence specificity	Gaussian mixture models	Phosphorylation–mutation sites pairs on specific proteins	VIP event prediction results and probabilities
PhosphoPICK-SNP [15]	2017	Phospho.ELM, HPRD, BioGRID, cell cycle data obtained from the experiments by Olsen et al. [17]	/	PPI, cell cycle, kinase sequence specificity	Bayesian network model	Phosphorylation–mutation sites pairs on specific proteins	VIP event prediction results and probabilities
PTMsnp [10]	2020	dbPTM, iPTMnet, TCGA, UniProt	/	The differences in mutation rates across various positions	Bayesian hierarchical model	Genetic mutation	PTM-related mutations
PhosSNP [11]	2015	Exp. and STRING PPIs, dbSNP, GPS 2.1 human kinase predictors [18]	/	PPI, kinase sequence specificity	BLOSUM62 similarity matrix	dbSNP mutations and experimentally known phosphorylation sites	All PhosSNP events
DeltaScansite [17]	2019	Disease-associated mutations from TCGA, UniProt and dbSNP with PTM sites from PhosphoSitePlus	/	/	Scansite and MIMP [16, 19]	Disease-associated mutations and PTM sites	All VIP events
VIPpred	2023	CPTAC, TCGA, UniProt	Experimentally supported VIP events from mass spectrometry data	Multidimensional features grouped into four categories	XGBoost	nsSNV–phosphosite pairs	Gain/Loss/No impact labels

Table 1

Open in new tab

Comparison with existing bioinformatics methods for VIP prediction

Model	Date	Data sources	VIP in training data	Classification features	Classification method	Input for prediction	Output
MIMP [16]	2015	PhosphoSitePlus, PhosphoELM, HPRD	/	Kinase sequence specificity	Gaussian mixture models	Phosphorylation–mutation sites pairs on specific proteins	VIP event prediction results and probabilities
PhosphoPICK-SNP [15]	2017	Phospho.ELM, HPRD, BioGRID, cell cycle data obtained from the experiments by Olsen et al. [17]	/	PPI, cell cycle, kinase sequence specificity	Bayesian network model	Phosphorylation–mutation sites pairs on specific proteins	VIP event prediction results and probabilities
PTMsnp [10]	2020	dbPTM, iPTMnet, TCGA, UniProt	/	The differences in mutation rates across various positions	Bayesian hierarchical model	Genetic mutation	PTM-related mutations
PhosSNP [11]	2015	Exp. and STRING PPIs, dbSNP, GPS 2.1 human kinase predictors [18]	/	PPI, kinase sequence specificity	BLOSUM62 similarity matrix	dbSNP mutations and experimentally known phosphorylation sites	All PhosSNP events
DeltaScansite [17]	2019	Disease-associated mutations from TCGA, UniProt and dbSNP with PTM sites from PhosphoSitePlus	/	/	Scansite and MIMP [16, 19]	Disease-associated mutations and PTM sites	All VIP events
VIPpred	2023	CPTAC, TCGA, UniProt	Experimentally supported VIP events from mass spectrometry data	Multidimensional features grouped into four categories	XGBoost	nsSNV–phosphosite pairs	Gain/Loss/No impact labels

Model	Date	Data sources	VIP in training data	Classification features	Classification method	Input for prediction	Output
MIMP [16]	2015	PhosphoSitePlus, PhosphoELM, HPRD	/	Kinase sequence specificity	Gaussian mixture models	Phosphorylation–mutation sites pairs on specific proteins	VIP event prediction results and probabilities
PhosphoPICK-SNP [15]	2017	Phospho.ELM, HPRD, BioGRID, cell cycle data obtained from the experiments by Olsen et al. [17]	/	PPI, cell cycle, kinase sequence specificity	Bayesian network model	Phosphorylation–mutation sites pairs on specific proteins	VIP event prediction results and probabilities
PTMsnp [10]	2020	dbPTM, iPTMnet, TCGA, UniProt	/	The differences in mutation rates across various positions	Bayesian hierarchical model	Genetic mutation	PTM-related mutations
PhosSNP [11]	2015	Exp. and STRING PPIs, dbSNP, GPS 2.1 human kinase predictors [18]	/	PPI, kinase sequence specificity	BLOSUM62 similarity matrix	dbSNP mutations and experimentally known phosphorylation sites	All PhosSNP events
DeltaScansite [17]	2019	Disease-associated mutations from TCGA, UniProt and dbSNP with PTM sites from PhosphoSitePlus	/	/	Scansite and MIMP [16, 19]	Disease-associated mutations and PTM sites	All VIP events
VIPpred	2023	CPTAC, TCGA, UniProt	Experimentally supported VIP events from mass spectrometry data	Multidimensional features grouped into four categories	XGBoost	nsSNV–phosphosite pairs	Gain/Loss/No impact labels

RESULTS

An overview of the study

Figure 1 provides a general overview of our study design. To investigate the impact of nsSNV on phosphorylation status in high-throughput manner, we quantified phosphorylation-level changes before and after the occurrence of nsSNV with phosphoproteomics data based on tandem mass spectrometry (MS/MS). Initially, we performed a comprehensive data collection and reprocessing procedure using (phosphor)proteomics and proteomics data from 916 cancer patients with tumor/normal samples across eight cancer types. To validate the accuracy of nsSNV search, we integrated proteomics data from the same cohort of patients into our analysis (Methods). We employed our systematic approach, referred to as VIPpipeline, to the extraction and reanalysis of nsSNV–phosphosite pairs derived from the CPTAC MS/MS data of CPTAC (Supplementary Table S1 available online at http://bib.oxfordjournals.org/, Method). We analyzed more than 244 million MS/MS spectra using Philosopher (v4.1.1) [20, 21], and all database search results went through quality control at peptide-spectrum match (PSM) level by false discovery rate (FDR) estimated using target-decoy strategy. The impact of variants on phosphorylation status, determined by comparing phosphorylation states on reference PSMs and variant PSMs, was categorized and annotated as Gain, Loss and No Impact. These curated VIP events formed our VIP dataset, which is the first experimentally supported VIP benchmark dataset based on high-throughput data. Furthermore, to predict candidate VIP events accurately and efficiently, we compared multidimensional features across different categories of VIP events and developed a highly accurate classification model, referred to as the VIPpred model. Moreover, utilizing the VIPpred model, we predicted VIP events at pan-cancer level and conducted detailed analyses of their close associations with carcinogenesis.

Figure 1

An overview of the study design. Our study includes the following key steps: data collection, VIPpipeline for MS/MS data reprocessing, VIP events annotating, prediction model construction, the validation and application of our VIPpred model. VIP, variant impact on phosphorylation. DB, database. CPTAC, Clinical Proteomic Tumor Analysis Consortium. TCGA, The Cancer Genome Atlas.

Open in new tab Download slide

VIPpipeline identified reliable nsSNV–phosphosite pairs

To improve the accuracy of the identified PSMs at the amino acid level, we applied PepQuery (v1.6.2) and PTMProphet (v6.0) within the pipeline [22, 23]. These steps helped in reducing false positives of detected nsSNVs and phosphosites, resulting in a reliable and high-quality dataset of nsSNV–phosphosite pairs. In total, we identified 24 574 PSMs with nsSNVs (varPSMs), corresponding to 11 576 nsSNVs, and 4812 varPSMs harbored concomitant phosphosites, resulting in 831 non-redundant nsSNV–phosphosite pairs (Figure 2A). We observed that a majority of the phosphorylation modifications occurred at serine (68.0%) or threonine (24.9%) residues, with an additional 7.2% occurring at tyrosine residues (Supplementary Figure S1A available online at http://bib.oxfordjournals.org/). In addition, we obtained more novel phosphosites on varPSMs compared with reference PSMs (refPSMs), possibly due to the occurrence of nsSNVs (Supplementary Figure S1B available online at http://bib.oxfordjournals.org/). Our analysis also revealed that the detected variant peptides were not biased toward highly expressed proteins or specific tumor types (Supplementary Figure S1C and D available online at http://bib.oxfordjournals.org/). Together, we reannotated multi-omics data of 916 patients across eight cancer types and identified 831 nsSNV–phosphosite pairs distributed across 478 proteins.

Figure 2

Identification, evaluation and annotation of variant impact on phosphorylation events of CPTAC cancer phosphoproteome and proteome datasets. (A) PSM counts from reanalyzed eight CPTAC cancer datasets. These PSMs were subjected to sequential quality control by applying PepQuery (v1.6.2) and PTMProphet (v6.0). Numbers in brackets stand for counts of PSMs searched from each cancer MS/MS data. PSM, peptide spectral match. MS/MS, tandem mass spectrometry. (B) Distribution of variant peptides’ abundance for all nsSNV–patient match pairs (case 1), nsSNV–patient mismatch pairs of tumor samples (case 2) and normal samples (case 3) in the same Tandem Mass Tag (TMT) multiplex. Match pairs represent a higher detection rate than mismatch pairs as expected. nsSNV, non-synonymous single nucleotide variant. (C) Distribution of VIP events. (Upper panel) X-axis stands for protein abundance FC between tumor samples and normal samples; y-axis stands for phosphorylation level FC triggered by nsSNVs. Each dot stands for a VIP event. (Lower panel) Number of different types of VIP events. FC, fold change. (D) Decrease in phosphorylation level of S303 due to the nsSNV S303L on SRSF6 protein. Depiction of the phosphorylation level FC of S303 (red points) and phosphorylation level FC of other phosphosites (green points) between paired tumor and normal sample on SRSF6 along the entire protein sequence. Phosphorylation-level FC is correlated with its protein abundance FC. UCEC, uterine corpus endometrial carcinoma. SRSF6, serine/arginine-rich splicing factor 6.

Open in new tab Download slide

Considering that the nsSNV–phosphosite pairs determine the annotation of impacts on phosphorylation status, we should take measures to secure the reliability of nsSNV–phosphosite pairs identified. Firstly, we hypothesized that nsSNVs detected by VIPpipeline should originate primarily from tumor samples with an abundance larger than corresponding normal samples. Our analysis confirmed this hypothesis, demonstrating significant enrichment in the variant peptides’ abundance within tumor samples compared to normal samples across both phosphoproteome and proteome datasets (P-values < 10⁻⁶, two-tailed Mann–Whitney test, Supplementary Figure S2A and B available online at http://bib.oxfordjournals.org/). Next, we validated the reliability of our collected nsSNVs data by comparing abundance differences of variant peptides in three groups of samples within each Tandem Mass Tag (TMT) [24] multiplex: nsSNVs corresponding tumor samples (green), nsSNVs non-corresponding tumor samples (pink) and nsSNVs non-corresponding normal samples (blue) (Figure 2B). The results showed that the abundance distribution of variant peptides in corresponding samples was much higher than in non-corresponding samples, supporting the robustness of our nsSNV detection. Moreover, PTMProphet enhanced the reliability of phosphorylation site detection, as evidenced by higher correlation coefficients of sites passing PTMProphet with publicly available CPTAC phosphoproteomics matrix (from LinkedOmics Portal) [25] (Supplementary Figure S2C available online at http://bib.oxfordjournals.org/). These results emphasized the reliability of identified nsSNV–phosphosite pairs with thorough quality control of MS data.

Annotation of Gain or Loss effects of nsSNVs on phosphorylation levels

To acquire an experimentally supported benchmark dataset, we annotated the effects of nsSNVs on phosphorylation levels according to the nsSNV–phosphosite pairs obtained in the previous section. We defined the VIP event where phosphosites display increased (fold change (FC) ≥2) phosphorylation levels triggered by nsSNVs as Gain event, while the Loss event is defined as a decrease in nsSNV-triggered phosphorylation levels (FC ≤ 1/2). To eliminate the effect of protein abundance on the evaluation of phosphorylation-level alteration, we required the FC in phosphorylation levels to be disproportionate to the magnitude of the FC of protein abundance (phosphosite FC/protein FC ≥ 2 for Gain events and vice versa for Loss events) (Supplementary Figure S3A and B available online at http://bib.oxfordjournals.org/, Methods). For nsSNVs not interfering with phosphosites, we determined such events as No Impact. Moreover, we defined the situation where nsSNVs occur at the identical location with phosphosite as On and different sites as Near. As listed in Supplementary Table S2 available online at http://bib.oxfordjournals.org/, we obtained 740 VIP events (Figure 2C), including 40 On events, 100 Gain events, 107 Loss events and 493 No Impact events, several of which play important roles in certain tumor processes (Figure 2D, Supplementary Figure S3C and D, Supplementary Results available online at http://bib.oxfordjournals.org/). We also identified On/Gain/Loss events in various cancer types, a number of which were found in lung squamous cell carcinoma (LSCC), uterine corpus endometrial carcinoma (UCEC) and lung adenocarcinoma (LUAD) (56, 52 and 46, respectively), which are distributed across 37, 23 and 23 proteins, respectively (Supplementary Figure S3E available online at http://bib.oxfordjournals.org/). Together, these observations strongly suggest that the effects of nsSNVs on phosphosites are widespread in different tumor types.

Multidimensional feature-based prediction model predicts VIP events at high accuracy

Previous studies investigating VIP events were hampered by the lack of experimentally validated benchmark datasets, limiting the features used in their predictions of the effects of nsSNVs on phosphorylation sites to sequence-based features primarily. We fed our VIP dataset to MIMP and PhosphoPICK-SNP as the testing dataset, failing to predict the effects of nsSNVs on phosphosites accurately (Figure 3A, AUC of MIMP close to 0.5, AUC of PhosphoPICK-SNP ≤ 0.6). To further evaluate the feature performance of alterations in kinase binding specificity, we compared the distribution of position weight matrix (PWM) [16] similarity scores FC between Gain, Loss and No Impact sets with no significant differences (Figure 3B). These results indicated that limited knowledge of kinase–substrate relationships constrained the predicting abilities of existing models [11, 15, 16, 26, 27]. To achieve better predictions, it is critical to investigate more aspects of features to reflect complicated molecular mechanisms of VIP events.

Figure 3

Construction of machine learning models to predict VIP events. (A) ROC curve of two classical computational predictors, MIMP and PhosphoPICK-SNP, on the Gain and Loss sets. The AUCs are indicators of the poor performance of these methods in screening VIP events. ROC, receiver operating characteristic. AUC, area under curve. (B) PWM similarity score FC caused by the nsSNV was calculated based on the sequence in the range of ±7 AAs at the phosphosite. Boxes represent the first to the third quantile with the center on the median value. PWM, position weight matrix. FC, fold change. AA, amino acid. (C) Comparison of distinct features between the Gain or Loss set and the No Impact set. All P-values were calculated by a two-tailed Mann–Whitney test (*P < 0.05; **P < 0.01; ***P < 0.001). (D) AUCs (bars) of 29 features in four groups (public data evidence, structural environment, evolutionary conservation, and site regulation) after 5-fold cross-validation between Gain/Loss and No Impact sets. The averaged SHAP values (dots) were calculated to represent the weight of each feature to the Gain model and Loss model. (E, F) Ten iterations of a stratified 5-fold cross-validation ROC curve of the Gain (E) and Loss (F) model. The mean ROC curve corresponds to the curve, with the shaded region corresponding to ±1 SD.

Open in new tab Download slide

To improve prediction accuracy, we explored multidimensional features grouped into four categories: public data evidence, structural environment, evolutionary conservation and site regulation (Supplementary Table S3 available online at http://bib.oxfordjournals.org/). To capture effective features, we compared the differences of the above features between the three sets. Apart from sequence-based features, we found that a number of other features including site conservation, residue co-evolution score and 3D distance can also provide information in cataloging VIP events (Figure 3C, Supplementary Figure S4A available online at http://bib.oxfordjournals.org/). Compared with No Impact set, we found that nsSNV–phosphosite pairs in Gain and Loss sets possess higher conservation scores, higher co-evolution scores and shorter 3D distances. The performance of each feature and its separate AUC to the trained models are pictured in Figure 3D (bars). Finally, we selected 29 effective features to predict the impact of nsSNVs on phosphosites.

The selected features were then used to construct XGBoost-based [28] prediction models for Gain and Loss events. The AUC values obtained for the Gain model and Loss model were 0.82 and 0.80, respectively (Figure 3E and F, see Methods for details). The SHAP values measuring property contributions to the predictive ability of each feature incorporated in our models are detailed in Figure 3D (dots). As a result, our selected properties provided effective information to distinguish Gain/Loss events from No Impact events.

To validate that our model can distinguish Gain and Loss events, we applied the Gain model to the Loss dataset and the Loss model to the Gain dataset. As shown in Supplementary Figure S4B available online at http://bib.oxfordjournals.org/, the aforementioned models mainly provided near-zero scores for non-corresponding datasets. This result indicated that the two models obtained sufficient information to correctly distinguish Gain sets from Loss sets. Therefore, we combined the Gain model with the Loss model as the final VIPpred model, with the highest score of the two models representing the impact type on phosphorylation status. The tripartite classification model yielded an AUC of 0.8 (Supplementary Figure S4C available online at http://bib.oxfordjournals.org/). We also validated VIPpred’s classification performances with precision–recall curves (Supplementary Figure S4D and E available online at http://bib.oxfordjournals.org/) and they also indicated the robustness of Gain and Loss events prediction of VIPpred.

Unlike existing sequence-based model, our VIPpred model stands out for its meticulously curated high-quality training dataset, comprehensive feature analysis and the selection of an efficient classification model, collectively ensuring enhanced performance (Table 1). However, the lack of experimental data makes model performance comparison between VIPpred and previous tools comprehensively unapproachable. Therefore, we performed such comparison on our dataset among VIPpred, MIMP and Phosphopick-SNP (Supplementary Figure S4D and E available online at http://bib.oxfordjournals.org/). Detailed numbers of accuracy, AUCPR and F1-score are illustrated in Supplementary Figure S4F available online at http://bib.oxfordjournals.org/. Together, these results clearly show that selected properties can provide information to distinguish Gain/Loss events from No Impact events. Our VIPpred model achieves better performance than previously existing methods in predicting VIP events.

Prediction model robustly and reliably predicts VIP events in independent datasets

To assess whether the VIPpred accurately predicts Gain and Loss events across various types of cancer, we employed a cross-validation strategy for eight independent subsets of the VIP dataset. In order to reduce model performance bias caused by unbalanced positive/negative samples in subsets, we skipped three Gain events subsets (CCRCC, OV and PDA) and four Loss events subsets (CCRCC, HNSCC, OV and PDA) containing less than 10 positive samples when picking the independent testing set. As shown in Figure 4A and B, the AUC of cancer subsets were all greater than 0.7, showing that our model was able to robustly predict VIP events across different cancer types. In addition, we merged eight subsets with relevant cancer types into four subgroups and repeated the cross-validation processes. As shown in Supplementary Figure S5A and B available online at http://bib.oxfordjournals.org/, the VIPpred robustly predicted Gain and Loss events across multiple cancer types.

Figure 4

Prediction performance of VIPpred model in independent datasets. (A, B) ROC curves showing the cross-validation results of distinct cancer types based on the Gain model (A) and the Loss model (B). One subset was used as the independent testing set and the seven remaining subsets were used to train the model. Cancer subsets containing less than 10 positive samples were skipped when picking the independent testing set. (C) Bar plot showing the percentages of the VIP events (Gain, Loss and No Impact) in the PTMvar set. (D) Bar plot showing the percentages of the VIP events in the cancer subset and the non-cancer diseases subset.

Open in new tab Download slide

To compare the predicted results by VIPpred with existing databases, we examined our model on PTMvar, which was a database of VIP events provided by PhosphoSitePlus [12]. PTMvar mainly screened VIP events with the sequential distance between nsSNV and phosphosite being ±5 amino acids (AAs) as a threshold. To evaluate over-representation of nsSNVs affecting phosphorylation status, we used VIPpred to assess 20 207 filtered pairs of the PTMvar set. As shown in Figure 4C, only 383 pairs were predicted as Gain events (1.90%) and 145 pairs as Loss events (0.72%), and most effects of disease-related nsSNVs on adjacent phosphosites were predicted to be No Impact events (97.38%). Due to the enhanced predictive performance of our model, we reduced false positives and identified fewer positive samples compared to previous study. Next, we divided the PTMvar set into two subsets: a cancer subset and a non-cancer diseases subset, according to disease annotations. As shown in Figure 4D, VIPpred is prone to predicting cancer-associated VIP events (P-value = 1.38e−3, Fisher exact test), presumably because our training set originates from CPTAC MS/MS datasets.

Together, these findings show that VIPpred is able to robustly predict VIP events in various tumors with higher sensitivity and a low false-positive rate. These results clearly demonstrate that our model effectively meets the requirements for large-scale pan-cancer predictions.

VIP events are closely associated with cancer drivers and protein degradation

To further elucidate the pan-cancer landscape of the nsSNVs impact on phosphosites, we employed the VIPpred model on nsSNV–potential phosphosite (S, T and Y residues) pairs across 33 tumor types from the TCGA and in 10 tumor types from the CPTAC. We identified 39 632 Gain events and 17 086 Loss events. Although only 2.5% of all predicted nsSNVs affected phosphorylation status, these events were distributed on 5139 proteins and accounted for 27.8% of all predicted proteins (Figure 5A). These results suggested that VIP events were widespread in tumor cells.

Figure 5

Pan-cancer landscape of VIP events and their association with the cancer driver genes. (A) The percentages of variants and proteins identified as Gain/Loss events using the VIPpred model in the TCGA and CPTAC dataset. (B) The percentages of cancer driver genes in the Gain, Loss and No Impact sets predicted by the original VIPpred model and the w/o driver VIPpred model. (C) The boxplot comparing the half-lives of the proteins in the distinct VIP classification from the pan-cancer sets and the cancer driver gene sets. (D) GO enrichment of cancer driver genes with Gain and Loss events. The top panel shows the difference of the significantly enriched gene sets between Gain events and Loss events. The bottom panel represents the enriched significance (P-value) of top GO terms specifically enriched in Gain sets. GO, Gene Ontology.

Open in new tab Download slide

We hypothesized that VIP events play an essential role in carcinogenesis. To test this hypothesis, we analyzed the percentages of Gain, Loss and No Impact events in known cancer driver genes. We collected the cancer driver genes from the COSMIC Cancer Gene Census [29], IntOGen [30, 31] and the study of Bailey et al. [32]. As shown in Figure 5B, the percentages of cancer driver genes in both Gain and Loss sets were significantly higher than in No Impact sets (P_Gain = 2.01e−24, P_Loss = 8.12e−21, Fisher exact test). To avoid the potential preference of cancer driver genes in the training set, we removed the cancer driver genes in the training set and built a new w/o driver VIPpred model. Our analysis also showed that cancer driver genes were significantly enriched in the Gain and Loss sets (Figure 5B, P_Gain = 1.30e−22, P_Loss = 3.74e−21, Fisher exact test). We further assessed the biological functions that are enriched among Gain events and Loss events predicted by our VIPpred model. We found that the proteins with Gain and Loss events were enriched in cancer-related signaling pathways in multiple cancer types, including transcription factor binding, protein kinase binding and cell division for Gain events, and kinase activity and cell–cell adhesion pathway for Loss events (Supplementary Figure S6 available online at http://bib.oxfordjournals.org/).

Previous studies showed that rapid protein degradation enables cells to quickly modulate protein abundance [33], while dysregulation of short-lived proteins plays essential roles in carcinogenesis [34]. We collected human protein half-life data from previous studies [35] and investigated protein half-life differences for proteins with Gain, Loss and No Impact events. As shown in Figure 5C, Gain events tend to occur in proteins with short half-lives compared to Loss and No Impact events. This trend is also significant in cancer drivers. Considering that a previous study showed that degron-related driver mutations were enriched in short half-life proteins [36], we hypothesized that Gain events are associated with protein degradation. To test our hypothesis, we identified and compared enriched functional sets of Gain and Loss events using cancer drivers as background. We found that Gain events were specifically enriched in ubiquitin-dependent protein catabolic process (Figure 5D).

Together, these results show that VIP events constitute a widespread phenomenon in cancer cells. Our study on protein half-life and functional analysis suggests that Gain events associate with the degradation of cancer driver proteins in cancer cells.

Protein stability changes caused by Gain events drive carcinogenesis

Next, we investigated how Gain events on cancer driver genes trigger carcinogenesis. As Gain events are enriched in ubiquitin-dependent protein catabolic process and proteins with short half-lives, we hypothesized that an nsSNV-induced increase in phosphorylation-level influences cancer driver protein stability, which in turn leads to their dysfunction. To test this hypothesis, we first analyzed the percentages of Gain events associated with the degron region on cancer driver genes. We collected the degron region data of the whole genome provided by previous works [36, 37]. Compared to No Impact events, Gain events are significantly enriched in degron regions (P = 4.01e−43, Figure 6A). Previous studies found that a majority of mutations in degron region can directly affect the binding of E3 ubiquitin ligase [37]. Thus, we excluded nsSNVs in degron regions and repeated the comparison of Gain events and No Impact events. As shown in Figure 6A, Gain events are still significantly enriched (P_PsiteIn = 2.16e−18; P_{PsiteFlanking} = 2.76e−53), regardless of whether phosphosites lie in degron regions (PsiteIn) or adjacent to degron regions (10AA upstream or downstream, PsiteFlanking).

Figure 6

Gain events affect cancer driver protein stability. (A) Bar plot showing the percentages of three types of degron-related Gain and No Impact events (left to right): (1) nsSNVs or Psites located in degron regions; (2) Psites located in degron regions while nsSNVs not; (3) Psites located 10AA upstream or downstream of the degron regions while nsSNVs not located in degron regions. Psite, phosphosite. (B) Correlation between the phosphorylation level and the changes in stability of proteins. The x-axis represents the stability-phosphorylation correlation (STABcorr) coefficients and the y-axis represents the correlation P-values. The horizontal line in the y-axis represents the point of statistical significance in the plot (correlation P-values < 0.05). (C) Box plot comparing VIPpred scores between the STABgain with different correlation P-values. The x-axis represents the P-values of STABcorr. STABgain with P-values > 0.05 were considered as the non-significant group. VIPpred scores of two significant groups were compared with the non-significant group using a two-tailed Mann–Whitney test. (D–F) Protein abundances of the oncogenes CTNNB1 (D) and BRD4 (E) were higher in mutant samples than the mean values of the non-mutant (WT) samples. D32X refers to D32H/Y/A/G/V. In contrast, protein abundance of the tumor suppressor gene TSC1 (F) was lower in mutant samples than the mean values of the non-mutant (WT) samples.

Open in new tab Download slide

We next examined how the stability of cancer driver proteins with Gain events and the expression abundance of phosphosites correlate. To this end, we obtained genomic, transcriptomic and proteomic profiles from the CPTAC project. We calculated the protein stability alterations of Gain events with a robust regression between the levels of each protein and the corresponding mRNA levels [37] (Methods). To ensure that phosphorylation changes are not primarily caused by differences in protein abundance, we adjusted the phosphorylation levels in our datasets according to the corresponding protein abundance ratios. Next, we calculated the correlations (STABcorr) between stability changes and phosphorylation levels in at least 30 samples of each cancer type. As shown in Figure 6B, only pairs with P < 0.05 were identified as stability-associated Gain events (STABgain). The positive correlation of STABgain indicated that the protein stability was increased because of higher phosphorylation levels induced by nsSNVs and vice versa. For instance, VIPpred model identified CTNNB1 pS33 as a STABgain, consistent with previous studies that pS33 induces abnormal protein abundance of the oncogene CTNNB1 in multiple cancer types. We also found that pS191 contained a higher STABcorr than pS33, which suggested that the Gain event of pS191 promoted protein stability. As shown in Figure 6C, we found that proteins with higher STABcorr were associated with higher VIPpred scores, providing evidence that the VIPpred model is able to identify Gain events that affect protein stability in cancer driver genes.

Finally, we validated that Gain events affected the degradation of cancer driver genes using the TCGA reverse-phase protein assay (RPPA) data. As shown in Figure 6D and E, we found that in the samples predicted to contain a STABgain on the oncogenes CTNNB1 and BRD4, the corresponding protein abundances are higher than the mean values of the non-mutant (WT) samples. In contrast, we found that the tumor suppressor gene TSC1 contained decreased protein abundance (Figure 6F). Together, these results are consistent with the established fact that the abnormal increase and activation of proteins encoded by oncogenes and the decrease and loss of function of proteins encoded by tumor suppressor genes are recognized as one of the mechanisms of carcinogenesis.

DISCUSSION

In this study, we systematically identified the effects of nsSNVs on phosphorylation across multiple cancer types by reprocessing MS/MS data to produce the first high-throughput VIP dataset with spectral evidence. Based on the obtained VIP dataset, we further integrated multidimensional features and constructed a novel machine learning model VIPpred, which displays better performance than existing sequence-based predicting methods. These VIP predicting methods are limited by known kinase–substrate binding patterns [15, 16]. An example is the increase of thyroid hormone receptor-associated protein 3 (THRAP3) pS672 in UCEC due to nsSNV E674D observed in MS/MS data, yet MIMP and PhosphoPICK-SNP failed to predict S672 phosphorylation-level changes without known specific kinases. Another example is that SRSF4 R441P disrupts key amino acid arginine (R) of kinase–substrate binding patterns (RXXpS) in LSCC. However, SRSF4 pS444 displayed stable phosphorylation abundances in MS/MS data (phosphosite FC/protein FC = 1.17) before and after the occurrence of the nsSNV SRSF4 R441P. This example suggests that alterations in kinase binding specificity is not sufficient to predict VIP events, which indicated the necessity to incorporate multidimensional features in our VIPpred. Previous studies show that evolutionary conservation and site regulation contribute to evaluating functional annotations of genetic variants and phosphosites [6, 38, 39]; our study suggests that such features can also effectively distinguish Gain, Loss and No Impact events. Another study discovered 3D co-clustering of phosphosites and cancer mutations on protein structures [40]; our study suggested that 3D distance is not only a useful predictor in isolation, but its integration facilitated the interpretation of other features within VIPpred.

Our method identified pan-cancer landscape of VIP events enriched in cancer-related pathways and cancer driver genes, indicating the widespread participation of VIP events in carcinogenesis. Phosphorylation dysregulation is known to cause abnormal protein degradation of several cancer drivers [8, 41]. Previous studies suggested that TP53 mutations R282W and R213Q increased TP53 expression levels by disrupting TP53 degradation with pT284 and pS215 phosphorylation alteration [16]. Our results also indicate that Gain events affect the protein degradation of both oncogenes and tumor suppressor genes. Based on these findings, we propose a common oncogenic pathway in which nsSNVs in cancer driver genes increase protein phosphorylation levels, thus resulting in dysregulation of protein degradation. For example, phosphorylation of S76 on programmed cell death protein 4 (PDCD4) leads to increased protein degradation, thus promoting the proliferation of cancer cells [42]. Our VIPpred model predicted a Gain event for PDCD4 pS76 affected by the nsSNV A84T. We propose that PDCD4 with A84T mutation constitutes a potential target for precision medicine.

As our training sets originate from CPTAC MS/MS data, our VIPpred model is more suitable for predicting VIP events in cancer cells. Future studies should include MS/MS data from other diseases to broaden the usage of the VIPpred model. Our VIPpipeline can also expand to other PTMs like acetylation and glycosylation with relevant MS/MS data provided. In this regard, researchers are encouraged to collect MS/MS data enriching other PTMs so as to apply VIPpipeline in investigating variants impact on other possible PTMs. While our study established a novel systematic pipeline to investigate PTM status affected by genetic variants, a number of limitations need to be addressed in future studies. First, although our model has extended the existing predicted range by ±7 AAs to ±14 AAs, the predicted range is still limited to nsSNV-adjacent phosphosites at sequential distances. More complex spatial structural features need to be introduced to study VIP events in intra-protein or inter-protein 3D spatial distance. Second, most degrons used in our research are predicted since very few degrons are experimentally validated. Further research concerning VIP events on protein degradation should depend on more detailed experimental evidence for better understanding.

In summary, the strength of our analysis is based on high-throughput re-annotation on (phospho)proteomics data publicly available. Our collected datasets and VIPpred model constitute valuable resources to researchers in carcinogenesis mechanisms and anti-cancer therapy. Critically, our newly developed VIP analysis process has the potential to be extended to other non-cancer diseases and predict changes of other PTM statuses in cell regulatory mechanism.

METHODS

Collection of CPTAC multi-omics data and clinical information

The original MS/MS data (including tumor samples and adjacent normal samples, ‘mzML’ format) of proteomics and phosphoproteomics were collected from 916 cancer patients including eight cancer types as follows: lung adenocarcinoma (LUAD) [43], lung squamous cell carcinoma (LSCC) [44], head and neck squamous cell carcinoma (HNSCC) [45], colon adenocarcinoma (COAD) [46], clear cell renal cell carcinoma (CCRCC) [47], uterine corpus endometrial carcinoma (UCEC) [48], pancreatic ductal adenocarcinoma (PDA) [49] and ovarian serous cystadenocarcinoma (OV) [50]. The information on somatic mutations of all patients with eight cancer types was downloaded from TCGA Data Portal (https://portal.gdc.cancer.gov/) [51]. The downloaded somatic mutations data were further filtered by Variant Effect Predictor version 101.0 for variant type ‘missense_variant’. A total of 176 765 nsSNVs were retained.

Reprocessing MS data

Construct custom protein sequence databases

To create the reference sequence database, 75 117 human canonical protein sequences from the UniProt Proteomes database (UP000005640, version November 2020) [52] were digested in a fully tryptic manner in silico. For the variant database construction, amino acid substitutions from collected CPTAC nsSNVs were mapped to reference sequences and digested in silico to generate variant sequences. In the variant sequence database, each variant sequence harbors only one nsSNV. The condition that one variant peptide is harboring more than one nsSNVs was ignored. When multiple nsSNVs were located in the interval between two adjacent tryptic sites, they were separately mapped to the reference sequence to generate variant sequences, each with a single nsSNV. After in silico digestion, sequences ranging in length from 7 to 50 were retained in accordance with the default parameter of the following search engine. Each sequence was extended by two fully tryptic digested reference peptides on both sides. Prolonged variant sequences and the corresponding reference sequences form the entire customized database. A non-redundant customized database was built by splitting out the entire sequence produced above for each TMT multiplexed sample.

MS/MS searching

MSFragger version 3.4 [20] was used to search customized databases appended with an equal number of decoy sequences. MS/MS spectra were searched using a precursor-ion mass tolerance of 20 ppm, a fragment mass tolerance of 0.5 Da and allowing for C12/C13 isotope errors (−1/0/1/2/3 for global, and 0/1/2/3 for phosphopeptide-enriched). Cysteine carbamidomethylation (+57.0215) and lysine TMT labeling (+229.1629) were specified as fixed modifications, and methionine oxidation (+15.9949), N-terminal protein acetylation (+42.0106) and TMT labeling of the peptide N terminus. The phosphorylation modifications of serine, threonine and tyrosine residues (+79.9663) and TMT labeling on serine residues were specified as variable modifications. The search was restricted to semi-tryptic peptides, allowing for up to two missed cleavage sites. The search results were then processed using the Philosopher toolkit version v4.1.1 [21], including PeptideProphet [53], ProteinProphet [54] and PTMProphet [22]. The data were filtered to 5% ion-level, 5% peptide-level and 5% PSM-level FDR using Philosopher filter command (Supplementary Table S1 available online at http://bib.oxfordjournals.org/).

PepQuery validation

The identified variant peptides were further validated using PepQuery (http://pepquery.org/) [23]. For each TMT multiplexed dataset, the database searching parameters described above were used. Hyperscore was used for PSM scoring, unrestricted modification searching-based filtering was enabled and only PSMs with a P-value ≤ 0.01 were considered valid identifications.

Annotation of VIP types

Labelquant method included in Philosopher toolkit was used to generate quantification matrices from phosphoproteome. Proteome is used for variant peptide verification, particularly in the case of Loss events. Loss events include variant-driven phosphosite loss, which makes it difficult to detect the corresponding nsSNVs in phosphoproteomics since the variant without phosphorylation cannot be enriched. We added proteomics data to verify these mutations in order to prove their authenticity. Case ID was matched to TMT channels and parameters were adjusted, with a minimum peptide probability of 0.5 for quantification and a minimum site localization probability of 0.5 (phosphoproteomics datasets only). The quantification results (log2 ratios) were summarized at the phosphosite (Psite) level. Next, the quantification results of the tumor samples were divided into the Psite levels of variant peptides (TumorVarPsite) and Psite levels of reference peptides (TumorRefPsite). Similarly, the quantification results of the normal samples were divided into NormalVarPsite and NormalRefPsite.

Abnormal events where NormalVarPsite was significantly larger than TumorVarPsite were removed. ‘On’ events were defined as nsSNVs occurring at the same location of the phosphosites that directly alter the modification sites. ‘No Impact’ events were defined as having no significant differences between paired RefPsite (both TumorRefPsite and NormalRefPsite) and VarPsite. ‘Gain/Loss’ events were defined as VarPsite being significantly up-regulated (FC ≥ 2) or down-regulated (FC ≤ 1/2) compared to RefPsite due to the corresponding nsSNV. In addition, FC in phosphorylation levels is required to be disproportionate to the magnitude of protein abundance FC (phosphosite FC/protein FC ≥2 for Gain events and vice versa for Loss events). The complete thresholds of each event and all VIP events are shown in Supplementary Figure S3B available online at http://bib.oxfordjournals.org/.

Feature extraction and model construction

Four categories of characteristics (public data evidence, structural environment, evolutionary conservation and site regulation) are listed in Supplementary Table S3 available online at http://bib.oxfordjournals.org/. Changes in amino acid properties were calculated using SNVbox [55], including AA residue volume, AA hydrophobicity, AA polarity and AA Grantham distance. Protein secondary structure, solvent accessibility, IDR score in intrinsic disorder region and LCD score in low complexity region were downloaded from the DESCRIBEPROT database (http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/main.php) [56]. The kinase PWM similarity score was computed to assess the kinase–substrate recognition using the MATCH algorithm [57]. The upstream and downstream PTMs of the phosphosites were collected from the PhosphoSitePlus database [12] and the dataset published by Pedro Laboratory [6]. The site co-evolution information was based on the mammalian MSA results from the eggNOG5 database [58] and the Python package was used to calculate the mutual information between sites. The 3D distance was calculated based on Alphafold2 prediction results of the linear distance between α-carbon atoms [55]. When comparing the distribution of 3D distance features among the Gain, Loss and No Impact sets, the 3D distance subsets are downsampled according to the 1D distribution.

The above 29 features were used to construct the classification model using the Python package XGBoost [28], a gradient-boosting decision tree algorithm with high efficiency and exemplary performance in handling tabular data. Since the different tools introduced in Supplementary Table S3 available online at http://bib.oxfordjournals.org/ have different restrictions on the input data, there are some missing values when calculating features. The XGBoost algorithm includes a mechanism for dealing with missing values. As a result, our models are more forgiving of the input data. The training set was obtained from re-annotation of mass spectrum data by the above procedure, divided into Gain, Loss and No Impact subsets. The Gain/Loss model was trained based on No Impact subsets as negative samples and Gain/Loss subsets as positive samples, respectively. Since the number of positive samples in both training sets was about one-fifth the number of negative samples, parameter scale_pos_weight was set to 5, and other parameters were set as default. The model performance was assessed based on a 5-fold cross-validation of 10 iterations (random starting values were set from 1 to 10, five rounds of training were performed for each). The assessment results were displayed by the receiver operator characteristic (ROC) curve, area under the curve (AUC) score and area under precision recall curve (AUCPR). The SHAP (Shapley Additive Explanation) value in the Python package was used to quantify the contribution of features [59]. Furthermore, the Gain and Loss models were combined as the VIPpred model. The highest score was taken as the prediction score of the combined model. The ROC curve of tripartite classification was plotted using the one class versus the rest methodology.

To assess the robustness of the model across different tumor types, our VIP dataset was divided into eight subsets by cancer type and cross-validation is performed. Each subset was selected as the test set and other cancer types subsets were used as the training set. Specifically, cancer subsets with at least 10 positive samples were selected. To further exclude bias of the result caused by the different number of samples, eight cancer types were regrouped into four subsets by cancer type relevance for the same cross-validation.

Prediction on the TCGA Pan-Cancer Atlas set

The TCGA Pan-Cancer Atlas data were collected from the MC3 project (https://gdc.cancer.gov/about-/publications/MC3-2017), and only nsSNVs were preserved [51]. Totally, 1 331 796 nsSNVs from CPTAC and TCGA were collected. All potential phosphosites (S, T, Y) within the range of ±14 AAs were combined with nsSNVs. Finally, 6 412 519 candidate pairs were obtained for model prediction.

Functional enrichment analysis

To assess enriched significance in the Gene Ontology (GO) functional annotation dataset, proteins of Gain events and Loss events were functionally annotated. To assess the enriched functions of VIP datasets in specific cancer types, proteins with VIP events were divided based on the cancer type annotation. GO enrichment was conducted using hypergeometric test for Gain or Loss subsets. The color and size of the nodes in the enrichment graph represent the significance of functional enrichment. For each cancer type, the significance of gene sets enriched by Gain events or Loss events was defined according to the adjusted P-value <0.01.

Calculation of protein stability and its correlation to phosphorylation abundance

Previously predicted degron regions [36, 37] were used in this study. The mean values for the half-life of proteins used for further analysis in this study were derived from previously published data [35].

The calculation of protein stability changes was based on the method used by Martínez-Jiménez et al. [37]. Robust regression between the levels of each protein and the corresponding mRNA across non-mutated samples were computed as expected protein levels from the observed mRNA level. The distance between the observed protein level and the expected value was re-scaled to account for the standard deviation of the protein and mRNA levels, referred to as the stability change of the protein. The CPTAC transcriptome expression profiles were all collected from TCGA Data Portal. CPTAC matched proteome abundances derived from MS/MS data were collected from CPTAC Data Portal. TCGA proteome abundances were collected from RPPA. Phosphorylation level was corrected by protein abundance, which was referred to as relative phosphorylation abundance. Relative phosphorylation abundance of Gain events phosphosites was used to calculate Spearman correlation with protein stability. Finally, the phosphosites with more than 30 samples were considered. Next, the sites with significant correlation (P-value <0.05) were defined as stability-associated Gain events (STABgain). The correlation was denoted as STABcorr. STABcorr coefficients and their P-values were illustrated in the scatter plot.

Key Points

We identified 740 experimentally supported VIP events by reprocessing MS/MS data provided by CPTAC.
We developed a robust and accurate machine learning model VIPpred, based on multidimensional features, to predict different types of VIP events.
Pan-cancer analysis with VIPpred shows that Gain/Loss events are enriched in cancer-related pathways and cancer driver genes.
We found that Gain events tend to inhibit the protein degradation of oncogenes and promote tumor suppressor proteins degradation.

ACKNOWLEDGEMENTS

The authors want to thank Dr. Bo Wen, the developer of PepQuery from University of Michigan, as well as Dr. Yan Fu and Mr. Zhiyuan Cheng from Chinese Academy of Sciences, for their helpful discussions during our work. This work was supported by the High-Performance Computing Platform of Peking University.

AUTHOR CONTRIBUTIONS

T.L. and X.X. designed the research; X.X. and Y.L. performed the research; X.X., Y.L., C.H. and Y.Z. analyzed data; X.X., Y.L., T.C., L.Y., P.Z., Y.Z., C.H. and T.L. wrote the paper.

FUNDING

National Key Research and Development Program of China (Grant Nos. 2021YFF1200900, 2018YFA0507504) , the National Natural Science Foundation of China (Grant No. 32070666), and the Shanxi Province Science Foundation for Youths (Grant No. 20210302123092).

DATA AVAILABILITY

We established a website (http://bioinfolilab.phasep.pro/vippred) including (1) MS data reprocessing results by Philosopher and PepQuery, (2) codes for VIPpred model and (3) pan-cancer landscape predicted by VIPpred. VIPpred GitHub repository is also available at https://github.com/TingtingLiGroup/VIPpred/. The computational analyses were performed in Python 3.8.8.

Author Biographies

Xiaofeng Xu is currently pursuing a PhD degree in bioinformatics at the School of Basic Medical Sciences, Peking University Health Science Center. His research interests involve bioinformatics and protein post-translational modification.

Ying Li is a lecturer at the Department of Computer Science and Technology, Taiyuan University of Technology. Her research interests involve bioinformatics, machine learning, tumor heterogeneity and protein post-translational modification.

Taoyu Chen is currently pursuing a PhD degree in bioinformatics at the School of Basic Medical Sciences, Peking University Health Science Center. His research interests involve bioinformatics and liquid–liquid phase separation.

Chao Hou is currently pursuing a PhD degree in bioinformatics at the School of Basic Medical Sciences, Peking University Health Science Center.

Liang Yang is currently pursuing a Master’s degree in bioinformatics at the School of Basic Medical Sciences, Peking University Health Science Center.

Peiyu Zhu is currently pursuing a Master’s degree in bioinformatics at the School of Basic Medical Sciences, Peking University Health Science Center.

Yi Zhang is currently pursuing a PhD degree in bioinformatics at the School of Basic Medical Sciences, Peking University Health Science Center.

Tingting Li is a Principal Investigator at the School of Basic Medical Sciences, Peking University Health Science Center, Beijing, China. Her research interests involve bioinformatics, machine learning, protein post-translational modification and liquid–liquid phase separation.

References

Hanahan

Weinberg

The hallmarks of cancer

Cell

2000

;

100

–

Hanahan

Weinberg

Hallmarks of cancer: the next generation

Cell

2011

;

144

646

–

Sanchez-Vega

Mina

Armenia

, et al.

Oncogenic signaling pathways in the Cancer Genome Atlas

Cell

2018

;

173

321

–

337.e310

Raimondi

Passemiers

Fariselli

Moreau

Current cancer driver variant predictors learn to recognize driver genes instead of functional variants

BMC Biol

2021

;

Kumar

Warrell

, et al.

Passenger mutations in more than 2,500 cancer genomes: overall molecular functional impact and consequences

Cell

2020

;

180

915

–

927.e16

Ochoa

Jarnuczak

Vieitez

, et al.

The functional landscape of the human phosphoproteome

Nat Biotechnol

2020

;

365

–

Lundby

Franciosa

Emdal

, et al.

Oncogenic mutations rewire signaling pathways by switching protein recruitment to phosphotyrosine sites

Cell

2019

;

179

543

–

560.e26

Gully

Velazquez-Torres

Shin

, et al.

Aurora B kinase phosphorylates and instigates degradation of p53

Proc Natl Acad Sci U S A

2012

;

109

E1513

–

Frankenreiter

Gahr

Schmid

, et al.

Phospho-site mutations in transcription factor suppressor of hairless impact notch signaling activity during hematopoiesis in Drosophila

Front Cell Dev Biol

2021

;

:658820.

Google Scholar

OpenURL Placeholder Text

WorldCat

10.

Peng

, et al.

PTMsnp: a web server for the identification of driver mutations that affect protein post-translational modification

Front Cell Dev Biol

2020

;

:593661.

Google Scholar

OpenURL Placeholder Text

WorldCat

11.

Ren

Jiang

Gao

, et al.

PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation

Mol Cell Proteomics

2010

;

623

–

12.

Hornbeck

Zhang

Murray

, et al.

PhosphoSitePlus, 2014: mutations, PTMs and recalibrations

Nucleic Acids Res

2015

;

D512

–

13.

Krassowski

Paczkowska

Cullion

, et al.

ActiveDriverDB: human disease mutations and genome variation in post-translational modification sites of proteins

Nucleic Acids Res

2018

;

D901

–

14.

Luo

, et al.

dbPTM in 2022: an updated database for exploring regulatory networks and functional associations of protein post-translational modifications

Nucleic Acids Res

2022

;

D471

–

15.

Patrick

Kobe

Le Cao

, et al.

PhosphoPICK-SNP: quantifying the effect of amino acid variants on protein phosphorylation

Bioinformatics

2017

;

1773

–

16.

Wagih

Reimand

Bader

MIMP: predicting the impact of mutations on kinase-substrate phosphorylation

Nat Methods

2015

;

531

–

17.

Olsen

Vermeulen

Santamaria

, et al. .

Quantitative Phosphoproteomics Reveals Widespread Full Phosphorylation Site Occupancy During Mitosis

Sci. Signal

2010

;

Google Scholar

OpenURL Placeholder Text

WorldCat

18.

Song

Liu

, et al.

Systematic Analysis of Protein Phosphorylation Networks From Phosphoproteomic Data

Molecular & Cellular Proteomics

2012

;

1070

–

Google Scholar

Crossref

WorldCat

19.

Obenauer

Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs

Nucleic Acids Research

2003

;

3635

–

20.

Kong

Leprevost

Avtonomov

, et al.

MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

Nat Methods

2017

;

513

–

21.

Veiga Leprevost

Haynes

Avtonomov

, et al.

Philosopher: a versatile toolkit for shotgun proteomics data analysis

Nat Methods

2020

;

(

869

–

22.

Shteynberg

Deutsch

Campbell

, et al.

PTMProphet: fast and accurate mass modification localization for the trans-proteomic pipeline

J Proteome Res

2019

;

4262

–

23.

Wen

Wang

Zhang

PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations

Genome Res

2019

;

485

–

24.

Thompson

Schafer

Kuhn

, et al.

Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS

Anal Chem

2003

;

1895

–

904

25.

Vasaikar

Straub

Wang

Zhang

LinkedOmics: analyzing multi-omics data within and across 32 cancer types

Nucleic Acids Res

2018

;

D956

–

26.

Patrick

Le Cao

Kobe

, et al.

PhosphoPICK: modelling cellular context to map kinase-substrate phosphorylation events

Bioinformatics

2015

;

382

–

27.

Simpson

Zhang

Hornbeck

Gnad

Systematic analysis of the intersection of disease mutations with protein modifications

BMC Med Genomics

2019

;

109

28.

Chen

Guestrin

. Xgboost: a scalable tree boosting system.

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

2016

785

–

29.

Tate

Bamford

Jubb

, et al.

COSMIC: the catalogue of somatic mutations in cancer

Nucleic Acids Res

2019

;

D941

–

30.

Martinez-Jimenez

Muinos

Sentis

, et al.

A compendium of mutational cancer driver genes

Nat Rev Cancer

2020

;

555

–

31.

Gonzalez-Perez

Perez-Llamas

Deu-Pons

, et al.

IntOGen-mutations identifies cancer drivers across tumor types

Nat Methods

2013

;

1081

–

32.

Bailey

Tokheim

Porta-Pardo

, et al.

Comprehensive characterization of cancer driver genes and mutations

Cell

2018

;

173

371

–

385.e318

33.

Christiano

Arlt

Kabatnik

, et al.

A systematic protein turnover map for decoding protein degradation

Cell Rep

2020

;

:108378.

Google Scholar

OpenURL Placeholder Text

WorldCat

34.

Cai

Vaites

, et al.

Proteome-wide mapping of short-lived proteins in human cells

Mol Cell

2021

;

4722

–

4735.e5

35.

Mathieson

Franken

Kosinski

, et al.

Systematic analysis of protein turnover in primary cells

Nat Commun

2018

;

689

36.

Hou

Wang

, et al.

Systematic prediction of degrons and E3 ubiquitin ligase binding via deep learning

BMC Biol

2022

;

162

37.

Martinez-Jimenez

Muinos

Lopez-Arribillaga

, et al.

Systematic analysis of alterations in the ubiquitin proteolysis system reveals its contribution to driver mutations in cancer

Nat Cancer

2020

;

122

–

38.

Liu

Jian

Boerwinkle

dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions

Hum Mutat

2011

;

894

–

39.

Liu

Mou

, et al.

dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs

Genome Med

2020

;

103

40.

Huang

Scott

Zhou

, et al.

Spatially interacting phosphorylation sites and mutations in cancer

Nat Commun

2021

;

2313

41.

Liu

Semenov

, et al.

Control of beta-catenin phosphorylation/degradation by a dual-kinase mechanism

Cell

2002

;

108

837

–

42.

Dorrello

Peschiaroli

Guardavaccaro

, et al.

S6K1- and betaTRCP-mediated degradation of PDCD4 promotes protein translation and cell growth

Science

2006

;

314

467

–

43.

Gillette

Satpathy

Cao

, et al.

Proteogenomic characterization reveals therapeutic vulnerabilities in lung adenocarcinoma

Cell

2020

;

182

200

–

225.e235

44.

Satpathy

Krug

Jean Beltran

, et al.

A proteogenomic portrait of lung squamous cell carcinoma

Cell

2021

;

184

4348

–

4371.e4340

45.

Huang

Chen

Savage

, et al.

Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma

Cancer Cell

2021

;

361

–

379.e316

46.

Vasaikar

Huang

Wang

, et al.

Proteogenomic analysis of human colon cancer reveals new therapeutic opportunities

Cell

2019

;

177

1035

–

1049.e1019

47.

Clark

Dhanasekaran

Petralia

, et al.

Integrated proteogenomic characterization of clear cell renal cell carcinoma

Cell

2019

;

179

964

–

983.e931

48.

Dou

Kawaler

Cui Zhou

, et al.

Proteogenomic characterization of endometrial carcinoma

Cell

2020

;

180

729

–

748.e726

49.

Cao

Huang

Cui Zhou

, et al.

Proteogenomic characterization of pancreatic ductal adenocarcinoma

Cell

2021

;

184

5031

–

5052.e5026

50.

Pan

Shah

, et al.

Integrated proteomic and glycoproteomic characterization of human high-grade serous ovarian carcinoma

Cell Rep

2020

;

:108276.

Google Scholar

OpenURL Placeholder Text

WorldCat

51.

Cancer Genome Atlas Research N

Weinstein

Collisson

, et al.

The Cancer Genome Atlas pan-cancer analysis project

Nat Genet

2013

;

1113

–

52.

UniProt C

UniProt: the Universal Protein Knowledgebase in 2023

Nucleic Acids Res

2023

;

D523

–

Crossref

PubMed

WorldCat

53.

Keller

Nesvizhskii

Kolker

Aebersold

Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search

Anal Chem

2002

;

5383

–

54.

Nesvizhskii

Keller

Kolker

Aebersold

A statistical model for identifying proteins by tandem mass spectrometry

Anal Chem

2003

;

4646

–

55.

Jumper

Evans

Pritzel

, et al.

Highly accurate protein structure prediction with AlphaFold

Nature

2021

;

596

583

–

56.

Zhao

Katuwawala

Oldfield

, et al.

DescribePROT: database of amino acid-level protein structure and function predictions

Nucleic Acids Res

2021

;

D298

–

308

57.

Kel

Gossling

Reuter

, et al.

MATCH: a tool for searching transcription factor binding sites in DNA sequences

Nucleic Acids Res

2003

;

3576

–

58.

Huerta-Cepas

Szklarczyk

Heller

, et al.

eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses

Nucleic Acids Res

2019

;

D309

–

59.

Lundberg

Lee

S-I

. A unified approach to interpreting model predictions.

Advances in Neural Information Processing Systems

2017

Author notes

Xiaofeng Xu and Ying Li contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Download all slides

Month:	Total Views:
December 2023	29
January 2024	504
February 2024	71
March 2024	100
April 2024	96
May 2024	102
June 2024	106
July 2024	40
August 2024	75
September 2024	109
October 2024	27
November 2024	61
December 2024	53
January 2025	45
February 2025	57
March 2025	66
April 2025	43
May 2025	4

Article Contents

VIPpred: a novel model for predicting variant impact on phosphorylation events driving carcinogenesis

Abstract

INTRODUCTION

RESULTS

An overview of the study

VIPpipeline identified reliable nsSNV–phosphosite pairs

Annotation of Gain or Loss effects of nsSNVs on phosphorylation levels

Multidimensional feature-based prediction model predicts VIP events at high accuracy

Prediction model robustly and reliably predicts VIP events in independent datasets

VIP events are closely associated with cancer drivers and protein degradation

Protein stability changes caused by Gain events drive carcinogenesis

DISCUSSION

METHODS

Collection of CPTAC multi-omics data and clinical information

Reprocessing MS data

Construct custom protein sequence databases

MS/MS searching

PepQuery validation

Annotation of VIP types

Feature extraction and model construction

Prediction on the TCGA Pan-Cancer Atlas set

Functional enrichment analysis

Calculation of protein stability and its correlation to phosphorylation abundance

ACKNOWLEDGEMENTS

AUTHOR CONTRIBUTIONS

FUNDING

DATA AVAILABILITY

Author Biographies

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

VIPpred: a novel model for predicting variant impact on phosphorylation events driving carcinogenesis

Abstract

INTRODUCTION

RESULTS

An overview of the study

VIPpipeline identified reliable nsSNV–phosphosite pairs

Annotation of Gain or Loss effects of nsSNVs on phosphorylation levels

Multidimensional feature-based prediction model predicts VIP events at high accuracy

Prediction model robustly and reliably predicts VIP events in independent datasets

VIP events are closely associated with cancer drivers and protein degradation

Protein stability changes caused by Gain events drive carcinogenesis

DISCUSSION

METHODS

Collection of CPTAC multi-omics data and clinical information

Reprocessing MS data

Construct custom protein sequence databases

MS/MS searching

PepQuery validation

Annotation of VIP types

Feature extraction and model construction

Prediction on the TCGA Pan-Cancer Atlas set

Functional enrichment analysis

Calculation of protein stability and its correlation to phosphorylation abundance

ACKNOWLEDGEMENTS

AUTHOR CONTRIBUTIONS

FUNDING

DATA AVAILABILITY

Author Biographies

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only