Abstract

Proteomic studies characterize the protein composition of complex biological samples. Despite recent advancements in mass spectrometry instrumentation and computational tools, low proteome coverage and interpretability remains a challenge. To address this, we developed Proteome Support Vector Enrichment (PROSE), a fast, scalable and lightweight pipeline for scoring proteins based on orthogonal gene co-expression network matrices. PROSE utilizes simple protein lists as input, generating a standard enrichment score for all proteins, including undetected ones. In our benchmark with 7 other candidate prioritization techniques, PROSE shows high accuracy in missing protein prediction, with scores correlating strongly to corresponding gene expression data. As a further proof-of-concept, we applied PROSE to a reanalysis of the Cancer Cell Line Encyclopedia proteomics dataset, where it captures key phenotypic features, including gene dependency. We lastly demonstrated its applicability on a breast cancer clinical dataset, showing clustering by annotated molecular subtype and identification of putative drivers of triple-negative breast cancer. PROSE is available as a user-friendly Python module from https://github.com/bwbio/PROSE.

Introduction

Practical challenges in proteomic analyses and interpretation

Proteomics, the high-throughput analysis of proteins in a sample, enables deep study of functional factors underlying a biological condition or phenotype [1, 2]. Liquid chromatography–tandem mass spectrometry (LC–MS/MS) is the most widely used technology in proteomics, supporting rapid and parallelizable analysis of complex biological samples [3]. In a typical proteomic experiment, sample proteins are first digested using a site-specific protease such as trypsin, followed by column-based separation of the resultant peptides, and spectral profiling of ionized peptides/fragments at the MS1 and MS2 (or higher) levels. Acquired spectra are then compared against theoretical spectra to obtain peptide-spectrum matches (PSMs). Confident PSMs, corresponding to component peptides of full-length proteins, are then used to determine and quantify present proteins in each sample.

Recent advances in proteomics have achieved improved ion discrimination [4–6], spectral capture [7, 8] and multiplexing capabilities [9–11]. Nevertheless, missing proteins are still common in many published proteomic datasets [12–15], hindering proper downstream analysis. This arises from a number of reasons, depending on the underlying paradigm. In Data-Dependent Acquisition (DDA), this is largely due to selection of only the highest intensity ions for fragmentation and analysis under MS2 [16]. While Data-Independent Acquisition (DIA) approaches (e.g. via isolation windows for ions [7, 8]) can greatly improve resolution, technical implementation and accurate deconvolution of multiplexed spectra in MS2+ is complicated [17]. Moreover, accurate imputation of these missing values, notably for proteins missing in all replicates, is especially challenging [12, 13]. Certain peptides are also inherently difficult to identify from their spectra, possibly due to extensive post-translational or other modifications (PTMs) [1, 18, 19]. Identifying modified peptides entails a large increase in search space and computational expense, though tools such as DIA-NN [20] are being explored to address this.

Nevertheless, there remains no clear consensus on how to approach these issues of peptide detectability (and hence, proteome coverage) beyond improving spectral capture. Furthermore, developing and deploying more sophisticated LC–MS(n) instrumentation is not always feasible due to high capital and operational costs. Another key consideration is how to handle already-published datasets; there is considerable interest in developing techniques that provide new insights from data generated under older technologies and acquisition paradigms.

Applying candidate prioritization approaches to proteomic data

Candidate gene/protein prioritization approaches show promise in overcoming these challenges. These methods generally use a set of seed elements to score and identify related partners based on similar biological context [21]. Some particularly useful examples of biological context include functional annotation and biological networks [22, 23]. In a biological network, distance or clustering metrics on co-expression or protein–protein interaction networks intuitively allow the identification of related pairs or groups of elements (e.g. functionally related proteins). Candidate prioritization tools that leverage such context, including the ToppGene suite [24] and MaxLink [25, 26], have previously been successful in identifying cancer-linked genes based on rudimentary seed lists. More recently, a benchmark of topology-based methods [27] highlighted the exceptional performance of network propagation techniques such as Random Walk with Restart (on heterogenous networks; RWRH) [28], as well as the network embedding approach Node2Vec [29] in identifying key genes in cerebral small vessel disease. Other approaches include functional class scoring, where individual sets of related network elements (or modules) are scored. For instance, our recent algorithm PROTREC [30], which uses the CORUM protein complexome [31] as functional annotations, was able to successfully recover missing proteins across replicates of HeLa lysates and solid cancer samples, using only protein lists as input.

We propose that candidate prioritization scores can represent stable and biologically meaningful (i.e. phenotype-specific) proteomic signatures [32, 33]. In principle, candidate prioritization techniques present certain key advantages when applied to proteomic analyses: they can be adapted to any instrumentation, are independent of spectra and can be applied to re-analysis of public proteomic datasets, including low-coverage samples. Given such potential, developing more powerful and flexible techniques that leverage biological context may prove highly valuable. However, to our knowledge, no candidate prioritization method explicitly incorporates information about missing proteins in the dataset, despite their informative value [14]. Furthermore, supervised learning approaches on co-expression networks have yet to be explored for either missing protein recovery or enrichment scoring of individual proteins.

In this vein, we developed Proteome Support Vector Enrichment (PROSE), an ensemble machine learning pipeline based on support vector machines (SVMs), for individual protein enrichment scoring against an orthogonal gene co-expression matrix. The resultant enrichment scores are analogous to those obtained via a network-based candidate prioritization approach. A simplified schematic of the analysis pipeline is provided in Figure 1. PROSE takes a simple list of identified (and optionally, unidentified) proteins in a sample as seed sets, before applying a supervised learning approach on a separate gene co-expression matrix, derived from public RNA-Seq data. PROSE outputs an enrichment score for each protein based on first-neighbor correlations (i.e. correlations between all elements in the matrix) [25]. We propose that these enrichment scores can be used as a useful heuristic for missing protein inference, functional analysis or prediction of expression levels. The input protein lists can be derived from a variety of sources, including peptide/protein identifications, relative quantification or differential expression data, allowing for flexible analyses. PROSE is made available as a user-friendly Python module (https://github.com/bwbio/PROSE).

Simple schematic workflow of PROSE. Using simple protein lists, we train several weak SVMs on subsampled data from the corresponding gene co-expression matrix, using the lists as class labels. The distance between learned hyperplanes and each gene in the matrix is averaged and normalized to generate an enrichment score for each corresponding protein. These scores represent a network-based signature for the underlying proteome.
Figure 1

Simple schematic workflow of PROSE. Using simple protein lists, we train several weak SVMs on subsampled data from the corresponding gene co-expression matrix, using the lists as class labels. The distance between learned hyperplanes and each gene in the matrix is averaged and normalized to generate an enrichment score for each corresponding protein. These scores represent a network-based signature for the underlying proteome.

Results

PROSE accurately discriminates between observed and unobserved proteins

While specific correlations between gene and protein expression remain contentious [34], we reasoned that broad mRNA-protein correlations within tissues and cell lines are sufficient to capture conserved patterns of up/downregulation, especially for differentially expressed mRNAs [35]. This may reasonably correspond to the functional expression of individual proteins. As such, our basic implementation of PROSE works on a gene co-expression matrix, constructed from a RNA-seq atlas of 675 cancer cell lines by Klijn et al. [36].

To evaluate PROSE, we applied it to the Mehta et al. HeLa [37] and Li et al. THP1 macrophage [38] DDA datasets, where unique protein groups were reported with an FDR threshold of 0.01. For each replicate in both the HeLa and THP1 DDA datasets, ~3700 protein groups comprising a single protein were observed, whereas ~4300 were observed for the HeLa DIA dataset (Supplementary Figure 1). We excluded known housekeeper genes/proteins [39] from our analyses, reasoning that due to their high abundance, they are readily detected in most proteomic samples [40], and hence unlikely to be specific to a particular phenotype. We benchmarked PROSE against other candidate prioritization implementations that utilize only gene/protein lists, including a UMAP-KNN model, PROTREC, MaxLink, ToppGene, RWRH and Node2Vec (summarized in Table 1; Figure 2A). Further details on their implementation are included in Supplementary Methods. Across all datasets, all methods performed better than random. We observed that PROSE generally outperforms other methods in terms of auROC (mean auROC across replicates >0.8) and correlation to gene expression (mean Spearman correlation coefficient (ρ) across replicates ~0.6) (Figure 2B–D). However, correlation to protein expression is considerably lower (mean ρ ~ 0.3), while MaxLink performs better (mean ρ ~ 0.4).

Table 1

Summary of benchmarked prioritization approaches

MethodApproachOriginal feature spaceFeature space formatBiological features
PROSENearest neighborKlijn et al. RNA-seqWeighted adjacency matrixGene–gene co-expression
UMAP-KNNNetwork embeddingKlijn et al. RNA-seqUMAP embeddingGene–gene co-expression
MaxLinkNearest neighborFunCoup 5NetworkVarious, aggregated
RWRHaNetwork propagationVariousbHet. networkVarious
Node2VeccNetwork embeddingVariousbHet. networkVarious
ToppGenedFunctional annotationToppGene portaldFunctional annotatonVarious
PROTRECModule scoringCORUMProtein setsProtein complexes
MethodApproachOriginal feature spaceFeature space formatBiological features
PROSENearest neighborKlijn et al. RNA-seqWeighted adjacency matrixGene–gene co-expression
UMAP-KNNNetwork embeddingKlijn et al. RNA-seqUMAP embeddingGene–gene co-expression
MaxLinkNearest neighborFunCoup 5NetworkVarious, aggregated
RWRHaNetwork propagationVariousbHet. networkVarious
Node2VeccNetwork embeddingVariousbHet. networkVarious
ToppGenedFunctional annotationToppGene portaldFunctional annotatonVarious
PROTRECModule scoringCORUMProtein setsProtein complexes

The feature space is derived from their original implementations, where possible.

aRWRH was applied as-is from the Zhang et al. benchmarking implementation.

bWe used the curated heterogenous network constructed by Zhang et al.

cNode2Vec embeddings on the Zhang et al. heterogenous network were generated with PecanPy and scored with cosine distance.

dToppGene is implemented through the eponymous webtool, using either default functional annotation groups or co-expression networks only.

Table 1

Summary of benchmarked prioritization approaches

MethodApproachOriginal feature spaceFeature space formatBiological features
PROSENearest neighborKlijn et al. RNA-seqWeighted adjacency matrixGene–gene co-expression
UMAP-KNNNetwork embeddingKlijn et al. RNA-seqUMAP embeddingGene–gene co-expression
MaxLinkNearest neighborFunCoup 5NetworkVarious, aggregated
RWRHaNetwork propagationVariousbHet. networkVarious
Node2VeccNetwork embeddingVariousbHet. networkVarious
ToppGenedFunctional annotationToppGene portaldFunctional annotatonVarious
PROTRECModule scoringCORUMProtein setsProtein complexes
MethodApproachOriginal feature spaceFeature space formatBiological features
PROSENearest neighborKlijn et al. RNA-seqWeighted adjacency matrixGene–gene co-expression
UMAP-KNNNetwork embeddingKlijn et al. RNA-seqUMAP embeddingGene–gene co-expression
MaxLinkNearest neighborFunCoup 5NetworkVarious, aggregated
RWRHaNetwork propagationVariousbHet. networkVarious
Node2VeccNetwork embeddingVariousbHet. networkVarious
ToppGenedFunctional annotationToppGene portaldFunctional annotatonVarious
PROTRECModule scoringCORUMProtein setsProtein complexes

The feature space is derived from their original implementations, where possible.

aRWRH was applied as-is from the Zhang et al. benchmarking implementation.

bWe used the curated heterogenous network constructed by Zhang et al.

cNode2Vec embeddings on the Zhang et al. heterogenous network were generated with PecanPy and scored with cosine distance.

dToppGene is implemented through the eponymous webtool, using either default functional annotation groups or co-expression networks only.

General performance of network approaches applied to DDA and DIA data. (A) Graphical summary of different approaches compared here. (B–D) For each dataset, models were supplied with a random sample of 2000 observed proteins (and 2000 random unobserved proteins for PROSE/KNN); 200 additional holdout proteins from each group were used to determine the area under the receiver operating characteristic curve (auROC) and correlation metrics. Proteins corresponding to annotated human housekeeper genes were excluded. For correlation to gene expression in HeLa, the transcripts per million (TPM) data from the Klijn et al. (2015) dataset were used. Values shown as mean ± SD. (B) Performance for the HeLa DDA dataset (n = 4 replicates). (C) Performance for the THP1-M0 macrophage dataset (n = 3 replicates). (D) Performance for the HeLa directDIA dataset (n = 4 replicates).
Figure 2

General performance of network approaches applied to DDA and DIA data. (A) Graphical summary of different approaches compared here. (BD) For each dataset, models were supplied with a random sample of 2000 observed proteins (and 2000 random unobserved proteins for PROSE/KNN); 200 additional holdout proteins from each group were used to determine the area under the receiver operating characteristic curve (auROC) and correlation metrics. Proteins corresponding to annotated human housekeeper genes were excluded. For correlation to gene expression in HeLa, the transcripts per million (TPM) data from the Klijn et al. (2015) dataset were used. Values shown as mean ± SD. (B) Performance for the HeLa DDA dataset (n = 4 replicates). (C) Performance for the THP1-M0 macrophage dataset (n = 3 replicates). (D) Performance for the HeLa directDIA dataset (n = 4 replicates).

Node2Vec demonstrated the lowest overall performance, which may indicate that this specific embedding is not suitable for our task, despite its previous success [27]. PROSE shows a slight improvement over ToppGene (default implementation) and RWRH, and substantial improvement over ToppGene (co-expression features only), PROTREC and the UMAP-KNN model. Interestingly, PROSE performs comparably with MaxLink, despite its rudimentary approach (where it simply counts the number of edges between a candidate protein and its seed neighbors in the FunCoup network [41]). It was likewise found to outperform more sophisticated methods in detecting disease-associated candidates in another benchmark [21]. However, this may be a result of overrepresentation of highly connected proteins. To investigate further, we re-evaluated MaxLink by adjusting the score of each protein by divided by its respective degree in the FunCoup network, obtaining observed/expected ratios [41]. The resultant decrease in performance (mean auROC from ~0.8 to >0.7 following adjustment) indeed suggests some level of bias for highly connected proteins, which may be addressed by filtering out highly connected nodes, such as those in previous implementations [21, 25]. However, this conservative measure would lead to a considerable decrease in evaluated proteins and, hence, was not applied here.

Repeating the analysis by including known housekeepers did not result in a substantial change in performance for any method (Supplementary Figure 2). PROSE scores are also stable across runs within and across technical replicates (for k = 200 bootstraps, Pearson’s r > 0.96 between replicates, >0.98 within a replicate; see Supplementary Figure 3), supporting their use in generalizable phenotypic characterization. To understand how PROSE performs under various protein dropout scenarios, we repeated the analysis using 1500, 1000, 500, 200 and 100 training proteins. Under simulated random dropout (Supplementary Figure 4), PROSE outperforms other methods in terms of auROC, for 500 or more training proteins, except MaxLink in the THP-1 dataset, where they perform comparably. Another scenario to consider is left-censored data, where low-abundance proteins are not detected, which generally results in detection of the top n proteins by MS signal. Under this simulation (Supplementary Figure 5), all methods except UMAP-KNN appear to perform better by all metrics when supplied with fewer training proteins. Furthermore, performance is broadly improved compared with when trained on randomly sampled proteins, suggesting that important phenotype-specific information may be captured by a few highly expressed non-housekeeper proteins. In all cases, PROSE performs comparably with RWRH and MaxLink (mean auROC >0.85) and has higher correlation to gene expression (mean ρ > 0.7).

PROSE scores capture phenotype-specific features

To visualize the specificity of PROSE scores for the source phenotype, we visualized the correlation of HeLa PROSE scores with corresponding HeLa gene expression (as TPM) and protein expression (as iBAQ) data (Figure 3A) previously reported by Bekker-Jensen et al. [42]. Observed proteins have high PROSE scores, corresponding TPM counts, and iBAQ levels compared with unobserved proteins, most notably producing a positive correlation between PROSE scores and TPM. In contrast, high PROSE scores in HeLa do not correlate strongly to gene expression in a morphologically unrelated NCI-H345 small cell lung carcinoma line. Hence, proteins with high PROSE scores have good transcriptomic and proteomic support for their existence, even if they are not explicitly observed in a given proteomic screen. Generally, we observe that correlations between PROSE score and TPM are highest for cell lines originating from tissues of similar morphology to the uterine cervix epithelium (Figure 3B) and lowest for blood/lung due to their distinct transcriptomic niche [43]. The relatively high rank correlation across all tissues (~>0.5) likely represents a basal proteome/transcriptome composition across human cells. Overall, these results suggest that PROSE accurately captures these phenotypic features.

Further visualization of proteome/transcriptome correlations with PROSE scores. HeLa DDA replicate R1 (from Figure 2) is used as a representative example. (A) Correlation between HeLa PROSE score and underlying transcriptome from HeLa versus the unrelated NCI-H345 small cell lung carcinoma transcriptome, as well as independent iBAQ levels from the Bekker-Jensen et al. (2017) HeLa proteome. Linear (Pearson’s r) and rank (Spearman’s ρ) correlations are annotated. (B) Correlation between PROSE score and transcriptome, aggregated by tissues. The source HeLa tissue (uterine cervix) is highlighted in red. Examples of tissues with cancer cell lines of broadly similar epithelial morphology to HeLa are bolded.
Figure 3

Further visualization of proteome/transcriptome correlations with PROSE scores. HeLa DDA replicate R1 (from Figure 2) is used as a representative example. (A) Correlation between HeLa PROSE score and underlying transcriptome from HeLa versus the unrelated NCI-H345 small cell lung carcinoma transcriptome, as well as independent iBAQ levels from the Bekker-Jensen et al. (2017) HeLa proteome. Linear (Pearson’s r) and rank (Spearman’s ρ) correlations are annotated. (B) Correlation between PROSE score and transcriptome, aggregated by tissues. The source HeLa tissue (uterine cervix) is highlighted in red. Examples of tissues with cancer cell lines of broadly similar epithelial morphology to HeLa are bolded.

We subsequently considered if this phenotypic specificity could be leveraged to predict missing proteins. We considered the set of proteins observed in the HeLa DIA data but missing in the DDA data and compared it to those specific to THP1 samples only (Supplementary Figure 6A and B). We observed that, under both the random and non-random dropout scenarios, class probabilities inferred by PROSE using a logistic regression model allow strong discrimination between HeLa-specific missing proteins and THP1-specific proteins. Further correlative analyses demonstrate that PROSE scores are more strongly correlated to phenotype than acquisition method, with HeLa DDA and DIA clustering together, distinctly from THP1 (Supplementary Figures 7 and 8), as compared with MaxLink and RWRH.

Reanalysis of the CCLE proteomics dataset preserves key biological features

To further assess PROSE scores as a potential heuristic, we applied it to the Cancer Cell Line Encyclopedia (CCLE) proteomics dataset [44], which quantifies individual proteins between cancer cell lines derived using a tandem mass tagging (TMT) approach. These quants represent relative expression between individual samples in a multiplex and contain extensive missing proteins across various cell lines [45]. We used two paradigms for the generation of our high- and low-confidence protein lists from this data (Figure 4A). Paradigm P1 only considers proteins that were quantified within a multiplex (i.e. confidently identified in at least one cell line in the multiplex), while paradigm P2 accounts for unquantified proteins by also including those with missing quantitation. We observed that PROSE performs well under both paradigms (Supplementary Figure 9), though PROSE is noticeably better at discriminating differentially expressed proteins (mean P1 auROC = 0.883, P2 auROC = 0.809). Nevertheless, under both paradigms, we observed near-identical clustering patterns based on PROSE scores, showing clear tissue specificity (Figure 4B and C). Interestingly, we observed that P1 produced strongly differential and generally tissue-specific correlations (Figure 4D–F) between PROSE scores and gene/protein expression, whereas P2 produced weaker differential effects (Figure 4G–I), but preserving similar patterns to P1. This may be because P2 essentially considers lists of expressed and non-expressed proteins, whereas P1 focuses on relatively up/downregulated elements. As discussed in our above HeLa analyses (c.f. Figure 3), P2 may therefore capture a basal proteomic/transcriptomic composition.

Reanalysis of the CCLE proteomic dataset captures cell line- and tissue-specific biological features. (A) Definition of observed and unobserved protein lists from the dataset. Paradigm P1 only considers detected, differentially expressed proteins. Paradigm P2 additionally considers undetected proteins in the dataset. (B, C) Clustering of cell lines by PROSE score under paradigm P1 or P2 produces essentially identical tissue-specific patterns. (D–F) Correlation of PROSE scores and matched transcriptome or proteome across cell lines under Paradigm P1. The clustering from panel D is used throughout (clustering here was performed within tissue of origin). (G–I) As before, but under paradigm P2. The clustering from panel G is used throughout.
Figure 4

Reanalysis of the CCLE proteomic dataset captures cell line- and tissue-specific biological features. (A) Definition of observed and unobserved protein lists from the dataset. Paradigm P1 only considers detected, differentially expressed proteins. Paradigm P2 additionally considers undetected proteins in the dataset. (B, C) Clustering of cell lines by PROSE score under paradigm P1 or P2 produces essentially identical tissue-specific patterns. (DF) Correlation of PROSE scores and matched transcriptome or proteome across cell lines under Paradigm P1. The clustering from panel D is used throughout (clustering here was performed within tissue of origin). (GI) As before, but under paradigm P2. The clustering from panel G is used throughout.

We subsequently evaluated PROSE’s ability in differentiating essential and non-essential proteins (Figure 5A), a crucial biological feature in the identification of potential therapeutic targets. We defined essential proteins here as those with a 95th percentile or higher dependency score, derived from the DEMETER2-corrected DRIVE deep RNAi screen. Here, the separability of the cumulative score distributions of high- and low-dependency proteins was measured with the Kolmogorov–Smirnov distance statistic (DKS, Figure 5B). We observed that, within a cell line, essential proteins were not readily separable using TMT quants (mean DKS = 0.131) or PROSE under paradigm P1 (DKS = 0.126), whereas P2 produced clear separability (DKS = 0.432). We visualized the underlying PROSE score distributions in the THP1 and MCF7 cell lines (Figure 5C), demonstrating that essential genes have higher PROSE P2 scores. This suggests that P2 can effectively generate a meaningful intra-sample distribution from relative quants that reflects actual upregulation of essential proteins, despite slightly lower class separation, as measured by the auROC previously.

Separation of high-dependency proteins under paradigm P2. (A) High-dependency proteins, defined here as those in the top 95th percentile or above by -DEMETER2 score, derived from the DRIVE deep shRNA screen for gene dependencies across cancer cell lines. (B) Separation of high-dependency and low-dependency protein score distributions, measured by the Kolmogorov–Smirnov test statistic DKS. Values closer to 0 indicate no distance between empirical cumulative distributions, while values closer to 1 indicate strong separability. The non-training (NT) subset shows a similar distribution. Known cancer-associated genes are represented by larger opaque dots on the scatterplot. (C) Visualization of high-dependency and low-dependency protein distributions under TMT quants or PROSE scoring paradigms. Two common cell lines, THP1 and MCF7, are shown as examples. In the upper-left corner, the corresponding P-value is shown alongside the DKS. n represents the number of quantified proteins with available DEMETER2 scores.
Figure 5

Separation of high-dependency proteins under paradigm P2. (A) High-dependency proteins, defined here as those in the top 95th percentile or above by -DEMETER2 score, derived from the DRIVE deep shRNA screen for gene dependencies across cancer cell lines. (B) Separation of high-dependency and low-dependency protein score distributions, measured by the Kolmogorov–Smirnov test statistic DKS. Values closer to 0 indicate no distance between empirical cumulative distributions, while values closer to 1 indicate strong separability. The non-training (NT) subset shows a similar distribution. Known cancer-associated genes are represented by larger opaque dots on the scatterplot. (C) Visualization of high-dependency and low-dependency protein distributions under TMT quants or PROSE scoring paradigms. Two common cell lines, THP1 and MCF7, are shown as examples. In the upper-left corner, the corresponding P-value is shown alongside the DKS. n represents the number of quantified proteins with available DEMETER2 scores.

Reanalysis of the CPTAC breast cancer proteome dataset highlights subtype-specific expression programs

Lastly, we explored if PROSE could be applied to the analysis of clinical proteomics data. Here, we reanalyzed the CPTAC breast cancer proteomics dataset [46], comprising TMT quants of 122 clinical specimens. In the original analysis, Krug et al. applied a non-negative matrix factorization (NMF) approach on the integrated proteogenomic data to assign molecular subtypes to individual samples. The resultant NMF clusters represent HER2-positive, Luminal A, Luminal B and basal-like (triple-negative breast cancer; TNBV) molecular profiles.

Akin to Paradigm P1 applied to the CCLE dataset, for each sample, we considered the top and bottom 1000 proteins by TMT quant to identify differential expression programs. Compared with the CCLE data, overall accuracy is lower (mean auROC = 0.778) (Supplementary Figure 10). We observed clear clustering of samples by their annotated NMF subtype (Figure 6A and B). Proteins with high subtype-specific PROSE scores can be identified based on the top proteins by variance (5000 shown here) (Figure 6C), of which 1677 proteins were initially undetected in the original screen. We then considered if PROSE scores could be used to recover functional enrichment scores obtained through commonly used tools such as Gene Set Enrichment Analysis (GSEA) [47]. We applied GSEA to generate normalized enrichment scores (NES) for MSigDB_Hallmark_2020 gene sets [48], which represent distinct, well-defined biological processes (Figure 6D).

PROSE supports recovery of known and novel, putative disease-linked proteins that are undetected in the initial proteomic screen. PROSE scoring was applied to the CPTAC breast cancer TMT dataset by Krug et al. (2020) and used to predict regulatory programs and missing proteins. (A, B) UMAP projection and heatmap demonstrating clustering of breast cancer subtypes by their inferred NMF label from the original dataset. HER2-positive breast cancer represents a particularly heterogenous population. (C) Mean PROSE scores of the top 5000 highest-variance proteins. The 2357 proteins with high-variance PROSE scores that were undetected in the initial proteomic screen are annotated above as black ticks. (D) GSEA using the above 5000 proteins, using MSigDB_Hallmark_2020 annotations, retaining well-established signatures for each subtype. (E) Mean PROSE scores of undetected genes with putative driver roles in the Basal-like/triple-negative (TN) subtype. Example annotated oncogenes are shown in orange (right) and are represented by larger dots. Example genes without annotated roles in cancer are shown in blue (left).
Figure 6

PROSE supports recovery of known and novel, putative disease-linked proteins that are undetected in the initial proteomic screen. PROSE scoring was applied to the CPTAC breast cancer TMT dataset by Krug et al. (2020) and used to predict regulatory programs and missing proteins. (A, B) UMAP projection and heatmap demonstrating clustering of breast cancer subtypes by their inferred NMF label from the original dataset. HER2-positive breast cancer represents a particularly heterogenous population. (C) Mean PROSE scores of the top 5000 highest-variance proteins. The 2357 proteins with high-variance PROSE scores that were undetected in the initial proteomic screen are annotated above as black ticks. (D) GSEA using the above 5000 proteins, using MSigDB_Hallmark_2020 annotations, retaining well-established signatures for each subtype. (E) Mean PROSE scores of undetected genes with putative driver roles in the Basal-like/triple-negative (TN) subtype. Example annotated oncogenes are shown in orange (right) and are represented by larger dots. Example genes without annotated roles in cancer are shown in blue (left).

As previously observed, PROSE scores likewise result in distinct NES profiles for each subtype, including terms linked to high transcriptional and replicative activity observed in TNBC (e.g. high NES scores for the Myc and E2F target annotations). We further show specific recovery of proteins known to be associated with TNBC, such as MYC [49, 50], HMGA2 [51] and those related to tumorigenic signalling (e.g. mTOR) [52]. Additionally, we observed proteins without well-understood roles in TNBC (e.g. MAGOH), which may be interesting for further study. Overall, this analysis further supports the use of PROSE when applied to TMT-based relative expression data.

Discussion

Given the frequent coverage issues observed in proteomics, we find that candidate prioritization approaches based on biological networks can generate biologically meaningful scores—in particular for missing proteins. We applied PROSE scores to complement the basic proteomic workflow, where it can inform missing protein prediction and downstream analyses. The simplicity and fast running time (~10 s with <0.8 GB memory on a standard laptop computer, for k = 200 bootstrap samples) makes PROSE particularly useful for biologists with rudimentary Python programing knowledge, and a user-friendly webtool implementation is in development. While we focused on creating protein lists based on observed/unobserved peptides as well as relative quants, these input lists are flexible and could also be generated from transcriptomic data or other proteomic assay designs.

Limitations

Our current implementation of PROSE is limited to a co-expression matrix derived from transcriptomic data, which may explain its stronger correlation to gene expression compared with protein expression. One key issue to consider when using such candidate prioritization approaches is that the results are idiosyncratic to the feature space used. For example, it is not clear that our co-expression matrix derived from a panel of cancer cell lines generalizes well to clinical samples: since networks show high cell-type specificity [53], a heterogenous tissue might have network dynamics that we are unable to capture. We speculate that this may be the reason for poorer auROC during analysis of the breast cancer dataset compared with the CCLE dataset. While in the future, using a richer feature space (e.g. by adding functional annotations to create a heterogenous network), there is unlikely to be a ‘one size fits all’ solution when applying candidate prioritization approaches to diverse proteomic data.

Additionally, PROSE is currently limited to scoring protein identifications. However, in the future, it can plausibly be extended to predicting more complex features, such as PTMs, if supplied with a correlation matrix on both PTM identifications and protein quants.

Finally, it is also important to note that PROSE does not consider explicit network topology outside of first-neighbor relations (similar to MaxLink), though this does not appear essential to the performance of the model. More sophisticated extensions that leverage network topology could in principle better capture local connectivity patterns, which may lead to further improvements in performance. For instance, reducing dependency of class labels on first neighbors in the network may improve the robustness of a model for sparse datasets.

Materials and methods

Gene co-expression matrix construction

The Klijn et al. dataset [36], comprising RNA-Seq results for 675 common cancer cell lines, was downloaded as TPM values from ArrayExpress [54] (https://www.ebi.ac.uk/arrayexpress/) under the accession number E-MTAB-2706. The original 35 522 gene identifiers were converted to UniProt IDs where possible, with only protein-coding genes retained. Genes with consistently low expression (median TPM < 1) were excluded due to low signal-to-noise ratio. A co-expression matrix was obtained by calculating the Spearman correlation coefficient ρij for all gene pairs (i, j). To constrain the feature space to a tight set of panel genes, we selected only those with high variability in expression and co-regulation. These were defined as the top 25% of genes ranked by coefficient of variation (CV; of TPM), as well as standard deviation of correlation coefficients, yielding 1216 panel genes (Supplementary Figure 11A) [32]. We verified that the panel genes enabled distinct clustering of cell lines by tissue origin via visual inspection of UMAP plots (Supplementary Figure 11B) and heatmaps of gene expression profiles (Supplementary Figure 11C and D). The resultant 17 054 × 1216 matrix is an array of protein identifiers with associated Spearman correlations to each panel gene. The matrix is provided in Supplementary File 1. For subsequent comparisons between PROSE scores and gene expression, TPM values were log2 transformed using a pseudocount of 1.

Prose

The core of the PROSE module is built from the scikit.learn [55] Python library. Using a list of positive (observed) proteins, negative (unobserved) proteins and the gene co-expression matrix above, PROSE returns a table containing a list of these proteins, their associated PROSE scores, predicted labels and their logistic regression probabilities.

In the basic implementation, we define observed proteins as those reported from a proteomic screen at a given protein-level FDR threshold (1% is a general standard). To avoid issues with ambiguous protein groups, we only consider those mapping to a single protein accession. An equal number of unobserved proteins are randomly sampled from the UniProt database. If the size of the supplied observed/unobserved protein lists differ, individual elements are re-weighted by the inverse of their class proportion.

An ensemble of linear SVMs is applied to generate hyperplanes separating the observed/unobserved based on their correlation to the panel genes. Via the LIBLINEAR [56] implementation, each SVM solves the primal problem

(1)

Where w is a vector of weights, Xi and yi are the feature values and the corresponding label for each training protein i, C is a penalty hyperparameter (default = 1), b is the hyperplane constant and L is the L1 hinge loss function

(2)

To enhance stability while reducing overfitting and computational expense, a bagging approach [57] was used to generate a series of weak learners. In the default implementation, k sub-classifiers are trained on a bootstrap sample of 100 proteins drawn from each class, under 50 panel genes. By default, k = 200 bootstrap samples are used. Bagging renders PROSE scalable to any number of panel genes and observations, as for any k subsets of fixed dimensions, the running time is essentially the same regardless of the input matrix size. This could for instance be useful when considering a transcript-level co-expression network. However, to ensure that critical features are being captured by the sampling procedure, it may be necessary to increase k alongside the matrix size. Alternatively, the highest variance panel genes/transcripts can be selected beforehand, as we have done here.

Formally, the decision function g(i) for a given protein i is defined as the mean signed distance between the hyperplanes and the point of interest

(3)

where g(i) > 0 represents positive classes, while g(i< 0 represents the null class. For standardization, we provide the final PROSE enrichment scores as Z-scores

(4)

where μ and σ are the mean and standard deviation over the set of all tested proteins in the co-expression matrix. The ranked output score vector is generally sufficient for functional analyses of phenotype. Class probabilities are estimated via logistic regression on score(i) for class prediction, where protein probability is estimated via 5-fold cross validation. Performance was evaluated on a set of 200 additional holdout proteins per. For correlation analyses between the HeLa PROSE scores and gene expression, the TPM data from Klijn et al. (2015) were used. For correlation analyses on protein expression, the corresponding MaxQuant or DIA-NN quantification output was used. A subsequent correlative analysis was applied to the independent HeLa iBAQ data previously reported from a deep proteomic screen by Bekker Jensen et al. (2017) [42].

Alternative benchmark approaches

In addition to PROSE, we considered the performance of other network models, including a k-nearest neighbor (KNN) approach on the UMAP-transformed co-expression matrix, PROTREC [30], ToppGene [24], RWRH [28], Node2Vec [29] and MaxLink [25, 26]. PROTREC and MaxLink were implemented as described originally. RWRH was implemented as in the gp-tools suite by Zhang et al. [27]. Node2Vec was implemented through PecanPy [58]. For details of each implementation, please refer to Supplementary Methods.

Proteomic datasets

HeLa label-free DDA and DIA

The Mehta et al. (2022) dataset [37] comprises commercially available HeLa digest standards in quadruplicate, run on an Orbitrap Fusion Lumos Tribrid instrument under DDA mode. Samples were run without offline fractionation. Raw data were downloaded from the PRIDE database [59] (https://www.ebi.ac.uk/pride/; accession number PXD022448).

THP-1 macrophage label-free DDA

The Li et al. (2020) dataset [38] comprises PMA-treated THP-1 cells (M0 macrophages) in triplicate, run on an Orbitrap Fusion Lumos Tribrid instrument under DDA mode. Samples were run without offline fractionation. Raw data were downloaded from PRIDE (accession number PXD019800).

C‌CLE TMT

The Nusinow et al. (2020) dataset [44, 45] comprises a phenotypically diverse panel of 375 CCLE cancer cell lines. Each TMT10 multiplex comprises 9 samples and a bridge channel (a pool of various cell lines) to facilitate normalization across plexes. Each multiplex was fractionated by basic reverse-phase chromatography (bRP) and fractions were run on either an Orbitrap Fusion or Fusion Lumos Tribrid instrument. Spectra in the study were searched using SEQUEST, followed by FDR refinement by linear discriminant analysis and additional filtering, reporting proteins at a 1% protein-level FDR. Protein lists with normalized TMT quants were downloaded from https://gygi.hms.harvard.edu/publications/ccle.html. For each cell line, protein identifications were obtained via parsimonious assembly within the multiplexed sample without imputation, and the assembly displays a wide number of missing proteins [45]. PROSE scores for the CCLE proteomics dataset are provided as Supplementary File 3.

CPTAC breast cancer clinical specimen TMT

The Krug et al. (2020) dataset [46] comprises 122 clinical breast cancer samples representing several well-established molecular subtypes. Similar to above, each TMT10 multiplex comprises 9 samples and a bridge channel (a pool of tumor samples). Each multiplex was fractionated by bRP and run on a Q exactive or Orbitrap Fusion Lumos Tribrid instrument. The normalized TMT data were downloaded directly from the study Supplementary Material (https://doi.org/10.1016%2Fj.cell.2020.10.036). PROSE scores for the CPTAC dataset are provided as Supplementary File 4.

Raw proteomic data processing

Raw label-free DDA data were processed using MaxQuant 2.2.0 [60]. Spectra were searched against the set of reviewed human UniProt sequences (SwissProt; UP000005640). Unique protein groups (MaxQuant) were reported using a 1% FDR cutoff. Carbamidomethylation (C) was set as a fixed modification, while oxidation (M) was set as a variable modification. For protein quantitation using MaxQuant, the match between runs (MBRs) option was disabled. Proteins were quantified using the iBAQ approach with log iBAQ correction. Default parameters were used otherwise.

Raw label-free DIA data were processed using DIA-NN 1.8.1 [20]. The spectral library was generated in silico from the set of SwissProt sequences above using the inbuilt deep learning approach with default parameters. Protein groups were reported using a 1% FDR cutoff. Carbamidomethylation (C) was set as a fixed modification, while oxidation (M) and acetylation (N-term) was set as a variable modification. The ‘unrelated runs’ option was enabled and MBR was disabled. Default parameters were used otherwise.

For downstream analyses, housekeeper genes/proteins were defined based on the HRT Atlas v1.0 database [39] and included or excluded in the analyses as stated.

C‌CLE analyses

PROSE scoring was performed on the CCLE proteomics dataset. Only SwissProt proteins were considered. Observed and unobserved subsets were generated based on intra-sample TMT quants. Note that while these quants were previously found to correlate to gene expression levels [44], they specifically measure protein expression relative to other samples in the multiplex.

For correlation analysis to RNA-Seq, we used the mRNA expression data from Barretina et al. [43]. Data were downloaded from ArrayExpress (accession number E-MTAB-2770). Further analyses were performed on the DRIVE shRNA data from McDonald et al. [61], which comprises deep RNAi screens for gene dependencies in 398 cancer cell lines. We used DEMETER2-adjusted scores from McFarland et al. [62] as highly accurate estimates of gene dependency. Data were downloaded from https://figshare.com/articles/dataset/DEMETER2_data/6025238. For simplicity, gene dependence is defined as the negative of the DEMETER2 score (higher dependence indicates essentiality). Cancer-linked genes were annotated using the Sanger COSMIC Cancer Gene Census [63]. Our analyses generally focused on cell lines occurring across these datasets.

Gene set enrichment analysis

GSEA was performed on the PROSE scores generated from the CPTAC breast cancer dataset, using ranked lists of the top 5000 highest variance proteins by PROSE score. The MSigDB_Hallmark_2020 gene sets [48] were used. GSEA was implemented through the GSEApy module [64] with default parameters.

Generic analyses

All analyses were performed in a Python 3.8 environment except where otherwise noted. Plotting and visualization was performed using the matplotlib [65] and seaborn [66] libraries. The Kolmogorov–Smirnov distance statistic DKS was calculated using the scipy [67] library. UMAP clustering was performed with the umap [68] library.

Key Points
  • Network-based candidate prioritization techniques offer a promising approach to analysis of sparse proteomic data.

  • PROSE leverages co-expression patterns from public transcriptomic data to reliably discriminate between observed and unobserved proteins in a proteomic screen.

  • Resultant scores can recapitulate key phenotypic features including expression patterns, gene dependency and putative driver genes in disease.

  • PROSE can be applied to protein lists generated under different paradigms (e.g. DDA/DIA proteomes ranked by FDR or expression; label-based DDA ranked by relative expression).

Acknowledgements

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. We thank Dr Sunny Hei Wong for our discussions. We wish to acknowledge the funding support for this project from Nanyang Technological University under the URECA Undergraduate Research Program.

Funding

National Research Foundation, Singapore (SDSC-2020-005).

Authors’ contributions

B.J.H.W. conceptualized and developed PROSE. B.J.H.W. and W.K. implemented the analyses and benchmarked the tools. B.J.H.W. and W.W.B.G. co-wrote the manuscript. B.J.H.W., W.K., H.P. and W.W.B.G. edited. W.W.B.G. supervised.

Data availability

The PROSE source code and key code used in our analyses are available at https://github.com/bwbio/PROSE, which also contains installation and example run instructions. Other data are provided as Supplementary Files 14. All other data and code are available from the authors upon reasonable request.

Author Biographies

Bertrand Jern Han Wong received his BSc in biological sciences from Nanyang Technological University. He is currently a Biomedical Science doctoral student at Harvard University. His work under Wilson Wen Bin Goh focused on network-centric, statistical and machine learning approaches to resolving challenges in proteomic analyses.

Weijia Kong received her BSc from Beijing Normal University. She is currently a Bioinformatics doctoral student at Nanyang Technological University. Her current work under Wilson Wen Bin Goh focuses on missing protein prediction and validation using biological networks.

Hui Peng is a research scientist at the School of Biological Sciences. His current work under Wilson Wen Bin Goh is concerned with network-assisted protein assembly and automated pipeline development.

Wilson Wen Bin Goh is an Assistant Professor of Biomedical Informatics and Head of the Data Science Research Programme at the Lee Kong Chian School of Medicine and School of Biological Sciences, Nanyang Technological University.

References

1.

Mann
M
,
Hendrickson
RC
,
Pandey
A
.
Analysis of proteins and proteomes by mass spectrometry
.
Annu Rev Biochem
2001
;
70
:
437
73
.

2.

Aebersold
R
,
Mann
M
.
Mass spectrometry-based proteomics
.
Nature
2003
;
422
:
198
207
.

3.

Karpievitch
YV
,
Polpitiya
AD
,
Anderson
GA
, et al.
Liquid chromatography mass spectrometry-based proteomics: biological and technological aspects
.
Ann Appl Stat
2010
;
4
:
1797
823
.

4.

Yu
F
,
Haynes
SE
,
Teo
GC
, et al.
Fast quantitative analysis of timsTOF PASEF data with MSFragger and IonQuant
.
Mol Cell Proteomics
2020
;
19
:
1575
85
.

5.

Meier
F
,
Beck
S
,
Grassl
N
, et al.
Parallel accumulation-serial fragmentation (PASEF): multiplying sequencing speed and sensitivity by synchronized scans in a trapped ion mobility device
.
J Proteome Res
2015
;
14
:
5378
87
.

6.

Suckau
D
,
Evers
W
,
Belau
E
, et al.
Use of PASEF for accelerated protein sequence confirmation and de novo sequencing with high data quality
.
Methods Mol Biol
2022
;
2313
:
207
17
.

7.

Gillet
LC
,
Navarro
P
,
Tate
S
, et al.
Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis
.
Mol Cell Proteomics
2012
;
11
:
O111.016717
.

8.

Rost
HL
,
Rosenberger
G
,
Navarro
P
, et al.
OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data
.
Nat Biotechnol
2014
;
32
:
219
23
.

9.

McAlister
GC
,
Huttlin
EL
,
Haas
W
, et al.
Increasing the multiplexing capacity of TMTs using reporter ion isotopologues with isobaric masses
.
Anal Chem
2012
;
84
:
7469
78
.

10.

Li
J
,
Cai
Z
,
Bomgarden
RD
, et al.
TMTpro-18plex: the expanded and complete set of TMTpro reagents for sample multiplexing
.
J Proteome Res
2021
;
20
:
2964
72
.

11.

Pappireddi
N
,
Martin
L
,
Wühr
M
.
A review on quantitative multiplexed proteomics
.
Chem Bio Chem
2019
;
20
:
1210
24
.

12.

Liu
M
,
Dongre
A
.
Proper imputation of missing values in proteomics datasets for differential expression analysis
.
Brief Bioinform
2021
;
22
:
bbaa112
.

13.

Webb-Robertson
B-JM
,
Wiberg
HK
,
Matzke
MM
, et al.
Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics
.
J Proteome Res
2015
;
14
:
1993
2001
.

14.

McGurk
KA
,
Dagliati
A
,
Chiasserini
D
, et al.
The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination
.
Bioinformatics
2020
;
36
:
2217
23
.

15.

Lazar
C
,
Gatto
L
,
Ferro
M
, et al.
Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies
.
J Proteome Res
2016
;
15
:
1116
25
.

16.

Bateman
NW
,
Goulding
SP
,
Shulman
NJ
, et al.
Maximizing peptide identification events in proteomic workflows using data-dependent acquisition (DDA)
.
Mol Cell Proteomics
2014
;
13
:
329
38
.

17.

Li
KW
,
Gonzalez-Lozano
MA
,
Koopmans
F
, et al.
Recent developments in data independent acquisition (DIA) mass spectrometry: application of quantitative analysis of the brain proteome
.
Front Mol Neurosci
2020
;
13
:
564446
.

18.

Bogdanow
B
,
Zauber
H
,
Selbach
M
.
Systematic errors in peptide and protein identification and quantification by modified peptides
.
Mol Cell Proteomics
2016
;
15
:
2791
801
.

19.

Ahrne
E
,
Muller
M
,
Lisacek
F
.
Unrestricted identification of modified proteins using MS/MS
.
Proteomics
2010
;
10
:
671
86
.

20.

Demichev
V
,
Messner
CB
,
Vernardis
SI
, et al.
DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput
.
Nat Methods
2020
;
17
:
41
4
.

21.

Guala
D
,
Sonnhammer
ELL
.
A large-scale benchmark of gene prioritization methods
.
Sci Rep
2017
;
7
:
46598
.

22.

Goh
WWB
,
Wong
L
.
Integrating networks and proteomics: moving forward
.
Trends Biotechnol
2016
;
34
:
951
9
.

23.

Li
J
,
Zimmerman
LJ
,
Park
B
, et al.
Network-assisted protein identification and data interpretation in shotgun proteomics
.
Mol Syst Biol
2009
;
5
:
303
.

24.

Chen
J
,
Bardes
EE
,
Aronow
BJ
, et al.
ToppGene suite for gene list enrichment analysis and candidate gene prioritization
.
Nucleic Acids Res
2009
;
37
:
W305
11
.

25.

Guala
D
,
Sjölund
E
,
Sonnhammer
ELL
.
MaxLink: network-based prioritization of genes tightly linked to a disease seed set
.
Bioinformatics
2014
;
30
:
2689
90
.

26.

Östlund
G
,
Lindskog
M
,
Sonnhammer
ELL
.
Network-based identification of novel cancer genes
.
Mol Cell Proteomics
2010
;
9
:
648
55
.

27.

Zhang
H
,
Ferguson
A
,
Robertson
G
, et al.
Benchmarking network-based gene prioritization methods for cerebral small vessel disease
.
Brief Bioinform
2021
;
22
:
bbab006
.

28.

Jiang
R
.
Walking on multiple disease-gene networks to prioritize candidate genes
.
J Mol Cell Biol
2015
;
7
:
214
30
.

29.

Grover
A
,
Leskovec
J
. node2vec: Scalable Feature Learning for Networks.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
2016
, 855–64. https://doi.org/10.1145/2939672.2939754.

30.

Kong
W
,
Wong
BJH
,
Gao
H
, et al.
PROTREC: a probability-based approach for recovering missing proteins based on biological networks
.
J Proteomics
2022
;
250
:
104392
.

31.

Giurgiu
M
,
Reinhard
J
,
Brauner
B
, et al.
CORUM: the comprehensive resource of mammalian protein complexes—2019
.
Nucleic Acids Res
2019
;
47
:
D559
63
.

32.

Goh
WWB
,
Guo
T
,
Aebersold
R
, et al.
Quantitative proteomics signature profiling based on network contextualization
.
Biol Direct
2015
;
10
:
71
.

33.

Goh
WWB
,
Sergot
MJ
,
Sng
JC
, et al.
Comparative network-based recovery analysis and proteomic profiling of neurological changes in Valproic acid-treated mice
.
J Proteome Res
2013
;
12
:
2116
27
.

34.

Liu
Y
,
Beyer
A
,
Aebersold
R
.
On the dependency of cellular protein levels on mRNA abundance
.
Cell
2016
;
165
:
535
50
.

35.

Koussounadis
A
,
Langdon
SP
,
Um
IH
, et al.
Relationship between differentially expressed mRNA and mRNA-protein correlations in a xenograft model system
.
Sci Rep
2015
;
5
:
10775
.

36.

Klijn
C
,
Durinck
S
,
Stawiski
EW
, et al.
A comprehensive transcriptional portrait of human cancer cell lines
.
Nat Biotechnol
2015
;
33
:
306
12
.

37.

Mehta
D
,
Scandola
S
,
Uhrig
RG
.
BoxCar and library-free data-independent acquisition substantially improve the depth, range, and completeness of label-free quantitative proteomics
.
Anal Chem
2022
;
94
:
793
802
.

38.

Li
P
,
Hao
Z
,
Wu
J
, et al.
Comparative proteomic analysis of polarized human THP-1 and mouse RAW264.7 macrophages
.
Front Immunol
2021
;
12
:
700009
.

39.

Hounkpe
BW
,
Chenou
F
,
de
Lima
F
, et al.
HRT atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets
.
Nucleic Acids Res
2021
;
49
:
D947
55
.

40.

Wiśniewski
JR
,
Mann
M
.
A proteomics approach to the protein normalization problem: selection of unvarying proteins for MS-based proteomics and western blotting
.
J Proteome Res
2016
;
15
:
2321
6
.

41.

Persson
E
,
Castresana-Aguirre
M
,
Buzzao
D
, et al.
FunCoup 5: functional association networks in all domains of life, supporting directed links and tissue-specificity
.
J Mol Biol
2021
;
433
:
166835
.

42.

Bekker-Jensen
DB
,
Kelstrup
CD
,
Batth
TS
, et al.
An optimized shotgun strategy for the rapid generation of comprehensive human proteomes
.
Cell Syst
2017
;
4
:
587
599.e4
.

43.

Barretina
J
,
Caponigro
G
,
Stransky
N
, et al.
The cancer cell line Encyclopedia enables predictive modelling of anticancer drug sensitivity
.
Nature
2012
;
483
:
603
7
.

44.

Nusinow
DP
,
Szpyt
J
,
Ghandi
M
, et al.
Quantitative proteomics of the cancer cell line Encyclopedia
.
Cell
2020
;
180
:
387
402.e16
.

45.

Nusinow
DP
,
Gygi
SP
.
A guide to the quantitative proteomic profiles of the cancer cell line encyclopedia
.
Cell
2020
;
180
(2):387–402.

46.

Krug
K
,
Jaehnig
EJ
,
Satpathy
S
, et al.
Proteogenomic landscape of breast cancer tumorigenesis and targeted therapy
.
Cell
2020
;
183
:
1436
1456.e31
.

47.

Subramanian
A
,
Tamayo
P
,
Mootha
VK
, et al.
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc Natl Acad Sci U S A
2005
;
102
:
15545
50
.

48.

Liberzon
A
,
Birger
C
,
Thorvaldsdóttir
H
, et al.
The molecular signatures database hallmark gene set collection
.
Cell Systems
2015
;
1
:
417
25
.

49.

Gabay
M
,
Li
Y
,
Felsher
DW
.
MYC activation is a hallmark of cancer initiation and maintenance
.
Cold Spring Harb Perspect Med
2014
;
4
:
a014241
1
.

50.

Horiuchi
D
,
Kusdra
L
,
Huskey
NE
, et al.
MYC pathway activation in triple-negative breast cancer is synthetic lethal with CDK inhibition
.
J Exp Med
2012
;
209
:
679
96
.

51.

Mansoori
B
,
Terp
MG
,
Mohammadi
A
, et al.
HMGA2 supports cancer hallmarks in triple-negative breast cancer
.
Cancer
2021
;
13
:
5197
.

52.

Pópulo
H
,
Lopes
JM
,
Soares
P
.
The mTOR signalling pathway in human cancer
.
IJMS
2012
;
13
:
1886
918
.

53.

Cha
J
,
Lee
I
.
Single-cell network biology for resolving cellular heterogeneity in human diseases
.
Exp Mol Med
2020
;
52
:
1798
808
.

54.

Athar
A
,
Füllgrabe
A
,
George
N
, et al.
ArrayExpress update – from bulk to single-cell expression data
.
Nucleic Acids Res
2019
;
47
:
D711
5
.

55.

Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
, et al.
Scikit-learn: machine learning in python
.
J Mach Learn Res
2011
;
12
:
2825
30
.

56.

Fan
R-E
,
Chang
K-W
,
Hsieh
C-J
, et al.
LIBLINEAR: a library for large linear classification
.
Journal of Machine Learning Research
2008
;
9
(61):1871−4.

57.

Breiman
L
.
Bagging predictors
.
Mach Learn
1996
;
24
:
123
40
.

58.

Liu
R
,
Krishnan
A
.
PecanPy: a fast, efficient and parallelized python implementation of node2vec
.
Bioinformatics
2021
;
37
:
3377
9
.

59.

Perez-Riverol
Y
,
Csordas
A
,
Bai
J
, et al.
The PRIDE database and related tools and resources in 2019: improving support for quantification data
.
Nucleic Acids Res
2019
;
47
:
D442
50
.

60.

Tyanova
S
,
Temu
T
,
Cox
J
.
The MaxQuant computational platform for mass spectrometry-based shotgun proteomics
.
Nat Protoc
2016
;
11
:
2301
19
.

61.

McDonald
ER
, 3rd,
de
Weck
A
,
Schlabach
MR
, et al.
Project DRIVE: a compendium of cancer dependencies and synthetic lethal relationships uncovered by large-scale, deep RNAi screening
.
Cell
2017
;
170
:
577
592.e10
.

62.

McFarland
JM
,
Ho
ZV
,
Kugener
G
, et al.
Improved estimation of cancer dependencies from large-scale RNAi screens using model-based normalization and data integration
.
Nat Commun
2018
;
9
:
4610
.

63.

Tate
JG
,
Bamford
S
,
Jubb
HC
, et al.
COSMIC: the catalogue of somatic mutations in cancer
.
Nucleic Acids Res
2019
;
47
:
D941
7
.

64.

Fang
Z
,
Liu
X
,
Peltz
G
.
GSEApy: a comprehensive package for performing gene set enrichment analysis in python
.
Bioinformatics
2023
;
39
:
btac757
.

65.

Hunter
JD
.
Matplotlib: a 2D graphics environment
.
Comput Sci Eng
2007
;
9
:
90
5
.

66.

Waskom
M
.
Seaborn: statistical data visualization
.
JOSS
2021
;
6
:
3021
.

67.

Virtanen
P
,
Gommers
R
,
Oliphant
TE
, et al.
SciPy 1.0: fundamental algorithms for scientific computing in python
.
Nat Methods
2020
;
17
:
261
72
.

68.

McInnes
L
,
Healy
J
,
Saul
N
, et al.
UMAP: uniform manifold approximation and projection
.
Journal of Open Source Software
2018
;
3
:861. https://doi.org/10.21105/joss.00861.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)