Abstract

Co-fractionation mass spectrometry (CFMS) enables the discovery of protein complexes and the systems-level analysis of multimer dynamics that facilitate responses to environmental and developmental conditions. A major challenge in CFMS data analysis, and omics approaches in general, is the development of reliable benchmarks for accurate evaluation of prediction methods. CORUM is commonly used as a source of benchmark complexes for protein complex composition predictions; however, its assumption of fully assembled subunit pools often conflicts with size exclusion chromatography (SEC) and interaction predictions from CFMS experiments. To address this, we developed an integrative analysis method that leverages cross-kingdom evolutionary conservation among specific CORUM complexes and high-resolution SEC profile data from cell extracts. The resulting benchmark complexes are supported by statistical significance and consistent sizes between calculated and measured apparent masses. The approach was robust, revealing both conserved and species-specific complexes. Designed specifically for benchmark identification, this method can be applied to any species and used to evaluate protein complex predictions from other studies.

Introduction

Co-fractionation mass spectrometry (CFMS) is a high-throughput, mass spectrometry-based protein quantification method coupled with biochemical fractionation to analyze protein complex compositions under non-denaturing conditions. This “guilt by association” method was initially used to predict protein organelle localization based on co-elution with known marker proteins [1] and was subsequently applied to predict protein interactions [2, 3]. The technique evolved from the principle that proteins present in a stable complex co-migrate independent of the separation method used. In life sciences, CFMS has been extensively employed to analyze protein complex dynamics, including protein–ligand interactions, circadian changes, and multimerization variants across various species and tissues [4–10]. However, a universal major challenge lies in evaluating CFMS-based predictions due to the lack of reliable benchmarks [11, 12]. Protein complex predictions rely on known protein complexes from reference databases, such as the CORUM mammalian protein complex database [13] as benchmark data. Reliable benchmarks are crucial for defining accuracy measures, such as precision and recall, in protein complex predictions. Inaccurate benchmark data result in flawed validation datasets, which in turn mislead prediction models to unreliable predictions.

Ortholog mapping revealed that a substantial number of CORUM complexes are highly conserved across animals and plants (see the Results section). This conservation is expected, as many of these complexes perform essential functions in eukaryotic cells. For example, very similar protein complexes are involved in processes such as DNA replication, gene expression, vesicle trafficking, signal transduction, and metabolism in eukaryotic cells [14]. Orthologs of CORUM complex subunits can be accurately identified across species using tools such as InParanoid [15] or EggNOG [9, 16–18]. Most CORUM complexes were derived from highly purified multimers and used to generate solved crystal structures [19–24]. However, this does not mean that the fully assembled state is the major protein pool in cell extracts. They may primarily exist as partially assembled complexes or CORUM subunits interacting with unknown proteins as part of a “moonlighting” function [25].

CORUM complexes are widely used as benchmarks for evaluating protein complex predictions. During evaluation, pairs of subunits in the reference CORUM complexes are assumed to comprise a positive set of protein–protein interactions, while negative interactions are created from proteins not present in the positive interaction set. These positive and negative sets have been used for training and testing computational methods of protein complex predictions [9, 12, 18, 26], inherently assuming that CORUM complexes exist in a fully assembled state. However, our previous work employing size exclusion chromatography (SEC) profiling of CORUM subunits revealed widespread violations of this assumption [6, 27]. Furthermore, recent publications related to animal cell extracts have also reported weak correlations among subunits of CORUM complexes in the CFMS datasets [12, 18]. Pang et al. [12] displayed probability density plots of the Pearson correlations between fractionation profiles of shared subunits of CORUM human complexes (Fig. 2 in [12]), in which the density plots have a mode around zero correlation values. Similarly, Wan et al. [18] reported low correlation among subunits of CORUM complexes in their Extended Data, (Fig. 2 in [18]). These plots, along with our published experimental data, indicate that the most abundant cellular pools of CORUM complex subunits, which are primarily detected in liquid chromatography mass spectrometry, do not always co-elute. This suggests that CORUM complexes are unlikely to be reliable benchmarks for protein complex predictions.

Here, we implement a machine learning method that integrates two sources of information to define reliable protein complex benchmarks: evolutionary conservation from CORUM and co-elution patterns from CFMS data obtained on analytical SEC columns. We first map plant orthologs to CORUM protein complexes and then analyze their CFMS elution profiles. Subcomplex predictions are generated using self-organizing maps (SOMs) [28], a robust unsupervised learning method, and are further refined through statistical testing and mass comparisons.

To generate reliable protein complex benchmarks, we refine CORUM complexes by integrating their subunit composition with CFMS co-elution patterns. Our method eliminates random co-elution by ensuring that only subunits within the same CORUM complex are considered. We then detect subcomplexes, identifying subsets of CORUM complexes that exhibit strong similarity based on CFMS co-elution. This approach is justified by the premise that functionally related subunits tend to co-occur within stable subcomplexes, even if chance co-elution exists in the overall CFMS data. By incorporating both CORUM composition and CFMS patterns, our method refines CORUM complexes to better reflect their assembly state in cell extracts and improves benchmarking for protein complex predictions. We demonstrate the generalizability of this method through its application to rice and Arabidopsis.

Materials and methods

SEC data acquisition and filtering

We used a previously published rice SEC profiling dataset for protein complex predictions [10]. The chromatography conditions are robust [4, 6, 27] and this particular dataset contains two replicates of SEC fraction profiles along with protein apparent mass (Mapp), monomeric mass (Mmono), and multimerization state (Rapp), which is defined as the ratio of Mapp to Mmono and serves as a multimerization predictor [4, 29]. Here, Mapp represents the measured protein mass from SEC experiments. To ensure reliable protein profiles, we applied established data filtering criteria [10, 17, 27]. Reproducible protein profiles were selected if their elution peaks between two replicates fell within two SEC fractions, a threshold met by 90% of peaks. If a reproducible peak was located in the first (void) fraction, as measured by blue dextran [4, 27], across all replicates, it was defined as an unresolvable peak and removed from the dataset. Proteins with multiple peaks were deconvolved into individual profiles representing distinct multimerization states [6, 10, 27]. Profiles with a mean Mapp < 850 kDa, ensuring resolvability on the SEC column, were retained for further analysis.

Interkingdom ortholog mapping for CORUM orthocomplex assignments

We performed ortholog mapping of CORUM complexes in rice, and protein complexes in rice inferred from orthologous human CORUM complexes are referred to as orthocomplexes. Human-to-rice orthologs were first identified using InParanoid 8 [15] with proteomes from UniProt (human) and Phytozome V12 (rice). Human CORUM complexes were then converted to rice orthocomplexes based on these ortholog mappings, detailed in Supplemental Tables S1A and S2. Polyploidization and gene duplication in plants often resulted in multiple rice orthologs mapping to a single human ortholog. These were treated as a single “ortho-paralog” group for subunit overlap calculations. The subunit coverage of a rice orthocomplex is defined as the ratio of the number of subunits with inferred orthologs to the total number of subunits in a CORUM complex. Rice orthocomplexes with at least 2/3 subunit coverage of a CORUM complex were classified as high-coverage candidates for the subsequent benchmark identification.

Integration of rice orthocomplexes with rice SEC data

After assigning rice orthocomplexes through the mapping of human CORUM complexes, we integrated this orthocomplex information with the experimental SEC profile data. We identified rice orthocomplexes that contained at least two subunits in the SEC profile data. To refine these orthocomplexes, we removed redundant ones based on the following criteria: (i) orthocomplexes consisting of identical rice subunits, (ii) smaller orthocomplexes that were subsets of larger ones, and (iii) orthocomplexes comprising solely of rice subunits mapped to a single human ortholog.

Experimental design and statistical rationale

Distance metric

In our SOM analysis, we used a distance metric that combines two similarity measurements of a pair of proteins in an orthocomplex: the correlation of their fractionation profiles and the distance between their peak fraction locations. First, we calculated a weighted cross-correlation (WCC) [30] between a pair of protein profiles using the formula:

(1)

Here, xi and xj are column vectors representing the SEC profiles of the two proteins, and W is a weight matrix with 1′s on the main diagonal and decreasing values on the sub- and super-diagonals. The bandwidth of W was set to 2 with a weighting function of (1 − distance/3). That is, there were 2 sub-diagonals and 2 super-diagonals in the weight matrix W, and the weights on the first sub/super-diagonals were 2/3, while the weights on the second sub-/super-diagonals were 1/3. A distance (WCCd) based on the weighted cross-correlation for a pair of proteins was given by

(2)

Another distance metric was calculated based on the Euclidean distance between the peak locations of the proteins:

(3)

where yi and yj were the peak fraction locations of the proteins, obtained using the Gaussian peak fitting algorithm [6]. These peak locations had been standardized by dividing them by the largest peak location of a complex, ensuring they fell within the range of 0 to 1 for the Euclidean distance calculation in Formula (3). Both distance measurements were computed for the two replicates of the SEC data. An overall distance between two proteins was defined as

(4)

where w is the weight obtained by model training, described in the following. There were two distances indicated by the subscripts 1 and 2 for the two SEC data replicates.

Pre-clustering evaluation of orthocomplexes

Before conducting an unsupervised clustering analysis as described next, we first assessed whether an orthocomplex formed a single cluster. If an orthocomplex consisted of only two subunits in the SEC profile, they were excluded from clustering and labeled as “only 2 proteins in the complex”. For larger orthocomplexes, we evaluated whether all subunits exhibited similar profiles, as indicated by small pairwise distance metrics. If the distances for all subunit pairs within an orthocomplex were below the fifth percentile threshold of an empirical distribution, we inferred that all subunits belonged to the same complex. This empirical distribution was based on pairwise distances across all proteins in the entire SEC dataset, not limited to CORUM ortholog SEC profiles (more details in the Supplementary Information). Orthocomplexes meeting these criteria were labeled accordingly in the result tables.

Two-stage clustering algorithm for subcomplex prediction

Only a very small number of rice orthocomplexes were found to be fully assembled using the pre-clustering method described above. For most orthocomplexes, we employed a two-stage clustering algorithm to identify subcomplexes within each orthocomplex individually. The first stage applied the SOM algorithm [28], which is a specific type of neural network model known for its robustness to data measurement errors and missing values. The SOM algorithm grouped the subunits of each orthocomplex into clusters, with distinct subgroups representing potential subcomplexes. In the second stage, clusters from the SOM were merged using the affinity propagation (AP) algorithm [31]. We fine-tuned three parameters for this two-stage approach: the weight |$w$| on doverall in Formula (4), the number of clusters in the SOM, and the merging threshold for the AP algorithm.

We used four well-known rice complexes (19S proteasome, 20S proteasome, 14-3-3 hetero-/homo-oligomers, and the exosome) to determine the three tuning parameters, ensuring that subunits of each complex were clustered together. The obtained tuning parameters were robust when any three of the four complexes were used as training data (details in the Supplementary Information). These parameters were set as the default parameter values in our computational package for subcomplex discovery in rice as well as Arabidopsis. Subcomplexes were classified as either “subcomplex” or “singleton", with the latter referring to single-member clusters. If all members of a subcomplex mapped to a single human ortholog, it was labeled as an “ortho-paralog subcomplex” in the Supplemental Tables and Figures.

Statistical significance and size evaluation of identified subcomplexes

The identified subcomplexes, as well as the potentially fully assembled complexes, were evaluated using statistical P-values and by comparing the apparent mass (Mapp) with the calculated mass (Mcalc). Here, Mcalc denotes the sum of the monomeric masses of all subunits in the subcomplex. First, a bootstrap P-value was calculated for each subcomplex using Monte Carlo simulations, where the mean pairwise distance for a subcomplex was compared to that of random complexes, which were created by randomly sampling the same number of proteins as the identified subcomplex from all SEC profiles in the dataset. A P < 0.05 indicated statistical significance. We also calculated false discovery rates (FDRs) to account for multiple testing. Additionally, the Mapp of a subcomplex was compared with its Mcalc. More specifically, Mapp for a subcomplex/complex was given by.

(5)

where n was the number of subunits in the subcomplex/complex, and Mappi was the apparent mass of a subunit. We determined a subcomplex/complex as a benchmark if it was obtained from the previous clustering analysis and met two criteria: (i) a P < .05 and (ii) an agreement between the SEC experiment mass (Mapp) and the calculated mass based on subunit composition (Mcalc), such that:

(6)

This ensured that the relative difference between Mapp and Mcalc was constrained within a factor of 1. Notably, some of these benchmarks were derived from the pre-clustering complexes. These were confirmed as fully assembled CORUM orthocomplexes when their apparent masses (Mapp) matched the calculated masses (Mcalc) of their subunits.

Dimerization prediction by AlphaFold-Multimer on COSMIC2

To run the AlphaFold-Multimer software package v2.2.0 [32], the COSMIC2 cloud platform was used [33]. Protein sequences for each dimeric subcomplex were obtained from the rice proteome file, Osativa_323_v7.0.protein.fa [34], in Phytozome V12 [35] and then searched against the full database (full_dbs) by default. Ranking confidence scores, a weighted combination of the interface predicted Template Modeling score (ipTM) and the predicted Template Modeling score (pTM), were used as model confidence metrics. The model confidence scores of the top five predicted models were averaged to evaluate the predicted dimeric subcomplexes.

Arabidopsis subcomplex predictions using the two-stage clustering

The identical analysis pipeline was employed to identify Arabidopsis subcomplexes. We replicated all analysis steps using an Arabidopsis SEC profile dataset [27], including CFMS data quality filtering, ortholog mapping for CORUM complexes in Arabidopsis, and two-stage clustering analysis on the Arabidopsis SEC profile data. Crucially, we applied the same set of tuning parameter values for the clustering algorithm in Arabidopsis subcomplex predictions.

Statistical tests and data analysis

Statistical analysis was performed using R version 4.2.0 [36] on RStudio 2022.07.1 [37]. The Flexible SOMs in Kohonen 3.0 package for R [38] and the APCluster package for R [39] were implemented for the SOM and AP algorithms, respectively. Gaussian fitting code (https://github.com/dlchenstat/Gaussian-fitting) was run on MATLAB (R2022a). Microsoft Excel on Office 365 for Mac was used to organize and display the analyzed data.

Results

Figure 1 presents an overview of the workflow for identifying benchmarks by integrating two sources of information: the CORUM database and SEC profile data from rice tissue extracts. The first step involves mapping orthologs to identify complex subunits based on CORUM information, while the second step focuses on grouping similar SEC profiles to characterize protein complex assembly. By combining these two complementary data sources, we identified both fully and partially assembled complexes in rice cell extracts, providing reliable benchmarks for protein complex predictions.

Identification of bona fide benchmarks to evaluate prediction accuracy in co-fractionation mass spectrometry-based protein complex discovery in plant species. (A) Human and rice proteins are assigned into OGs using the InParanoid algorithm [13]. Rice orthocomplexes are built from CORUM complexes based on the OGs. (B) SEC profiles in CFMS analysis are used in the identification of benchmark complexes. (C) SOM is used to cluster protein profiles generated in B. Each code/group (outer circle) contains a group profile (peak profile), which represents the profiles of all protein members (dots). (D) Subcomplex predictions in SEC datasets are evaluated via a statistical bootstrap P-value calculation. Among experimentally detected subunits (upper panel) in a rice orthocomplex, subunits with similar SEC profiles are clustered in a subcomplex (dotted line). Profile similarity scores are calculated between all possible pairs of subunits in the subcomplex and then are averaged to get the mean dissimilarity of the subcomplex. Simultaneously, an equal number of proteins observed in the orthocomplex are sampled from randomly generated plant orthocomplex (lower panel). The random mean is calculated as mean dissimilarity for pairs of proteins in the random subcomplex. The P-value for each subcomplex is computed as the fraction of times the observed mean is larger than the random mean. (E) Predicted benchmarks can evaluate protein complex prediction results by CFMS performed with any type of biochemical separations.
Figure 1

Identification of bona fide benchmarks to evaluate prediction accuracy in co-fractionation mass spectrometry-based protein complex discovery in plant species. (A) Human and rice proteins are assigned into OGs using the InParanoid algorithm [13]. Rice orthocomplexes are built from CORUM complexes based on the OGs. (B) SEC profiles in CFMS analysis are used in the identification of benchmark complexes. (C) SOM is used to cluster protein profiles generated in B. Each code/group (outer circle) contains a group profile (peak profile), which represents the profiles of all protein members (dots). (D) Subcomplex predictions in SEC datasets are evaluated via a statistical bootstrap P-value calculation. Among experimentally detected subunits (upper panel) in a rice orthocomplex, subunits with similar SEC profiles are clustered in a subcomplex (dotted line). Profile similarity scores are calculated between all possible pairs of subunits in the subcomplex and then are averaged to get the mean dissimilarity of the subcomplex. Simultaneously, an equal number of proteins observed in the orthocomplex are sampled from randomly generated plant orthocomplex (lower panel). The random mean is calculated as mean dissimilarity for pairs of proteins in the random subcomplex. The P-value for each subcomplex is computed as the fraction of times the observed mean is larger than the random mean. (E) Predicted benchmarks can evaluate protein complex prediction results by CFMS performed with any type of biochemical separations.

Generating rice orthocomplexes from CORUM

We identified CORUM orthologs in rice using the InParanoid algorithm (Standalone Version 4.2 from (https://inparanoid.sbc.su.se/) [15]. The InParanoid search compared 20 834 human protein sequences with 42 160 proteins in rice. The algorithm assigned 5363 human proteins and 8178 rice proteins into 3131 distinct orthologous groups (OGs) at the whole proteome level (Supplemental Table S1A). The higher number of orthologs in rice was due to an elevated gene copy number compared to humans [40]. We next created predicted rice orthocomplexes from CORUM human complex compositions [13] using the ortholog dataset generated above (Fig. 1A). Among the 3047 human complexes curated in CORUM, 1964 (64.5% of total CORUM complexes) had at least one subunit orthologous to one or more rice proteins. Of these, 920 rice orthocomplexes exhibited subunit coverage greater than or equal to 2/3. Four hundred and thirty-six (14.3% of total CORUM complexes) of the 1964 rice orthocomplexes displayed completely conserved subunit compositions and are expected to have very similar core functions in the cell (Fig. 2A; Supplemental Fig. S1; Supplemental Table S2).

Assumed CORUM complexes do not exist in a fully assembled state in plant species. (A) Genomic-level subunit coverage of CORUM complexes to rice orthocomplexes. Coverage is defined as the ratio of the number of subunits in a rice orthocomplex to the number of subunits in its orthologous human CORUM complexes. The genome coverage of 1964 rice orthocomplexes was calculated and plotted at different subunit coverage levels. The sharp peaks are influenced by the number of members within the CORUM complexes. The various colors within each peak represent the relative proportions of different complex sizes covered. (B) Conserved predicted masses of human complexes and rice orthocomplexes. Mcalc of 436 rice orthocomplexes with 100% subunit coverage to CORUM complexes were plotted. Mapp values of the one-to-multi rice orthologs were averaged in Mcalc calculation (Fig. 1A). (C) CORUM complex subunits rarely exist in a fully assembled complex. Mapp values of subunits of the 258 rice orthocomplexes were obtained from the reference rice CFMS datasets. (D) Conserved masses of human CORUM subunits and rice orthocomplex subunits. A scatter plot shows conserved Mmono values between human and rice subunits that assemble into the known complexes. The rice orthocomplexes with 100% subunit coverage to CORUM complexes were plotted.
Figure 2

Assumed CORUM complexes do not exist in a fully assembled state in plant species. (A) Genomic-level subunit coverage of CORUM complexes to rice orthocomplexes. Coverage is defined as the ratio of the number of subunits in a rice orthocomplex to the number of subunits in its orthologous human CORUM complexes. The genome coverage of 1964 rice orthocomplexes was calculated and plotted at different subunit coverage levels. The sharp peaks are influenced by the number of members within the CORUM complexes. The various colors within each peak represent the relative proportions of different complex sizes covered. (B) Conserved predicted masses of human complexes and rice orthocomplexes. Mcalc of 436 rice orthocomplexes with 100% subunit coverage to CORUM complexes were plotted. Mapp values of the one-to-multi rice orthologs were averaged in Mcalc calculation (Fig. 1A). (C) CORUM complex subunits rarely exist in a fully assembled complex. Mapp values of subunits of the 258 rice orthocomplexes were obtained from the reference rice CFMS datasets. (D) Conserved masses of human CORUM subunits and rice orthocomplex subunits. A scatter plot shows conserved Mmono values between human and rice subunits that assemble into the known complexes. The rice orthocomplexes with 100% subunit coverage to CORUM complexes were plotted.

To assess variability in the sizes of the rice and human orthologs, we compared predicted protein complexes within the 436 human and rice orthocomplexes that shared 100% subunit coverage (Fig. 2B; Supplemental Table S2B). The Mcalc value was determined by summing the monomeric masses of individual subunits within an orthocomplex. In Fig. 2B, we illustrate that rice and human orthocomplexes exhibit similar complex sizes when considering Mcalc based on the calculated masses of the summed subunits (r = 0.986 and slope = 1.01), indicating that most complexes possess similar masses. Among these 436 orthocomplexes, 258 had detected subunit(s) in the SEC datasets. However, when Mcalc values of these highly conserved rice orthocomplexes were compared to their subunit Mapp values measured in the rice SEC experiments, a low correlation was found (Fig. 2C, r = 0.53 and R2 = 0.28). The lack of correlation in Fig. 2C indicates that the Mapp of rice orthocomplexes from SEC experiments do not align with their calculated sizes. To reinforce that the low correlation shown in Fig. 2C was not due to differences in Mmono of subunits in these 258 orthocomplexes, we found a strong positive correlation between Mmono values for this subset of human and rice orthocomplex subunits (Fig. 2D).

Furthermore, 25 proteins from the rice orthocomplexes with 100% subunit coverage had Rapp ≤ 1.6 and were therefore considered likely monomers. The existence of monomeric subunit pools and partially assembled complexes can explain the large number of data points falling well below the diagonal in Fig. 2C. Data points above the diagonal may reflect novel complexes in which CORUM orthocomplexes and/or subcomplexes interact with unknown proteins. These results are consistent with previous observations [6, 12, 17, 18] and indicate that CORUM subunits detected in CFMS experiments rarely agree with the predicted mass of the fully assembled state.

Benchmark identification

We extracted reproducible protein elution profiles from the reference rice SEC datasets [10]. There were 3426 proteins present in both of the two SEC replicates, and 197 had multiple peaks that arose when the protein existed in multiple multimerization states. We deconvolved these multiple peaks to generate 350 reproducible peaks, and a total of 2618 protein subunits with Rapp > 1 were used as the rice SEC reference profiles. We further curated rice orthocomplexes with subunit coverage greater than or equal to 2/3 in the rice SEC profiles. Among the 920 orthocomplexes meeting this coverage threshold, 531 orthocomplexes had at least one rice subunit detected by the rice SEC reference profiles, and 287 of these 531 orthocomplexes had at least two rice subunits detected. After eliminating redundant orthocomplexes, 103 rice orthocomplexes were selected for further analysis by integrating rice orthocomplexes with the rice SEC data. Clustering analysis was performed across the 103 rice orthocomplexes one by one to predict the composition of subcomplexes. The results of the clustering analysis, including essential details such as cluster composition, singleton, statistical significance, and size evaluation, are reported in Supplemental Table S3 and illustrated in Supplemental Figs. S2 and S3. More specifically, in Supplemental Fig. S2, we plotted Mapp versus Mcalc of the identified subcomplexes after the clustering analysis, demonstrating the existence of partially assembled complexes in the rice cell extract. Additionally, SEC profile plots illustrating co-elution patterns of the clustering analysis results are generated in Supplemental Fig. S3.

During the process of clustering analysis, subunits within each rice orthocomplex were clustered based on their profiles and distances. The outcome of this analysis included 162 subcomplexes/complexes from the 103 rice orthocomplexes (Fig. 3A; Supplemental Figs. S2 and S3; Supplemental Table S3). Random sampling was conducted a large number of times, specifically 134 350 times, which is equal to 50 times the total number of the rice SEC reference profiles in our dataset. Supplemental Fig. S4 displays the random empirical distributions for various subcomplex sizes. The P-values and FDR values for the 162 subcomplexes/complexes are reported in Supplemental Table S3, with 112 of them having P-values <5% and the corresponding FDR values less than 7%. After removing redundancies from the list of 112, and discarding potential monomers (Rapp ≤ 1.6), we identified 79 unique subcomplexes/complexes with small P-values (Supplemental Table S4).

Useful benchmarks near the diagonal. (A) A process flow to identify benchmarks from the rice data. (B) Benchmark subcomplexes/complexes of rice with matched Mcalc and Mapp-avg are rendered in pink (in the middle). Asterisks (*) indicate fully assembled CORUM orthocomplexes. Numbers in parentheses point out predicted subcomplexes present in the corresponding panels in C. (C) Mcalc values of predicted subcomplexes and Mapp values of subunits of the subcomplexes. Benchmark subcomplexes (P-values = .0) are highlighted in bold text.
Figure 3

Useful benchmarks near the diagonal. (A) A process flow to identify benchmarks from the rice data. (B) Benchmark subcomplexes/complexes of rice with matched Mcalc and Mapp-avg are rendered in pink (in the middle). Asterisks (*) indicate fully assembled CORUM orthocomplexes. Numbers in parentheses point out predicted subcomplexes present in the corresponding panels in C. (C) Mcalc values of predicted subcomplexes and Mapp values of subunits of the subcomplexes. Benchmark subcomplexes (P-values = .0) are highlighted in bold text.

We proceeded to compare their Mapp and Mcalc values using the criterion defined in Formula (6) for this set of 79 subcomplexes/complexes (Fig. 3B; Supplemental Table S4). In comparison to Fig. 2C, Fig. 3B exhibits considerably fewer data points below the diagonal. This is because a subcomplex comprises fewer subunits than the CORUM orthocomplex. Therefore, the reduction in Mcalc values brought them closer to the Mapp values obtained from the SEC experiment. This result indicates that our algorithm successfully identified more reliable subcomplex formations. In Fig. 3B, those subcomplexes/complexes with elevated Mapp greater than two times Mcalc might be attributable to undetected or unknown subunits in the complexes with non-spherical shapes, or unknown subunit stoichiometries. Among the set of 79 subcomplexes/complexes, 40 demonstrated substantial agreement between Mapp and Mcalc, as shown in Fig. 3B, meeting the criterion of Formula (6).

These 40 subcomplexes or complexes were stable subcomplexes/complexes supported by both statistically significant P-values and consistent apparent masses (Fig. 3B and3C; Supplemental Table S4). They could serve as benchmarks for evaluating CFMS predictions. Additionally, within these 40 benchmarks, our algorithm identified 8 fully assembled CORUM orthocomplexes. These fully assembled complexes met the criteria of containing all CORUM subunits, being statistically significant, and having similar Mapp and Mcalc values according to Formula (6). A summary list of these 40 benchmarks is provided in Table 1, with additional details in Supplemental Table S5.

Table 1

Predicted subcomplexes that could be used as benchmarks to evaluate CFMS-based protein complex predictions

CORUM complexRice subcomplex IDs (# of subunits orthocomplex)Subcomplex prediction P-valueMcalc of subcomplex (kDa)Mapp-avg of subcomplex (kDa)# of CORUM subunits predicted/total # of CORUM subunits
17S U2 snRNPSubcomplex 3 (11).000008392.7550.09/33
Gamma-BAR-AP1 complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).004744391.6421.35/9
APPBP1-UBA3 complexOnly 2 proteins in the complex before clustering (2).000947108.7121.32/2
RNA polymerase II holoenzyme complexSubcomplex 1 (2).01007650.568.22/24
RNA polymerase II holoenzyme complexSubcomplex 3 (8).000008460.2636.47/24
RNA polymerase II holoenzyme complexSubcomplex 4 (3).011642201.5372.43/24
C complex spliceosomeSubcomplex 2 (5).000252880.0730.45/80
CAPZalpha-CAPZbeta complexOnly 2 proteins in the complex before clustering (2).03032861.789.52/2
CCT complex (chaperonin containing TCP1 complex)All proteins defined in the complex by Gaussian peak distance before clustering (10).000008937.1698.28/8
Prefoldin complexAll proteins defined in the complex by Gaussian peak distance before clustering (6).00000899.1136.86/6
SEC23-SEC24 adaptor complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).000069197.6341.02/2
CSA-POLIIa complexSubcomplex 1 (6).000008301.8438.76/13
CUL4A-DDB1-RBBP5 complexSubcomplex 1 (3).031703278.6325.93/3
EIF2B1-EIF2B2-EIF2B3-EIF2B4-EIF2B5 complexAll proteins defined in the complex by similarity before clustering (5).000008563.4863.25/5
Elongator holo complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).000122305.4590.23/4
ESCRT-III complexOnly 2 proteins in the complex before clustering (2).01378998.857.94/10
ExosomeSubcomplex 1 (7).000008199.9278.17/11
FIB-associated protein complexSubcomplex (ortho-paralog) 2 (4).00002349.890.11/6
HCF-1 complexSubcomplex 1 (6).000008177.3352.02/18
KAT2A-Oxoglutarate dehydrogenase complexSubcomplex (ortho-paralog) 1 (2).000122108.8193.61/4
KAT2A-Oxoglutarate dehydrogenase complexSubcomplex 1 (3).002040109.5114.82/4
LSm1–7 complexSubcomplex 2 (4).00064246.678.34/7
LSm2–8 complexSubcomplex 2 (5).00086363.194.65/7
MCM2-MCM4-MCM6-MCM7 complexAll proteins defined in the complex by similarity before clustering (4).000046377.6466.34/4
Membrane protein complex (VCP, UFD1L, SEC61B)Subcomplex (ortho-paralog) 2 (2).000405538.7563.71/3
mRNA decay complex (UPF1, UPF2, UPF3B, DCP2, XRN1, XRN2, EXOSC2, EXOSC4, EXOSC10, PARN)Subcomplex 1 (3).004362173.7293.03/10
p400-associated complexSubcomplex 1 (4).001597346.2614.73/6
PA700-20S-PA28 complexSubcomplex 1 (15).000008600.7610.711/36
PA700-20S-PA28 complexSubcomplex 2 (21).000008761.6876.615/36
Parvulin-associated pre-rRNP complexSubcomplex (ortho-paralog) 1 (3).00029049.791.11/12
RAF1-PPP2-PIN1 complexSubcomplex 1 (2).003018121.9198.72/5
Retromer complex (SNX1, SNX2, VPS35, VPS29, and VPS26B)All proteins defined in the complex by similarity before clustering (3).002559248.0211.74/5
Ribosome, cytoplasmicSubcomplex 2 (2).00160442.368.52/80
SNW1 complexSubcomplex 1 (3).042039268.6223.12/18
SNW1 complexSubcomplex 2 (4).002040559.1751.73/18
SNW1 complexSubcomplex 3 (5).00000850.190.81/18
SpliceosomeSubcomplex 1 (3).00073364.393.84/143
SpliceosomeSubcomplex 2 (2).01012257.561.82/143
TBCD-ARL2-tubulin (beta-TBCE complex)Subcomplex 1 (2).000970146.7282.02/4
CORUM complexRice subcomplex IDs (# of subunits orthocomplex)Subcomplex prediction P-valueMcalc of subcomplex (kDa)Mapp-avg of subcomplex (kDa)# of CORUM subunits predicted/total # of CORUM subunits
17S U2 snRNPSubcomplex 3 (11).000008392.7550.09/33
Gamma-BAR-AP1 complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).004744391.6421.35/9
APPBP1-UBA3 complexOnly 2 proteins in the complex before clustering (2).000947108.7121.32/2
RNA polymerase II holoenzyme complexSubcomplex 1 (2).01007650.568.22/24
RNA polymerase II holoenzyme complexSubcomplex 3 (8).000008460.2636.47/24
RNA polymerase II holoenzyme complexSubcomplex 4 (3).011642201.5372.43/24
C complex spliceosomeSubcomplex 2 (5).000252880.0730.45/80
CAPZalpha-CAPZbeta complexOnly 2 proteins in the complex before clustering (2).03032861.789.52/2
CCT complex (chaperonin containing TCP1 complex)All proteins defined in the complex by Gaussian peak distance before clustering (10).000008937.1698.28/8
Prefoldin complexAll proteins defined in the complex by Gaussian peak distance before clustering (6).00000899.1136.86/6
SEC23-SEC24 adaptor complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).000069197.6341.02/2
CSA-POLIIa complexSubcomplex 1 (6).000008301.8438.76/13
CUL4A-DDB1-RBBP5 complexSubcomplex 1 (3).031703278.6325.93/3
EIF2B1-EIF2B2-EIF2B3-EIF2B4-EIF2B5 complexAll proteins defined in the complex by similarity before clustering (5).000008563.4863.25/5
Elongator holo complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).000122305.4590.23/4
ESCRT-III complexOnly 2 proteins in the complex before clustering (2).01378998.857.94/10
ExosomeSubcomplex 1 (7).000008199.9278.17/11
FIB-associated protein complexSubcomplex (ortho-paralog) 2 (4).00002349.890.11/6
HCF-1 complexSubcomplex 1 (6).000008177.3352.02/18
KAT2A-Oxoglutarate dehydrogenase complexSubcomplex (ortho-paralog) 1 (2).000122108.8193.61/4
KAT2A-Oxoglutarate dehydrogenase complexSubcomplex 1 (3).002040109.5114.82/4
LSm1–7 complexSubcomplex 2 (4).00064246.678.34/7
LSm2–8 complexSubcomplex 2 (5).00086363.194.65/7
MCM2-MCM4-MCM6-MCM7 complexAll proteins defined in the complex by similarity before clustering (4).000046377.6466.34/4
Membrane protein complex (VCP, UFD1L, SEC61B)Subcomplex (ortho-paralog) 2 (2).000405538.7563.71/3
mRNA decay complex (UPF1, UPF2, UPF3B, DCP2, XRN1, XRN2, EXOSC2, EXOSC4, EXOSC10, PARN)Subcomplex 1 (3).004362173.7293.03/10
p400-associated complexSubcomplex 1 (4).001597346.2614.73/6
PA700-20S-PA28 complexSubcomplex 1 (15).000008600.7610.711/36
PA700-20S-PA28 complexSubcomplex 2 (21).000008761.6876.615/36
Parvulin-associated pre-rRNP complexSubcomplex (ortho-paralog) 1 (3).00029049.791.11/12
RAF1-PPP2-PIN1 complexSubcomplex 1 (2).003018121.9198.72/5
Retromer complex (SNX1, SNX2, VPS35, VPS29, and VPS26B)All proteins defined in the complex by similarity before clustering (3).002559248.0211.74/5
Ribosome, cytoplasmicSubcomplex 2 (2).00160442.368.52/80
SNW1 complexSubcomplex 1 (3).042039268.6223.12/18
SNW1 complexSubcomplex 2 (4).002040559.1751.73/18
SNW1 complexSubcomplex 3 (5).00000850.190.81/18
SpliceosomeSubcomplex 1 (3).00073364.393.84/143
SpliceosomeSubcomplex 2 (2).01012257.561.82/143
TBCD-ARL2-tubulin (beta-TBCE complex)Subcomplex 1 (2).000970146.7282.02/4
Table 1

Predicted subcomplexes that could be used as benchmarks to evaluate CFMS-based protein complex predictions

CORUM complexRice subcomplex IDs (# of subunits orthocomplex)Subcomplex prediction P-valueMcalc of subcomplex (kDa)Mapp-avg of subcomplex (kDa)# of CORUM subunits predicted/total # of CORUM subunits
17S U2 snRNPSubcomplex 3 (11).000008392.7550.09/33
Gamma-BAR-AP1 complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).004744391.6421.35/9
APPBP1-UBA3 complexOnly 2 proteins in the complex before clustering (2).000947108.7121.32/2
RNA polymerase II holoenzyme complexSubcomplex 1 (2).01007650.568.22/24
RNA polymerase II holoenzyme complexSubcomplex 3 (8).000008460.2636.47/24
RNA polymerase II holoenzyme complexSubcomplex 4 (3).011642201.5372.43/24
C complex spliceosomeSubcomplex 2 (5).000252880.0730.45/80
CAPZalpha-CAPZbeta complexOnly 2 proteins in the complex before clustering (2).03032861.789.52/2
CCT complex (chaperonin containing TCP1 complex)All proteins defined in the complex by Gaussian peak distance before clustering (10).000008937.1698.28/8
Prefoldin complexAll proteins defined in the complex by Gaussian peak distance before clustering (6).00000899.1136.86/6
SEC23-SEC24 adaptor complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).000069197.6341.02/2
CSA-POLIIa complexSubcomplex 1 (6).000008301.8438.76/13
CUL4A-DDB1-RBBP5 complexSubcomplex 1 (3).031703278.6325.93/3
EIF2B1-EIF2B2-EIF2B3-EIF2B4-EIF2B5 complexAll proteins defined in the complex by similarity before clustering (5).000008563.4863.25/5
Elongator holo complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).000122305.4590.23/4
ESCRT-III complexOnly 2 proteins in the complex before clustering (2).01378998.857.94/10
ExosomeSubcomplex 1 (7).000008199.9278.17/11
FIB-associated protein complexSubcomplex (ortho-paralog) 2 (4).00002349.890.11/6
HCF-1 complexSubcomplex 1 (6).000008177.3352.02/18
KAT2A-Oxoglutarate dehydrogenase complexSubcomplex (ortho-paralog) 1 (2).000122108.8193.61/4
KAT2A-Oxoglutarate dehydrogenase complexSubcomplex 1 (3).002040109.5114.82/4
LSm1–7 complexSubcomplex 2 (4).00064246.678.34/7
LSm2–8 complexSubcomplex 2 (5).00086363.194.65/7
MCM2-MCM4-MCM6-MCM7 complexAll proteins defined in the complex by similarity before clustering (4).000046377.6466.34/4
Membrane protein complex (VCP, UFD1L, SEC61B)Subcomplex (ortho-paralog) 2 (2).000405538.7563.71/3
mRNA decay complex (UPF1, UPF2, UPF3B, DCP2, XRN1, XRN2, EXOSC2, EXOSC4, EXOSC10, PARN)Subcomplex 1 (3).004362173.7293.03/10
p400-associated complexSubcomplex 1 (4).001597346.2614.73/6
PA700-20S-PA28 complexSubcomplex 1 (15).000008600.7610.711/36
PA700-20S-PA28 complexSubcomplex 2 (21).000008761.6876.615/36
Parvulin-associated pre-rRNP complexSubcomplex (ortho-paralog) 1 (3).00029049.791.11/12
RAF1-PPP2-PIN1 complexSubcomplex 1 (2).003018121.9198.72/5
Retromer complex (SNX1, SNX2, VPS35, VPS29, and VPS26B)All proteins defined in the complex by similarity before clustering (3).002559248.0211.74/5
Ribosome, cytoplasmicSubcomplex 2 (2).00160442.368.52/80
SNW1 complexSubcomplex 1 (3).042039268.6223.12/18
SNW1 complexSubcomplex 2 (4).002040559.1751.73/18
SNW1 complexSubcomplex 3 (5).00000850.190.81/18
SpliceosomeSubcomplex 1 (3).00073364.393.84/143
SpliceosomeSubcomplex 2 (2).01012257.561.82/143
TBCD-ARL2-tubulin (beta-TBCE complex)Subcomplex 1 (2).000970146.7282.02/4
CORUM complexRice subcomplex IDs (# of subunits orthocomplex)Subcomplex prediction P-valueMcalc of subcomplex (kDa)Mapp-avg of subcomplex (kDa)# of CORUM subunits predicted/total # of CORUM subunits
17S U2 snRNPSubcomplex 3 (11).000008392.7550.09/33
Gamma-BAR-AP1 complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).004744391.6421.35/9
APPBP1-UBA3 complexOnly 2 proteins in the complex before clustering (2).000947108.7121.32/2
RNA polymerase II holoenzyme complexSubcomplex 1 (2).01007650.568.22/24
RNA polymerase II holoenzyme complexSubcomplex 3 (8).000008460.2636.47/24
RNA polymerase II holoenzyme complexSubcomplex 4 (3).011642201.5372.43/24
C complex spliceosomeSubcomplex 2 (5).000252880.0730.45/80
CAPZalpha-CAPZbeta complexOnly 2 proteins in the complex before clustering (2).03032861.789.52/2
CCT complex (chaperonin containing TCP1 complex)All proteins defined in the complex by Gaussian peak distance before clustering (10).000008937.1698.28/8
Prefoldin complexAll proteins defined in the complex by Gaussian peak distance before clustering (6).00000899.1136.86/6
SEC23-SEC24 adaptor complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).000069197.6341.02/2
CSA-POLIIa complexSubcomplex 1 (6).000008301.8438.76/13
CUL4A-DDB1-RBBP5 complexSubcomplex 1 (3).031703278.6325.93/3
EIF2B1-EIF2B2-EIF2B3-EIF2B4-EIF2B5 complexAll proteins defined in the complex by similarity before clustering (5).000008563.4863.25/5
Elongator holo complexAll proteins defined in the complex by Gaussian peak distance before clustering (3).000122305.4590.23/4
ESCRT-III complexOnly 2 proteins in the complex before clustering (2).01378998.857.94/10
ExosomeSubcomplex 1 (7).000008199.9278.17/11
FIB-associated protein complexSubcomplex (ortho-paralog) 2 (4).00002349.890.11/6
HCF-1 complexSubcomplex 1 (6).000008177.3352.02/18
KAT2A-Oxoglutarate dehydrogenase complexSubcomplex (ortho-paralog) 1 (2).000122108.8193.61/4
KAT2A-Oxoglutarate dehydrogenase complexSubcomplex 1 (3).002040109.5114.82/4
LSm1–7 complexSubcomplex 2 (4).00064246.678.34/7
LSm2–8 complexSubcomplex 2 (5).00086363.194.65/7
MCM2-MCM4-MCM6-MCM7 complexAll proteins defined in the complex by similarity before clustering (4).000046377.6466.34/4
Membrane protein complex (VCP, UFD1L, SEC61B)Subcomplex (ortho-paralog) 2 (2).000405538.7563.71/3
mRNA decay complex (UPF1, UPF2, UPF3B, DCP2, XRN1, XRN2, EXOSC2, EXOSC4, EXOSC10, PARN)Subcomplex 1 (3).004362173.7293.03/10
p400-associated complexSubcomplex 1 (4).001597346.2614.73/6
PA700-20S-PA28 complexSubcomplex 1 (15).000008600.7610.711/36
PA700-20S-PA28 complexSubcomplex 2 (21).000008761.6876.615/36
Parvulin-associated pre-rRNP complexSubcomplex (ortho-paralog) 1 (3).00029049.791.11/12
RAF1-PPP2-PIN1 complexSubcomplex 1 (2).003018121.9198.72/5
Retromer complex (SNX1, SNX2, VPS35, VPS29, and VPS26B)All proteins defined in the complex by similarity before clustering (3).002559248.0211.74/5
Ribosome, cytoplasmicSubcomplex 2 (2).00160442.368.52/80
SNW1 complexSubcomplex 1 (3).042039268.6223.12/18
SNW1 complexSubcomplex 2 (4).002040559.1751.73/18
SNW1 complexSubcomplex 3 (5).00000850.190.81/18
SpliceosomeSubcomplex 1 (3).00073364.393.84/143
SpliceosomeSubcomplex 2 (2).01012257.561.82/143
TBCD-ARL2-tubulin (beta-TBCE complex)Subcomplex 1 (2).000970146.7282.02/4

To further justify the machine learning approach, we applied the method to randomly shuffled rice proteins. Specifically, we randomly permuted the protein IDs among the total of 42 160 rice proteins, with 2377 of them having SEC profiles. This permutation generated a random dataset where a SEC profile (if available) did not correspond to its true protein but was instead assigned to a random protein. We conducted the entire process of benchmark analysis on the shuffled data, including CORUM ortholog mapping, SOM clustering analysis, statistical significance testing, and comparison of the Mapp and Mcalc values. Among multiple random simulations with shuffled data, we typically identified either no benchmark or one benchmark.

Confirmation of monomer identification by the algorithm

Within the set of 103 CORUM orthocomplexes, there existed a list of 50 proteins with small Rapp values (Rapp ≤ 1.6), likely being monomers or multimers with a restricted type of binding partner (Supplemental Table S6). This list, derived directly from the original input data, served as prior information to evaluate and validate the accuracy of our algorithm in discerning monomeric proteins within orthocomplexes. None of these 50 potential monomers were predicted to form subcomplexes, confirming that the algorithm effectively distinguishes monomeric proteins from those likely to be involved in larger complex assemblies.

Structural validation of RNA polymerase II subcomplexes

We identified several novel subcomplexes using our two-stage clustering approach. To validate these predictions, we focused on RNA polymerase II (Pol II) as a model example (Fig. 4). While the solved structure of rice Pol II is unavailable, we mapped the rice subunits onto the human Pol II structure (PDB: 6O9L). Using the rice CFMS data, we identified 12 reproducible subunits of the Pol II complex, including RPB3, RPB9, and RBP11a, which exhibited multiple resolvable peaks (Fig. 4A). Our clustering method assigned these subunits into four distinct subcomplexes, with subcomplexes 1, 3, and 4 showing significant P-values (≤.01), while subcomplex 2 did not (P-value = 0.56), suggesting monomeric forms for TFIIB and RPB9 (Fig. 4B). To further validate the predictions, we compared the Mapp values of the subcomplexes to their Mcalc values, supporting the presence of three significant subcomplexes (Fig. 4B and4D). We also used the AlphaFold-Multimer algorithm to explore potential interactions among subunits. The highest-confidence interaction was observed between RPB3 and RBP11, supporting the subcomplex 1 prediction (Fig. 4C). Structural mapping of the predicted subcomplexes onto the holoenzyme structure (PDB: 6O9L) showed consistency with known features, including the TFIIH and Pol II complexes (Fig. 4D). These findings support the existence of discrete RNA Pol II subcomplexes and highlight the utility of CFMS data in uncovering regulated assembly events. Our results suggest that these subcomplexes may also exist in the cytosol, reflecting potential functional roles beyond the nucleus. Importantly, our approach is not specific to plants and can be applied to other species to detect partial complex assembly.

Validation of subcomplexes predicted in CORUM RNA polymerase II complex. (A) Protein elution profiles of subunits in each predicted subcomplex. Profiles from the second replicate are shown in Supplemental Fig. S3. (B) Mcalc values of predicted subcomplexes and Mapp values of their subunits. Circles indicate multimers, while triangles represent monomers (Rapp ≤ 1.6). (C) dimerization predictions between subcomplex subunits using AlphaFold Multimer [32]. AlphaFold-Multimer was run on COSMIC2 to predict the top five models [33]. Each value in the table represents the mean ranking confidence score (ipTM + pTM) ± standard deviation. (D) Structural validation of predicted subcomplexes. Predicted subcomplexes were searched in the RCSB protein data Bank (https://www.rcsb.org/). The structure of the fully assembled CORUM RNA polymerase II complex is available (PDB: 6O9L). (1) undetected subunits in the CFMS dataset are shown in gray. (2)–(4) subcomplexes were predicted in B. Mcalc and Mapp values of predicted rice subcomplexes are summarized next to the corresponding subcomplex structures.
Figure 4

Validation of subcomplexes predicted in CORUM RNA polymerase II complex. (A) Protein elution profiles of subunits in each predicted subcomplex. Profiles from the second replicate are shown in Supplemental Fig. S3. (B) Mcalc values of predicted subcomplexes and Mapp values of their subunits. Circles indicate multimers, while triangles represent monomers (Rapp ≤ 1.6). (C) dimerization predictions between subcomplex subunits using AlphaFold Multimer [32]. AlphaFold-Multimer was run on COSMIC2 to predict the top five models [33]. Each value in the table represents the mean ranking confidence score (ipTM + pTM) ± standard deviation. (D) Structural validation of predicted subcomplexes. Predicted subcomplexes were searched in the RCSB protein data Bank (https://www.rcsb.org/). The structure of the fully assembled CORUM RNA polymerase II complex is available (PDB: 6O9L). (1) undetected subunits in the CFMS dataset are shown in gray. (2)–(4) subcomplexes were predicted in B. Mcalc and Mapp values of predicted rice subcomplexes are summarized next to the corresponding subcomplex structures.

Benchmark identification in Arabidopsis

We applied the same pipeline to Arabidopsis, as outlined in Fig. 1. InParanoid was used to identify orthologs between 20 834 human proteins and 35 368 Arabidopsis proteins, resulting in 3227 OGs (Supplemental Table S1B). Of the 3047 human complexes in CORUM, 1985 (65.1%) had at least one orthologous subunit in Arabidopsis. Among these, 970 complexes had >2/3 subunit coverage (Supplemental Table S7). Using a previously published Arabidopsis SEC dataset [27], we applied the same filtering criteria as for the rice data to identify reproducible protein elution profiles. This yielded 1738 subunits with Rapp > 1. Of the 970 orthocomplexes, 289 had at least one detected Arabidopsis subunit, and 184 had at least two. After removing redundant complexes, 63 orthocomplexes were selected for further analysis (Supplemental Table S8). Similar to rice, there was poor agreement between Mapp of Arabidopsis subunits and Mcalc of the assumed fully assembled Arabidopsis orthocomplexes (Supplemental Fig. S5).

We used the same two-stage clustering algorithm applied to rice, identifying 23 Arabidopsis benchmarks with statistically significant P-values and matching Mapp and Mcalc values (Fig. 5A; Supplemental Figs. S6 and S7; Supplemental Table S9). To analyze the extent to which subcomplexes were conserved in a monocot and dicot species, we compared the 40 rice subcomplexes and 23 Arabidopsis subcomplexes. A stringent comparison, requiring the complete overlap of all subunits between an Arabidopsis subcomplex and its corresponding rice subcomplex, was limited to only 5 predicted benchmarks. This was largely due to non-overlapping proteome coverages obtained from different tissue types and different mass spectrometers. As an alternative approach, we compared Mapp values of subcomplexes predicted in both species for cases in which at least one conserved subunit was detected (Fig. 5B; Supplemental Table S10). Eleven subcomplexes were represented in the datasets from both species and showed the same or similar Mapp values, suggesting conserved assemblies. This result showcases the generalizability of our benchmark identification in species that diverged >100 million years ago. However, the agreement was not absolute, as three rice orthologs had an elevated Mapp compared to their Arabidopsis counterparts (Fig. 5B).

Benchmarks identified in Arabidopsis. (A) Benchmark subcomplexes/complexes of Arabidopsis with matched Mcalc and Mapp-avg are rendered in pink (in the middle). (B) Conservation of rice and Arabidopsis subcomplexes. Mapp values of rice benchmarks and Mapp values of their orthologous Arabidopsis benchmarks were plotted. The conserved benchmarks (on the diagonal line) and non-conserved ones (above the diagonal line) are reported in Supplemental Table S10.
Figure 5

Benchmarks identified in Arabidopsis. (A) Benchmark subcomplexes/complexes of Arabidopsis with matched Mcalc and Mapp-avg are rendered in pink (in the middle). (B) Conservation of rice and Arabidopsis subcomplexes. Mapp values of rice benchmarks and Mapp values of their orthologous Arabidopsis benchmarks were plotted. The conserved benchmarks (on the diagonal line) and non-conserved ones (above the diagonal line) are reported in Supplemental Table S10.

Discussion

Protein correlation profiling (also known as CFMS) is a powerful tool to predict the multimerization state, composition, and localization of endogenous protein complexes. As mass spectrometry coverage and subcellular fractionation methods improve, CFMS will increasingly link in vitro biochemistry with cell function. A key limitation in CFMS is the lack of data on true protein multimerization states. Solved CORUM structures certainly exist in cells, but the extent to which the fully assembled state dominates the subunit pools is not known. Our analyzes of Arabidopsis and rice soluble extracts indicate that CORUM orthocomplexes are rarely fully assembled. We anticipate the same to be true in non-plant systems. This study introduces robust experimental methods and statistical/machine learning approaches to identify more reliable benchmarks that better reflect the multimerization states of CORUM orthologs in cell extracts. Tools like InParanoid and EggNOG enable accurate mapping of orthologs across species, and the two-stage clustering algorithm developed here successfully predicts reliable complexes. Evolutionarily conserved multimeric assemblies were identified, but benchmarks cannot be transferred across species or kingdoms without testing for multimerization variability.

Accurate benchmark databases will improve CFMS predictions, though some errors, such as chance co-elution, remain a challenge due to high sample complexity. Optimizing cluster numbers with benchmarks or conducting series IEX and SEC separations may help, but increased sample fractionation will be key. Future advances in mass spectrometry and computational tools will enhance our ability to analyze protein multimerization dynamics.

Key points
  • CORUM complexes are not reliable benchmarks for protein complex predictions.

  • Ortholog mapping identifies evolutionarily conserved CORUM complexes.

  • Integrating evolutionary conservation from CORUM and co-elution patterns from CFMS enhances benchmark protein complexes.

Conflict of interest: None declared.

Funding

This work was supported by the National Science Foundation (NSF) Plant Genome Research Project 1951819 to D.B.S.

Data availability

The source code and sample input data for the clustering analysis are publicly available on GitHub (https://github.com/yangpengchengstat/R-code-S4_Class-protein-clustering-based-on-data-integration-of-corum-and-inparanoid.git). The package at the GitHub link contains comprehensive information on running the code, description of the input data, and steps of performing hyperparameter tuning.

References

1.

Andersen
 
JS
,
Wilkinson
 
CJ
,
Mayor
 
T
. et al.  
Proteomic characterization of the human centrosome by protein correlation profiling
.
Nature
 
2003
;
426
:
570
4
. .

2.

Kristensen
 
AR
,
Gsponer
 
J
,
Foster
 
LJ
.
A high-throughput approach for measuring temporal changes in the interactome
.
Nat Methods
 
2012
;
9
:
907
9
. .

3.

Havugimana Pierre
 
C
,
Hart
 
GT
,
Nepusz
 
T
. et al.  
A census of human soluble protein complexes
.
Cell
 
2012
;
150
:
1068
81
. .

4.

Aryal
 
UK
,
Xiong
 
Y
,
McBride
 
Z
. et al.  
A proteomic strategy for global analysis of plant protein complexes
.
Plant Cell
 
2014
;
26
:
3867
82
. .

5.

Aryal
 
UK
,
McBride
 
Z
,
Chen
 
D
. et al.  
Analysis of protein complexes in Arabidopsis leaves using size exclusion chromatography and label-free protein correlation profiling
.
J Proteomics
 
2017
;
166
:
8
18
. .

6.

McBride
 
Z
,
Chen
 
D
,
Reick
 
C
. et al.  
Global analysis of membrane-associated protein oligomerization using protein correlation profiling
.
Mol Cell Proteomics
 
2017
;
16
:
mcp.000276.002017
1989
. .

7.

Gilbert
 
M
,
Schulze
 
WX
.
Global identification of protein complexes within the membrane proteome of Arabidopsis roots using a SEC-MS approach
.
J Proteome Res
 
2019
;
18
:
107
19
. .

8.

Mergner
 
J
,
Frejno
 
M
,
List
 
M
. et al.  
Mass-spectrometry-based draft of the Arabidopsis proteome
.
Nature
 
2020
;
579
:
409
14
. .

9.

McWhite
 
CD
,
Papoulas
 
O
,
Drew
 
K
. et al.  
A pan-plant protein complex map reveals deep conservation and novel assemblies
.
Cell
 
2020
;
181
:
460
474.e414
. .

10.

Lee
 
Y
,
Okita
 
TW
,
Szymanski
 
DB
.
A co-fractionation mass spectrometry-based prediction of protein complex assemblies in the developing rice aleurone-subaleurone
.
Plant Cell
 
2021
;
33
:
2965
80
. .

11.

Salas
 
D
,
Stacey
 
GR
,
Akinlaja
 
M
. et al.  
Next-generation Interactomics: Considerations for the use of co-elution to measure protein interaction networks
.
Mol Cell Proteomics
 
2020
;
19
:
1
10
. .

12.

Pang
 
CNI
,
Ballouz
 
S
,
Weissberger
 
D
. et al.  
Analytical guidelines for co-fractionation mass spectrometry obtained through global profiling of gold standard Saccharomyces cerevisiae protein complexes
.
Mol Cell Proteomics
 
2020
;
19
:
1876
95
. .

13.

Giurgiu
 
M
,
Reinhard
 
J
,
Brauner
 
B
. et al.  
CORUM: The comprehensive resource of mammalian protein complexes—2019
.
Nucleic Acids Res
 
2018
;
47
:
D559
63
. .

14.

Alberts
 
B
,
Heald
 
R
,
Johnson
 
A
. et al.  
Molecular Biology of the Cell
. 7th Edition,
New York
:
W. W. Norton & Company
,
2022
.

15.

Sonnhammer
 
ELL
,
Östlund
 
G
.
InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic
.
Nucleic Acids Res
 
2015
;
43
:
D234
9
. .

16.

Huerta-Cepas
 
J
,
Szklarczyk
 
D
,
Heller
 
D
. et al.  
eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses
.
Nucleic Acids Res
 
2018
;
47
:
D309
14
. .

17.

Lee
 
Y
,
Szymanski
 
DB
.
Multimerization variants as potential drivers of neofunctionalization
.
Sci Adv
 
2021
;
7
:
eabf0984
. .

18.

Wan
 
C
,
Borgeson
 
B
,
Phanse
 
S
. et al.  
Panorama of ancient metazoan macromolecular complexes
.
Nature
 
2015
;
525
:
339
44
. .

19.

Hendil
 
KB
,
Hartmann-Petersen
 
R
,
Tanaka
 
K
.
26 S proteasomes function as stable entities11Edited by R
Huber.
J Mol Biol
 
2002
;
315
:
627
36
.

20.

Takatsu
 
H
,
Sakurai
 
M
,
Shin
 
H-W
. et al.  
Identification and characterization of novel Clathrin adaptor-related proteins*
.
J Biol Chem
 
1998
;
273
:
24693
700
. .

21.

Chen
 
C-Y
,
Gherzi
 
R
,
Ong
 
S-E
. et al.  
AU binding proteins recruit the exosome to degrade ARE-containing mRNAs
.
Cell
 
2001
;
107
:
451
64
. .

22.

Martín-Benito
 
J
,
Boskovic
 
J
,
Gómez-Puertas
 
P
. et al.  
Structure of eukaryotic prefoldin and of its complexes with unfolded actin and the cytosolic chaperonin CCT
.
EMBO J
 
2002
;
21
:
6377
86
. .

23.

Yan
 
C
,
Dodd
 
T
,
He
 
Y
. et al.  
Transcription preinitiation complex structure and dynamics provide insight into genetic diseases
.
Nat Struct Mol Biol
 
2019
;
26
:
397
406
. .

24.

Nesic
 
D
,
Krämer
 
A
.
Domains in human splicing factors SF3a60 and SF3a66 required for binding to SF3a120, assembly of the 17S U2 snRNP, and Prespliceosome formation
.
Mol Cell Biol
 
2001
;
21
:
6406
17
. .

25.

Huberts
,
Der Klei
. et al.  
Moonlighting proteins: an intriguing mode of multitasking
.
Biochim Biophys Acta
 
2010
;
1803
:
520
5
. .

26.

Fossati
 
A
,
Li
 
C
,
Uliana
 
F
. et al.  
PCprophet: a framework for protein complex prediction and differential analysis using proteomic data
.
Nat Methods
 
2021
;
18
:
520
7
. .

27.

McBride
 
Z
,
Chen
 
D
,
Lee
 
Y
. et al.  
A label-free mass spectrometry method to predict endogenous protein complex composition
.
Mol Cell Proteomics
 
2019
;
18
:
1588
. .

28.

Kohonen
 
T
.
The self-organizing map
.
Proc IEEE
 
1990
;
78
:
1464
80
. .

29.

Liu
 
X
,
Yang
 
W-C
,
Gao
 
Q
. et al.  
Toward chromatographic analysis of interacting protein networks
.
J Chromatogr A
 
2008
;
1178
:
24
32
. .

30.

de
 
Gelder
 
R
,
Wehrens
 
R
,
Hageman
 
JA
.
A generalized expression for the similarity of spectra: application to powder diffraction pattern classification
.
J Comput Chem
 
2001
;
22
:
273
89
. .

31.

Frey
 
BJ
,
Dueck
 
D
.
Clustering by passing messages between data points
.
Science
 
2007
;
315
:
972
6
. .

32.

Evans
 
R
,
O’Neill
 
M
,
Pritzel
 
A
. et al.  
Protein complex prediction with AlphaFold-Multimer
.
bioRxiv
 
2022
;
2021
:
2010.2004.463034
. .

33.

Cianfrocco
 
MA
,
Wong-Barnum
 
M
,
Youn
 
C
. et al.  COSMIC2: A science gateway for Cryo-electron microscopy structure determination. In:
Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact
, Vol.
22
.
New Orleans, LA, USA
:
Association for Computing Machinery
,
2017
.

34.

Ouyang
 
S
,
Zhu
 
W
,
Hamilton
 
J
. et al.  
The TIGR rice genome annotation resource: improvements and new features
.
Nucleic Acids Res
 
2007
;
35
:
D883
7
. .

35.

Goodstein
 
DM
,
Shu
 
S
,
Howson
 
R
. et al.  
Phytozome: a comparative platform for green plant genomics
.
Nucleic Acids Res
 
2011
;
40
:
D1178
86
. .

36.

R Core Team
.
R: A Language and Environment for Statistical Computing
.
Vienna, Austria
:
R Foundation for Statistical Computing
,
2018
.

37.

RStudio Team
.
RStudio: Integrated Development Environment for R
.
Boston, MA
:
RStudio, Inc.
,
2018
.

38.

Wehrens
 
R
,
Kruisselbrink
 
J
.
Flexible self-organizing maps in kohonen 3.0
.
J Stat Softw
 
2018
;
87
:
1
18
. .

39.

Bodenhofer
 
U
,
Kothmeier
 
A
,
Hochreiter
 
S
.
APCluster: an R package for affinity propagation clustering
.
Bioinformatics
 
2011
;
27
:
2463
4
. .

40.

Clark
 
JW
,
Donoghue
 
PCJ
.
Whole-genome duplication and plant macroevolution
.
Trends Plant Sci
 
2018
;
23
:
933
45
. .

Author notes

Pengcheng Yang and Youngwoo Lee contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]