Abstract

Protein complexes are the fundamental units for many cellular processes. Identifying protein complexes accurately is critical for understanding the functions and organizations of cells. With the increment of genome-scale protein–protein interaction (PPI) data for different species, various computational methods focus on identifying protein complexes from PPI networks. In this article, we give a comprehensive and updated review on the state-of-the-art computational methods in the field of protein complex identification, especially focusing on the newly developed approaches. The computational methods are organized into three categories, including cluster-quality-based methods, node-affinity-based methods and ensemble clustering methods. Furthermore, the advantages and disadvantages of different methods are discussed, and then, the performance of 17 state-of-the-art methods is evaluated on two widely used benchmark data sets. Finally, the bottleneck problems and their potential solutions in this important field are discussed.

Introduction

Protein complexes are multimolecular machines composed of proteins that bind by protein–protein interactions (PPIs) [1, 2]. They achieve functions that transcend the sum of individual proteins and carry out nearly all major processes inside cells [3, 4]. Therefore, identifying protein complexes accurately is a critical step for revealing the integration and coordination of cellular functions [5].

There are many techniques for determining protein complexes from cells through biological experiments. Among these techniques, tandem affinity purification with mass spectrometry (TAP-MS) [6, 7] is one of the most popular approaches, and its large-scale experiments revealed a global map of complexome for various species [8–14]. However, protein complexes determined by TAP-MS are often incomplete and unreliable because of the TAP tag used in experiments and inherent technical biases [5]. On the other hand, high-throughput (HTP) techniques, like yeast two-hybrid (Y2H) [15], have determined genome-wide PPI data for different organisms [16–18], and the large-scale PPI networks constructed from these data revealed the global interaction patterns of PPIs [19] and functional organization of cells [1, 20]. Based on PPI networks, many powerful computational predictors have been developed for protein complex identification, and several review papers [21–25] have reviewed the development of this field and summarized the existing computational methods. All these studies have greatly promoted the development of this important field. However, an updated and more comprehensive review is highly required for the following reasons: (1) Outdated. The review papers aforementioned only discuss protein complex predictors before 2013. However, more than 30 computational predictors [26–57] have been proposed since the year, which should be discussed. (2) Incomplete. All these review papers only focus on computational methods aiming to static and dynamic PPI networks, ignoring those methods for interolog PPI networks. A comprehensive review of the methods based on all these three types of networks and the discussions on the relationship of these methods are desired, which will be very useful for insightfully understanding the development of this field. (3) Insufficiently categorizing. The ensemble clustering methods are the new trend for developing computational predictors for protein complex identification. Unfortunately, this important category is ignored by the previous reviews.

In this study, we performed a comprehensive and updated review on the computational methods for identifying protein complexes from PPI networks, concentrating on the recently proposed methods. First, the databases of PPIs and protein complexes are introduced. Second, the three types of PPI networks (static, dynamic and interolog) are discussed. Next, protein complex identification methods are organized into three categories. Furthermore, the performance of 17 state-of-the-art methods is fairly evaluated on two benchmark data sets. Finally, some open challenges and future research directions in this important field are extensively discussed.

PPI databases and protein complex databases

With the development of this important filed, many databases have been established. For examples, DIP [58], BioGRID [59], HPRD [60], MINT [61] and STRING [62] are PPI databases; CYGD [63], CORUM [64], CYC2008[65] and PISA [66] are protein complex databases. These databases will be discussed in the following sections.

PPI databases

DIP [58] is one of the most commonly used PPI databases for protein complex identification, collecting PPIs curated manually by biology experts and automatically by using computational techniques from scientific articles. Furthermore, DIP evaluates the reliability of individual PPIs and provides a highly reliable PPI core subset for each species. The latest version of DIP released in 5 February 2017 contains 76 881 PPIs among 28 255 proteins from more than 800 organisms.

BioGRID [59] is another commonly used PPI database for protein complex identification, where PPIs are mainly determined by HTP approaches [5, 17, 67]. In addition, PPIs drawn from low-throughput studies [68] are used to augment the HTP data. In May 2019, the latest BioGRID version 3.5.172 contains 1 683 670 PPIs for 68 model organisms.

HPRD [60] and MINT [61] are other two useful PPI databases for protein complex identification. HPRD is a human PPI database where the experimental evidences for PPIs are derived based on in vivo and in vitro experiments, as well as Y2H [15]. MINT collects PPIs reported in peer-reviewed journals. In MINT, the reliability of individual PPIs is estimated based on various experimental evidences.

The aforementioned databases are constructed based on the PPIs validated by biological experiments. Therefore, these PPIs are incomplete. In this regard, STRING [62] is developed to integrate different PPI databases and PPIs predicted by computational algorithms. Furthermore, STRING evaluates the confidence score of each PPI by using a scoring system. The latest version of STRING (11.0) contains 3 123 056 667 PPIs among 24 584 628 proteins from 5090 organisms. A summary of these PPI databases is shown in Table 1.

Table 1

The summary of PPI databases

NameVersionSpecies# proteins# PPIsWebsite
DIP [58]Feb 2017All28 25576 881http://dip.doe-mbi.ucla.edu/dip/Main.cgi
BioGRID [59]3.5.172All71 6191 683 670https://thebiogrid.org/
HPRD [60]Apr 2010Human30 04741 327http://www.hprd.org/
MINT [61]2019All26 099130 733https://mint.bio.uniroma2.it
STRING [62]11.0All24 584 6283 123 056 667https://string-db.org/
NameVersionSpecies# proteins# PPIsWebsite
DIP [58]Feb 2017All28 25576 881http://dip.doe-mbi.ucla.edu/dip/Main.cgi
BioGRID [59]3.5.172All71 6191 683 670https://thebiogrid.org/
HPRD [60]Apr 2010Human30 04741 327http://www.hprd.org/
MINT [61]2019All26 099130 733https://mint.bio.uniroma2.it
STRING [62]11.0All24 584 6283 123 056 667https://string-db.org/
Table 1

The summary of PPI databases

NameVersionSpecies# proteins# PPIsWebsite
DIP [58]Feb 2017All28 25576 881http://dip.doe-mbi.ucla.edu/dip/Main.cgi
BioGRID [59]3.5.172All71 6191 683 670https://thebiogrid.org/
HPRD [60]Apr 2010Human30 04741 327http://www.hprd.org/
MINT [61]2019All26 099130 733https://mint.bio.uniroma2.it
STRING [62]11.0All24 584 6283 123 056 667https://string-db.org/
NameVersionSpecies# proteins# PPIsWebsite
DIP [58]Feb 2017All28 25576 881http://dip.doe-mbi.ucla.edu/dip/Main.cgi
BioGRID [59]3.5.172All71 6191 683 670https://thebiogrid.org/
HPRD [60]Apr 2010Human30 04741 327http://www.hprd.org/
MINT [61]2019All26 099130 733https://mint.bio.uniroma2.it
STRING [62]11.0All24 584 6283 123 056 667https://string-db.org/

Protein complex databases

The Munich Information Center for Protein Sequence (MIPS) [69] provides protein complex databases for various species, where protein complexes are curated manually by biologists from scientific articles and are hierarchically structured. CYGD [63] is a yeast protein complex database in MIPS, which also documents physical and functional protein interactions and other genetic information. CORUM [64] is another protein complex database in MIPS, collecting mammalian protein complexes verified by small-scale experiments. In the latest version CORUM 3.0, most of the protein complexes are originated from human (67%), mouse (15%) and rat (10%).

CYC2008 [65] is the most used yeast protein complex database in this field, which documents 408 curated heteromeric protein complexes that were validated by small-scale experiments.

PISA [66] is a database of predicted protein complexes based on six different physicochemical properties of protein quaternary structures. A summary of these protein complex databases is shown in Table 2.

Table 2

The summary of protein complex databases

NameVersionSpecies# proteins# complexesWebsite
CYGD [63]2006Yeast1189a203ahttp://mips.gsf.de/genre/proj/yeast
CORUM [64]3.0Mammalian44734274http://mips.helmholtz-muenchen.de/corum
CYC2008 [65]2.0Yeast1627408http://wodaklab.org/cyc2008
PISA [66]1.48AllUnknown12241http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html
NameVersionSpecies# proteins# complexesWebsite
CYGD [63]2006Yeast1189a203ahttp://mips.gsf.de/genre/proj/yeast
CORUM [64]3.0Mammalian44734274http://mips.helmholtz-muenchen.de/corum
CYC2008 [65]2.0Yeast1627408http://wodaklab.org/cyc2008
PISA [66]1.48AllUnknown12241http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html
a

The number of complexes are reported in [70].

Table 2

The summary of protein complex databases

NameVersionSpecies# proteins# complexesWebsite
CYGD [63]2006Yeast1189a203ahttp://mips.gsf.de/genre/proj/yeast
CORUM [64]3.0Mammalian44734274http://mips.helmholtz-muenchen.de/corum
CYC2008 [65]2.0Yeast1627408http://wodaklab.org/cyc2008
PISA [66]1.48AllUnknown12241http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html
NameVersionSpecies# proteins# complexesWebsite
CYGD [63]2006Yeast1189a203ahttp://mips.gsf.de/genre/proj/yeast
CORUM [64]3.0Mammalian44734274http://mips.helmholtz-muenchen.de/corum
CYC2008 [65]2.0Yeast1627408http://wodaklab.org/cyc2008
PISA [66]1.48AllUnknown12241http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html
a

The number of complexes are reported in [70].

The flowchart of identifying protein complexes from dynamic PPI networks. First, using gene expression profiles and subcellular location data to project the static PPI network as different network slices to construct the dynamic PPI network. In each network slice, proteins are expressed at the same time point, and only the proteins in the same subcellular location can interact with each other. Second, clustering methods are performed on each network slice to generate clusters. Finally, merging or filtering highly overlapping clusters to form protein complexes.
Figure 1

The flowchart of identifying protein complexes from dynamic PPI networks. First, using gene expression profiles and subcellular location data to project the static PPI network as different network slices to construct the dynamic PPI network. In each network slice, proteins are expressed at the same time point, and only the proteins in the same subcellular location can interact with each other. Second, clustering methods are performed on each network slice to generate clusters. Finally, merging or filtering highly overlapping clusters to form protein complexes.

PPI networks

PPI networks are constructed from PPI data sets by denoting proteins as nodes and PPIs as edges. Typically, PPI networks are static, assuming every PPI is constant in any cell and condition. However, PPIs are dynamic inside cells, and their occurrences depend on spatial, temporal and/or contextual variation. Therefore, it is more reasonable to model PPI networks as dynamic systems [71]. Besides, as more and more PPI data from different species are available, constructing interolog PPI networks becomes a new trend to transfer biological knowledge across species [72].

Static PPI networks

Usually, static PPI networks are modeled as simple undirected graphs. For preserving the raw information of PPI determination experiments, several studies represent static PPI network as a bipartite graph, where one part of nodes corresponds to bait proteins and the other part corresponds to prey proteins [39, 73, 74]. Because network weights are crucial for protein complex identification [70], different kinds of biological data have been used to assign weights to PPI networks.

First, TAP-MS data are important information source to calculate PPI affinity scores. Different techniques derive scores that quantify the likelihood of two proteins being observed together by using different procedures and adopt these scores as the weights of edges in PPI networks [75].

Second, topological structure information is widely used to quantify the closeness between node pairs in PPI networks. On this aspect, several methods measure the closeness between two connected nodes based on their common neighbors and degrees [38, 76–78]. Later, NEOComplex [53] further accounts the common neighbors of common neighbors for defining the closeness score of two connected nodes. To improve the reliability of weights, AdjustCD [79] and PE-measure [30] are defined in an iterative manner. Because PPI networks are noisy and incomplete, Chen et al. [80] adapt networks by removing edges with low FS-weight that measuring the overlap degree of the neighborhoods of two nodes and introduce edges for indirectly connected nodes with high FS-weight.

Third, Gene Ontology (GO) [81, 82] is commonly used for evaluating PPI reliability. During the past decades, various semantic similarity measures of GO terms have been defined [83] and assigned as the weights of edges in PPI networks. Gene expression data are also widely used for assigning weights to edges in PPI networks. For example, Luo et al. [84] adopt the Pearson correlation coefficient (PCC) of the coding genes of proteins as the weight of an edge. SGNMF [41] assigns the ‘signs’ for edges based on the PCCs of genes, whereas DSMP [85] calculates the distances between gene expression profiles as weights of edges. Besides, ProRank [86] computes the protein sequence similarity score for PPIs as their weights.

Finally, integrating multiple biological data will lead to more accurate weights. In this regard, several approaches combine PPI reliability scores derived from different experimental evidences by different means [87, 88]. Other methods combine topology-based weights with other scores, such as GO similarity score [33, 89, 90] and PCCs of gene expression data [26].

The flowchart of identifying protein complexes from interolog PPI networks. First, the interolog PPI network is constructed by aligning multiple species’ PPI networks. Then, the conserved protein complexes are identified by directly clustering the interolog PPI network. In addition, the orthology protein complex groups are identified by projecting the predicted protein complexes from the source species to other species via the interolog PPI network.
Figure 2

The flowchart of identifying protein complexes from interolog PPI networks. First, the interolog PPI network is constructed by aligning multiple species’ PPI networks. Then, the conserved protein complexes are identified by directly clustering the interolog PPI network. In addition, the orthology protein complex groups are identified by projecting the predicted protein complexes from the source species to other species via the interolog PPI network.

Dynamic PPI networks

In order to capture the dynamic characteristics of PPIs in vivo, a number of studies construct dynamic PPI networks by mapping additional biological information into large-scale PPI data sets [91, 92]. Generally, dynamic PPI networks contain a series of network slices, each consisting of proteins expressed under a condition or at a time point and the corresponding PPIs [92].

Gene expression data are widely used for dynamic PPI network construction. In order to determine expressed proteins or co-expression protein pairs, several methods compare the expression level of proteins with different thresholds [92]. For reducing the impact of inherent background noise in gene expression data, DPPN [52] assigns the active probability at each time point for each protein according to the three-sigma principle [28]. The dynamic PPI networks aforementioned ignore the connections among network slices. To solve this problem, TS-OCD [34] distinguishes stable PPIs and transient PPIs to construct dynamic PPI networks. Stable PPIs are reserved in all network slices, while transient interactions only present in certain time point. Because PPIs also depend on protein subcellular location, several methods constructs more refined dynamic PPI networks by combining protein subcellular information with gene expression data [32, 46]. Other than groups PPIs by time or condition, several methods [55, 93] constructs dynamic PPI networks where each network slice consists of proteins with similar functions and the corresponding PPIs.

Figure 1 shows the general process of identifying protein complexes from dynamic PPI networks. The strategy of identifying protein complexes from dynamic PPI networks is clustering each network slice first and then merges or filters clusters from different network slices based on certain criteria.

Interolog PPI networks

PPIs and protein complexes are frequently conserved in evolution [94]. In order to accurately identify conserved protein complexes and transfer biology information across species, different interolog PPI networks have been developed.

In interolog PPI networks, nodes represent groups of similar proteins, each of which comes from a species, and edges represent conserved PPIs. To determine which proteins are similar and assign edges for which node pairs, various approaches have been developed. The simplest way is assuming every two proteins of different species are similar and linking every two nodes by an edge [95]. A more refined approach is assigning nodes for sequence-similar protein pairs and linking two nodes only if at least one species’ protein pair interacts; meanwhile, the other species’ protein pair share a common interaction partner [96]. In order to control the size of interolog PPI networks, CAPPI [97] groups proteins into families and then assigns a node for each family. The approaches aforementioned ignore the information of interaction partners when determining similarity relationship among proteins. In this regard, IsoRank [72] combines the protein sequences alignment scores and their PPI network neighbor match scores to determine similar protein groups. Since many conserved cellular mechanisms across species cannot be revealed only by protein sequence similarity, COCIN [98] combines domain conservation information with sequence similarity for computing functional similar protein groups.

Figure 2 illustrates the general process of identifying protein complexes from interolog PPI networks. There are two strategies to identify protein complexes from interolog PPI networks. The first strategy directly clusters the interolog PPI network and maps the clustering results to each species. The other strategy clusters the PPI network of a selected species, first, and then maps the identified results into other species via the interolog PPI network.

The schematic of computational methods for protein complex identification. Cluster-quality-based methods and node-affinity-based methods measure protein complexes from different aspect, and ensemble clustering methods are developed based on the former two categories of methods.
Figure 3

The schematic of computational methods for protein complex identification. Cluster-quality-based methods and node-affinity-based methods measure protein complexes from different aspect, and ensemble clustering methods are developed based on the former two categories of methods.

Methods

Computational methods detect meaningful clusters from PPI networks as protein complexes. According to their methodologies, these methods can be organized into three categories, including cluster-quality-based methods, node-affinity-based methods and ensemble clustering methods. Cluster-quality-based methods treat clusters as measure units; node-affinity-based methods measure the affinity between nodes inside clusters. However, the two categories are not completely disconnected. Actually, some node-affinity-based methods select subgraphs with high cluster-quality as seeds to construct clusters. Finally, ensemble clustering methods combine the outputted clusters of different methods to mine final complexes. Figure 3 shows the schematic and relations of those categories of protein complex identification methods. It should be noticed that each clustering method can be applied to all of the three types of PPI networks to predict protein complexes, as long as it is combined with the corresponding protein complex identifying strategy of each type of network as discussed in the former section.

Cluster-quality-based methods

Cluster-quality-based methods define cluster quality functions and design corresponding algorithms to extract clusters from PPI networks. According to their measuring object, cluster quality functions can be further divided into local-cluster-quality functions and global-cluster-quality functions. Local-cluster-quality functions measure individual clusters, while Global-cluster-quality functions measure all clusters by considering them as a whole.

Local-cluster-quality-based methods

Local-cluster-quality-based methods identify individual clusters with optimal local-cluster-quality in a seed growth manner. Clusters are grown from different seeds via iteratively adding or removing nodes.

Because many protein complexes show dense structures in PPI networks [1], several studies directly employ graph density as the local-cluster-quality function of individual clusters. Starting from a randomly connected node set, Spirin et al. [1] maximize the density of a cluster by ‘moving’ nodes into or out from the cluster according to the Metropolis criteria. Finally, statistically insignificant nodes are removed from clusters, and overlapping clusters are merged if their union remains dense. NCMine [51], Core&Peel [49] and TP-WDPIN [45] identify clusters with density larger than a threshold value. NCMine [51] grows a cluster from each node by adding neighbors according to degree centrality order. During addition, the density change of the cluster must be less than a threshold. Finally, highly overlapping clusters are merged if their union remains dense. Core&Peel [49] extracts clusters from the closed neighborhood of each node based on the degree and core number of nodes. TP-WDPIN [45] selects high-topology potential nodes as seeds. The algorithm first extracts cores from the closed neighborhoods of seeds and then generates a cluster from each core by adding all neighbors that increase the density of the core. After generating all clusters, both Core&Peel [49] and TP-WDPIN [45] discard clusters that are completely contained by others. In order to take advantage of edge weights, DME [99] enumerates all local maximum subgraphs with weighted density exceeding a certain threshold as clusters. NEOComplex [53] expands clusters from each node by recursively adding the neighbor with highest weight of edge connected to the newly added node, until the weighted density of the cluster falls below a threshold. Finally, the algorithm discards highly overlapping clusters and all those with sizes of less than three.

In general, a cluster is considered as a network region with internal connections denser than external connections [100], and many quantitative definitions of this property have been proposed. For example, Chen et al. [101] define the quantitative definition of a cluster based on the in-degree and out-degree of nodes inside the cluster and adapt the GN algorithm [102] to generate clusters. miPALM [77] defines the parametric local modularity of individual clusters. Clusters with maximum parametric local modularity are grown from three-node cliques in a greedy way. Finally, miPALM merges highly overlapping clusters and discards sparse clusters. Cho et al. [103, 104] define the entropy for each cluster. Clusters with minimal graph entropy are generated from the closed neighborhood of each node. Since PPI networks often contain many false-positive edges, methods such as HC-PIN [105], LF-PIN [106] and ClusterONE [70] incorporate weights of edges into cluster quantitative definitions. HC-PIN [105] defines the quantitative definition of a cluster based on the weighted in-degree and out-degree of nodes. The algorithm expands singleton clusters by greedily adding edges according to the clustering value order of edges. Finally, all small clusters are discarded. LF-PIN [106] defines the local fitness of a cluster by multiplying the HC-PIN quantitative definition and the density of the cluster. By iteratively selecting the unclustered edge with highest weight as seeds, clusters with maximal local fitness are constructed by adding or removing nodes via a greedy procedure. ClusterONE [70] defines the cohesiveness score of a cluster based on the cluster size and the intra- and extra-cluster weighted degrees of nodes. The algorithm iteratively selects the unclustered node with highest degree as the seed of a cluster. Clusters with maximal cohesiveness score are constructed by greedily adding or removing nodes. Finally, ClusterONE merges highly overlapping clusters and discards all small and sparse clusters. Later, several improved methods have been proposed. PCE-FR [44] detects pseudocliques based on the fuzzy relation between nodes as seeds of clusters; Zhang et al. [43] redefine the cohesiveness score of a cluster based on the number of triangles related to the cluster; EPOF [107] removes the penalty term from the dominator of the cohesiveness score function.

Because PPI data are imperfect, several methods [73, 95, 96, 108] quantify the likelihood ratio score that a subgraph of the PPI network being a protein complex based on different probability models. The first three methods define different types of seeds and generate individual clusters with maximum likelihood ratio score by adding or removing nodes via a greedy procedure. The last method selects each node as a seed and constructs maximal likelihood ratio score clusters by employing the simulated annealing search.

Global-cluster-quality-based methods

Global-cluster-quality-based methods search for an optimal clustering result with best global-cluster-quality function value. In order to obtain a set of dense clusters, RNSC [109] defines a naive and a scaled cost function for the clustering result. The algorithm first minimizes the naive cost and then the scaled cost of the clustering result by moving nodes between clusters via a restricted procedure. Finally, all small, sparse and weak functional homogeneity clusters are removed. MAE-FMD [33] searches for the clustering result with maximum modularity [110] by applying evolutionary operators on different clustering results. Finally, the algorithm merges functionally similar clusters and discards all singleton and sparse clusters. LCMA [111] considers maximum cliques as initial clusters and iteratively merges highly overlapping clusters until the average density of all clusters is less than a parameter. Ramadan et al. [50] define the fitness value of a clustering result based on the density of individual clusters and apply a genetic algorithm to search the clustering result with maximum fitness value. GS [112, 113] defines the graph description length of a clustering result and searches for the clustering result with minimal description length by merging clusters in a greedy manner.

The aforementioned methods are unable to detect clusters containing peripheral proteins. Therefore, several generative network models, such as RSGNM [114], GMFTP [37] and SGNMF [41], have been proposed. These models generate real PPI networks based on a set of propensity parameters that represent how likely a protein belongs to a protein complex. For identifying clusters, the probability of generating real PPI networks is optimized by tuning propensity parameters. RSGNM [114] generates the PPI network according to Poisson distribution and uses the Laplacian regularizer to smooth estimators of propensity parameters. To mitigate the effects of false-positive PPI noise, GMFTP [37] maximizes the joint probability of generating a real PPI network and protein function profiles. In GMFTP, the PPI network and protein function profiles are generated according to Bernoulli distribution. Ou-Yang et al. [47] maximize the joint probability of generating the PPI network and the domain–domain interaction (DDI) network, where the propensity parameters obey the Bernoulli distribution. SGNMF [41] incorporates a Laplacian regularizer into the Kullback–Leibler divergence between the generated network and the real PPI network.

Since some protein complexes do not show dense structures in PPI networks, Wang and Qian [36] identified clusters whose member nodes have similar neighbors. They define the conductance of a clustering result based on the two-hop transition probability between nodes and then propose two algorithms, SLCP2 and GLSP2, to search the minimum conductance clustering result. SLCP2 is a spectral approximate algorithm for identifying non-overlapping clusters, while GLSP2 is a bottom-up greedy strategy-based algorithm and can identify overlapping clusters.

In order to measure clustering results more comprehensively, several methods combine different global-cluster-quality functions to define a unified function. For example, PPSampler [27] defines the unified function of a clustering result as the negative sum of three different score functions. The algorithm searches the clustering result with minimum unified function score by using a Markov chain Monte Carlo method. Later, PPSampler2 [29] alters the distribution of the three score functions; PPSampler2-PIME [40] adds a regularizer of exclusive PPIs into the unified function of PPSampler2.

The flowchart of ensemble clustering methods. The base clusters are generated by applying different methods on the PPI network first, and then they are fused to construct a consensus matrix or a cluster-cluster adjacency matrix. Finally, protein complexes are identified from the consensus matrix or the cluster–cluster adjacency matrix.
Figure 4

The flowchart of ensemble clustering methods. The base clusters are generated by applying different methods on the PPI network first, and then they are fused to construct a consensus matrix or a cluster-cluster adjacency matrix. Finally, protein complexes are identified from the consensus matrix or the cluster–cluster adjacency matrix.

Node-affinity-based methods

Cluster-quality-based methods ignore the inherent relationship among nodes when identifying clusters. To overcome this problem, node-affinity-based methods have been proposed. Node-affinity-based methods usually generate clusters from seeds by expanding neighbors with high-affinity scores. Based on the measuring scope, node-affinity functions can be further divided into local-node-affinity functions and global-node-affinity functions. Local-node-affinity functions measure the local neighborhood affinity between nodes, while global-node-affinity functions measure the global network affinity between nodes.

Table 3

The summary of available web servers, stand-alone tools and source code for protein complex identification

a

‘LQ’ represents the local-cluster-quality-based methods, ‘GQ’ represents the global-cluster-quality-based methods, ‘LA’ represents the Local-node-affinity-based methods and ‘GA’ represents the global-node-affinity methods.

Table 3

The summary of available web servers, stand-alone tools and source code for protein complex identification

a

‘LQ’ represents the local-cluster-quality-based methods, ‘GQ’ represents the global-cluster-quality-based methods, ‘LA’ represents the Local-node-affinity-based methods and ‘GA’ represents the global-node-affinity methods.

Local-node-affinity-based methods

To detect dense clusters, MCODE [115] and SWEMODE [116] quantify local-node-affinity based on node weight. MCODE [115] assigns weight to each node based on the highest k-core that contains the node. Clusters are generated by iteratively taking the unseen highest weight node as the seed and then recursively adding unseen neighbors. When all clusters have been generated, the algorithm further processes them by the ‘fluff’ and ‘haircut’ operation. Later, SWEMODE [116] proposes two different schemes to assign weights to nodes. Some methods quantify local-node-affinity based on the number of edges among nodes. For example, Spirin et al. [1] suggest that nodes inside a cluster should completely link to each other and enumerate all maximum cliques in the PPI network as clusters. CFinder [117], DPClus [76] and ICPA [118] assume that the intra-cluster degree of each node should be larger than a threshold value. CFinder [117] sets the threshold to be a constant value and identifies k-clique communities as clusters. DPClus [76] sets the threshold of a cluster based on the density and size of the cluster. The algorithm generates a cluster by selecting the node with the highest weighted degree as the seed and then adding neighbors according to node priorities. During expanding, the density of the cluster must be higher than a threshold. When a cluster cannot be further expanded, it is removed from the network for generating the next cluster. IPCA [118] sets the threshold of a cluster based on the size of the cluster. The algorithm constructs clusters in a procedure similar as that of DPClus [76] but ensures the diameter of a cluster is less than a threshold during expanding. Furthermore, nodes would not be removed from the network after a cluster is generated. SPICi [119], OIIP [89] and MKE [35] suggest that the intra-cluster weighted degree of each node should be larger than a threshold. SPICi [119] defines the threshold based on the weighted density and size of the cluster and generates clusters in a procedure similar to that of DPClus [76]. OIIP [89] defines the threshold based on the total edge weights inside the cluster and generates clusters in a procedure similar to that of ICPA [118]. MKE [35] defines the threshold based on the edge weights of the closed neighborhood of the cluster. The algorithm generates clusters by taking high degree nodes as seeds and then iteratively adding weighted best neighbors. CMC [79] suggests that the connectivity between nodes inside a cluster should be larger than a threshold. The algorithm mines all maximal cliques in the PPI network as initial clusters and then ranks them in decreasing weighted density order. Finally, for every highly overlapping clique pair, CMC either discards the one with lower weighted density or merges them into one clique according to the interconnectivity score between them. SCAN [120] assumes that two nodes inside a cluster should share many common neighbors. The algorithm defines the structural similarity for connected node pairs and generates clusters by selecting core nodes as seeds and then recursively including structure-reachable nodes.

Because many real protein complexes exhibit the core attachment architecture [8], some algorithms first detect dense regions in PPI networks as cores and then integrate attachment nodes with each core to form clusters. For example, Pu et al. [121] adopt the non-overlapping clusters outputted by existing methods as cores and take neighbors connected with sufficient fraction of a core as the attachment nodes of that core. CORE [122], COACH [123] and PEWCC [30] take neighbors that connected with more than half of the nodes of a core as the attachment nodes of that core. CORE [122] identifies cores based on the probability of two nodes forming a protein complex core; COACH [123] identifies dense cores from the closed neighborhood of each node by using the core removal algorithm; PEWCC [30] identifies cores with weighted clustering coefficient larger than a threshold by using a procedure similar to that of COACH [123]. DCU [38] identifies high expected density cores from the closed neighborhood of each node via a greedy procedure. The algorithm selects neighbors with relative degree inside a core larger than a threshold as the attachment nodes of that core. WPNCA [42] extracts cores with high weighted density from dense preliminary clusters. For each core, its attachment nodes are selected from the preliminary cluster containing it.

Different from previous conclusions, Chen et al. [124] discover that most protein complexes are sparse in PPI networks. They identify starlike clusters by taking high degree nodes as seeds and then adding neighbors via a random procedure. ProRank [86] and ProRank+ [57] apply the spoke model to generate clusters based on node types and importance values and then merge highly overlapping clusters by different procedures. Dutkowski et al. [97] assume that nodes connected by high weighted edges are likely to form a cluster. They compute edge weights by using a PPI network evolution model, and then remove all low weight edges from the PPI network. After that, each connected component of the remaining network is outputted as a cluster.

Combining additional biological data will predict more biological meaningful clusters. For example, CSO [31] integrates the GO slims [81, 82] of proteins and the core attachment architecture to identify clusters. SPIC [125] splits clusters generated by existing methods into smaller and more refined ones by eliminating mutually exclusive PPIs. Ozawa et al. [126] discard clusters according to DDIs.

Global-node-affinity-based methods

In order to identify dense clusters, several methods such as MCL [127, 128], R-MCL [129], SR-MCL [130] and STM [131] simulate flows inside PPI networks. MCL [127, 128] alternately performs the expansion and inflation operator to the stochastic matrix of a PPI network. The expansion operator simulates flow through nodes, and the inflation operator is responsible for strengthening intra-cluster currents, meanwhile weakening inter-cluster currents. When the stochastic matrix became doubly idempotent, each attractor system in it is outputted as a cluster. To minimize the number of unreasonable size clusters, R-MCL [129] replaces the expansion operator with the ‘Regularize’ operator. In order to identify overlapping clusters, SR-MCL [130] modifies the ‘Regularize’ operator. The algorithm is iteratively re-executed via decreasing the flow to attractor nodes of previous iterations to generate different clustering results. Cho et al. [132] propagate information from an informative node to every other node in the PPI network through every possible path and take each informative node with its information coverage as a cluster. Finally, strong interconnection clusters are merged in a greedy manner. STM [131] propagates signal between every node pair based on the Erlang distribution and then generates clusters based on the signal value received among nodes. Other methods such as RRW [133], NWE [134] and RWRB [39] identify clusters based on random walks with restart. RRW [133] assumes that nodes inside a cluster have equal impact on determining new nodes of the cluster. By selecting high nodes as seeds, the algorithm expands clusters with restricted size in a greedy manner. NWE [134] argues that the nodes with different weight should have different impact on determining new nodes of a cluster. RWRB [39] incorporates the protein bait–prey relationship from TAP-MS data onto the random walk with restart and generates clusters in a procedure similar to that of MCL [127, 128].

Ensemble clustering methods

Figure 4 shows the general process of how ensemble clustering methods identify protein complexes from PPI networks. Each computational method has its unique advantages and shortcomings. Therefore, ensemble clustering methods are proposed to overcome the bias of different methods and generate more accurate clusters. According to their fusing strategies, ensemble clustering methods can be divided into consensus methods and meta-clustering methods.

Consensus methods fuse base clusters by different approa-ches to construct a consensus matrix and then identify final complexes from it. In the consensus matrix, each entry indicates the co-clustered strength of two proteins. Asur et al. [135] calculate the co-clustered strength of two proteins based on the reliability score of base clusters containing the two proteins. TINCD [48] constructs the final consensus matrix by fusing the TAP data based co-complex score matrix and consensus matrices of different PPI networks and proposes a new algorithm to identify final clusters. EnsemHC [54] also combines the co-complex scores that were derived from TAP-MS experiments [75] and the co-clustered strength of pairs of proteins to define a co-cluster matrix for each base clustering result. The consensus matrix is then constructed by fusing co-cluster matrices via an iterative procedure. Friedel et al. [136] define the co-clustered strength of a protein pair as the proportion of Bootstrap sample PPI networks where they are clustered together.

Meta-clustering methods further cluster base clustering results by considering each base cluster as a unit. Greene et al. [137] represent base clustering results into a complete weighted graph, where nodes represents base clusters and the weights of edges are assigned as the [0, 1]-normalized Pearson correlation between base clusters. By considering each node as a meta-cluster, they iteratively merge meta-clusters in a greedy manner to construct a hierarchical tree and then remove all unstable subtrees from the hierarchical tree.

Evaluation

In this section, we comprehensively evaluate the performance of different computational methods on two widely used benchmark PPI datasets, that is, the DIP dataset [58] and the Collins dataset [138].

Evaluation metrics

Seven metrics are commonly used for evaluating protein complex identification methods, including Precision, Recall, F-measure, Sensitivity (Sn), Positive Predictive Value (PPV), Accuracy (Acc), and Proportion of Significant Complexes (PSC).

Given a pair of protein sets A and B, the overlap score (OS) between them is defined as [115]
(1)
If |$\mathrm{OS}\Big(A,B\Big)\ge \upomega$|⁠, A and B are considered to match each other. Let P and R represent the sets of identified complexes and reference complexes, respectively, Precision, Recall and F-measure are defined as [80]
(2)
 
(3)
 
(4)
Given n reference protein complexes and m identified clusters, a confusion matrix T can be represented as
(5)
where |${t}_{ij}$| indicates the number of proteins that the ith protein complex and the jth cluster have in common. Based on T, Sn, PPV and Acc are defined as [139]
(6)
 
(7)
 
(8)

In Equation (6), |${N}_i$| is the size of the ith reference protein complex.

Because known protein complexes are incomplete, a predicted protein complex still could be a true complex even without matching any known complexes. Therefore, the PSC is adopted to evaluate predicted results [21]. For a predicted protein complex, if its functional homogeneity P-value does not exceed a threshold value, then it is considered to be functional significant. The P-value of a predicted protein complex C with respect to a functional homologous protein group F is defined as [140, 141]
(8)
where k is the number of common proteins shared by C and F and V is the entire protein set of the PPI network. For each predicted complex, its smallest P-value over all functional homologous protein groups is reported as its P-value [109].
Table 4

Performance comparison of 17 computational methods using CYC2008 as reference protein complex set

CategoryaMethodb,  c# predicted complexesPrecisionRecallF-measureSnPPVACC
(a) DIP dataset
LQClusterONE [70]3610.4210.3530.3840.3980.6640.514
Core&Peel [49]7880.4560.4660.4610.4610.5530.505
NCMine [51]15200.3180.5100.3920.5170.3970.453
GQRNSC [109]2380.5550.3820.4530.4060.6660.520
RSGNM [114]5600.3320.4800.3930.5380.5540.546
SGNMF [41]6860.3100.5660.4010.5940.5800.587
LAMCODE [115]1370.6280.2520.3600.3060.6320.440
CFinder [117]1330.6320.2450.3530.4820.3760.425
DPClus [76]3220.4940.4260.4580.4670.5800.521
SPICi [119]3250.3600.3530.3560.4990.5230.511
CMC [79]5600.5270.5170.5220.5910.5150.552
CORE [122]16440.1550.6300.2490.5970.4970.545
PEWCC [30]12820.5000.5740.5340.5950.4930.542
ProRank+ [57]4330.2700.2620.2660.3550.4820.414
GAMCL [127, 128]15700.1990.7080.3110.5160.7050.603
RRW [133]3840.4660.4240.4440.4600.4920.476
NWE [134]3650.4930.4210.4550.4350.5940.508
(b) Collins dataset
LQClusterONE [70]3230.6250.5690.5960.6580.6250.642
Core&Peel [49]4070.5260.3700.4340.5090.6000.553
NCMine [51]8220.6910.3850.4940.5680.5570.563
GQRNSC [109]3520.5650.5640.5650.6020.7320.664
RSGNM [114]2830.5550.4680.5080.5790.6020.590
SGNMF [41]2900.6660.4850.5610.5640.6330.598
LAMCODE [115]1420.7460.3310.4590.5860.5150.549
CFinder [117]1140.7890.2720.4050.5690.4920.529
DPClus [76]3140.6150.5590.5850.6200.6800.649
SPICi [119]1300.8150.3550.4950.5420.6430.590
CMC [79]2250.8000.3330.4710.5500.5790.565
CORE [122]3840.5390.5660.5520.6220.6580.640
PEWCC [30]5890.7880.3060.4410.5070.5860.545
ProRank+ [57]6590.7170.3630.4820.5770.4850.529
GAMCL [127, 128]2720.6400.5470.5890.6770.6410.659
RRW [133]2780.5400.4410.4850.4940.6940.586
NWE [134]5270.5070.6300.5620.5890.7080.645
CategoryaMethodb,  c# predicted complexesPrecisionRecallF-measureSnPPVACC
(a) DIP dataset
LQClusterONE [70]3610.4210.3530.3840.3980.6640.514
Core&Peel [49]7880.4560.4660.4610.4610.5530.505
NCMine [51]15200.3180.5100.3920.5170.3970.453
GQRNSC [109]2380.5550.3820.4530.4060.6660.520
RSGNM [114]5600.3320.4800.3930.5380.5540.546
SGNMF [41]6860.3100.5660.4010.5940.5800.587
LAMCODE [115]1370.6280.2520.3600.3060.6320.440
CFinder [117]1330.6320.2450.3530.4820.3760.425
DPClus [76]3220.4940.4260.4580.4670.5800.521
SPICi [119]3250.3600.3530.3560.4990.5230.511
CMC [79]5600.5270.5170.5220.5910.5150.552
CORE [122]16440.1550.6300.2490.5970.4970.545
PEWCC [30]12820.5000.5740.5340.5950.4930.542
ProRank+ [57]4330.2700.2620.2660.3550.4820.414
GAMCL [127, 128]15700.1990.7080.3110.5160.7050.603
RRW [133]3840.4660.4240.4440.4600.4920.476
NWE [134]3650.4930.4210.4550.4350.5940.508
(b) Collins dataset
LQClusterONE [70]3230.6250.5690.5960.6580.6250.642
Core&Peel [49]4070.5260.3700.4340.5090.6000.553
NCMine [51]8220.6910.3850.4940.5680.5570.563
GQRNSC [109]3520.5650.5640.5650.6020.7320.664
RSGNM [114]2830.5550.4680.5080.5790.6020.590
SGNMF [41]2900.6660.4850.5610.5640.6330.598
LAMCODE [115]1420.7460.3310.4590.5860.5150.549
CFinder [117]1140.7890.2720.4050.5690.4920.529
DPClus [76]3140.6150.5590.5850.6200.6800.649
SPICi [119]1300.8150.3550.4950.5420.6430.590
CMC [79]2250.8000.3330.4710.5500.5790.565
CORE [122]3840.5390.5660.5520.6220.6580.640
PEWCC [30]5890.7880.3060.4410.5070.5860.545
ProRank+ [57]6590.7170.3630.4820.5770.4850.529
GAMCL [127, 128]2720.6400.5470.5890.6770.6410.659
RRW [133]2780.5400.4410.4850.4940.6940.586
NWE [134]5270.5070.6300.5620.5890.7080.645

aSee the footnote of Table 3.

bAll methods are rerun by us. The predicted result of each method is obtained by tuning its parameters to maximize F-measure, and the optimized parameter values are given in Table S1 in Supporting Information S1.

cThe best performance is highlighted with bold font, the performance ranked second is highlighted with underline and the performance ranked third is highlighted with italic in each column.

Table 4

Performance comparison of 17 computational methods using CYC2008 as reference protein complex set

CategoryaMethodb,  c# predicted complexesPrecisionRecallF-measureSnPPVACC
(a) DIP dataset
LQClusterONE [70]3610.4210.3530.3840.3980.6640.514
Core&Peel [49]7880.4560.4660.4610.4610.5530.505
NCMine [51]15200.3180.5100.3920.5170.3970.453
GQRNSC [109]2380.5550.3820.4530.4060.6660.520
RSGNM [114]5600.3320.4800.3930.5380.5540.546
SGNMF [41]6860.3100.5660.4010.5940.5800.587
LAMCODE [115]1370.6280.2520.3600.3060.6320.440
CFinder [117]1330.6320.2450.3530.4820.3760.425
DPClus [76]3220.4940.4260.4580.4670.5800.521
SPICi [119]3250.3600.3530.3560.4990.5230.511
CMC [79]5600.5270.5170.5220.5910.5150.552
CORE [122]16440.1550.6300.2490.5970.4970.545
PEWCC [30]12820.5000.5740.5340.5950.4930.542
ProRank+ [57]4330.2700.2620.2660.3550.4820.414
GAMCL [127, 128]15700.1990.7080.3110.5160.7050.603
RRW [133]3840.4660.4240.4440.4600.4920.476
NWE [134]3650.4930.4210.4550.4350.5940.508
(b) Collins dataset
LQClusterONE [70]3230.6250.5690.5960.6580.6250.642
Core&Peel [49]4070.5260.3700.4340.5090.6000.553
NCMine [51]8220.6910.3850.4940.5680.5570.563
GQRNSC [109]3520.5650.5640.5650.6020.7320.664
RSGNM [114]2830.5550.4680.5080.5790.6020.590
SGNMF [41]2900.6660.4850.5610.5640.6330.598
LAMCODE [115]1420.7460.3310.4590.5860.5150.549
CFinder [117]1140.7890.2720.4050.5690.4920.529
DPClus [76]3140.6150.5590.5850.6200.6800.649
SPICi [119]1300.8150.3550.4950.5420.6430.590
CMC [79]2250.8000.3330.4710.5500.5790.565
CORE [122]3840.5390.5660.5520.6220.6580.640
PEWCC [30]5890.7880.3060.4410.5070.5860.545
ProRank+ [57]6590.7170.3630.4820.5770.4850.529
GAMCL [127, 128]2720.6400.5470.5890.6770.6410.659
RRW [133]2780.5400.4410.4850.4940.6940.586
NWE [134]5270.5070.6300.5620.5890.7080.645
CategoryaMethodb,  c# predicted complexesPrecisionRecallF-measureSnPPVACC
(a) DIP dataset
LQClusterONE [70]3610.4210.3530.3840.3980.6640.514
Core&Peel [49]7880.4560.4660.4610.4610.5530.505
NCMine [51]15200.3180.5100.3920.5170.3970.453
GQRNSC [109]2380.5550.3820.4530.4060.6660.520
RSGNM [114]5600.3320.4800.3930.5380.5540.546
SGNMF [41]6860.3100.5660.4010.5940.5800.587
LAMCODE [115]1370.6280.2520.3600.3060.6320.440
CFinder [117]1330.6320.2450.3530.4820.3760.425
DPClus [76]3220.4940.4260.4580.4670.5800.521
SPICi [119]3250.3600.3530.3560.4990.5230.511
CMC [79]5600.5270.5170.5220.5910.5150.552
CORE [122]16440.1550.6300.2490.5970.4970.545
PEWCC [30]12820.5000.5740.5340.5950.4930.542
ProRank+ [57]4330.2700.2620.2660.3550.4820.414
GAMCL [127, 128]15700.1990.7080.3110.5160.7050.603
RRW [133]3840.4660.4240.4440.4600.4920.476
NWE [134]3650.4930.4210.4550.4350.5940.508
(b) Collins dataset
LQClusterONE [70]3230.6250.5690.5960.6580.6250.642
Core&Peel [49]4070.5260.3700.4340.5090.6000.553
NCMine [51]8220.6910.3850.4940.5680.5570.563
GQRNSC [109]3520.5650.5640.5650.6020.7320.664
RSGNM [114]2830.5550.4680.5080.5790.6020.590
SGNMF [41]2900.6660.4850.5610.5640.6330.598
LAMCODE [115]1420.7460.3310.4590.5860.5150.549
CFinder [117]1140.7890.2720.4050.5690.4920.529
DPClus [76]3140.6150.5590.5850.6200.6800.649
SPICi [119]1300.8150.3550.4950.5420.6430.590
CMC [79]2250.8000.3330.4710.5500.5790.565
CORE [122]3840.5390.5660.5520.6220.6580.640
PEWCC [30]5890.7880.3060.4410.5070.5860.545
ProRank+ [57]6590.7170.3630.4820.5770.4850.529
GAMCL [127, 128]2720.6400.5470.5890.6770.6410.659
RRW [133]2780.5400.4410.4850.4940.6940.586
NWE [134]5270.5070.6300.5620.5890.7080.645

aSee the footnote of Table 3.

bAll methods are rerun by us. The predicted result of each method is obtained by tuning its parameters to maximize F-measure, and the optimized parameter values are given in Table S1 in Supporting Information S1.

cThe best performance is highlighted with bold font, the performance ranked second is highlighted with underline and the performance ranked third is highlighted with italic in each column.

Performance comparison

We compare the performance of 17 state-of-the-art computational methods: ClusterONE [70], Core&Peel [49] and NCmine [51] are local-cluster-quality-based methods; RNSC [109], RSGNM [114] and SGNMF [41] are global-cluster-quality-based methods; MCODE [115], CFinder [117], DPClus [76], SPICi [119], CMC [79], CORE [122], PEWCC [30] and ProRank+ [57] are local-node-affinity-based methods and MCL [127, 128], RRW [133] and NWE [134] are global-node-affinity methods. The reasons for selecting these methods are as follows: (1) they achieved the best performance in this field and (2) the implementations of these methods are available. These methods were evaluated on two widely used benchmark PPI datasets, including the DIP dataset [58] and the Collins dataset [138]. The DIP dataset is constructed based on the latest version of DIP database [58], containing 22 763 PPIs among 5015 yeast proteins. The Collins dataset [138] contains 9074 high-confidence PPIs among 1622 yeast proteins. We rerun the 17 methods (Table 3) on the two benchmark datasets. It is worth noting that the previous reviews [21–25] only evaluated the predictors published in 2010 and before. In this study, 7 of the 17 evaluated methods are proposed after 2010.

Table 5

Performance comparison of 17 methods on for identifying small and large protein complexes using CYC2008 as reference protein complex set

CategoryaMethodb,cSmall complexesLarge complexes
# predicted small complexesPrecisionRecallF-measure# Predicted large complexesPrecisionRecallF-measure
(a) DIP dataset
LQClusterONE [70]1420.0700.0390.0502190.4250.5370.474
Core&Peel [49]3050.1510.1700.1604830.4990.5640.529
NCMine [51]3170.1290.1310.13012030.2600.6510.372
GQRNSC [109]1020.2650.1160.1611360.4710.5230.496
RSGNM [114]1310.1530.0770.1034290.2210.6170.326
SGNMF [41]2110.0940.0810.0874750.2270.7250.346
LAMCODE [115]590.3220.0810.130780.6030.3830.468
CFinder [117]01330.3680.3960.382
DPClus [76]970.2370.0890.1292250.3780.6110.467
SPICi [119]220.3640.0310.0573030.2210.5440.314
CMC [79]750.2530.0730.1144850.3960.7110.509
CORE [122]12920.0860.4050.1423520.2240.5970.326
PEWCC [30]160.0630.0040.00712660.3520.7450.478
ProRank+ [57]2590.0850.0850.0851740.3390.3090.323
GAMCL [127, 128]12140.1250.5020.2003560.1990.5300.290
RRW [133]03840.2940.6580.407
NWE [134]2180.2060.1740.1891470.5030.4900.497
(b) Collins dataset
LQClusterONE [70]1730.4620.3360.3891500.5270.6310.574
Core&Peel [49]620.2900.0660.1073450.4410.6440.523
NCMine [51]08220.6170.6910.652
GQRNSC [109]2560.3790.4050.392960.7400.5840.653
RSGNM [114]400.2750.0580.0962430.4120.6980.518
SGNMF [41]650.2460.0730.1132250.5070.6780.580
LAMCODE [115]170.4120.0270.0511250.6320.6040.618
CFinder [117]430.3490.0620.105710.7460.4500.561
DPClus [76]1760.4550.3350.3861380.5720.6440.606
SPICi [119]280.5360.0730.1291020.6760.5970.634
CMC [79]260.3850.0370.0701990.7290.6110.665
CORE [122]2650.3660.4130.3881190.6220.6040.612
PEWCC [30]05890.7320.6380.681
ProRank+ [57]930.2800.0970.1445660.7120.6510.680
GAMCL [127, 128]1640.4760.3280.3881080.6390.6240.631
RRW [133]1610.2420.1780.2051170.6240.5910.607
NWE [134]4400.3270.5410.408870.7700.5230.623
CategoryaMethodb,cSmall complexesLarge complexes
# predicted small complexesPrecisionRecallF-measure# Predicted large complexesPrecisionRecallF-measure
(a) DIP dataset
LQClusterONE [70]1420.0700.0390.0502190.4250.5370.474
Core&Peel [49]3050.1510.1700.1604830.4990.5640.529
NCMine [51]3170.1290.1310.13012030.2600.6510.372
GQRNSC [109]1020.2650.1160.1611360.4710.5230.496
RSGNM [114]1310.1530.0770.1034290.2210.6170.326
SGNMF [41]2110.0940.0810.0874750.2270.7250.346
LAMCODE [115]590.3220.0810.130780.6030.3830.468
CFinder [117]01330.3680.3960.382
DPClus [76]970.2370.0890.1292250.3780.6110.467
SPICi [119]220.3640.0310.0573030.2210.5440.314
CMC [79]750.2530.0730.1144850.3960.7110.509
CORE [122]12920.0860.4050.1423520.2240.5970.326
PEWCC [30]160.0630.0040.00712660.3520.7450.478
ProRank+ [57]2590.0850.0850.0851740.3390.3090.323
GAMCL [127, 128]12140.1250.5020.2003560.1990.5300.290
RRW [133]03840.2940.6580.407
NWE [134]2180.2060.1740.1891470.5030.4900.497
(b) Collins dataset
LQClusterONE [70]1730.4620.3360.3891500.5270.6310.574
Core&Peel [49]620.2900.0660.1073450.4410.6440.523
NCMine [51]08220.6170.6910.652
GQRNSC [109]2560.3790.4050.392960.7400.5840.653
RSGNM [114]400.2750.0580.0962430.4120.6980.518
SGNMF [41]650.2460.0730.1132250.5070.6780.580
LAMCODE [115]170.4120.0270.0511250.6320.6040.618
CFinder [117]430.3490.0620.105710.7460.4500.561
DPClus [76]1760.4550.3350.3861380.5720.6440.606
SPICi [119]280.5360.0730.1291020.6760.5970.634
CMC [79]260.3850.0370.0701990.7290.6110.665
CORE [122]2650.3660.4130.3881190.6220.6040.612
PEWCC [30]05890.7320.6380.681
ProRank+ [57]930.2800.0970.1445660.7120.6510.680
GAMCL [127, 128]1640.4760.3280.3881080.6390.6240.631
RRW [133]1610.2420.1780.2051170.6240.5910.607
NWE [134]4400.3270.5410.408870.7700.5230.623

aSee the footnote of Table 3.

bSee the footnote of Table 4.

cSee the footnote of Table 4.

Table 5

Performance comparison of 17 methods on for identifying small and large protein complexes using CYC2008 as reference protein complex set

CategoryaMethodb,cSmall complexesLarge complexes
# predicted small complexesPrecisionRecallF-measure# Predicted large complexesPrecisionRecallF-measure
(a) DIP dataset
LQClusterONE [70]1420.0700.0390.0502190.4250.5370.474
Core&Peel [49]3050.1510.1700.1604830.4990.5640.529
NCMine [51]3170.1290.1310.13012030.2600.6510.372
GQRNSC [109]1020.2650.1160.1611360.4710.5230.496
RSGNM [114]1310.1530.0770.1034290.2210.6170.326
SGNMF [41]2110.0940.0810.0874750.2270.7250.346
LAMCODE [115]590.3220.0810.130780.6030.3830.468
CFinder [117]01330.3680.3960.382
DPClus [76]970.2370.0890.1292250.3780.6110.467
SPICi [119]220.3640.0310.0573030.2210.5440.314
CMC [79]750.2530.0730.1144850.3960.7110.509
CORE [122]12920.0860.4050.1423520.2240.5970.326
PEWCC [30]160.0630.0040.00712660.3520.7450.478
ProRank+ [57]2590.0850.0850.0851740.3390.3090.323
GAMCL [127, 128]12140.1250.5020.2003560.1990.5300.290
RRW [133]03840.2940.6580.407
NWE [134]2180.2060.1740.1891470.5030.4900.497
(b) Collins dataset
LQClusterONE [70]1730.4620.3360.3891500.5270.6310.574
Core&Peel [49]620.2900.0660.1073450.4410.6440.523
NCMine [51]08220.6170.6910.652
GQRNSC [109]2560.3790.4050.392960.7400.5840.653
RSGNM [114]400.2750.0580.0962430.4120.6980.518
SGNMF [41]650.2460.0730.1132250.5070.6780.580
LAMCODE [115]170.4120.0270.0511250.6320.6040.618
CFinder [117]430.3490.0620.105710.7460.4500.561
DPClus [76]1760.4550.3350.3861380.5720.6440.606
SPICi [119]280.5360.0730.1291020.6760.5970.634
CMC [79]260.3850.0370.0701990.7290.6110.665
CORE [122]2650.3660.4130.3881190.6220.6040.612
PEWCC [30]05890.7320.6380.681
ProRank+ [57]930.2800.0970.1445660.7120.6510.680
GAMCL [127, 128]1640.4760.3280.3881080.6390.6240.631
RRW [133]1610.2420.1780.2051170.6240.5910.607
NWE [134]4400.3270.5410.408870.7700.5230.623
CategoryaMethodb,cSmall complexesLarge complexes
# predicted small complexesPrecisionRecallF-measure# Predicted large complexesPrecisionRecallF-measure
(a) DIP dataset
LQClusterONE [70]1420.0700.0390.0502190.4250.5370.474
Core&Peel [49]3050.1510.1700.1604830.4990.5640.529
NCMine [51]3170.1290.1310.13012030.2600.6510.372
GQRNSC [109]1020.2650.1160.1611360.4710.5230.496
RSGNM [114]1310.1530.0770.1034290.2210.6170.326
SGNMF [41]2110.0940.0810.0874750.2270.7250.346
LAMCODE [115]590.3220.0810.130780.6030.3830.468
CFinder [117]01330.3680.3960.382
DPClus [76]970.2370.0890.1292250.3780.6110.467
SPICi [119]220.3640.0310.0573030.2210.5440.314
CMC [79]750.2530.0730.1144850.3960.7110.509
CORE [122]12920.0860.4050.1423520.2240.5970.326
PEWCC [30]160.0630.0040.00712660.3520.7450.478
ProRank+ [57]2590.0850.0850.0851740.3390.3090.323
GAMCL [127, 128]12140.1250.5020.2003560.1990.5300.290
RRW [133]03840.2940.6580.407
NWE [134]2180.2060.1740.1891470.5030.4900.497
(b) Collins dataset
LQClusterONE [70]1730.4620.3360.3891500.5270.6310.574
Core&Peel [49]620.2900.0660.1073450.4410.6440.523
NCMine [51]08220.6170.6910.652
GQRNSC [109]2560.3790.4050.392960.7400.5840.653
RSGNM [114]400.2750.0580.0962430.4120.6980.518
SGNMF [41]650.2460.0730.1132250.5070.6780.580
LAMCODE [115]170.4120.0270.0511250.6320.6040.618
CFinder [117]430.3490.0620.105710.7460.4500.561
DPClus [76]1760.4550.3350.3861380.5720.6440.606
SPICi [119]280.5360.0730.1291020.6760.5970.634
CMC [79]260.3850.0370.0701990.7290.6110.665
CORE [122]2650.3660.4130.3881190.6220.6040.612
PEWCC [30]05890.7320.6380.681
ProRank+ [57]930.2800.0970.1445660.7120.6510.680
GAMCL [127, 128]1640.4760.3280.3881080.6390.6240.631
RRW [133]1610.2420.1780.2051170.6240.5910.607
NWE [134]4400.3270.5410.408870.7700.5230.623

aSee the footnote of Table 3.

bSee the footnote of Table 4.

cSee the footnote of Table 4.

Tables 4 shows the overall performance of the 17 methods in terms of Precision, Recall, F-measure, Sn, PPV and Acc on the two benchmark datasets, respectively. In these experiments, the CYC2008 database [65] is used as the reference protein complexes set, and the match threshold |$\upomega$| is set as 0.2 by following those studies [21, 53, 115]. We mainly focus on Precision, Recall and F-measure because Sn, PPV and Acc have severe limitations [21]. The parameters of each method are optimized based on the F-measure, and the optimized values are given in Table S1 in Supporting Information S1. From these tables, we observe the following phenomena. (i) The performance of most methods on the Collins dataset is better than that on the DIP dataset. The reason is that the DIP dataset contains many noise PPIs. In particular, the high reliable interaction subset of the DIP dataset only contains 5470 PPIs, indicating that more than 75% PPIs in the DIP dataset may be false-positive noise. The performance of CMC [79] and PEWCC [30] is better on the DIP dataset than that on the Collins dataset because they change the PPI network by filtering low reliable edges and introduce new edges between nodes. (ii) All methods show unstable performance on different datasets, indicating they heavily rely on PPI networks. For example, ClusterONE [70] achieves the best F-measure on the Collins dataset, while its performance on the DIP dataset is very poor. The reason is that the DIP dataset lacks PPI weights and contains a large amount of false-positive PPIs. (iii) Local-node-affinity-based methods show state-of-the-art Precision and F-measure, while global-node-affinity-based methods achieve the best Recall.

Table 6

Performance comparison of 17 computational methods in terms of PSC in biological process, molecular function and cellular component

CategoryaMethodb,c# predicted complexesPSC
BPdMFdCCd
(a) DIP dataset
LQClusterONE [70]3610.5680.3960.482
Core&Peel [49]7880.7080.5280.647
NCMine [51]15200.6340.5020.568
GQRNSC [109]2380.7440.5040.580
RSGNM [114]5600.5750.4090.468
SGNMF [41]6860.4660.3760.391
LAMCODE [115]1370.8030.5180.679
CFinder [117]1330.8200.6470.729
DPClus [76]3220.7330.5710.590
SPICi [119]3250.5450.4430.452
CMC [79]5600.8570.7200.752
CORE [122]16440.1590.1190.128
PEWCC [30]12820.8000.6650.720
ProRank+ [57]4330.3740.2730.344
GAMCL [127, 128]15700.1460.1020.118
RRW [133]3840.7060.5340.609
NWE [134]3650.6580.4160.515
(b) Collins dataset
LQClusterONE [70]3230.5260.4020.427
Core&Peel [49]4070.8350.7150.789
NCMine [51]8220.9400.8520.895
GQRNSC [109]3520.3860.2590.324
RSGNM [114]2830.7770.6640.731
SGNMF [41]2900.7860.6100.641
LSMCODE [115]1420.9010.6760.796
CFinder [117]1140.8420.5440.702
DPClus [76]3140.5060.3790.427
SPICi [119]1300.9230.7150.777
CMC [79]2250.9420.7870.858
CORE [122]3840.3720.2630.315
PEWCC [30]5890.9690.8640.910
ProRank+ [57]6590.9100.7870.863
GSMCL [127, 128]2720.7720.5810.566
RRW [133]2780.6290.4890.590
NWE [134]5270.2450.1670.201
CategoryaMethodb,c# predicted complexesPSC
BPdMFdCCd
(a) DIP dataset
LQClusterONE [70]3610.5680.3960.482
Core&Peel [49]7880.7080.5280.647
NCMine [51]15200.6340.5020.568
GQRNSC [109]2380.7440.5040.580
RSGNM [114]5600.5750.4090.468
SGNMF [41]6860.4660.3760.391
LAMCODE [115]1370.8030.5180.679
CFinder [117]1330.8200.6470.729
DPClus [76]3220.7330.5710.590
SPICi [119]3250.5450.4430.452
CMC [79]5600.8570.7200.752
CORE [122]16440.1590.1190.128
PEWCC [30]12820.8000.6650.720
ProRank+ [57]4330.3740.2730.344
GAMCL [127, 128]15700.1460.1020.118
RRW [133]3840.7060.5340.609
NWE [134]3650.6580.4160.515
(b) Collins dataset
LQClusterONE [70]3230.5260.4020.427
Core&Peel [49]4070.8350.7150.789
NCMine [51]8220.9400.8520.895
GQRNSC [109]3520.3860.2590.324
RSGNM [114]2830.7770.6640.731
SGNMF [41]2900.7860.6100.641
LSMCODE [115]1420.9010.6760.796
CFinder [117]1140.8420.5440.702
DPClus [76]3140.5060.3790.427
SPICi [119]1300.9230.7150.777
CMC [79]2250.9420.7870.858
CORE [122]3840.3720.2630.315
PEWCC [30]5890.9690.8640.910
ProRank+ [57]6590.9100.7870.863
GSMCL [127, 128]2720.7720.5810.566
RRW [133]2780.6290.4890.590
NWE [134]5270.2450.1670.201

aSee the footnote of Table 3.

bSee the footnote of Table 4.

cSee the footnote of Table 4.

d‘BP’ represents biological process, ‘MF’ represents molecular function and ‘CC’ represents cellular component.

Table 6

Performance comparison of 17 computational methods in terms of PSC in biological process, molecular function and cellular component

CategoryaMethodb,c# predicted complexesPSC
BPdMFdCCd
(a) DIP dataset
LQClusterONE [70]3610.5680.3960.482
Core&Peel [49]7880.7080.5280.647
NCMine [51]15200.6340.5020.568
GQRNSC [109]2380.7440.5040.580
RSGNM [114]5600.5750.4090.468
SGNMF [41]6860.4660.3760.391
LAMCODE [115]1370.8030.5180.679
CFinder [117]1330.8200.6470.729
DPClus [76]3220.7330.5710.590
SPICi [119]3250.5450.4430.452
CMC [79]5600.8570.7200.752
CORE [122]16440.1590.1190.128
PEWCC [30]12820.8000.6650.720
ProRank+ [57]4330.3740.2730.344
GAMCL [127, 128]15700.1460.1020.118
RRW [133]3840.7060.5340.609
NWE [134]3650.6580.4160.515
(b) Collins dataset
LQClusterONE [70]3230.5260.4020.427
Core&Peel [49]4070.8350.7150.789
NCMine [51]8220.9400.8520.895
GQRNSC [109]3520.3860.2590.324
RSGNM [114]2830.7770.6640.731
SGNMF [41]2900.7860.6100.641
LSMCODE [115]1420.9010.6760.796
CFinder [117]1140.8420.5440.702
DPClus [76]3140.5060.3790.427
SPICi [119]1300.9230.7150.777
CMC [79]2250.9420.7870.858
CORE [122]3840.3720.2630.315
PEWCC [30]5890.9690.8640.910
ProRank+ [57]6590.9100.7870.863
GSMCL [127, 128]2720.7720.5810.566
RRW [133]2780.6290.4890.590
NWE [134]5270.2450.1670.201
CategoryaMethodb,c# predicted complexesPSC
BPdMFdCCd
(a) DIP dataset
LQClusterONE [70]3610.5680.3960.482
Core&Peel [49]7880.7080.5280.647
NCMine [51]15200.6340.5020.568
GQRNSC [109]2380.7440.5040.580
RSGNM [114]5600.5750.4090.468
SGNMF [41]6860.4660.3760.391
LAMCODE [115]1370.8030.5180.679
CFinder [117]1330.8200.6470.729
DPClus [76]3220.7330.5710.590
SPICi [119]3250.5450.4430.452
CMC [79]5600.8570.7200.752
CORE [122]16440.1590.1190.128
PEWCC [30]12820.8000.6650.720
ProRank+ [57]4330.3740.2730.344
GAMCL [127, 128]15700.1460.1020.118
RRW [133]3840.7060.5340.609
NWE [134]3650.6580.4160.515
(b) Collins dataset
LQClusterONE [70]3230.5260.4020.427
Core&Peel [49]4070.8350.7150.789
NCMine [51]8220.9400.8520.895
GQRNSC [109]3520.3860.2590.324
RSGNM [114]2830.7770.6640.731
SGNMF [41]2900.7860.6100.641
LSMCODE [115]1420.9010.6760.796
CFinder [117]1140.8420.5440.702
DPClus [76]3140.5060.3790.427
SPICi [119]1300.9230.7150.777
CMC [79]2250.9420.7870.858
CORE [122]3840.3720.2630.315
PEWCC [30]5890.9690.8640.910
ProRank+ [57]6590.9100.7870.863
GSMCL [127, 128]2720.7720.5810.566
RRW [133]2780.6290.4890.590
NWE [134]5270.2450.1670.201

aSee the footnote of Table 3.

bSee the footnote of Table 4.

cSee the footnote of Table 4.

d‘BP’ represents biological process, ‘MF’ represents molecular function and ‘CC’ represents cellular component.

Both small and large protein complexes are ubiquitous in cells [55]. Here, those complexes composed of two or three proteins are referred as small complexes, and complexes containing more than three proteins are referred as large complexes. In CYC2008, the numbers of small and large protein complexes are 259 and 149, respectively. In order to further evaluate different methods, Table 5 gives the Precision, Recall and F-measure values of each method on small and large protein complexes. From Table 5, it is obvious that small complexes remain difficult to identify. On the Collins dataset, the best F-measure of small complexes is only 0.408, while that of the large complexes is 0.681. Furthermore, global-node-affinity-based methods outperform other methods on identifying small complexes. For identifying large complexes, both local-cluster-affinity-based and local-node-affinity-based methods achieve the state-of-the-art performance. In particular, NEW is the best tool for identifying small complexes, and CMC is the best tool for identifying large complexes. Based on the observations from Tables 4 and 5, we conclude that each method has its own advantages, and integration of different methods will further improve the performance of protein complex identification. In this regard, several ensemble clustering methods such as TINCD [48] and EnsemHC [54] have shown better performance than base methods in terms of Recall and Acc.

Tables 6 shows the overall performance of the 17 methods in terms of PSC on the two benchmark datasets. The P-values with Bonferroni correction for predicted protein complexes are calculated by using the stand-alone tool GO::TermFinder [142], and the significant threshold is set as 0.01 by following this study [21]. In general, the performance for most methods on the Collins dataset is also better than that on the DIP dataset. Methods like ClusterONE [70], RNSC [109] and NWE [134] decrease their minimum size cutoff of protein complexes to 2 on the Collins dataset, leading to lower PSC than that on the DIP dataset because protein complexes with the size of 2 have high P-values [85]. Furthermore, some local-cluster-quality-based methods and local-node-affinity-based methods, such as NCMine [51], MCODE [115], CFinder [117] and PEWCC [30], identify dense and large regions in PPI network as complexes; therefore, they show high PSC.

Furthermore, we give a performance comparison of different predictors on the dynamic PPI networks constructed from the DIP dataset, and the results are shown in Table 7. Compared with Table 4, it is obvious that the predictors on dynamic networks outperform those on static networks. In addition, CPredictor2.0 [55] achieves the best F-measure, indicating that protein function information is highly effective for protein complex identification.

Discussion and conclusion

All the aforementioned methods have greatly promoted the development of protein complex identification. However, the following problems should be addressed for further promoting this important field.

PPI datasets

PPI datasets are critical for protein complex identification. There are two main problems of the existing PPI datasets [145]. (1) The available PPIs are incomplete [11]. For example, the estimated number of non-redundant human PPIs is about 650 000 [146], whereas the number of non-redundant human PPIs collected in the latest version of BioGRID database [59] is only 375 840, indicating the coverage is less than 60%. (2) The PPI data sets contain many false-positives [147–149] caused by experiment conditions, the contaminating nucleic acid used in experiments [150] and technical biases [151]. Table 8 shows the false-positive rate of PPIs in the three PPI databases against the Negatome 2.0 [152] database, including DIP, BioGRID and MINT. Negatome 2.0 contains 6532 non-interacting protein pairs reported in literature or derived from protein complex three-dimensional structures. It shows that the highest false-positive PPI rate of the three PPI databases is 4.33%.

Table 7

Performance comparison of methods on different dynamic PPI networks constructed from the DIP data set using CYC2008 as reference set

Networka,  b# predicted complexesIdentifying methodDIP versionPrecisionRecallF-measureSnPPVACC
TC-PINs [143]2063MCL [127, 128]10 October 20100.2150.5690.312
DPINs [28]1289CPM [144]28 February 20120.3280.5070.399
Zhang et al. [52]Zhang et al. [52]0.4830.4710.4770.3730.6940.509
ST-APIN [46]1761MCL [127, 128]10 October 20100.3390.6050.435
WDPIN [45]550TP-WDPIN [45]14 January 20160.4660.4360.451
CPredictor2.0 [55]636aMCL [127, 128]13 February 20170.5520.5420.5470.3970.7110.531
Networka,  b# predicted complexesIdentifying methodDIP versionPrecisionRecallF-measureSnPPVACC
TC-PINs [143]2063MCL [127, 128]10 October 20100.2150.5690.312
DPINs [28]1289CPM [144]28 February 20120.3280.5070.399
Zhang et al. [52]Zhang et al. [52]0.4830.4710.4770.3730.6940.509
ST-APIN [46]1761MCL [127, 128]10 October 20100.3390.6050.435
WDPIN [45]550TP-WDPIN [45]14 January 20160.4660.4360.451
CPredictor2.0 [55]636aMCL [127, 128]13 February 20170.5520.5420.5470.3970.7110.531

aThe predicted result of CPredictor2.0 is obtained by reruning the method, while the predicted results on other dynamic PPI networks are adopted from the origin papers and then transformed by us.

bThe best performance is highlighted with bold font.

Table 7

Performance comparison of methods on different dynamic PPI networks constructed from the DIP data set using CYC2008 as reference set

Networka,  b# predicted complexesIdentifying methodDIP versionPrecisionRecallF-measureSnPPVACC
TC-PINs [143]2063MCL [127, 128]10 October 20100.2150.5690.312
DPINs [28]1289CPM [144]28 February 20120.3280.5070.399
Zhang et al. [52]Zhang et al. [52]0.4830.4710.4770.3730.6940.509
ST-APIN [46]1761MCL [127, 128]10 October 20100.3390.6050.435
WDPIN [45]550TP-WDPIN [45]14 January 20160.4660.4360.451
CPredictor2.0 [55]636aMCL [127, 128]13 February 20170.5520.5420.5470.3970.7110.531
Networka,  b# predicted complexesIdentifying methodDIP versionPrecisionRecallF-measureSnPPVACC
TC-PINs [143]2063MCL [127, 128]10 October 20100.2150.5690.312
DPINs [28]1289CPM [144]28 February 20120.3280.5070.399
Zhang et al. [52]Zhang et al. [52]0.4830.4710.4770.3730.6940.509
ST-APIN [46]1761MCL [127, 128]10 October 20100.3390.6050.435
WDPIN [45]550TP-WDPIN [45]14 January 20160.4660.4360.451
CPredictor2.0 [55]636aMCL [127, 128]13 February 20170.5520.5420.5470.3970.7110.531

aThe predicted result of CPredictor2.0 is obtained by reruning the method, while the predicted results on other dynamic PPI networks are adopted from the origin papers and then transformed by us.

bThe best performance is highlighted with bold font.

Table 8

False positive PPI rate of different database against the Negatome 2.0 data set

Database# overlapping proteins# overlapping protein–PPIsaNo. of false-positive PPIsFalse-positive PPI rate
DIP [58]106645297501.65%
BioGRID [59]19441835179404.33%
MINT [61]11641572125401.61%
Database# overlapping proteins# overlapping protein–PPIsaNo. of false-positive PPIsFalse-positive PPI rate
DIP [58]106645297501.65%
BioGRID [59]19441835179404.33%
MINT [61]11641572125401.61%

aThese PPIs are nonredundant and consisting of only proteins overlapping with the Negatome 2.0.

Table 8

False positive PPI rate of different database against the Negatome 2.0 data set

Database# overlapping proteins# overlapping protein–PPIsaNo. of false-positive PPIsFalse-positive PPI rate
DIP [58]106645297501.65%
BioGRID [59]19441835179404.33%
MINT [61]11641572125401.61%
Database# overlapping proteins# overlapping protein–PPIsaNo. of false-positive PPIsFalse-positive PPI rate
DIP [58]106645297501.65%
BioGRID [59]19441835179404.33%
MINT [61]11641572125401.61%

aThese PPIs are nonredundant and consisting of only proteins overlapping with the Negatome 2.0.

In order to obtain complete and reliable PPI datasets, many research groups combine multiple techniques to determine PPIs [153]. This goal also can be achieved by assigning weights to observed interactions based on additional biological data [154], characterizing proteins across cell types and subcellular localizations [155] and aligning PPI networks [156]. Furthermore, PPI network alignment would be benefited from introducing the evolutionary relationship by the protein remote homology detection techniques [157–161]. Link prediction paradigms [162–164], which have been successfully applied in various areas [165, 166], are also potential methods for reconstructing PPI networks.

Dynamic PPI networks

Dynamic PPI networks are the key for further improving the performance of protein complex identification. Since protein function information is highly helpful for identifying protein complexes, new dynamic PPI networks should be constructed based on this kind of information. Recently, researchers point out that a PPI may have different scores in different conditions [167, 168]. Therefore, dynamic PPI networks reflecting this property would be very helpful for promoting protein complex identification. In this regard, Zhang et al. [52] assign dynamic weights to PPIs based on the probabilities that proteins expressing at different time points.

Protein complex identification methods

Most protein complex identification methods tend to identify dense and large regions in PPI networks as protein complexes. However, a large proportion of real protein complexes are small and sparse [55, 124]. In order to further improve the performance, new predictors should be designed to accurately identify not only dense and large protein complexes but also small and sparse ones. To achieve this goal, new predictors can be constructed based on the PPI weights that fuse more kinds of biological information, such as protein sequences [169, 170], disorders regions [171, 172], etc. In addition, the ensemble clustering methods are a new trend to construct more accurate meta-predictors. When developing an accurate meta-predictors, the following aspects should be considered. (1) Many protein complexes can only be predicted by some specific methods. Therefore, the meta-predictors should give these complexes high confidence scores to correctly predict these complexes. (2) Some false complexes would be predicted by multiple base predictors. The meta-predictors should correct these false positives by removing the false proteins from complexes. By deriving protein co-cluster strength from base clustering results [48, 54, 135–137] and by further integrating protein co-complex scores derived from TAP-MS experiments [48, 54], the existing meta-predictors not only reduce the impact of false-positive PPI noise but also add some missing PPI information. Therefore, they achieve better performance than base methods. However, most meta-predictors do not have a scoring system that assigns confidence scores to clusters [48, 54, 136, 137], or the scoring system only considers network topology information [135]. Therefore, they would miss many real protein complexes that can only be predicted by a single base method. Furthermore, all these meta-predictors integrate the predicted results of different base methods in an unsupervised manner. Therefore, false complexes predicted by multiple base methods will still be incorrectly predicted by these meta-predictors. More specifically, these meta-predictors cannot remove false-positive proteins predicted by multiple base methods from predicted complexes. In order to develop more accurate meta-predictors, biological relevant protein complex features should be combined with PPI network-based protein complex features in a supervised manner.

Another problem is that different methods heavily depend on the PPI networks. Since most PPI data sets are incomplete and unreliable, the performance of different methods is limited. In this regard, generative network models [37, 47, 114] are potential solutions, and their performance would be further improved by integrating more information [37, 47]. Furthermore, they can output sparse complexes, even those consist of proteins that are not connected in PPI networks. It is also feasible to assign scores to protein pairs and improve the quality of networks by removing low-reliability edges and introducing edges to protein pairs with high affinity score [30, 79, 88]. Supervised machine learning models, such as Bayesian network models [108], also can be used for identifying protein complexes. However, most of the protein complex features are network topological features, preventing the performance improvement of these models. Therefore, biologically relevant protein complex features should be explored to improve the performance of the supervised models.

Key Points

  • With the availability of genomic-scale PPI data for various species, it is important to develop computational methods to accurately and efficiently identify protein complexes from PPI networks.

  • The databases of PPIs/protein complexes are introduced and discussed.

  • The three types of PPI networks, including static, dynamic and interolog networks, are introduced, and the strategies of identifying protein complexes from them are described.

  • The computational methods for identifying protein complexes from PPI networks are introduced, and their advantages and shortcomings are discussed. The existing available web servers and stand-alone tools in this filed are given.

  • The performance of 17 state-of-the-art computational methods for identifying protein complexes from PPI networks is evaluated and compared on two widely used benchmark datasets. Finally, some challenges and perspectives in this field are discussed.

Acknowledgements

The authors are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful in strengthening the presentation of this article.

Funding

This work was supported by the National Natural Science Foundation of China (61672184, 61732012 and 61822306), the Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063), the Shenzhen Overseas High Level Talents Innovation Foundation (Grant No. KQJSCX20170327161949608), the Guangdong Natural Science Funds for Distinguished Young Scholars (2016A030306008) and the Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

Zhourun Wu is a PhD candidate at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China. His research areas include bioinformatics and machine learning.

Qing Liao, PhD, is an associate professor at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China. Her research areas include bioinformatics and machine learning.

Bin Liu, PhD, is a professor at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China, and School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China. His expertise is in bioinformatics, nature language processing and machine learning.

References

1.

Spirin
 
V
,
Mirny
 
LA
.
Protein complexes and functional modules in molecular networks
.
Proc Natl Acad Sci U S A
 
2003
;
100
:
12123
8
.

2.

Zeng
 
J
,
Zou
 
Q
,
Wu
 
Y
, et al.  
An empirical study of features fusion techniques for protein–protein interaction prediction
.
Curr Bioinform
 
2016
;
11
:
4
12
.

3.

Alberts
 
B
.
The cell as a collection of protein machines: preparing the next generation of molecular biologists
.
Cell
 
1998
;
92
:
291
4
.

4.

Hartwell
 
LH
,
Hopfield
 
JJ
,
Leibler
 
S
, et al.  
From molecular to modular cell biology
.
Nature
 
1999
;
402
:
C47
C52
.

5.

Gavin
 
AC
,
Bosche
 
M
,
Krause
 
R
, et al.  
Functional organization of the yeast proteome by systematic analysis of protein complexes
.
Nature
 
2002
;
415
:
141
7
.

6.

Rigaut
 
G
,
Shevchenko
 
A
,
Rutz
 
B
, et al.  
A generic protein purification method for protein complex characterization and proteome exploration
.
Nat Biotechnol
 
1999
;
17
:
1030
2
.

7.

Puig
 
O
,
Caspary
 
F
,
Rigaut
 
G
, et al.  
The tandem affinity purification (TAP) method: a general procedure of protein complex purification
.
Methods
 
2001
;
24
:
218
29
.

8.

Gavin
 
AC
,
Aloy
 
P
,
Grandi
 
P
, et al.  
Proteome survey reveals modularity of the yeast cell machinery
.
Nature
 
2006
;
440
:
631
6
.

9.

Krogan
 
NJ
,
Cagney
 
G
,
Yu
 
H
, et al.  
Global landscape of protein complexes in the yeast Saccharomyces cerevisiae
.
Nature
 
2006
;
440
:
637
43
.

10.

Huttlin
 
EL
,
Ting
 
L
,
Bruckner
 
RJ
, et al.  
The BioPlex network: a systematic exploration of the human interactome
.
Cell
 
2015
;
162
:
425
40
.

11.

Huttlin
 
EL
,
Bruckner
 
RJ
,
Paulo
 
JA
, et al.  
Architecture of the human interactome defines protein communities and disease networks
.
Nature
 
2017
;
545
:
505
9
.

12.

Hein
 
MY
,
Hubner
 
NC
,
Poser
 
I
, et al.  
A human interactome in three quantitative dimensions organized by stoichiometries and abundances
.
Cell
 
2015
;
163
:
712
23
.

13.

Drew
 
K
,
Lee
 
C
,
Huizar
 
RL
, et al.  
Integration of over 9,000 mass spectrometry experiments builds a global map of human protein complexes
.
Mol Syst Biol
 
2017
;
13
:
932
.

14.

Wan
 
C
,
Borgeson
 
B
,
Phanse
 
S
, et al.  
Panorama of ancient metazoan macromolecular complexes
.
Nature
 
2015
;
525
:
339
44
.

15.

Young
 
KH
.
Yeast two-hybrid: so many interactions, (in) so little time…
.
Biol Reprod
 
1998
;
58
:
302
11
.

16.

Ito
 
T
,
Tashiro
 
K
,
Muta
 
S
, et al.  
Toward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins
.
Proc Natl Acad Sci U S A
 
2000
;
97
:
1143
7
.

17.

Uetz
 
P
,
Giot
 
L
,
Cagney
 
G
, et al.  
A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae
.
Nature
 
2000
;
403
:
623
7
.

18.

Ito
 
T
,
Chiba
 
T
,
Ozawa
 
R
, et al.  
A comprehensive two-hybrid analysis to explore the yeast protein interactome
.
Proc Natl Acad Sci U S A
 
2001
;
98
:
4569
74
.

19.

Schwikowski
 
B
,
Uetz
 
P
,
Fields
 
S
.
A network of protein–protein interactions in yeast
.
Nat Biotechnol
 
2000
;
18
:
1257
61
.

20.

Barabasi
 
AL
,
Oltvai
 
ZN
.
Network biology: understanding the cell's functional organization
.
Nat Rev Genet
 
2004
;
5
:
101
113
.

21.

Li
 
X
,
Wu
 
M
,
Kwoh
 
CK
, et al.  
Computational approaches for detecting protein complexes from protein interaction networks: a survey
.
BMC Genomics
 
2010
;
11
(
Suppl 1
):
S3
.

22.

Srihari
 
S
,
Leong
 
HW
.
A survey of computational methods for protein complex prediction from protein interaction networks
.
J Bioinform Comput Biol
 
2013
;
11
:
1230002
.

23.

Chen
 
B
,
Fan
 
W
,
Liu
 
J
, et al.  
Identifying protein complexes and functional modules—from static PPI networks to dynamic PPI networks
.
Brief Bioinform
 
2014
;
15
:
177
94
.

24.

Teng
 
B
,
Zhao
 
C
,
Liu
 
X
, et al.  
Network inference from AP-MS data: computational challenges and solutions
.
Brief Bioinform
 
2015
;
16
:
658
74
.

25.

Snider
 
J
,
Kotlyar
 
M
,
Saraon
 
P
, et al.  
Fundamentals of protein interaction network mapping
.
Mol Syst Biol
 
2015
;
11
:
848
.

26.

Tang
 
X
,
Feng
 
Q
,
Wang
 
J
, et al.  
Clustering based on multiple biological information: approach for predicting protein complexes
.
IET Syst Biol
 
2013
;
7
:
223
30
.

27.

Tatsuke
 
D
,
Maruyama
 
O
.
Sampling strategy for protein complex prediction using cluster size frequency
.
Gene
 
2013
;
518
:
152
8
.

28.

Wang
 
J
,
Peng
 
X
,
Li
 
M
, et al.  
Construction and application of dynamic protein interaction network based on time course gene expression data
.
Proteomics
 
2013
;
13
:
301
12
.

29.

Widita
 
CK
,
Maruyama
 
O
.
PPSampler2: predicting protein complexes more accurately and efficiently by sampling
.
BMC Syst Biol
 
2013
;
7
(
Suppl 6
):
S14
.

30.

Zaki
 
N
,
Efimov
 
D
,
Berengueres
 
J
.
Protein complex detection using interaction reliability assessment and weighted clustering coefficient
.
BMC Bioinformatics
 
2013
;
14
:
163
.

31.

Zhang
 
Y
,
Lin
 
H
,
Yang
 
Z
, et al.  
Protein complex prediction in large ontology attributed protein–protein interaction networks
.
IEEE/ACM Trans Comput Biol Bioinform
 
2013
;
10
:
729
41
.

32.

Guo
 
Y
,
Shang
 
XQ
,
Zhu
 
QP
, et al.  Identification of Protein Complexes and Functional Modules in Integrated PPI Networks. In:
2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Belfast, UK
,
2014
.

33.

Ji
 
JZ
,
Jiao
 
L
,
Yang
 
CC
, et al.  
MAE-FMD: multi-agent evolutionary method for functional module detection in protein–protein interaction networks
.
BMC Bioinformatics
 
2014
;
15
:
325
.

34.

Ou-Yang
 
L
,
Dai
 
DQ
,
Li
 
XL
, et al.  
Detecting temporal protein complexes from dynamic protein–protein interaction networks
.
BMC Bioinformatics
 
2014
;
15
:
335
.

35.

Shen
 
X
,
Zhao
 
Y
,
Li
 
Y
, et al.  
An efficient protein complex mining algorithm based on multistage Kernel extension
.
BMC Bioinformatics
 
2014
;
15
(
Suppl 12
):
S7
.

36.

Wang
 
Y
,
Qian
 
X
.
Functional module identification in protein interaction networks by interaction patterns
.
Bioinformatics
 
2014
;
30
:
81
93
.

37.

Zhang
 
XF
,
Dai
 
DQ
,
Ou-Yang
 
L
, et al.  
Detecting overlapping protein complexes based on a generative model with functional and topological properties
.
BMC Bioinformatics
 
2014
;
15
:
186
.

38.

Zhao
 
B
,
Wang
 
J
,
Li
 
M
, et al.  
Detecting protein complexes based on uncertain graph model
.
IEEE/ACM Trans Comput Biol Bioinform
 
2014
;
11
:
486
97
.

39.

Cai
 
B
,
Wang
 
H
,
Zheng
 
H
.
Identification of protein complexes from tandem affinity purification/mass spectrometry data via biased random walk
.
IEEE/ACM Trans Comput Biol Bioinform
 
2015
;
12
:
455
66
.

40.

Maruyama
 
O
,
Wong
 
L
. Regularizing predicted complexes by mutually exclusive protein–protein interactions. In:
Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (Asonam 2015)
,
2015
,
1068
75
.

41.

Ou-Yang
 
L
,
Dai
 
DQ
,
Zhang
 
XF
.
Detecting protein complexes from signed protein-protein interaction networks
.
IEEE/ACM Trans Comput Biol Bioinform
 
2015
;
12
:
1333
44
.

42.

Peng
 
W
,
Wang
 
JX
,
Zhao
 
BH
, et al.  
Identification of protein complexes using weighted pagerank-nibble algorithm and core-attachment structure
.
IEEE/ACM Trans Comput Biol Bioinform
 
2015
;
12
:
179
92
.

43.

Zhang
 
W
,
Zou
 
X
.
A new method for detecting protein complexes based on the three node cliques
.
IEEE/ACM Trans Comput Biol Bioinform
 
2015
;
12
:
879
86
.

44.

Cao
 
BW
,
Luo
 
JW
,
Liang
 
C
, et al.  Detecting overlapping protein complexes in weighted protein–protein interaction networks using pseudo-clique extension based on fuzzy relation. In:
2016 International Joint Conference on Neural Networks (Ijcnn)
.
Shenzhen, China
,
2016
,
1244
52
.

45.

Lei
 
XJ
,
Zhang
 
YC
,
Wu
 
FX
, et al.  Mining protein complexes based on topology potential from weighted dynamic PPI network. In:
2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Shenzhen, China
,
2016
,
735
8
.

46.

Meng
 
XM
,
Li
 
M
,
Wang
 
JX
, et al.  
Construction of the spatial and temporal active protein interaction network for identifying protein complexes
. In:
2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Shenzhen, China
,
2016
, pp.
631
636
.

47.

Ou-Yang
 
L
,
Hong
 
Y
,
Zhang
 
XF
. Identifying protein complexes via multi-network clustering. In:
2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Shenzhen, China
,
2016
,
645
50
.

48.

Ou-Yang
 
L
,
Wu
 
M
,
Zhang
 
XF
, et al.  
A two-layer integration framework for protein complex detection
.
BMC Bioinformatics
 
2016
;
17
:
100
.

49.

Pellegrini
 
M
,
Baglioni
 
M
,
Geraci
 
F
.
Protein complex prediction for large protein–protein interaction networks with the Core&Peel method
.
BMC Bioinformatics
 
2016
;
17
:
372
.

50.

Ramadan
 
E
,
Naef
 
A
,
Ahmed
 
M
.
Protein complexes predictions within protein interaction networks using genetic algorithms
.
BMC Bioinformatics
 
2016
;
17
(
Suppl 7
):
269
.

51.

Tadaka
 
S
,
Kinoshita
 
K
.
NCMine: core-peripheral based functional module detection using near-clique mining
.
Bioinformatics
 
2016
;
32
:
3454
60
.

52.

Zhang
 
Y
,
Lin
 
H
,
Yang
 
Z
, et al.  
Construction of dynamic probabilistic protein interaction networks for protein complex identification
.
BMC Bioinformatics
 
2016
;
17
:
186
.

53.

Ma
 
CY
,
Chen
 
YP
,
Berger
 
B
, et al.  
Identification of protein complexes by integrating multiple alignment of protein interaction networks
.
Bioinformatics
 
2017
;
33
:
1681
8
.

54.

Wu
 
M
,
Ou-Yang
 
L
,
Li
 
XL
.
Protein complex detection via effective integration of base clustering solutions and co-complex affinity scores
.
IEEE/ACM Trans Comput Biol Bioinform
 
2017
;
14
:
733
9
.

55.

Xu
 
B
,
Wang
 
Y
,
Wang
 
ZW
, et al.  
An effective approach to detecting both small and large complexes from protein–protein interaction networks
.
BMC Bioinformatics
 
2017
;
18
(
Suppl 12
):
419
.

56.

Manikandan
 
P
,
Ramyachitra
 
D
,
Banupriya
 
D
.
Detection of overlapping protein complexes in gene expression, phenotype and pathways of Saccharomyces cerevisiae using Prorank based Fuzzy algorithm
.
Gene
 
2016
;
580
:
144
58
.

57.

Hanna
 
EM
,
Zaki
 
N
.
Detecting protein complexes in protein interaction networks using a ranking algorithm with a refined merging procedure
.
BMC Bioinformatics
 
2014
;
15
:
204
.

58.

Xenarios
 
I
,
Salwínski
 
L
,
Duan
 
XJ
, et al.  
DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions
.
Nucleic Acids Res
 
2002
;
30
:
303
5
.

59.

Stark
 
C
,
Breitkreutz
 
B-J
,
Reguly
 
T
, et al.  
BioGRID: a general repository for interaction data sets
.
Nucleic Acids Res
 
2006
;
34
:
D535
9
.

60.

Keshava Prasad
 
TS
,
Goel
 
R
,
Kandasamy
 
K
, et al.  
Human protein reference database—2009 update
.
Nucleic Acids Res
 
2009
;
37
:
D767
72
.

61.

Licata
 
L
,
Briganti
 
L
,
Peluso
 
D
, et al.  
MINT, the molecular interaction database: 2012 update
.
Nucleic Acids Res
 
2012
;
40
:
D857
61
.

62.

Szklarczyk
 
D
,
Gable
 
AL
,
Lyon
 
D
, et al.  
STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental data sets
.
Nucleic Acids Res
 
2019
;
47
:
D607
13
.

63.

Güldener
 
U
,
Münsterkötter
 
M
,
Kastenmüller
 
G
, et al.  
CYGD: the comprehensive yeast genome database
.
Nucleic Acids Res
 
2005
;
33
:
D364
8
.

64.

Giurgiu
 
M
,
Reinhard
 
J
,
Brauner
 
B
, et al.  
CORUM: the comprehensive resource of mammalian protein complexes-2019
.
Nucleic Acids Res
 
2019
;
47
:
D559
63
.

65.

Pu
 
S
,
Wong
 
J
,
Turner
 
B
, et al.  
Up-to-date catalogues of yeast protein complexes
.
Nucleic Acids Res
 
2009
;
37
:
825
31
.

66.

Krissinel
 
E
,
Henrick
 
K
.
Inference of macromolecular assemblies from crystalline state
.
J Mol Biol
 
2007
;
372
:
774
97
.

67.

Ho
 
Y
,
Gruhler
 
A
,
Heilbut
 
A
, et al.  
Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry
.
Nature
 
2002
;
415
:
180
3
.

68.

Reguly
 
T
,
Breitkreutz
 
A
,
Boucher
 
L
, et al.  
Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae
.
J Biol
 
2006
;
5
:
11
.

69.

Mewes
 
HW
,
Frishman
 
D
,
Güldener
 
U
, et al.  
MIPS: a database for genomes and protein sequences
.
Nucleic Acids Res
 
2002
;
30
:
31
4
.

70.

Nepusz
 
T
,
Yu
 
H
,
Paccanaro
 
A
.
Detecting overlapping protein complexes in protein–protein interaction networks
.
Nat Methods
 
2012
;
9
:
471
2
.

71.

Przytycka
 
TM
,
Singh
 
M
,
Slonim
 
DK
.
Toward the dynamic interactome: it's about time
.
Brief Bioinform
 
2010
;
11
:
15
29
.

72.

Singh
 
R
,
Xu
 
J
,
Berger
 
B
.
Global alignment of multiple protein interaction networks with application to functional orthology detection
.
Proc Natl Acad Sci U S A
 
2008
;
105
:
12763
8
.

73.

Geva
 
G
,
Sharan
 
R
.
Identification of protein complexes from co-immunoprecipitation data
.
Bioinformatics
 
2011
;
27
:
111
7
.

74.

Wei
 
L
,
Zou
 
Q
,
Liao
 
M
, et al.  
A novel machine learning method for cytokine–receptor interaction prediction
.
Comb Chem High Throughput Screen
 
2016
;
19
:
144
52
.

75.

Xie
 
Z
,
Kwoh
 
CK
,
Li
 
XL
, et al.  
Construction of co-complex score matrix for protein complex prediction from AP-MS data
.
Bioinformatics
 
2011
;
27
:
i159
66
.

76.

Altaf-Ul-Amin
 
M
,
Shinbo
 
Y
,
Mihara
 
K
, et al.  
Development and implementation of an algorithm for detection of protein complexes in large interaction networks
.
BMC Bioinformatics
 
2006
;
7
:
207
19
.

77.

Kim
 
J
,
Tan
 
K
.
Discover protein complexes in protein–protein interaction networks using parametric local modularity
.
BMC Bioinformatics
 
2010
;
11
:
521
31
.

78.

Wang
 
J
,
Li
 
M
,
Wang
 
H
, et al.  
Identification of essential proteins based on edge clustering coefficient
.
IEEE/ACM Trans Comput Biol Bioinform
 
2012
;
9
:
1070
80
.

79.

Liu
 
G
,
Wong
 
L
,
Chua
 
HN
.
Complex discovery from weighted PPI networks
.
Bioinformatics
 
2009
;
25
:
1891
7
.

80.

Chua
 
HN
,
Ning
 
K
,
Sung
 
W-K
, et al.  
Using indirect protein–protein interactions for protein complex prediction
.
J Bioinform Comput Biol
 
2008
;
06
:
435
66
.

81.

Ashburner
 
M
,
Ball
 
CA
,
Blake
 
JA
, et al.  
Gene Ontology: tool for the unification of biology
.
Nat Genet
 
2000
;
25
:
25
9
.

82.

Expansion of the Gene Ontology knowledgebase and resources
.
Nucleic Acids Res
 
2017
;
45
:
D331
8
.

83.

Mazandu
 
GK
,
Chimusa
 
ER
,
Mulder
 
NJ
.
Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery
.
Brief Bioinform
 
2017
;
18
:
886
901
.

84.

Luo
 
F
,
Liu
 
J
,
Li
 
J
.
Discovering conditional co-regulated protein complexes by integrating diverse data sources
.
BMC Syst Biol
 
2010
;
4
:
S4
S16
.

85.

Maraziotis
 
IA
,
Dimitrakopoulou
 
K
,
Bezerianos
 
A
.
Growing functional modules from a seed protein via integration of protein interaction and gene expression data
.
BMC Bioinformatics
 
2007
;
8
:
408
22
.

86.

Zaki
 
N
,
Berengueres
 
J
,
Efimov
 
D
.
Detection of protein complexes using a protein ranking algorithm
.
Proteins
 
2012
;
80
:
2459
68
.

87.

Kiemer
 
L
,
Costa
 
S
,
Ueffing
 
M
, et al.  
WI-PHI: a weighted yeast interactome enriched for direct physical interactions
.
Proteomics
 
2007
;
7
:
932
43
.

88.

Taghipour
 
S
,
Zarrineh
 
P
,
Ganjtabesh
 
M
, et al.  
Improving protein complex prediction by reconstructing a high-confidence protein–protein interaction network of Escherichia coli from different physical interaction data sources
.
BMC Bioinformatics
 
2017
;
18
:
10
.

89.

Xu
 
B
,
Lin
 
H
,
Yang
 
Z
.
Ontology integration to identify protein complex in protein interaction networks
.
Proteome Sci
 
2011
;
9
(
Suppl 1
):
S7
.

90.

Cai
 
B
,
Wang
 
H
,
Zheng
 
H
. Incorporating semantic similarity into clustering process for identifying protein complexes from affinity purification/mass spectrometry data. In:
2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Philadelphia, USA
,
2012
,
1
4
.

91.

Levy
 
ED
,
Pereira-Leal
 
JB
.
Evolution and dynamics of protein interactions and networks
.
Curr Opin Struct Biol
 
2008
;
18
:
349
57
.

92.

Wang
 
J
,
Peng
 
X
,
Peng
 
W
, et al.  
Dynamic protein interaction network construction and applications
.
Proteomics
 
2014
;
14
:
338
52
.

93.

Hanna
 
EM
,
Zaki
 
N
,
Amin
 
A
.
Detecting protein complexes in protein interaction networks modeled as gene expression biclusters
.
PLoS One
 
2015
;
10
(
12
):
e0144163
.

94.

Pereira-Leal
 
JB
,
Levy
 
ED
,
Teichmann
 
SA
.
The origins and evolution of functional modules: lessons from protein complexes
.
Philos Trans R Soc Lond B Biol Sci
 
2006
;
361
:
507
17
.

95.

Sharan
 
R
,
Ideker
 
T
,
Kelley
 
B
, et al.  
Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data
.
J Comput Biol
 
2005
;
12
:
835
46
.

96.

Hirsh
 
E
,
Sharan
 
R
.
Identification of conserved protein complexes based on a model of protein network evolution
.
Bioinformatics
 
2007
;
23
:
e170
6
.

97.

Dutkowski
 
J
,
Tiuryn
 
J
.
Identification of functional modules from conserved ancestral protein–protein interactions
.
Bioinformatics
 
2007
;
23
:
i149
58
.

98.

Nguyen
 
PV
,
Srihari
 
S
,
Leong
 
HW
.
Identifying conserved protein complexes between species by constructing interolog networks
.
BMC Bioinformatics
 
2013
;
14
(
Suppl 16
):
S8
.

99.

Georgii
 
E
,
Dietmann
 
S
,
Uno
 
T
, et al.  
Enumeration of condition-dependent dense modules in protein interaction networks
.
Bioinformatics
 
2009
;
25
:
933
40
.

100.

Radicchi
 
F
,
Castellano
 
C
,
Cecconi
 
F
, et al.  
Defining and identifying communities in networks
.
Proc Natl Acad Sci U S A
 
2004
;
101
:
2658
63
.

101.

Chen
 
J
,
Yuan
 
B
.
Detecting functional modules in the yeast protein–protein interaction network
.
Bioinformatics
 
2006
;
22
:
2283
90
.

102.

Girvan
 
M
,
Newman
 
MEJ
.
Community structure in social and biological networks
.
Proc Natl Acad Sci U S A
 
2002
;
99
:
7821
6
.

103.

Lian
 
H
,
Song
 
CS
,
Cho
 
YR
. Decomposing protein interactome networks by graph entropy. In:
2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
, Vol.
2010
.
Hong Kong, China
,
585
9
.

104.

Kenley
 
EC
,
Cho
 
YR
.
Detecting protein complexes and functional modules from protein interaction networks: a graph entropy approach
.
Proteomics
 
2011
;
11
:
3835
44
.

105.

Wang
 
J
,
Li
 
M
,
Chen
 
J
, et al.  
A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks
.
IEEE/ACM Trans Comput Biol Bioinform
 
2011
;
8
:
607
20
.

106.

Ren
 
J
,
Wang
 
J
,
Li
 
M
. Identifying protein complexes based on local fitness method. In:
2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Philadelphia, USA
,
2012
,
1
6
.

107.

Wang
 
J
,
Chen
 
G
,
Liu
 
B
, et al.  
Identifying protein complexes from interactome based on essential proteins and local fitness method
.
IEEE Trans Nanobioscience
 
2012
;
11
:
324
35
.

108.

Qi
 
Y
,
Balem
 
F
,
Faloutsos
 
C
, et al.  
Protein complex identification by supervised graph local clustering
.
Bioinformatics
 
2008
;
24
:
i250
68
.

109.

King
 
AD
,
Pržulj
 
N
,
Jurisica
 
I
.
Protein complex prediction via cost-based clustering
.
Bioinformatics
 
2004
;
20
:
3013
20
.

110.

Newman
 
ME
.
Modularity and community structure in networks
.
Proc Natl Acad Sci U S A
 
2006
;
103
:
8577
82
.

111.

Li
 
X
,
Tan
 
S-H
,
Foo
 
C-S
, et al.  
Interaction graph mining for protein complexes using local clique merging
.
Genome Inform
 
2005
;
16
:
260
8
.

112.

Navlakha
 
S
,
Rastogi
 
R
,
Shrivastava
 
N
. Graph summarization with bounded error. In:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
.
Vancouver, Canada
:
ACM
,
2008
,
419
32
.

113.

Navlakha
 
S
,
Schatz
 
MC
,
Kingsford
 
C
.
Revealing biological modules via graph summarization
.
J Comput Biol
 
2009
;
16
:
253
64
.

114.

Zhang
 
XF
,
Dai
 
DQ
,
Li
 
XX
.
Protein complexes discovery based on protein–protein interaction data via a regularized sparse generative network model
.
IEEE/ACM Trans Comput Biol Bioinform
 
2012
;
9
:
857
70
.

115.

Bader
 
GD
,
Hogue
 
CWV
.
An automated method for finding molecular complexes in large protein interaction networks
.
BMC Bioinformatics
 
2003
;
4
:
2
27
.

116.

Lubovac
 
Z
,
Gamalielsson
 
J
,
Olsson
 
B
.
Combining functional and topological properties to identify core modules in protein interaction networks
.
Proteins
 
2006
;
64
:
948
59
.

117.

Adamcsek
 
B
,
Palla
 
G
,
Farkas
 
IJ
, et al.  
CFinder: locating cliques and overlapping modules in biological networks
.
Bioinformatics
 
2006
;
22
:
1021
3
.

118.

Li
 
M
,
J-e
 
C
,
Wang
 
J-x
, et al.  
Modifying the DPClus algorithm for identifying protein complexes based on new topological structures
.
BMC Bioinformatics
 
2008
;
9
:
398
413
.

119.

Jiang
 
P
,
Singh
 
M
.
SPICi: a fast clustering algorithm for large biological networks
.
Bioinformatics
 
2010
;
26
:
1105
11
.

120.

Mete
 
M
,
Tang
 
F
,
Xu
 
X
, et al.  
A structural approach for finding functional modules from large biological networks
.
BMC Bioinformatics
 
2008
;
9
:
S19
32
.

121.

Pu
 
S
,
Vlasblom
 
J
,
Emili
 
A
, et al.  
Identifying functional modules in the physical interactome of Saccharomyces cerevisiae
.
Proteomics
 
2007
;
7
:
944
60
.

122.

Leung
 
HCM
,
Xiang
 
Q
,
Yiu
 
SM
, et al.  
Predicting protein complexes from PPI data: a core-attachment approach
.
J Comput Biol
 
2009
;
16
:
133
44
.

123.

Wu
 
M
,
Li
 
X
,
Kwoh
 
C-K
, et al.  
A core-attachment based method to detect protein complexes in PPI networks
.
BMC Bioinformatics
 
2009
;
10
:
169
84
.

124.

Chen
 
B
,
Shi
 
J
,
Wu
 
F-X
. Not all protein complexes exhibit dense structures in S. cerevisiae PPI network. In:
2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Philadelphia, USA
,
2012
,
1
4
.

125.

Jung
 
SH
,
Jang
 
WH
,
Hur
 
HY
, et al.  
Protein complex prediction based on mutually exclusive interactions in protein interaction network
.
Genome Inform
 
2008
;
21
:
77
88
.

126.

Ozawa
 
Y
,
Saito
 
R
,
Fujimori
 
S
, et al.  
Protein complex prediction via verifying and reconstructing the topology of domain–domain interactions
.
BMC Bioinformatics
 
2010
;
11
:
350
61
.

127.

van
 
Dongen
 
SM
.
Graph Clustering by Flow Simulation
.
Universiteit Utrecht
,
2000
.

128.

Enright
 
AJ
,
Van Dongen
 
S
,
Ouzounis
 
CA
.
An efficient algorithm for large-scale detection of protein families
.
Nucleic Acids Res
 
2002
;
30
:
1575
84
.

129.

Satuluri
 
V
,
Parthasarathy
 
S
,
Ucar
 
D
. Markov clustering of protein interaction networks with improved balance and scalability. In:
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
.
Niagara Falls, New York
:
ACM
,
2010
,
247
56
.

130.

Shih
 
YK
,
Parthasarathy
 
S
.
Identifying functional modules in interaction networks through overlapping Markov clustering
.
Bioinformatics
 
2012
;
28
:
I473
9
.

131.

Hwang
 
W
,
Cho
 
Y-R
,
Zhang
 
A
, et al.  
A novel functional module detection algorithm for protein–protein interaction networks
.
Algorithms Mol Biol
 
2006
;
1
:
24
34
.

132.

Yr
 
C
,
Hwang
 
W
,
Zhang
 
A
. Identification of overlapping functional modules in protein interaction networks: information flow-based approach. In:
Sixth IEEE International Conference on Data Mining-Workshops (ICDMW'06)
.
Hong Kong, China
,
2006
,
147
52
.

133.

Macropol
 
K
,
Can
 
T
,
Singh
 
AK
.
RRW: repeated random walks on genome-scale protein networks for local cluster discovery
.
BMC Bioinformatics
 
2009
;
10
:
283
92
.

134.

Maruyama
 
O
,
Chihara
 
A
. NWE: node-weighted expansion for protein complex prediction using random walk distances. In:
2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
Hong Kong, China
,
2010
,
590
4
.

135.

Asur
 
S
,
Ucar
 
D
,
Parthasarathy
 
S
.
An ensemble framework for clustering protein–protein interaction networks
.
Bioinformatics
 
2007
;
23
:
i29
i40
.

136.

Friedel
 
CC
,
Krumsiek
 
J
,
Zimmer
 
R
.
Bootstrapping the interactome: unsupervised identification of protein complexes in yeast
.
J Comput Biol
 
2009
;
16
:
971
87
.

137.

Greene
 
D
,
Cagney
 
G
,
Krogan
 
N
, et al.  
Ensemble non-negative matrix factorization methods for clustering protein–protein interactions
.
Bioinformatics
 
2008
;
24
:
1722
8
.

138.

Collins
 
SR
,
Kemmeren
 
P
,
Zhao
 
X-C
, et al.  
Toward a comprehensive atlas of the physical Interactome of Saccharomyces cerevisiae
.
Mol Cell Proteomics
 
2007
;
6
:
439
50
.

139.

Brohee
 
S
,
van
 
Helden
 
J
.
Evaluation of clustering algorithms for protein–protein interaction networks
.
BMC Bioinformatics
 
2006
;
7
:
488
.

140.

Bu
 
DB
,
Zhao
 
Y
,
Cai
 
L
, et al.  
Topological structure analysis of the protein–protein interaction network in budding yeast
.
Nucleic Acids Res
 
2003
;
31
:
2443
50
.

141.

Przulj
 
N
,
Wigle
 
DA
,
Jurisica
 
I
.
Functional topology in a network of protein interactions
.
Bioinformatics
 
2004
;
20
:
340
8
.

142.

Boyle
 
EI
,
Weng
 
S
,
Gollub
 
J
, et al.  
GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes
.
Bioinformatics
 
2004
;
20
:
3710
5
.

143.

Tang
 
X
,
Wang
 
J
,
Liu
 
B
, et al.  
A comparison of the functional modules identified from time course and static PPI network data
.
BMC Bioinformatics
 
2011
;
12
:
339
.

144.

Palla
 
G
,
Derenyi
 
I
,
Farkas
 
I
, et al.  
Uncovering the overlapping community structure of complex networks in nature and society
.
Nature
 
2005
;
435
:
814
8
.

145.

Wodak
 
SJ
,
Vlasblom
 
J
,
Turinsky
 
AL
, et al.  
Protein–protein interaction networks: the puzzling riches
.
Curr Opin Struct Biol
 
2013
;
23
:
941
53
.

146.

Stumpf
 
MP
,
Thorne
 
T
,
de
 
Silva
 
E
, et al.  
Estimating the size of the human interactome
.
Proc Natl Acad Sci U S A
 
2008
;
105
:
6959
64
.

147.

Hart
 
GT
,
Ramani
 
AK
,
Marcotte
 
EM
.
How complete are current yeast and human protein-interaction networks?
 
Genome Biol
 
2006
;
7
(
11
):
120
.

148.

Yu
 
H
,
Braun
 
P
,
Yildirim
 
MA
, et al.  
High-quality binary protein interaction map of the yeast interactome network
.
Science
 
2008
;
322
:
104
10
.

149.

De Las
 
RJ
,
Fontanillo
 
C
.
Protein–protein interactions essentials: key concepts to building and analyzing interactome networks
.
PLoS Comput Biol
 
2010
;
6
:e1000807.

150.

Nguyen
 
TN
,
Goodrich
 
JA
.
Protein–protein interaction assays: eliminating false positive interactions
.
Nat Methods
 
2006
;
3
:
135
9
.

151.

von
 
Mering
 
C
,
Krause
 
R
,
Snel
 
B
, et al.  
Comparative assessment of large-scale data sets of protein–protein interactions
.
Nature
 
2002
;
417
:
399
403
.

152.

Blohm
 
P
,
Frishman
 
G
,
Smialowski
 
P
, et al.  
Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis
.
Nucleic Acids Res
 
2014
;
42
:
D396
D400
.

153.

Stelzl
 
U
,
Wanker
 
EE
.
The value of high quality protein–protein interaction networks for systems biology
.
Curr Opin Chem Biol
 
2006
;
10
:
551
8
.

154.

Suthram
 
S
,
Shlomi
 
T
,
Ruppin
 
E
, et al.  
A direct comparison of protein interaction confidence assignment schemes
.
BMC Bioinformatics
 
2006
;
7
:
360
.

155.

Orre
 
LM
,
Vesterlund
 
M
,
Pan
 
Y
, et al.  
SubCellBarCode: proteome-wide mapping of protein localization and relocalization
.
Mol Cell
 
2019
;
73
:
166
82
.

156.

Yu
 
HY
,
Luscombe
 
NM
,
Lu
 
HX
, et al.  
Annotation transfer between genomes: protein–protein interologs and protein-DNA regulogs
.
Genome Res
 
2004
;
14
:
1107
18
.

157.

Liu
 
B
,
Zhang
 
D
,
Xu
 
R
, et al.  
Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection
.
Bioinformatics
 
2014
;
30
:
472
9
.

158.

Chen
 
J
,
Guo
 
M
,
Li
 
S
, et al.  
ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank
.
Bioinformatics
 
2017
;
33
:
3473
6
.

159.

Liu
 
B
,
Chen
 
J
,
Wang
 
X
.
Application of learning to rank to protein remote homology detection
.
Bioinformatics
 
2015
;
31
:
3492
8
.

160.

Chen
 
J
,
Guo
 
M
,
Wang
 
X
, et al.  
A comprehensive review and comparison of different computational methods for protein remote homology detection
.
Brief Bioinform
 
2018
;
9
:
231
44
.

161.

Liu
 
B
,
Jiang
 
S
,
Zou
 
Q
.
HITS-PR-HHblits: protein remote homology detection by combining pagerank and hyperlink-induced topic search
.
Brief Bioinform
. DOI: .

162.

Zeng
 
XX
,
Liu
 
L
,
Lu
 
LY
, et al.  
Prediction of potential disease-associated microRNAs using structural perturbation method
.
Bioinformatics
 
2018
;
34
:
2425
32
.

163.

Zou
 
Q
,
Li
 
J
,
Song
 
L
, et al.  
Similarity computation strategies in the microRNA–disease network: a survey
.
Brief Funct Genomics
 
2016
;
15
:
55
64
.

164.

Zhang
 
X
,
Zou
 
Q
,
Rodruguez-Paton
 
A
, et al.  
Meta-path methods for prioritizing candidate disease miRNAs
.
IEEE/ACM Trans Comput Biol Bioinform
 
2019
;
16
(
1
):
283
91
.

165.

Zeng
 
X
,
Ding
 
N
,
Rodríguezpatón
 
A
, et al.  
Probability-based collaborative filtering model for predicting gene–disease associations
.
BMC Med Genomics
 
2017
;
10
:
76
.

166.

Liu
 
Y
,
Zeng
 
X
,
He
 
Z
, et al.  
Inferring microRNA–disease associations by random walk on a heterogeneous network with multiple data sources
.
IEEE/ACM Trans Comput Biol Bioinform
 
2017
;
14
:
905
15
.

167.

Schlecht
 
U
,
Miranda
 
M
,
Suresh
 
S
, et al.  
Multiplex assay for condition-dependent changes in protein–protein interactions
.
Proc Natl Acad Sci U S A
 
2012
;
109
:
9213
8
.

168.

Celaj
 
A
,
Schlecht
 
U
,
Smith
 
JD
, et al.  
Quantitative analysis of protein interaction network dynamics in yeast
.
Mol Syst Biol
 
2017
;
13
:
934
.

169.

Liu
 
B
,
Gao
 
X
,
Zhang
 
H
.
BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches
.
Nucleic Acids Res
. DOI: .

170.

Liu
 
B
.
BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches
.
Brief Bioinform
. DOI: .

171.

Liu
 
Y
,
Wang
 
X
,
Liu
 
B
.
A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction
.
Brief Bioinform
 
2019
;
20
:
330
46
.

172.

Liu
 
Y
,
Wang
 
X
,
Liu
 
B
.
IDP-CRF: intrinsically disordered protein/region identification based on conditional random fields
.
Int J Mol Sci
 
2018
;
19
:
2483
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data