A comprehensive review and evaluation of computational methods for identifying protein complexes from protein–protein interaction networks

The summary of PPI databases

Name	Version	Species	# proteins	# PPIs	Website
DIP [58]	Feb 2017	All	28 255	76 881	http://dip.doe-mbi.ucla.edu/dip/Main.cgi
BioGRID [59]	3.5.172	All	71 619	1 683 670	https://thebiogrid.org/
HPRD [60]	Apr 2010	Human	30 047	41 327	http://www.hprd.org/
MINT [61]	2019	All	26 099	130 733	https://mint.bio.uniroma2.it
STRING [62]	11.0	All	24 584 628	3 123 056 667	https://string-db.org/

Name	Version	Species	# proteins	# PPIs	Website
DIP [58]	Feb 2017	All	28 255	76 881	http://dip.doe-mbi.ucla.edu/dip/Main.cgi
BioGRID [59]	3.5.172	All	71 619	1 683 670	https://thebiogrid.org/
HPRD [60]	Apr 2010	Human	30 047	41 327	http://www.hprd.org/
MINT [61]	2019	All	26 099	130 733	https://mint.bio.uniroma2.it
STRING [62]	11.0	All	24 584 628	3 123 056 667	https://string-db.org/

Table 1

The summary of PPI databases

Name	Version	Species	# proteins	# PPIs	Website
DIP [58]	Feb 2017	All	28 255	76 881	http://dip.doe-mbi.ucla.edu/dip/Main.cgi
BioGRID [59]	3.5.172	All	71 619	1 683 670	https://thebiogrid.org/
HPRD [60]	Apr 2010	Human	30 047	41 327	http://www.hprd.org/
MINT [61]	2019	All	26 099	130 733	https://mint.bio.uniroma2.it
STRING [62]	11.0	All	24 584 628	3 123 056 667	https://string-db.org/

Name	Version	Species	# proteins	# PPIs	Website
DIP [58]	Feb 2017	All	28 255	76 881	http://dip.doe-mbi.ucla.edu/dip/Main.cgi
BioGRID [59]	3.5.172	All	71 619	1 683 670	https://thebiogrid.org/
HPRD [60]	Apr 2010	Human	30 047	41 327	http://www.hprd.org/
MINT [61]	2019	All	26 099	130 733	https://mint.bio.uniroma2.it
STRING [62]	11.0	All	24 584 628	3 123 056 667	https://string-db.org/

Protein complex databases

The Munich Information Center for Protein Sequence (MIPS) [69] provides protein complex databases for various species, where protein complexes are curated manually by biologists from scientific articles and are hierarchically structured. CYGD [63] is a yeast protein complex database in MIPS, which also documents physical and functional protein interactions and other genetic information. CORUM [64] is another protein complex database in MIPS, collecting mammalian protein complexes verified by small-scale experiments. In the latest version CORUM 3.0, most of the protein complexes are originated from human (67%), mouse (15%) and rat (10%).

CYC2008 [65] is the most used yeast protein complex database in this field, which documents 408 curated heteromeric protein complexes that were validated by small-scale experiments.

PISA [66] is a database of predicted protein complexes based on six different physicochemical properties of protein quaternary structures. A summary of these protein complex databases is shown in Table 2.

Table 2

The summary of protein complex databases

Name	Version	Species	# proteins	# complexes	Website
CYGD [63]	2006	Yeast	1189^a	203^a	http://mips.gsf.de/genre/proj/yeast
CORUM [64]	3.0	Mammalian	4473	4274	http://mips.helmholtz-muenchen.de/corum
CYC2008 [65]	2.0	Yeast	1627	408	http://wodaklab.org/cyc2008
PISA [66]	1.48	All	Unknown	12241	http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html

Name	Version	Species	# proteins	# complexes	Website
CYGD [63]	2006	Yeast	1189^a	203^a	http://mips.gsf.de/genre/proj/yeast
CORUM [64]	3.0	Mammalian	4473	4274	http://mips.helmholtz-muenchen.de/corum
CYC2008 [65]	2.0	Yeast	1627	408	http://wodaklab.org/cyc2008
PISA [66]	1.48	All	Unknown	12241	http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html

a

The number of complexes are reported in [70].

Table 2

The summary of protein complex databases

Name	Version	Species	# proteins	# complexes	Website
CYGD [63]	2006	Yeast	1189^a	203^a	http://mips.gsf.de/genre/proj/yeast
CORUM [64]	3.0	Mammalian	4473	4274	http://mips.helmholtz-muenchen.de/corum
CYC2008 [65]	2.0	Yeast	1627	408	http://wodaklab.org/cyc2008
PISA [66]	1.48	All	Unknown	12241	http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html

Name	Version	Species	# proteins	# complexes	Website
CYGD [63]	2006	Yeast	1189^a	203^a	http://mips.gsf.de/genre/proj/yeast
CORUM [64]	3.0	Mammalian	4473	4274	http://mips.helmholtz-muenchen.de/corum
CYC2008 [65]	2.0	Yeast	1627	408	http://wodaklab.org/cyc2008
PISA [66]	1.48	All	Unknown	12241	http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html

a

The number of complexes are reported in [70].

Figure 1

The flowchart of identifying protein complexes from dynamic PPI networks. First, using gene expression profiles and subcellular location data to project the static PPI network as different network slices to construct the dynamic PPI network. In each network slice, proteins are expressed at the same time point, and only the proteins in the same subcellular location can interact with each other. Second, clustering methods are performed on each network slice to generate clusters. Finally, merging or filtering highly overlapping clusters to form protein complexes.

PPI networks

PPI networks are constructed from PPI data sets by denoting proteins as nodes and PPIs as edges. Typically, PPI networks are static, assuming every PPI is constant in any cell and condition. However, PPIs are dynamic inside cells, and their occurrences depend on spatial, temporal and/or contextual variation. Therefore, it is more reasonable to model PPI networks as dynamic systems [71]. Besides, as more and more PPI data from different species are available, constructing interolog PPI networks becomes a new trend to transfer biological knowledge across species [72].

Static PPI networks

Usually, static PPI networks are modeled as simple undirected graphs. For preserving the raw information of PPI determination experiments, several studies represent static PPI network as a bipartite graph, where one part of nodes corresponds to bait proteins and the other part corresponds to prey proteins [39, 73, 74]. Because network weights are crucial for protein complex identification [70], different kinds of biological data have been used to assign weights to PPI networks.

First, TAP-MS data are important information source to calculate PPI affinity scores. Different techniques derive scores that quantify the likelihood of two proteins being observed together by using different procedures and adopt these scores as the weights of edges in PPI networks [75].

Second, topological structure information is widely used to quantify the closeness between node pairs in PPI networks. On this aspect, several methods measure the closeness between two connected nodes based on their common neighbors and degrees [38, 76–78]. Later, NEOComplex [53] further accounts the common neighbors of common neighbors for defining the closeness score of two connected nodes. To improve the reliability of weights, AdjustCD [79] and PE-measure [30] are defined in an iterative manner. Because PPI networks are noisy and incomplete, Chen et al. [80] adapt networks by removing edges with low FS-weight that measuring the overlap degree of the neighborhoods of two nodes and introduce edges for indirectly connected nodes with high FS-weight.

Third, Gene Ontology (GO) [81, 82] is commonly used for evaluating PPI reliability. During the past decades, various semantic similarity measures of GO terms have been defined [83] and assigned as the weights of edges in PPI networks. Gene expression data are also widely used for assigning weights to edges in PPI networks. For example, Luo et al. [84] adopt the Pearson correlation coefficient (PCC) of the coding genes of proteins as the weight of an edge. SGNMF [41] assigns the ‘signs’ for edges based on the PCCs of genes, whereas DSMP [85] calculates the distances between gene expression profiles as weights of edges. Besides, ProRank [86] computes the protein sequence similarity score for PPIs as their weights.

Finally, integrating multiple biological data will lead to more accurate weights. In this regard, several approaches combine PPI reliability scores derived from different experimental evidences by different means [87, 88]. Other methods combine topology-based weights with other scores, such as GO similarity score [33, 89, 90] and PCCs of gene expression data [26].

Figure 2

The flowchart of identifying protein complexes from interolog PPI networks. First, the interolog PPI network is constructed by aligning multiple species’ PPI networks. Then, the conserved protein complexes are identified by directly clustering the interolog PPI network. In addition, the orthology protein complex groups are identified by projecting the predicted protein complexes from the source species to other species via the interolog PPI network.

Dynamic PPI networks

In order to capture the dynamic characteristics of PPIs in vivo, a number of studies construct dynamic PPI networks by mapping additional biological information into large-scale PPI data sets [91, 92]. Generally, dynamic PPI networks contain a series of network slices, each consisting of proteins expressed under a condition or at a time point and the corresponding PPIs [92].

Gene expression data are widely used for dynamic PPI network construction. In order to determine expressed proteins or co-expression protein pairs, several methods compare the expression level of proteins with different thresholds [92]. For reducing the impact of inherent background noise in gene expression data, DPPN [52] assigns the active probability at each time point for each protein according to the three-sigma principle [28]. The dynamic PPI networks aforementioned ignore the connections among network slices. To solve this problem, TS-OCD [34] distinguishes stable PPIs and transient PPIs to construct dynamic PPI networks. Stable PPIs are reserved in all network slices, while transient interactions only present in certain time point. Because PPIs also depend on protein subcellular location, several methods constructs more refined dynamic PPI networks by combining protein subcellular information with gene expression data [32, 46]. Other than groups PPIs by time or condition, several methods [55, 93] constructs dynamic PPI networks where each network slice consists of proteins with similar functions and the corresponding PPIs.

Figure 1 shows the general process of identifying protein complexes from dynamic PPI networks. The strategy of identifying protein complexes from dynamic PPI networks is clustering each network slice first and then merges or filters clusters from different network slices based on certain criteria.

Interolog PPI networks

PPIs and protein complexes are frequently conserved in evolution [94]. In order to accurately identify conserved protein complexes and transfer biology information across species, different interolog PPI networks have been developed.

In interolog PPI networks, nodes represent groups of similar proteins, each of which comes from a species, and edges represent conserved PPIs. To determine which proteins are similar and assign edges for which node pairs, various approaches have been developed. The simplest way is assuming every two proteins of different species are similar and linking every two nodes by an edge [95]. A more refined approach is assigning nodes for sequence-similar protein pairs and linking two nodes only if at least one species’ protein pair interacts; meanwhile, the other species’ protein pair share a common interaction partner [96]. In order to control the size of interolog PPI networks, CAPPI [97] groups proteins into families and then assigns a node for each family. The approaches aforementioned ignore the information of interaction partners when determining similarity relationship among proteins. In this regard, IsoRank [72] combines the protein sequences alignment scores and their PPI network neighbor match scores to determine similar protein groups. Since many conserved cellular mechanisms across species cannot be revealed only by protein sequence similarity, COCIN [98] combines domain conservation information with sequence similarity for computing functional similar protein groups.

Figure 2 illustrates the general process of identifying protein complexes from interolog PPI networks. There are two strategies to identify protein complexes from interolog PPI networks. The first strategy directly clusters the interolog PPI network and maps the clustering results to each species. The other strategy clusters the PPI network of a selected species, first, and then maps the identified results into other species via the interolog PPI network.

Figure 3

The schematic of computational methods for protein complex identification. Cluster-quality-based methods and node-affinity-based methods measure protein complexes from different aspect, and ensemble clustering methods are developed based on the former two categories of methods.

Methods

Computational methods detect meaningful clusters from PPI networks as protein complexes. According to their methodologies, these methods can be organized into three categories, including cluster-quality-based methods, node-affinity-based methods and ensemble clustering methods. Cluster-quality-based methods treat clusters as measure units; node-affinity-based methods measure the affinity between nodes inside clusters. However, the two categories are not completely disconnected. Actually, some node-affinity-based methods select subgraphs with high cluster-quality as seeds to construct clusters. Finally, ensemble clustering methods combine the outputted clusters of different methods to mine final complexes. Figure 3 shows the schematic and relations of those categories of protein complex identification methods. It should be noticed that each clustering method can be applied to all of the three types of PPI networks to predict protein complexes, as long as it is combined with the corresponding protein complex identifying strategy of each type of network as discussed in the former section.

Cluster-quality-based methods

Cluster-quality-based methods define cluster quality functions and design corresponding algorithms to extract clusters from PPI networks. According to their measuring object, cluster quality functions can be further divided into local-cluster-quality functions and global-cluster-quality functions. Local-cluster-quality functions measure individual clusters, while Global-cluster-quality functions measure all clusters by considering them as a whole.

Local-cluster-quality-based methods

Local-cluster-quality-based methods identify individual clusters with optimal local-cluster-quality in a seed growth manner. Clusters are grown from different seeds via iteratively adding or removing nodes.

Because many protein complexes show dense structures in PPI networks [1], several studies directly employ graph density as the local-cluster-quality function of individual clusters. Starting from a randomly connected node set, Spirin et al. [1] maximize the density of a cluster by ‘moving’ nodes into or out from the cluster according to the Metropolis criteria. Finally, statistically insignificant nodes are removed from clusters, and overlapping clusters are merged if their union remains dense. NCMine [51], Core&Peel [49] and TP-WDPIN [45] identify clusters with density larger than a threshold value. NCMine [51] grows a cluster from each node by adding neighbors according to degree centrality order. During addition, the density change of the cluster must be less than a threshold. Finally, highly overlapping clusters are merged if their union remains dense. Core&Peel [49] extracts clusters from the closed neighborhood of each node based on the degree and core number of nodes. TP-WDPIN [45] selects high-topology potential nodes as seeds. The algorithm first extracts cores from the closed neighborhoods of seeds and then generates a cluster from each core by adding all neighbors that increase the density of the core. After generating all clusters, both Core&Peel [49] and TP-WDPIN [45] discard clusters that are completely contained by others. In order to take advantage of edge weights, DME [99] enumerates all local maximum subgraphs with weighted density exceeding a certain threshold as clusters. NEOComplex [53] expands clusters from each node by recursively adding the neighbor with highest weight of edge connected to the newly added node, until the weighted density of the cluster falls below a threshold. Finally, the algorithm discards highly overlapping clusters and all those with sizes of less than three.

In general, a cluster is considered as a network region with internal connections denser than external connections [100], and many quantitative definitions of this property have been proposed. For example, Chen et al. [101] define the quantitative definition of a cluster based on the in-degree and out-degree of nodes inside the cluster and adapt the GN algorithm [102] to generate clusters. miPALM [77] defines the parametric local modularity of individual clusters. Clusters with maximum parametric local modularity are grown from three-node cliques in a greedy way. Finally, miPALM merges highly overlapping clusters and discards sparse clusters. Cho et al. [103, 104] define the entropy for each cluster. Clusters with minimal graph entropy are generated from the closed neighborhood of each node. Since PPI networks often contain many false-positive edges, methods such as HC-PIN [105], LF-PIN [106] and ClusterONE [70] incorporate weights of edges into cluster quantitative definitions. HC-PIN [105] defines the quantitative definition of a cluster based on the weighted in-degree and out-degree of nodes. The algorithm expands singleton clusters by greedily adding edges according to the clustering value order of edges. Finally, all small clusters are discarded. LF-PIN [106] defines the local fitness of a cluster by multiplying the HC-PIN quantitative definition and the density of the cluster. By iteratively selecting the unclustered edge with highest weight as seeds, clusters with maximal local fitness are constructed by adding or removing nodes via a greedy procedure. ClusterONE [70] defines the cohesiveness score of a cluster based on the cluster size and the intra- and extra-cluster weighted degrees of nodes. The algorithm iteratively selects the unclustered node with highest degree as the seed of a cluster. Clusters with maximal cohesiveness score are constructed by greedily adding or removing nodes. Finally, ClusterONE merges highly overlapping clusters and discards all small and sparse clusters. Later, several improved methods have been proposed. PCE-FR [44] detects pseudocliques based on the fuzzy relation between nodes as seeds of clusters; Zhang et al. [43] redefine the cohesiveness score of a cluster based on the number of triangles related to the cluster; EPOF [107] removes the penalty term from the dominator of the cohesiveness score function.

Because PPI data are imperfect, several methods [73, 95, 96, 108] quantify the likelihood ratio score that a subgraph of the PPI network being a protein complex based on different probability models. The first three methods define different types of seeds and generate individual clusters with maximum likelihood ratio score by adding or removing nodes via a greedy procedure. The last method selects each node as a seed and constructs maximal likelihood ratio score clusters by employing the simulated annealing search.

Global-cluster-quality-based methods

Global-cluster-quality-based methods search for an optimal clustering result with best global-cluster-quality function value. In order to obtain a set of dense clusters, RNSC [109] defines a naive and a scaled cost function for the clustering result. The algorithm first minimizes the naive cost and then the scaled cost of the clustering result by moving nodes between clusters via a restricted procedure. Finally, all small, sparse and weak functional homogeneity clusters are removed. MAE-FMD [33] searches for the clustering result with maximum modularity [110] by applying evolutionary operators on different clustering results. Finally, the algorithm merges functionally similar clusters and discards all singleton and sparse clusters. LCMA [111] considers maximum cliques as initial clusters and iteratively merges highly overlapping clusters until the average density of all clusters is less than a parameter. Ramadan et al. [50] define the fitness value of a clustering result based on the density of individual clusters and apply a genetic algorithm to search the clustering result with maximum fitness value. GS [112, 113] defines the graph description length of a clustering result and searches for the clustering result with minimal description length by merging clusters in a greedy manner.

The aforementioned methods are unable to detect clusters containing peripheral proteins. Therefore, several generative network models, such as RSGNM [114], GMFTP [37] and SGNMF [41], have been proposed. These models generate real PPI networks based on a set of propensity parameters that represent how likely a protein belongs to a protein complex. For identifying clusters, the probability of generating real PPI networks is optimized by tuning propensity parameters. RSGNM [114] generates the PPI network according to Poisson distribution and uses the Laplacian regularizer to smooth estimators of propensity parameters. To mitigate the effects of false-positive PPI noise, GMFTP [37] maximizes the joint probability of generating a real PPI network and protein function profiles. In GMFTP, the PPI network and protein function profiles are generated according to Bernoulli distribution. Ou-Yang et al. [47] maximize the joint probability of generating the PPI network and the domain–domain interaction (DDI) network, where the propensity parameters obey the Bernoulli distribution. SGNMF [41] incorporates a Laplacian regularizer into the Kullback–Leibler divergence between the generated network and the real PPI network.

Since some protein complexes do not show dense structures in PPI networks, Wang and Qian [36] identified clusters whose member nodes have similar neighbors. They define the conductance of a clustering result based on the two-hop transition probability between nodes and then propose two algorithms, SLCP² and GLSP², to search the minimum conductance clustering result. SLCP² is a spectral approximate algorithm for identifying non-overlapping clusters, while GLSP² is a bottom-up greedy strategy-based algorithm and can identify overlapping clusters.

In order to measure clustering results more comprehensively, several methods combine different global-cluster-quality functions to define a unified function. For example, PPSampler [27] defines the unified function of a clustering result as the negative sum of three different score functions. The algorithm searches the clustering result with minimum unified function score by using a Markov chain Monte Carlo method. Later, PPSampler2 [29] alters the distribution of the three score functions; PPSampler2-PIME [40] adds a regularizer of exclusive PPIs into the unified function of PPSampler2.

Figure 4

The flowchart of ensemble clustering methods. The base clusters are generated by applying different methods on the PPI network first, and then they are fused to construct a consensus matrix or a cluster-cluster adjacency matrix. Finally, protein complexes are identified from the consensus matrix or the cluster–cluster adjacency matrix.

Node-affinity-based methods

Cluster-quality-based methods ignore the inherent relationship among nodes when identifying clusters. To overcome this problem, node-affinity-based methods have been proposed. Node-affinity-based methods usually generate clusters from seeds by expanding neighbors with high-affinity scores. Based on the measuring scope, node-affinity functions can be further divided into local-node-affinity functions and global-node-affinity functions. Local-node-affinity functions measure the local neighborhood affinity between nodes, while global-node-affinity functions measure the global network affinity between nodes.

Table 3

The summary of available web servers, stand-alone tools and source code for protein complex identification

Method	Category^a	Year	Website
ClusterONE [70]	LQ	2012	http://www.paccanarolab.org/cluster-one
Core&Peel [49]	LQ	2016	http://bioalgo.iit.cnr.it/index.php?pg=ppin
NCmine [51]	LQ	2016	http://apps.cytoscape.org/apps/ncmine
RNSC [109]	GQ	2004	http://www.cs.toronto.edu/~juris/data/rnsc
RSGNM [114]	GQ	2012	https://github.com/Zhangxf-ccnu/RSGNM
SGNMF [41]	GQ	2015	https://github.com/hongyusu/SGNMF
MCODE [115]	LA	2003	http://apps.cytoscape.org/apps/mcode
CFinder [117]	LA	2006	http://www.cfinder.org
DPClus [76]	LA	2006	http://kanaya.naist.jp/DPClus
SPICi [119]	LA	2010	http://compbio.cs.princeton.edu/spici
CMC [79]	LA	2009	http://www.comp.nus.edu.sg/~wongls/projects/complexprediction/CMC-26may09
CORE [122]	LA	2009	http://i.cs.hku.hk/~alse/hkubrg/projects/core
PEWCC [30]	LA	2013	http://faculty.uaeu.ac.ae/nzaki/Research.htm
ProRank+ [57]	LA	2014	http://faculty.uaeu.ac.ae/nzaki/Research.htm
MCL [127, 128]	GA	2000	https://micans.org/mcl
RRW [133]	GA	2009	http://imi.kyushu-u.ac.jp/~om
NWE [134]	GA	2010	http://imi.kyushu-u.ac.jp/~om

Method	Category^a	Year	Website
ClusterONE [70]	LQ	2012	http://www.paccanarolab.org/cluster-one
Core&Peel [49]	LQ	2016	http://bioalgo.iit.cnr.it/index.php?pg=ppin
NCmine [51]	LQ	2016	http://apps.cytoscape.org/apps/ncmine
RNSC [109]	GQ	2004	http://www.cs.toronto.edu/~juris/data/rnsc
RSGNM [114]	GQ	2012	https://github.com/Zhangxf-ccnu/RSGNM
SGNMF [41]	GQ	2015	https://github.com/hongyusu/SGNMF
MCODE [115]	LA	2003	http://apps.cytoscape.org/apps/mcode
CFinder [117]	LA	2006	http://www.cfinder.org
DPClus [76]	LA	2006	http://kanaya.naist.jp/DPClus
SPICi [119]	LA	2010	http://compbio.cs.princeton.edu/spici
CMC [79]	LA	2009	http://www.comp.nus.edu.sg/~wongls/projects/complexprediction/CMC-26may09
CORE [122]	LA	2009	http://i.cs.hku.hk/~alse/hkubrg/projects/core
PEWCC [30]	LA	2013	http://faculty.uaeu.ac.ae/nzaki/Research.htm
ProRank+ [57]	LA	2014	http://faculty.uaeu.ac.ae/nzaki/Research.htm
MCL [127, 128]	GA	2000	https://micans.org/mcl
RRW [133]	GA	2009	http://imi.kyushu-u.ac.jp/~om
NWE [134]	GA	2010	http://imi.kyushu-u.ac.jp/~om

a

‘LQ’ represents the local-cluster-quality-based methods, ‘GQ’ represents the global-cluster-quality-based methods, ‘LA’ represents the Local-node-affinity-based methods and ‘GA’ represents the global-node-affinity methods.

Table 3

The summary of available web servers, stand-alone tools and source code for protein complex identification

Method	Category^a	Year	Website
ClusterONE [70]	LQ	2012	http://www.paccanarolab.org/cluster-one
Core&Peel [49]	LQ	2016	http://bioalgo.iit.cnr.it/index.php?pg=ppin
NCmine [51]	LQ	2016	http://apps.cytoscape.org/apps/ncmine
RNSC [109]	GQ	2004	http://www.cs.toronto.edu/~juris/data/rnsc
RSGNM [114]	GQ	2012	https://github.com/Zhangxf-ccnu/RSGNM
SGNMF [41]	GQ	2015	https://github.com/hongyusu/SGNMF
MCODE [115]	LA	2003	http://apps.cytoscape.org/apps/mcode
CFinder [117]	LA	2006	http://www.cfinder.org
DPClus [76]	LA	2006	http://kanaya.naist.jp/DPClus
SPICi [119]	LA	2010	http://compbio.cs.princeton.edu/spici
CMC [79]	LA	2009	http://www.comp.nus.edu.sg/~wongls/projects/complexprediction/CMC-26may09
CORE [122]	LA	2009	http://i.cs.hku.hk/~alse/hkubrg/projects/core
PEWCC [30]	LA	2013	http://faculty.uaeu.ac.ae/nzaki/Research.htm
ProRank+ [57]	LA	2014	http://faculty.uaeu.ac.ae/nzaki/Research.htm
MCL [127, 128]	GA	2000	https://micans.org/mcl
RRW [133]	GA	2009	http://imi.kyushu-u.ac.jp/~om
NWE [134]	GA	2010	http://imi.kyushu-u.ac.jp/~om

Method	Category^a	Year	Website
ClusterONE [70]	LQ	2012	http://www.paccanarolab.org/cluster-one
Core&Peel [49]	LQ	2016	http://bioalgo.iit.cnr.it/index.php?pg=ppin
NCmine [51]	LQ	2016	http://apps.cytoscape.org/apps/ncmine
RNSC [109]	GQ	2004	http://www.cs.toronto.edu/~juris/data/rnsc
RSGNM [114]	GQ	2012	https://github.com/Zhangxf-ccnu/RSGNM
SGNMF [41]	GQ	2015	https://github.com/hongyusu/SGNMF
MCODE [115]	LA	2003	http://apps.cytoscape.org/apps/mcode
CFinder [117]	LA	2006	http://www.cfinder.org
DPClus [76]	LA	2006	http://kanaya.naist.jp/DPClus
SPICi [119]	LA	2010	http://compbio.cs.princeton.edu/spici
CMC [79]	LA	2009	http://www.comp.nus.edu.sg/~wongls/projects/complexprediction/CMC-26may09
CORE [122]	LA	2009	http://i.cs.hku.hk/~alse/hkubrg/projects/core
PEWCC [30]	LA	2013	http://faculty.uaeu.ac.ae/nzaki/Research.htm
ProRank+ [57]	LA	2014	http://faculty.uaeu.ac.ae/nzaki/Research.htm
MCL [127, 128]	GA	2000	https://micans.org/mcl
RRW [133]	GA	2009	http://imi.kyushu-u.ac.jp/~om
NWE [134]	GA	2010	http://imi.kyushu-u.ac.jp/~om

a

‘LQ’ represents the local-cluster-quality-based methods, ‘GQ’ represents the global-cluster-quality-based methods, ‘LA’ represents the Local-node-affinity-based methods and ‘GA’ represents the global-node-affinity methods.

Local-node-affinity-based methods

To detect dense clusters, MCODE [115] and SWEMODE [116] quantify local-node-affinity based on node weight. MCODE [115] assigns weight to each node based on the highest k-core that contains the node. Clusters are generated by iteratively taking the unseen highest weight node as the seed and then recursively adding unseen neighbors. When all clusters have been generated, the algorithm further processes them by the ‘fluff’ and ‘haircut’ operation. Later, SWEMODE [116] proposes two different schemes to assign weights to nodes. Some methods quantify local-node-affinity based on the number of edges among nodes. For example, Spirin et al. [1] suggest that nodes inside a cluster should completely link to each other and enumerate all maximum cliques in the PPI network as clusters. CFinder [117], DPClus [76] and ICPA [118] assume that the intra-cluster degree of each node should be larger than a threshold value. CFinder [117] sets the threshold to be a constant value and identifies k-clique communities as clusters. DPClus [76] sets the threshold of a cluster based on the density and size of the cluster. The algorithm generates a cluster by selecting the node with the highest weighted degree as the seed and then adding neighbors according to node priorities. During expanding, the density of the cluster must be higher than a threshold. When a cluster cannot be further expanded, it is removed from the network for generating the next cluster. IPCA [118] sets the threshold of a cluster based on the size of the cluster. The algorithm constructs clusters in a procedure similar as that of DPClus [76] but ensures the diameter of a cluster is less than a threshold during expanding. Furthermore, nodes would not be removed from the network after a cluster is generated. SPICi [119], OIIP [89] and MKE [35] suggest that the intra-cluster weighted degree of each node should be larger than a threshold. SPICi [119] defines the threshold based on the weighted density and size of the cluster and generates clusters in a procedure similar to that of DPClus [76]. OIIP [89] defines the threshold based on the total edge weights inside the cluster and generates clusters in a procedure similar to that of ICPA [118]. MKE [35] defines the threshold based on the edge weights of the closed neighborhood of the cluster. The algorithm generates clusters by taking high degree nodes as seeds and then iteratively adding weighted best neighbors. CMC [79] suggests that the connectivity between nodes inside a cluster should be larger than a threshold. The algorithm mines all maximal cliques in the PPI network as initial clusters and then ranks them in decreasing weighted density order. Finally, for every highly overlapping clique pair, CMC either discards the one with lower weighted density or merges them into one clique according to the interconnectivity score between them. SCAN [120] assumes that two nodes inside a cluster should share many common neighbors. The algorithm defines the structural similarity for connected node pairs and generates clusters by selecting core nodes as seeds and then recursively including structure-reachable nodes.

Because many real protein complexes exhibit the core attachment architecture [8], some algorithms first detect dense regions in PPI networks as cores and then integrate attachment nodes with each core to form clusters. For example, Pu et al. [121] adopt the non-overlapping clusters outputted by existing methods as cores and take neighbors connected with sufficient fraction of a core as the attachment nodes of that core. CORE [122], COACH [123] and PEWCC [30] take neighbors that connected with more than half of the nodes of a core as the attachment nodes of that core. CORE [122] identifies cores based on the probability of two nodes forming a protein complex core; COACH [123] identifies dense cores from the closed neighborhood of each node by using the core removal algorithm; PEWCC [30] identifies cores with weighted clustering coefficient larger than a threshold by using a procedure similar to that of COACH [123]. DCU [38] identifies high expected density cores from the closed neighborhood of each node via a greedy procedure. The algorithm selects neighbors with relative degree inside a core larger than a threshold as the attachment nodes of that core. WPNCA [42] extracts cores with high weighted density from dense preliminary clusters. For each core, its attachment nodes are selected from the preliminary cluster containing it.

Different from previous conclusions, Chen et al. [124] discover that most protein complexes are sparse in PPI networks. They identify starlike clusters by taking high degree nodes as seeds and then adding neighbors via a random procedure. ProRank [86] and ProRank+ [57] apply the spoke model to generate clusters based on node types and importance values and then merge highly overlapping clusters by different procedures. Dutkowski et al. [97] assume that nodes connected by high weighted edges are likely to form a cluster. They compute edge weights by using a PPI network evolution model, and then remove all low weight edges from the PPI network. After that, each connected component of the remaining network is outputted as a cluster.

Combining additional biological data will predict more biological meaningful clusters. For example, CSO [31] integrates the GO slims [81, 82] of proteins and the core attachment architecture to identify clusters. SPIC [125] splits clusters generated by existing methods into smaller and more refined ones by eliminating mutually exclusive PPIs. Ozawa et al. [126] discard clusters according to DDIs.

Global-node-affinity-based methods

In order to identify dense clusters, several methods such as MCL [127, 128], R-MCL [129], SR-MCL [130] and STM [131] simulate flows inside PPI networks. MCL [127, 128] alternately performs the expansion and inflation operator to the stochastic matrix of a PPI network. The expansion operator simulates flow through nodes, and the inflation operator is responsible for strengthening intra-cluster currents, meanwhile weakening inter-cluster currents. When the stochastic matrix became doubly idempotent, each attractor system in it is outputted as a cluster. To minimize the number of unreasonable size clusters, R-MCL [129] replaces the expansion operator with the ‘Regularize’ operator. In order to identify overlapping clusters, SR-MCL [130] modifies the ‘Regularize’ operator. The algorithm is iteratively re-executed via decreasing the flow to attractor nodes of previous iterations to generate different clustering results. Cho et al. [132] propagate information from an informative node to every other node in the PPI network through every possible path and take each informative node with its information coverage as a cluster. Finally, strong interconnection clusters are merged in a greedy manner. STM [131] propagates signal between every node pair based on the Erlang distribution and then generates clusters based on the signal value received among nodes. Other methods such as RRW [133], NWE [134] and RWRB [39] identify clusters based on random walks with restart. RRW [133] assumes that nodes inside a cluster have equal impact on determining new nodes of the cluster. By selecting high nodes as seeds, the algorithm expands clusters with restricted size in a greedy manner. NWE [134] argues that the nodes with different weight should have different impact on determining new nodes of a cluster. RWRB [39] incorporates the protein bait–prey relationship from TAP-MS data onto the random walk with restart and generates clusters in a procedure similar to that of MCL [127, 128].

Ensemble clustering methods

Figure 4 shows the general process of how ensemble clustering methods identify protein complexes from PPI networks. Each computational method has its unique advantages and shortcomings. Therefore, ensemble clustering methods are proposed to overcome the bias of different methods and generate more accurate clusters. According to their fusing strategies, ensemble clustering methods can be divided into consensus methods and meta-clustering methods.

Consensus methods fuse base clusters by different approa-ches to construct a consensus matrix and then identify final complexes from it. In the consensus matrix, each entry indicates the co-clustered strength of two proteins. Asur et al. [135] calculate the co-clustered strength of two proteins based on the reliability score of base clusters containing the two proteins. TINCD [48] constructs the final consensus matrix by fusing the TAP data based co-complex score matrix and consensus matrices of different PPI networks and proposes a new algorithm to identify final clusters. EnsemHC [54] also combines the co-complex scores that were derived from TAP-MS experiments [75] and the co-clustered strength of pairs of proteins to define a co-cluster matrix for each base clustering result. The consensus matrix is then constructed by fusing co-cluster matrices via an iterative procedure. Friedel et al. [136] define the co-clustered strength of a protein pair as the proportion of Bootstrap sample PPI networks where they are clustered together.

Meta-clustering methods further cluster base clustering results by considering each base cluster as a unit. Greene et al. [137] represent base clustering results into a complete weighted graph, where nodes represents base clusters and the weights of edges are assigned as the [0, 1]-normalized Pearson correlation between base clusters. By considering each node as a meta-cluster, they iteratively merge meta-clusters in a greedy manner to construct a hierarchical tree and then remove all unstable subtrees from the hierarchical tree.

Evaluation

In this section, we comprehensively evaluate the performance of different computational methods on two widely used benchmark PPI datasets, that is, the DIP dataset [58] and the Collins dataset [138].

Evaluation metrics

Seven metrics are commonly used for evaluating protein complex identification methods, including Precision, Recall, F-measure, Sensitivity (Sn), Positive Predictive Value (PPV), Accuracy (Acc), and Proportion of Significant Complexes (PSC).

Given a pair of protein sets A and B, the overlap score (OS) between them is defined as [115]

$$\begin{align} \mathrm{OS}\left(A,B\right)=\frac{{\left|A\cap B\right|}^2}{\left|A\right|\left|B\right|} \end{align}$$

(1)

If |$\mathrm{OS}\Big(A,B\Big)\ge \upomega$|⁠, A and B are considered to match each other. Let P and R represent the sets of identified complexes and reference complexes, respectively, Precision, Recall and F-measure are defined as [80]

$$\begin{equation} \mathrm{Precision}=\frac{\left|\left\{p\left|p\in P\wedge \exists r\in R,p\ \mathrm{matches}\ r\right.\right\}\right|}{\left|P\right|} \end{equation}$$

(2)

$$\begin{equation} \mathrm{Recall}=\frac{\left|\left\{r\left|r\in R\wedge \exists p\in P,r\ \mathrm{matches}\ p\right.\right\}\right|}{\left|R\right|} \end{equation}$$

(3)

$$\begin{equation} \mathrm{F}-\mathrm{measure}=\frac{2\times \mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}} \end{equation}$$

(4)

Given n reference protein complexes and m identified clusters, a confusion matrix T can be represented as

$$\begin{equation} \mathbf{T}=\left[\begin{array}{ccc}{t}_{1,1}& \mathbf{\cdots}& {t}_{1,m}\\ {}\mathbf{\vdots}& \mathbf{\ddots}& \mathbf{\vdots}\\ {}{t}_{n,1}& \mathbf{\cdots}& {t}_{n,m}\end{array}\right] \end{equation}$$

(5)

where |${t}_{ij}$| indicates the number of proteins that the ith protein complex and the jth cluster have in common. Based on T, Sn, PPV and Acc are defined as [139]

$$\begin{equation} \mathrm{Sn}=\frac{\sum_{i=1}^n\underset{j}{\max }{t}_{ij}}{\sum_{i=1}^n{N}_i} \end{equation}$$

(6)

$$\begin{equation} \mathrm{PPV}=\frac{\sum_{j=1}^m\underset{i}{\max }{t}_{ij}}{\sum_{j=1}^m\sum_{i=1}^n{t}_{ij}} \end{equation}$$

(7)

$$\begin{equation} \mathrm{Acc}=\sqrt{\mathrm{Sn}\times \mathrm{PPV}} \end{equation}$$

(8)

In Equation (6), |${N}_i$| is the size of the ith reference protein complex.

Because known protein complexes are incomplete, a predicted protein complex still could be a true complex even without matching any known complexes. Therefore, the PSC is adopted to evaluate predicted results [21]. For a predicted protein complex, if its functional homogeneity P-value does not exceed a threshold value, then it is considered to be functional significant. The P-value of a predicted protein complex C with respect to a functional homologous protein group F is defined as [140, 141]

$$\begin{align} P-\mathrm{value}=1-\sum_{i=0}^{k-1}\frac{\left(\begin{array}{c}\left|F\right|\\ {}i\end{array}\right)\left(\begin{array}{c}\left|V\right|-\left|F\right|\\ {}\left|C\right|-i\end{array}\right)}{\left(\begin{array}{c}\left|V\right|\\ {}\left|C\right|\end{array}\right)} \end{align}$$

(8)

where k is the number of common proteins shared by C and F and V is the entire protein set of the PPI network. For each predicted complex, its smallest P-value over all functional homologous protein groups is reported as its P-value [109].

Table 4

Performance comparison of 17 computational methods using CYC2008 as reference protein complex set

Category^a	Method^b^, ^c	# predicted complexes	Precision	Recall	F-measure	Sn	PPV	ACC
(a) DIP dataset
LQ	ClusterONE [70]	361	0.421	0.353	0.384	0.398	0.664	0.514
	Core&Peel [49]	788	0.456	0.466	0.461	0.461	0.553	0.505
	NCMine [51]	1520	0.318	0.510	0.392	0.517	0.397	0.453
GQ	RNSC [109]	238	0.555	0.382	0.453	0.406	0.666	0.520
	RSGNM [114]	560	0.332	0.480	0.393	0.538	0.554	0.546
	SGNMF [41]	686	0.310	0.566	0.401	0.594	0.580	0.587
LA	MCODE [115]	137	0.628	0.252	0.360	0.306	0.632	0.440
	CFinder [117]	133	0.632	0.245	0.353	0.482	0.376	0.425
	DPClus [76]	322	0.494	0.426	0.458	0.467	0.580	0.521
	SPICi [119]	325	0.360	0.353	0.356	0.499	0.523	0.511
	CMC [79]	560	0.527	0.517	0.522	0.591	0.515	0.552
	CORE [122]	1644	0.155	0.630	0.249	0.597	0.497	0.545
	PEWCC [30]	1282	0.500	0.574	0.534	0.595	0.493	0.542
	ProRank+ [57]	433	0.270	0.262	0.266	0.355	0.482	0.414
GA	MCL [127, 128]	1570	0.199	0.708	0.311	0.516	0.705	0.603
	RRW [133]	384	0.466	0.424	0.444	0.460	0.492	0.476
	NWE [134]	365	0.493	0.421	0.455	0.435	0.594	0.508

(b) Collins dataset
LQ	ClusterONE [70]	323	0.625	0.569	0.596	0.658	0.625	0.642
	Core&Peel [49]	407	0.526	0.370	0.434	0.509	0.600	0.553
	NCMine [51]	822	0.691	0.385	0.494	0.568	0.557	0.563
GQ	RNSC [109]	352	0.565	0.564	0.565	0.602	0.732	0.664
	RSGNM [114]	283	0.555	0.468	0.508	0.579	0.602	0.590
	SGNMF [41]	290	0.666	0.485	0.561	0.564	0.633	0.598
LA	MCODE [115]	142	0.746	0.331	0.459	0.586	0.515	0.549
	CFinder [117]	114	0.789	0.272	0.405	0.569	0.492	0.529
	DPClus [76]	314	0.615	0.559	0.585	0.620	0.680	0.649
	SPICi [119]	130	0.815	0.355	0.495	0.542	0.643	0.590
	CMC [79]	225	0.800	0.333	0.471	0.550	0.579	0.565
	CORE [122]	384	0.539	0.566	0.552	0.622	0.658	0.640
	PEWCC [30]	589	0.788	0.306	0.441	0.507	0.586	0.545
	ProRank+ [57]	659	0.717	0.363	0.482	0.577	0.485	0.529
GA	MCL [127, 128]	272	0.640	0.547	0.589	0.677	0.641	0.659
	RRW [133]	278	0.540	0.441	0.485	0.494	0.694	0.586
	NWE [134]	527	0.507	0.630	0.562	0.589	0.708	0.645

Category^a	Method^b^, ^c	# predicted complexes	Precision	Recall	F-measure	Sn	PPV	ACC
(a) DIP dataset
LQ	ClusterONE [70]	361	0.421	0.353	0.384	0.398	0.664	0.514
	Core&Peel [49]	788	0.456	0.466	0.461	0.461	0.553	0.505
	NCMine [51]	1520	0.318	0.510	0.392	0.517	0.397	0.453
GQ	RNSC [109]	238	0.555	0.382	0.453	0.406	0.666	0.520
	RSGNM [114]	560	0.332	0.480	0.393	0.538	0.554	0.546
	SGNMF [41]	686	0.310	0.566	0.401	0.594	0.580	0.587
LA	MCODE [115]	137	0.628	0.252	0.360	0.306	0.632	0.440
	CFinder [117]	133	0.632	0.245	0.353	0.482	0.376	0.425
	DPClus [76]	322	0.494	0.426	0.458	0.467	0.580	0.521
	SPICi [119]	325	0.360	0.353	0.356	0.499	0.523	0.511
	CMC [79]	560	0.527	0.517	0.522	0.591	0.515	0.552
	CORE [122]	1644	0.155	0.630	0.249	0.597	0.497	0.545
	PEWCC [30]	1282	0.500	0.574	0.534	0.595	0.493	0.542
	ProRank+ [57]	433	0.270	0.262	0.266	0.355	0.482	0.414
GA	MCL [127, 128]	1570	0.199	0.708	0.311	0.516	0.705	0.603
	RRW [133]	384	0.466	0.424	0.444	0.460	0.492	0.476
	NWE [134]	365	0.493	0.421	0.455	0.435	0.594	0.508

(b) Collins dataset
LQ	ClusterONE [70]	323	0.625	0.569	0.596	0.658	0.625	0.642
	Core&Peel [49]	407	0.526	0.370	0.434	0.509	0.600	0.553
	NCMine [51]	822	0.691	0.385	0.494	0.568	0.557	0.563
GQ	RNSC [109]	352	0.565	0.564	0.565	0.602	0.732	0.664
	RSGNM [114]	283	0.555	0.468	0.508	0.579	0.602	0.590
	SGNMF [41]	290	0.666	0.485	0.561	0.564	0.633	0.598
LA	MCODE [115]	142	0.746	0.331	0.459	0.586	0.515	0.549
	CFinder [117]	114	0.789	0.272	0.405	0.569	0.492	0.529
	DPClus [76]	314	0.615	0.559	0.585	0.620	0.680	0.649
	SPICi [119]	130	0.815	0.355	0.495	0.542	0.643	0.590
	CMC [79]	225	0.800	0.333	0.471	0.550	0.579	0.565
	CORE [122]	384	0.539	0.566	0.552	0.622	0.658	0.640
	PEWCC [30]	589	0.788	0.306	0.441	0.507	0.586	0.545
	ProRank+ [57]	659	0.717	0.363	0.482	0.577	0.485	0.529
GA	MCL [127, 128]	272	0.640	0.547	0.589	0.677	0.641	0.659
	RRW [133]	278	0.540	0.441	0.485	0.494	0.694	0.586
	NWE [134]	527	0.507	0.630	0.562	0.589	0.708	0.645

^aSee the footnote of Table 3.

^bAll methods are rerun by us. The predicted result of each method is obtained by tuning its parameters to maximize F-measure, and the optimized parameter values are given in Table S1 in Supporting Information S1.

^cThe best performance is highlighted with bold font, the performance ranked second is highlighted with underline and the performance ranked third is highlighted with italic in each column.

Table 4

Performance comparison of 17 computational methods using CYC2008 as reference protein complex set

Category^a	Method^b^, ^c	# predicted complexes	Precision	Recall	F-measure	Sn	PPV	ACC
(a) DIP dataset
LQ	ClusterONE [70]	361	0.421	0.353	0.384	0.398	0.664	0.514
	Core&Peel [49]	788	0.456	0.466	0.461	0.461	0.553	0.505
	NCMine [51]	1520	0.318	0.510	0.392	0.517	0.397	0.453
GQ	RNSC [109]	238	0.555	0.382	0.453	0.406	0.666	0.520
	RSGNM [114]	560	0.332	0.480	0.393	0.538	0.554	0.546
	SGNMF [41]	686	0.310	0.566	0.401	0.594	0.580	0.587
LA	MCODE [115]	137	0.628	0.252	0.360	0.306	0.632	0.440
	CFinder [117]	133	0.632	0.245	0.353	0.482	0.376	0.425
	DPClus [76]	322	0.494	0.426	0.458	0.467	0.580	0.521
	SPICi [119]	325	0.360	0.353	0.356	0.499	0.523	0.511
	CMC [79]	560	0.527	0.517	0.522	0.591	0.515	0.552
	CORE [122]	1644	0.155	0.630	0.249	0.597	0.497	0.545
	PEWCC [30]	1282	0.500	0.574	0.534	0.595	0.493	0.542
	ProRank+ [57]	433	0.270	0.262	0.266	0.355	0.482	0.414
GA	MCL [127, 128]	1570	0.199	0.708	0.311	0.516	0.705	0.603
	RRW [133]	384	0.466	0.424	0.444	0.460	0.492	0.476
	NWE [134]	365	0.493	0.421	0.455	0.435	0.594	0.508

(b) Collins dataset
LQ	ClusterONE [70]	323	0.625	0.569	0.596	0.658	0.625	0.642
	Core&Peel [49]	407	0.526	0.370	0.434	0.509	0.600	0.553
	NCMine [51]	822	0.691	0.385	0.494	0.568	0.557	0.563
GQ	RNSC [109]	352	0.565	0.564	0.565	0.602	0.732	0.664
	RSGNM [114]	283	0.555	0.468	0.508	0.579	0.602	0.590
	SGNMF [41]	290	0.666	0.485	0.561	0.564	0.633	0.598
LA	MCODE [115]	142	0.746	0.331	0.459	0.586	0.515	0.549
	CFinder [117]	114	0.789	0.272	0.405	0.569	0.492	0.529
	DPClus [76]	314	0.615	0.559	0.585	0.620	0.680	0.649
	SPICi [119]	130	0.815	0.355	0.495	0.542	0.643	0.590
	CMC [79]	225	0.800	0.333	0.471	0.550	0.579	0.565
	CORE [122]	384	0.539	0.566	0.552	0.622	0.658	0.640
	PEWCC [30]	589	0.788	0.306	0.441	0.507	0.586	0.545
	ProRank+ [57]	659	0.717	0.363	0.482	0.577	0.485	0.529
GA	MCL [127, 128]	272	0.640	0.547	0.589	0.677	0.641	0.659
	RRW [133]	278	0.540	0.441	0.485	0.494	0.694	0.586
	NWE [134]	527	0.507	0.630	0.562	0.589	0.708	0.645

Category^a	Method^b^, ^c	# predicted complexes	Precision	Recall	F-measure	Sn	PPV	ACC
(a) DIP dataset
LQ	ClusterONE [70]	361	0.421	0.353	0.384	0.398	0.664	0.514
	Core&Peel [49]	788	0.456	0.466	0.461	0.461	0.553	0.505
	NCMine [51]	1520	0.318	0.510	0.392	0.517	0.397	0.453
GQ	RNSC [109]	238	0.555	0.382	0.453	0.406	0.666	0.520
	RSGNM [114]	560	0.332	0.480	0.393	0.538	0.554	0.546
	SGNMF [41]	686	0.310	0.566	0.401	0.594	0.580	0.587
LA	MCODE [115]	137	0.628	0.252	0.360	0.306	0.632	0.440
	CFinder [117]	133	0.632	0.245	0.353	0.482	0.376	0.425
	DPClus [76]	322	0.494	0.426	0.458	0.467	0.580	0.521
	SPICi [119]	325	0.360	0.353	0.356	0.499	0.523	0.511
	CMC [79]	560	0.527	0.517	0.522	0.591	0.515	0.552
	CORE [122]	1644	0.155	0.630	0.249	0.597	0.497	0.545
	PEWCC [30]	1282	0.500	0.574	0.534	0.595	0.493	0.542
	ProRank+ [57]	433	0.270	0.262	0.266	0.355	0.482	0.414
GA	MCL [127, 128]	1570	0.199	0.708	0.311	0.516	0.705	0.603
	RRW [133]	384	0.466	0.424	0.444	0.460	0.492	0.476
	NWE [134]	365	0.493	0.421	0.455	0.435	0.594	0.508

(b) Collins dataset
LQ	ClusterONE [70]	323	0.625	0.569	0.596	0.658	0.625	0.642
	Core&Peel [49]	407	0.526	0.370	0.434	0.509	0.600	0.553
	NCMine [51]	822	0.691	0.385	0.494	0.568	0.557	0.563
GQ	RNSC [109]	352	0.565	0.564	0.565	0.602	0.732	0.664
	RSGNM [114]	283	0.555	0.468	0.508	0.579	0.602	0.590
	SGNMF [41]	290	0.666	0.485	0.561	0.564	0.633	0.598
LA	MCODE [115]	142	0.746	0.331	0.459	0.586	0.515	0.549
	CFinder [117]	114	0.789	0.272	0.405	0.569	0.492	0.529
	DPClus [76]	314	0.615	0.559	0.585	0.620	0.680	0.649
	SPICi [119]	130	0.815	0.355	0.495	0.542	0.643	0.590
	CMC [79]	225	0.800	0.333	0.471	0.550	0.579	0.565
	CORE [122]	384	0.539	0.566	0.552	0.622	0.658	0.640
	PEWCC [30]	589	0.788	0.306	0.441	0.507	0.586	0.545
	ProRank+ [57]	659	0.717	0.363	0.482	0.577	0.485	0.529
GA	MCL [127, 128]	272	0.640	0.547	0.589	0.677	0.641	0.659
	RRW [133]	278	0.540	0.441	0.485	0.494	0.694	0.586
	NWE [134]	527	0.507	0.630	0.562	0.589	0.708	0.645

^aSee the footnote of Table 3.

^bAll methods are rerun by us. The predicted result of each method is obtained by tuning its parameters to maximize F-measure, and the optimized parameter values are given in Table S1 in Supporting Information S1.

^cThe best performance is highlighted with bold font, the performance ranked second is highlighted with underline and the performance ranked third is highlighted with italic in each column.

Performance comparison

We compare the performance of 17 state-of-the-art computational methods: ClusterONE [70], Core&Peel [49] and NCmine [51] are local-cluster-quality-based methods; RNSC [109], RSGNM [114] and SGNMF [41] are global-cluster-quality-based methods; MCODE [115], CFinder [117], DPClus [76], SPICi [119], CMC [79], CORE [122], PEWCC [30] and ProRank+ [57] are local-node-affinity-based methods and MCL [127, 128], RRW [133] and NWE [134] are global-node-affinity methods. The reasons for selecting these methods are as follows: (1) they achieved the best performance in this field and (2) the implementations of these methods are available. These methods were evaluated on two widely used benchmark PPI datasets, including the DIP dataset [58] and the Collins dataset [138]. The DIP dataset is constructed based on the latest version of DIP database [58], containing 22 763 PPIs among 5015 yeast proteins. The Collins dataset [138] contains 9074 high-confidence PPIs among 1622 yeast proteins. We rerun the 17 methods (Table 3) on the two benchmark datasets. It is worth noting that the previous reviews [21–25] only evaluated the predictors published in 2010 and before. In this study, 7 of the 17 evaluated methods are proposed after 2010.

Table 5

Performance comparison of 17 methods on for identifying small and large protein complexes using CYC2008 as reference protein complex set

Category^a	Method^b,c	Small complexes				Large complexes
Category^a	Method^b,c	# predicted small complexes	Precision	Recall	F-measure	# Predicted large complexes	Precision	Recall	F-measure
(a) DIP dataset
LQ	ClusterONE [70]	142	0.070	0.039	0.050	219	0.425	0.537	0.474
	Core&Peel [49]	305	0.151	0.170	0.160	483	0.499	0.564	0.529
	NCMine [51]	317	0.129	0.131	0.130	1203	0.260	0.651	0.372
GQ	RNSC [109]	102	0.265	0.116	0.161	136	0.471	0.523	0.496
	RSGNM [114]	131	0.153	0.077	0.103	429	0.221	0.617	0.326
	SGNMF [41]	211	0.094	0.081	0.087	475	0.227	0.725	0.346
LA	MCODE [115]	59	0.322	0.081	0.130	78	0.603	0.383	0.468
	CFinder [117]	0	–	–	–	133	0.368	0.396	0.382
	DPClus [76]	97	0.237	0.089	0.129	225	0.378	0.611	0.467
	SPICi [119]	22	0.364	0.031	0.057	303	0.221	0.544	0.314
	CMC [79]	75	0.253	0.073	0.114	485	0.396	0.711	0.509
	CORE [122]	1292	0.086	0.405	0.142	352	0.224	0.597	0.326
	PEWCC [30]	16	0.063	0.004	0.007	1266	0.352	0.745	0.478
	ProRank+ [57]	259	0.085	0.085	0.085	174	0.339	0.309	0.323
GA	MCL [127, 128]	1214	0.125	0.502	0.200	356	0.199	0.530	0.290
	RRW [133]	0	–	–	–	384	0.294	0.658	0.407
	NWE [134]	218	0.206	0.174	0.189	147	0.503	0.490	0.497

(b) Collins dataset
LQ	ClusterONE [70]	173	0.462	0.336	0.389	150	0.527	0.631	0.574
	Core&Peel [49]	62	0.290	0.066	0.107	345	0.441	0.644	0.523
	NCMine [51]	0	–	–	–	822	0.617	0.691	0.652
GQ	RNSC [109]	256	0.379	0.405	0.392	96	0.740	0.584	0.653
	RSGNM [114]	40	0.275	0.058	0.096	243	0.412	0.698	0.518
	SGNMF [41]	65	0.246	0.073	0.113	225	0.507	0.678	0.580
LA	MCODE [115]	17	0.412	0.027	0.051	125	0.632	0.604	0.618
	CFinder [117]	43	0.349	0.062	0.105	71	0.746	0.450	0.561
	DPClus [76]	176	0.455	0.335	0.386	138	0.572	0.644	0.606
	SPICi [119]	28	0.536	0.073	0.129	102	0.676	0.597	0.634
	CMC [79]	26	0.385	0.037	0.070	199	0.729	0.611	0.665
	CORE [122]	265	0.366	0.413	0.388	119	0.622	0.604	0.612
	PEWCC [30]	0	–	–	–	589	0.732	0.638	0.681
	ProRank+ [57]	93	0.280	0.097	0.144	566	0.712	0.651	0.680
GA	MCL [127, 128]	164	0.476	0.328	0.388	108	0.639	0.624	0.631
	RRW [133]	161	0.242	0.178	0.205	117	0.624	0.591	0.607
	NWE [134]	440	0.327	0.541	0.408	87	0.770	0.523	0.623

Category^a	Method^b,c	Small complexes				Large complexes
Category^a	Method^b,c	# predicted small complexes	Precision	Recall	F-measure	# Predicted large complexes	Precision	Recall	F-measure
(a) DIP dataset
LQ	ClusterONE [70]	142	0.070	0.039	0.050	219	0.425	0.537	0.474
	Core&Peel [49]	305	0.151	0.170	0.160	483	0.499	0.564	0.529
	NCMine [51]	317	0.129	0.131	0.130	1203	0.260	0.651	0.372
GQ	RNSC [109]	102	0.265	0.116	0.161	136	0.471	0.523	0.496
	RSGNM [114]	131	0.153	0.077	0.103	429	0.221	0.617	0.326
	SGNMF [41]	211	0.094	0.081	0.087	475	0.227	0.725	0.346
LA	MCODE [115]	59	0.322	0.081	0.130	78	0.603	0.383	0.468
	CFinder [117]	0	–	–	–	133	0.368	0.396	0.382
	DPClus [76]	97	0.237	0.089	0.129	225	0.378	0.611	0.467
	SPICi [119]	22	0.364	0.031	0.057	303	0.221	0.544	0.314
	CMC [79]	75	0.253	0.073	0.114	485	0.396	0.711	0.509
	CORE [122]	1292	0.086	0.405	0.142	352	0.224	0.597	0.326
	PEWCC [30]	16	0.063	0.004	0.007	1266	0.352	0.745	0.478
	ProRank+ [57]	259	0.085	0.085	0.085	174	0.339	0.309	0.323
GA	MCL [127, 128]	1214	0.125	0.502	0.200	356	0.199	0.530	0.290
	RRW [133]	0	–	–	–	384	0.294	0.658	0.407
	NWE [134]	218	0.206	0.174	0.189	147	0.503	0.490	0.497

(b) Collins dataset
LQ	ClusterONE [70]	173	0.462	0.336	0.389	150	0.527	0.631	0.574
	Core&Peel [49]	62	0.290	0.066	0.107	345	0.441	0.644	0.523
	NCMine [51]	0	–	–	–	822	0.617	0.691	0.652
GQ	RNSC [109]	256	0.379	0.405	0.392	96	0.740	0.584	0.653
	RSGNM [114]	40	0.275	0.058	0.096	243	0.412	0.698	0.518
	SGNMF [41]	65	0.246	0.073	0.113	225	0.507	0.678	0.580
LA	MCODE [115]	17	0.412	0.027	0.051	125	0.632	0.604	0.618
	CFinder [117]	43	0.349	0.062	0.105	71	0.746	0.450	0.561
	DPClus [76]	176	0.455	0.335	0.386	138	0.572	0.644	0.606
	SPICi [119]	28	0.536	0.073	0.129	102	0.676	0.597	0.634
	CMC [79]	26	0.385	0.037	0.070	199	0.729	0.611	0.665
	CORE [122]	265	0.366	0.413	0.388	119	0.622	0.604	0.612
	PEWCC [30]	0	–	–	–	589	0.732	0.638	0.681
	ProRank+ [57]	93	0.280	0.097	0.144	566	0.712	0.651	0.680
GA	MCL [127, 128]	164	0.476	0.328	0.388	108	0.639	0.624	0.631
	RRW [133]	161	0.242	0.178	0.205	117	0.624	0.591	0.607
	NWE [134]	440	0.327	0.541	0.408	87	0.770	0.523	0.623

^aSee the footnote of Table 3.

^bSee the footnote of Table 4.

^cSee the footnote of Table 4.

Table 5

Performance comparison of 17 methods on for identifying small and large protein complexes using CYC2008 as reference protein complex set

Category^a	Method^b,c	Small complexes				Large complexes
Category^a	Method^b,c	# predicted small complexes	Precision	Recall	F-measure	# Predicted large complexes	Precision	Recall	F-measure
(a) DIP dataset
LQ	ClusterONE [70]	142	0.070	0.039	0.050	219	0.425	0.537	0.474
	Core&Peel [49]	305	0.151	0.170	0.160	483	0.499	0.564	0.529
	NCMine [51]	317	0.129	0.131	0.130	1203	0.260	0.651	0.372
GQ	RNSC [109]	102	0.265	0.116	0.161	136	0.471	0.523	0.496
	RSGNM [114]	131	0.153	0.077	0.103	429	0.221	0.617	0.326
	SGNMF [41]	211	0.094	0.081	0.087	475	0.227	0.725	0.346
LA	MCODE [115]	59	0.322	0.081	0.130	78	0.603	0.383	0.468
	CFinder [117]	0	–	–	–	133	0.368	0.396	0.382
	DPClus [76]	97	0.237	0.089	0.129	225	0.378	0.611	0.467
	SPICi [119]	22	0.364	0.031	0.057	303	0.221	0.544	0.314
	CMC [79]	75	0.253	0.073	0.114	485	0.396	0.711	0.509
	CORE [122]	1292	0.086	0.405	0.142	352	0.224	0.597	0.326
	PEWCC [30]	16	0.063	0.004	0.007	1266	0.352	0.745	0.478
	ProRank+ [57]	259	0.085	0.085	0.085	174	0.339	0.309	0.323
GA	MCL [127, 128]	1214	0.125	0.502	0.200	356	0.199	0.530	0.290
	RRW [133]	0	–	–	–	384	0.294	0.658	0.407
	NWE [134]	218	0.206	0.174	0.189	147	0.503	0.490	0.497

(b) Collins dataset
LQ	ClusterONE [70]	173	0.462	0.336	0.389	150	0.527	0.631	0.574
	Core&Peel [49]	62	0.290	0.066	0.107	345	0.441	0.644	0.523
	NCMine [51]	0	–	–	–	822	0.617	0.691	0.652
GQ	RNSC [109]	256	0.379	0.405	0.392	96	0.740	0.584	0.653
	RSGNM [114]	40	0.275	0.058	0.096	243	0.412	0.698	0.518
	SGNMF [41]	65	0.246	0.073	0.113	225	0.507	0.678	0.580
LA	MCODE [115]	17	0.412	0.027	0.051	125	0.632	0.604	0.618
	CFinder [117]	43	0.349	0.062	0.105	71	0.746	0.450	0.561
	DPClus [76]	176	0.455	0.335	0.386	138	0.572	0.644	0.606
	SPICi [119]	28	0.536	0.073	0.129	102	0.676	0.597	0.634
	CMC [79]	26	0.385	0.037	0.070	199	0.729	0.611	0.665
	CORE [122]	265	0.366	0.413	0.388	119	0.622	0.604	0.612
	PEWCC [30]	0	–	–	–	589	0.732	0.638	0.681
	ProRank+ [57]	93	0.280	0.097	0.144	566	0.712	0.651	0.680
GA	MCL [127, 128]	164	0.476	0.328	0.388	108	0.639	0.624	0.631
	RRW [133]	161	0.242	0.178	0.205	117	0.624	0.591	0.607
	NWE [134]	440	0.327	0.541	0.408	87	0.770	0.523	0.623

Category^a	Method^b,c	Small complexes				Large complexes
Category^a	Method^b,c	# predicted small complexes	Precision	Recall	F-measure	# Predicted large complexes	Precision	Recall	F-measure
(a) DIP dataset
LQ	ClusterONE [70]	142	0.070	0.039	0.050	219	0.425	0.537	0.474
	Core&Peel [49]	305	0.151	0.170	0.160	483	0.499	0.564	0.529
	NCMine [51]	317	0.129	0.131	0.130	1203	0.260	0.651	0.372
GQ	RNSC [109]	102	0.265	0.116	0.161	136	0.471	0.523	0.496
	RSGNM [114]	131	0.153	0.077	0.103	429	0.221	0.617	0.326
	SGNMF [41]	211	0.094	0.081	0.087	475	0.227	0.725	0.346
LA	MCODE [115]	59	0.322	0.081	0.130	78	0.603	0.383	0.468
	CFinder [117]	0	–	–	–	133	0.368	0.396	0.382
	DPClus [76]	97	0.237	0.089	0.129	225	0.378	0.611	0.467
	SPICi [119]	22	0.364	0.031	0.057	303	0.221	0.544	0.314
	CMC [79]	75	0.253	0.073	0.114	485	0.396	0.711	0.509
	CORE [122]	1292	0.086	0.405	0.142	352	0.224	0.597	0.326
	PEWCC [30]	16	0.063	0.004	0.007	1266	0.352	0.745	0.478
	ProRank+ [57]	259	0.085	0.085	0.085	174	0.339	0.309	0.323
GA	MCL [127, 128]	1214	0.125	0.502	0.200	356	0.199	0.530	0.290
	RRW [133]	0	–	–	–	384	0.294	0.658	0.407
	NWE [134]	218	0.206	0.174	0.189	147	0.503	0.490	0.497

(b) Collins dataset
LQ	ClusterONE [70]	173	0.462	0.336	0.389	150	0.527	0.631	0.574
	Core&Peel [49]	62	0.290	0.066	0.107	345	0.441	0.644	0.523
	NCMine [51]	0	–	–	–	822	0.617	0.691	0.652
GQ	RNSC [109]	256	0.379	0.405	0.392	96	0.740	0.584	0.653
	RSGNM [114]	40	0.275	0.058	0.096	243	0.412	0.698	0.518
	SGNMF [41]	65	0.246	0.073	0.113	225	0.507	0.678	0.580
LA	MCODE [115]	17	0.412	0.027	0.051	125	0.632	0.604	0.618
	CFinder [117]	43	0.349	0.062	0.105	71	0.746	0.450	0.561
	DPClus [76]	176	0.455	0.335	0.386	138	0.572	0.644	0.606
	SPICi [119]	28	0.536	0.073	0.129	102	0.676	0.597	0.634
	CMC [79]	26	0.385	0.037	0.070	199	0.729	0.611	0.665
	CORE [122]	265	0.366	0.413	0.388	119	0.622	0.604	0.612
	PEWCC [30]	0	–	–	–	589	0.732	0.638	0.681
	ProRank+ [57]	93	0.280	0.097	0.144	566	0.712	0.651	0.680
GA	MCL [127, 128]	164	0.476	0.328	0.388	108	0.639	0.624	0.631
	RRW [133]	161	0.242	0.178	0.205	117	0.624	0.591	0.607
	NWE [134]	440	0.327	0.541	0.408	87	0.770	0.523	0.623

^aSee the footnote of Table 3.

^bSee the footnote of Table 4.

^cSee the footnote of Table 4.

Tables 4 shows the overall performance of the 17 methods in terms of Precision, Recall, F-measure, Sn, PPV and Acc on the two benchmark datasets, respectively. In these experiments, the CYC2008 database [65] is used as the reference protein complexes set, and the match threshold |$\upomega$| is set as 0.2 by following those studies [21, 53, 115]. We mainly focus on Precision, Recall and F-measure because Sn, PPV and Acc have severe limitations [21]. The parameters of each method are optimized based on the F-measure, and the optimized values are given in Table S1 in Supporting Information S1. From these tables, we observe the following phenomena. (i) The performance of most methods on the Collins dataset is better than that on the DIP dataset. The reason is that the DIP dataset contains many noise PPIs. In particular, the high reliable interaction subset of the DIP dataset only contains 5470 PPIs, indicating that more than 75% PPIs in the DIP dataset may be false-positive noise. The performance of CMC [79] and PEWCC [30] is better on the DIP dataset than that on the Collins dataset because they change the PPI network by filtering low reliable edges and introduce new edges between nodes. (ii) All methods show unstable performance on different datasets, indicating they heavily rely on PPI networks. For example, ClusterONE [70] achieves the best F-measure on the Collins dataset, while its performance on the DIP dataset is very poor. The reason is that the DIP dataset lacks PPI weights and contains a large amount of false-positive PPIs. (iii) Local-node-affinity-based methods show state-of-the-art Precision and F-measure, while global-node-affinity-based methods achieve the best Recall.

Table 6

Performance comparison of 17 computational methods in terms of PSC in biological process, molecular function and cellular component

Category^a	Method^b,c	# predicted complexes	PSC
Category^a	Method^b,c	# predicted complexes	BP^d	MF^d	CC^d
(a) DIP dataset
LQ	ClusterONE [70]	361	0.568	0.396	0.482
	Core&Peel [49]	788	0.708	0.528	0.647
	NCMine [51]	1520	0.634	0.502	0.568
GQ	RNSC [109]	238	0.744	0.504	0.580
	RSGNM [114]	560	0.575	0.409	0.468
	SGNMF [41]	686	0.466	0.376	0.391
LA	MCODE [115]	137	0.803	0.518	0.679
	CFinder [117]	133	0.820	0.647	0.729
	DPClus [76]	322	0.733	0.571	0.590
	SPICi [119]	325	0.545	0.443	0.452
	CMC [79]	560	0.857	0.720	0.752
	CORE [122]	1644	0.159	0.119	0.128
	PEWCC [30]	1282	0.800	0.665	0.720
	ProRank+ [57]	433	0.374	0.273	0.344
GA	MCL [127, 128]	1570	0.146	0.102	0.118
	RRW [133]	384	0.706	0.534	0.609
	NWE [134]	365	0.658	0.416	0.515

(b) Collins dataset
LQ	ClusterONE [70]	323	0.526	0.402	0.427
	Core&Peel [49]	407	0.835	0.715	0.789
	NCMine [51]	822	0.940	0.852	0.895
GQ	RNSC [109]	352	0.386	0.259	0.324
	RSGNM [114]	283	0.777	0.664	0.731
	SGNMF [41]	290	0.786	0.610	0.641
LS	MCODE [115]	142	0.901	0.676	0.796
	CFinder [117]	114	0.842	0.544	0.702
	DPClus [76]	314	0.506	0.379	0.427
	SPICi [119]	130	0.923	0.715	0.777
	CMC [79]	225	0.942	0.787	0.858
	CORE [122]	384	0.372	0.263	0.315
	PEWCC [30]	589	0.969	0.864	0.910
	ProRank+ [57]	659	0.910	0.787	0.863
GS	MCL [127, 128]	272	0.772	0.581	0.566
	RRW [133]	278	0.629	0.489	0.590
	NWE [134]	527	0.245	0.167	0.201

Category^a	Method^b,c	# predicted complexes	PSC
Category^a	Method^b,c	# predicted complexes	BP^d	MF^d	CC^d
(a) DIP dataset
LQ	ClusterONE [70]	361	0.568	0.396	0.482
	Core&Peel [49]	788	0.708	0.528	0.647
	NCMine [51]	1520	0.634	0.502	0.568
GQ	RNSC [109]	238	0.744	0.504	0.580
	RSGNM [114]	560	0.575	0.409	0.468
	SGNMF [41]	686	0.466	0.376	0.391
LA	MCODE [115]	137	0.803	0.518	0.679
	CFinder [117]	133	0.820	0.647	0.729
	DPClus [76]	322	0.733	0.571	0.590
	SPICi [119]	325	0.545	0.443	0.452
	CMC [79]	560	0.857	0.720	0.752
	CORE [122]	1644	0.159	0.119	0.128
	PEWCC [30]	1282	0.800	0.665	0.720
	ProRank+ [57]	433	0.374	0.273	0.344
GA	MCL [127, 128]	1570	0.146	0.102	0.118
	RRW [133]	384	0.706	0.534	0.609
	NWE [134]	365	0.658	0.416	0.515

(b) Collins dataset
LQ	ClusterONE [70]	323	0.526	0.402	0.427
	Core&Peel [49]	407	0.835	0.715	0.789
	NCMine [51]	822	0.940	0.852	0.895
GQ	RNSC [109]	352	0.386	0.259	0.324
	RSGNM [114]	283	0.777	0.664	0.731
	SGNMF [41]	290	0.786	0.610	0.641
LS	MCODE [115]	142	0.901	0.676	0.796
	CFinder [117]	114	0.842	0.544	0.702
	DPClus [76]	314	0.506	0.379	0.427
	SPICi [119]	130	0.923	0.715	0.777
	CMC [79]	225	0.942	0.787	0.858
	CORE [122]	384	0.372	0.263	0.315
	PEWCC [30]	589	0.969	0.864	0.910
	ProRank+ [57]	659	0.910	0.787	0.863
GS	MCL [127, 128]	272	0.772	0.581	0.566
	RRW [133]	278	0.629	0.489	0.590
	NWE [134]	527	0.245	0.167	0.201

^aSee the footnote of Table 3.

^bSee the footnote of Table 4.

^cSee the footnote of Table 4.

^d‘BP’ represents biological process, ‘MF’ represents molecular function and ‘CC’ represents cellular component.

Table 6

Performance comparison of 17 computational methods in terms of PSC in biological process, molecular function and cellular component

Category^a	Method^b,c	# predicted complexes	PSC
Category^a	Method^b,c	# predicted complexes	BP^d	MF^d	CC^d
(a) DIP dataset
LQ	ClusterONE [70]	361	0.568	0.396	0.482
	Core&Peel [49]	788	0.708	0.528	0.647
	NCMine [51]	1520	0.634	0.502	0.568
GQ	RNSC [109]	238	0.744	0.504	0.580
	RSGNM [114]	560	0.575	0.409	0.468
	SGNMF [41]	686	0.466	0.376	0.391
LA	MCODE [115]	137	0.803	0.518	0.679
	CFinder [117]	133	0.820	0.647	0.729
	DPClus [76]	322	0.733	0.571	0.590
	SPICi [119]	325	0.545	0.443	0.452
	CMC [79]	560	0.857	0.720	0.752
	CORE [122]	1644	0.159	0.119	0.128
	PEWCC [30]	1282	0.800	0.665	0.720
	ProRank+ [57]	433	0.374	0.273	0.344
GA	MCL [127, 128]	1570	0.146	0.102	0.118
	RRW [133]	384	0.706	0.534	0.609
	NWE [134]	365	0.658	0.416	0.515

(b) Collins dataset
LQ	ClusterONE [70]	323	0.526	0.402	0.427
	Core&Peel [49]	407	0.835	0.715	0.789
	NCMine [51]	822	0.940	0.852	0.895
GQ	RNSC [109]	352	0.386	0.259	0.324
	RSGNM [114]	283	0.777	0.664	0.731
	SGNMF [41]	290	0.786	0.610	0.641
LS	MCODE [115]	142	0.901	0.676	0.796
	CFinder [117]	114	0.842	0.544	0.702
	DPClus [76]	314	0.506	0.379	0.427
	SPICi [119]	130	0.923	0.715	0.777
	CMC [79]	225	0.942	0.787	0.858
	CORE [122]	384	0.372	0.263	0.315
	PEWCC [30]	589	0.969	0.864	0.910
	ProRank+ [57]	659	0.910	0.787	0.863
GS	MCL [127, 128]	272	0.772	0.581	0.566
	RRW [133]	278	0.629	0.489	0.590
	NWE [134]	527	0.245	0.167	0.201

Category^a	Method^b,c	# predicted complexes	PSC
Category^a	Method^b,c	# predicted complexes	BP^d	MF^d	CC^d
(a) DIP dataset
LQ	ClusterONE [70]	361	0.568	0.396	0.482
	Core&Peel [49]	788	0.708	0.528	0.647
	NCMine [51]	1520	0.634	0.502	0.568
GQ	RNSC [109]	238	0.744	0.504	0.580
	RSGNM [114]	560	0.575	0.409	0.468
	SGNMF [41]	686	0.466	0.376	0.391
LA	MCODE [115]	137	0.803	0.518	0.679
	CFinder [117]	133	0.820	0.647	0.729
	DPClus [76]	322	0.733	0.571	0.590
	SPICi [119]	325	0.545	0.443	0.452
	CMC [79]	560	0.857	0.720	0.752
	CORE [122]	1644	0.159	0.119	0.128
	PEWCC [30]	1282	0.800	0.665	0.720
	ProRank+ [57]	433	0.374	0.273	0.344
GA	MCL [127, 128]	1570	0.146	0.102	0.118
	RRW [133]	384	0.706	0.534	0.609
	NWE [134]	365	0.658	0.416	0.515

(b) Collins dataset
LQ	ClusterONE [70]	323	0.526	0.402	0.427
	Core&Peel [49]	407	0.835	0.715	0.789
	NCMine [51]	822	0.940	0.852	0.895
GQ	RNSC [109]	352	0.386	0.259	0.324
	RSGNM [114]	283	0.777	0.664	0.731
	SGNMF [41]	290	0.786	0.610	0.641
LS	MCODE [115]	142	0.901	0.676	0.796
	CFinder [117]	114	0.842	0.544	0.702
	DPClus [76]	314	0.506	0.379	0.427
	SPICi [119]	130	0.923	0.715	0.777
	CMC [79]	225	0.942	0.787	0.858
	CORE [122]	384	0.372	0.263	0.315
	PEWCC [30]	589	0.969	0.864	0.910
	ProRank+ [57]	659	0.910	0.787	0.863
GS	MCL [127, 128]	272	0.772	0.581	0.566
	RRW [133]	278	0.629	0.489	0.590
	NWE [134]	527	0.245	0.167	0.201

^aSee the footnote of Table 3.

^bSee the footnote of Table 4.

^cSee the footnote of Table 4.

^d‘BP’ represents biological process, ‘MF’ represents molecular function and ‘CC’ represents cellular component.

Both small and large protein complexes are ubiquitous in cells [55]. Here, those complexes composed of two or three proteins are referred as small complexes, and complexes containing more than three proteins are referred as large complexes. In CYC2008, the numbers of small and large protein complexes are 259 and 149, respectively. In order to further evaluate different methods, Table 5 gives the Precision, Recall and F-measure values of each method on small and large protein complexes. From Table 5, it is obvious that small complexes remain difficult to identify. On the Collins dataset, the best F-measure of small complexes is only 0.408, while that of the large complexes is 0.681. Furthermore, global-node-affinity-based methods outperform other methods on identifying small complexes. For identifying large complexes, both local-cluster-affinity-based and local-node-affinity-based methods achieve the state-of-the-art performance. In particular, NEW is the best tool for identifying small complexes, and CMC is the best tool for identifying large complexes. Based on the observations from Tables 4 and 5, we conclude that each method has its own advantages, and integration of different methods will further improve the performance of protein complex identification. In this regard, several ensemble clustering methods such as TINCD [48] and EnsemHC [54] have shown better performance than base methods in terms of Recall and Acc.

Tables 6 shows the overall performance of the 17 methods in terms of PSC on the two benchmark datasets. The P-values with Bonferroni correction for predicted protein complexes are calculated by using the stand-alone tool GO::TermFinder [142], and the significant threshold is set as 0.01 by following this study [21]. In general, the performance for most methods on the Collins dataset is also better than that on the DIP dataset. Methods like ClusterONE [70], RNSC [109] and NWE [134] decrease their minimum size cutoff of protein complexes to 2 on the Collins dataset, leading to lower PSC than that on the DIP dataset because protein complexes with the size of 2 have high P-values [85]. Furthermore, some local-cluster-quality-based methods and local-node-affinity-based methods, such as NCMine [51], MCODE [115], CFinder [117] and PEWCC [30], identify dense and large regions in PPI network as complexes; therefore, they show high PSC.

Furthermore, we give a performance comparison of different predictors on the dynamic PPI networks constructed from the DIP dataset, and the results are shown in Table 7. Compared with Table 4, it is obvious that the predictors on dynamic networks outperform those on static networks. In addition, CPredictor2.0 [55] achieves the best F-measure, indicating that protein function information is highly effective for protein complex identification.

Discussion and conclusion

All the aforementioned methods have greatly promoted the development of protein complex identification. However, the following problems should be addressed for further promoting this important field.

PPI datasets

PPI datasets are critical for protein complex identification. There are two main problems of the existing PPI datasets [145]. (1) The available PPIs are incomplete [11]. For example, the estimated number of non-redundant human PPIs is about 650 000 [146], whereas the number of non-redundant human PPIs collected in the latest version of BioGRID database [59] is only 375 840, indicating the coverage is less than 60%. (2) The PPI data sets contain many false-positives [147–149] caused by experiment conditions, the contaminating nucleic acid used in experiments [150] and technical biases [151]. Table 8 shows the false-positive rate of PPIs in the three PPI databases against the Negatome 2.0 [152] database, including DIP, BioGRID and MINT. Negatome 2.0 contains 6532 non-interacting protein pairs reported in literature or derived from protein complex three-dimensional structures. It shows that the highest false-positive PPI rate of the three PPI databases is 4.33%.

Table 7

Performance comparison of methods on different dynamic PPI networks constructed from the DIP data set using CYC2008 as reference set

Network^a^, ^b	# predicted complexes	Identifying method	DIP version	Precision	Recall	F-measure	Sn	PPV	ACC
TC-PINs [143]	2063	MCL [127, 128]	10 October 2010	0.215	0.569	0.312	–	–	–
DPINs [28]	1289	CPM [144]	28 February 2012	0.328	0.507	0.399	–	–	–
Zhang et al. [52]	–	Zhang et al. [52]	–	0.483	0.471	0.477	0.373	0.694	0.509
ST-APIN [46]	1761	MCL [127, 128]	10 October 2010	0.339	0.605	0.435	–	–	–
WDPIN [45]	550	TP-WDPIN [45]	14 January 2016	0.466	0.436	0.451	–	–	–
CPredictor2.0 [55]	636^a	MCL [127, 128]	13 February 2017	0.552	0.542	0.547	0.397	0.711	0.531

Network^a^, ^b	# predicted complexes	Identifying method	DIP version	Precision	Recall	F-measure	Sn	PPV	ACC
TC-PINs [143]	2063	MCL [127, 128]	10 October 2010	0.215	0.569	0.312	–	–	–
DPINs [28]	1289	CPM [144]	28 February 2012	0.328	0.507	0.399	–	–	–
Zhang et al. [52]	–	Zhang et al. [52]	–	0.483	0.471	0.477	0.373	0.694	0.509
ST-APIN [46]	1761	MCL [127, 128]	10 October 2010	0.339	0.605	0.435	–	–	–
WDPIN [45]	550	TP-WDPIN [45]	14 January 2016	0.466	0.436	0.451	–	–	–
CPredictor2.0 [55]	636^a	MCL [127, 128]	13 February 2017	0.552	0.542	0.547	0.397	0.711	0.531

^aThe predicted result of CPredictor2.0 is obtained by reruning the method, while the predicted results on other dynamic PPI networks are adopted from the origin papers and then transformed by us.

^bThe best performance is highlighted with bold font.

Table 7

Performance comparison of methods on different dynamic PPI networks constructed from the DIP data set using CYC2008 as reference set

Network^a^, ^b	# predicted complexes	Identifying method	DIP version	Precision	Recall	F-measure	Sn	PPV	ACC
TC-PINs [143]	2063	MCL [127, 128]	10 October 2010	0.215	0.569	0.312	–	–	–
DPINs [28]	1289	CPM [144]	28 February 2012	0.328	0.507	0.399	–	–	–
Zhang et al. [52]	–	Zhang et al. [52]	–	0.483	0.471	0.477	0.373	0.694	0.509
ST-APIN [46]	1761	MCL [127, 128]	10 October 2010	0.339	0.605	0.435	–	–	–
WDPIN [45]	550	TP-WDPIN [45]	14 January 2016	0.466	0.436	0.451	–	–	–
CPredictor2.0 [55]	636^a	MCL [127, 128]	13 February 2017	0.552	0.542	0.547	0.397	0.711	0.531

Network^a^, ^b	# predicted complexes	Identifying method	DIP version	Precision	Recall	F-measure	Sn	PPV	ACC
TC-PINs [143]	2063	MCL [127, 128]	10 October 2010	0.215	0.569	0.312	–	–	–
DPINs [28]	1289	CPM [144]	28 February 2012	0.328	0.507	0.399	–	–	–
Zhang et al. [52]	–	Zhang et al. [52]	–	0.483	0.471	0.477	0.373	0.694	0.509
ST-APIN [46]	1761	MCL [127, 128]	10 October 2010	0.339	0.605	0.435	–	–	–
WDPIN [45]	550	TP-WDPIN [45]	14 January 2016	0.466	0.436	0.451	–	–	–
CPredictor2.0 [55]	636^a	MCL [127, 128]	13 February 2017	0.552	0.542	0.547	0.397	0.711	0.531

^aThe predicted result of CPredictor2.0 is obtained by reruning the method, while the predicted results on other dynamic PPI networks are adopted from the origin papers and then transformed by us.

^bThe best performance is highlighted with bold font.

Table 8

False positive PPI rate of different database against the Negatome 2.0 data set

Database	# overlapping proteins	# overlapping protein–PPIs^a	No. of false-positive PPIs	False-positive PPI rate
DIP [58]	1066	4529	75	01.65%
BioGRID [59]	1944	18351	794	04.33%
MINT [61]	1164	15721	254	01.61%

Database	# overlapping proteins	# overlapping protein–PPIs^a	No. of false-positive PPIs	False-positive PPI rate
DIP [58]	1066	4529	75	01.65%
BioGRID [59]	1944	18351	794	04.33%
MINT [61]	1164	15721	254	01.61%

^aThese PPIs are nonredundant and consisting of only proteins overlapping with the Negatome 2.0.

Table 8

False positive PPI rate of different database against the Negatome 2.0 data set

Database	# overlapping proteins	# overlapping protein–PPIs^a	No. of false-positive PPIs	False-positive PPI rate
DIP [58]	1066	4529	75	01.65%
BioGRID [59]	1944	18351	794	04.33%
MINT [61]	1164	15721	254	01.61%

Database	# overlapping proteins	# overlapping protein–PPIs^a	No. of false-positive PPIs	False-positive PPI rate
DIP [58]	1066	4529	75	01.65%
BioGRID [59]	1944	18351	794	04.33%
MINT [61]	1164	15721	254	01.61%

^aThese PPIs are nonredundant and consisting of only proteins overlapping with the Negatome 2.0.

In order to obtain complete and reliable PPI datasets, many research groups combine multiple techniques to determine PPIs [153]. This goal also can be achieved by assigning weights to observed interactions based on additional biological data [154], characterizing proteins across cell types and subcellular localizations [155] and aligning PPI networks [156]. Furthermore, PPI network alignment would be benefited from introducing the evolutionary relationship by the protein remote homology detection techniques [157–161]. Link prediction paradigms [162–164], which have been successfully applied in various areas [165, 166], are also potential methods for reconstructing PPI networks.

Dynamic PPI networks

Dynamic PPI networks are the key for further improving the performance of protein complex identification. Since protein function information is highly helpful for identifying protein complexes, new dynamic PPI networks should be constructed based on this kind of information. Recently, researchers point out that a PPI may have different scores in different conditions [167, 168]. Therefore, dynamic PPI networks reflecting this property would be very helpful for promoting protein complex identification. In this regard, Zhang et al. [52] assign dynamic weights to PPIs based on the probabilities that proteins expressing at different time points.

Protein complex identification methods

Most protein complex identification methods tend to identify dense and large regions in PPI networks as protein complexes. However, a large proportion of real protein complexes are small and sparse [55, 124]. In order to further improve the performance, new predictors should be designed to accurately identify not only dense and large protein complexes but also small and sparse ones. To achieve this goal, new predictors can be constructed based on the PPI weights that fuse more kinds of biological information, such as protein sequences [169, 170], disorders regions [171, 172], etc. In addition, the ensemble clustering methods are a new trend to construct more accurate meta-predictors. When developing an accurate meta-predictors, the following aspects should be considered. (1) Many protein complexes can only be predicted by some specific methods. Therefore, the meta-predictors should give these complexes high confidence scores to correctly predict these complexes. (2) Some false complexes would be predicted by multiple base predictors. The meta-predictors should correct these false positives by removing the false proteins from complexes. By deriving protein co-cluster strength from base clustering results [48, 54, 135–137] and by further integrating protein co-complex scores derived from TAP-MS experiments [48, 54], the existing meta-predictors not only reduce the impact of false-positive PPI noise but also add some missing PPI information. Therefore, they achieve better performance than base methods. However, most meta-predictors do not have a scoring system that assigns confidence scores to clusters [48, 54, 136, 137], or the scoring system only considers network topology information [135]. Therefore, they would miss many real protein complexes that can only be predicted by a single base method. Furthermore, all these meta-predictors integrate the predicted results of different base methods in an unsupervised manner. Therefore, false complexes predicted by multiple base methods will still be incorrectly predicted by these meta-predictors. More specifically, these meta-predictors cannot remove false-positive proteins predicted by multiple base methods from predicted complexes. In order to develop more accurate meta-predictors, biological relevant protein complex features should be combined with PPI network-based protein complex features in a supervised manner.

Another problem is that different methods heavily depend on the PPI networks. Since most PPI data sets are incomplete and unreliable, the performance of different methods is limited. In this regard, generative network models [37, 47, 114] are potential solutions, and their performance would be further improved by integrating more information [37, 47]. Furthermore, they can output sparse complexes, even those consist of proteins that are not connected in PPI networks. It is also feasible to assign scores to protein pairs and improve the quality of networks by removing low-reliability edges and introducing edges to protein pairs with high affinity score [30, 79, 88]. Supervised machine learning models, such as Bayesian network models [108], also can be used for identifying protein complexes. However, most of the protein complex features are network topological features, preventing the performance improvement of these models. Therefore, biologically relevant protein complex features should be explored to improve the performance of the supervised models.

Key Points

With the availability of genomic-scale PPI data for various species, it is important to develop computational methods to accurately and efficiently identify protein complexes from PPI networks.
The databases of PPIs/protein complexes are introduced and discussed.
The three types of PPI networks, including static, dynamic and interolog networks, are introduced, and the strategies of identifying protein complexes from them are described.
The computational methods for identifying protein complexes from PPI networks are introduced, and their advantages and shortcomings are discussed. The existing available web servers and stand-alone tools in this filed are given.
The performance of 17 state-of-the-art computational methods for identifying protein complexes from PPI networks is evaluated and compared on two widely used benchmark datasets. Finally, some challenges and perspectives in this field are discussed.

Acknowledgements

The authors are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful in strengthening the presentation of this article.

Funding

This work was supported by the National Natural Science Foundation of China (61672184, 61732012 and 61822306), the Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063), the Shenzhen Overseas High Level Talents Innovation Foundation (Grant No. KQJSCX20170327161949608), the Guangdong Natural Science Funds for Distinguished Young Scholars (2016A030306008) and the Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

Zhourun Wu is a PhD candidate at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China. His research areas include bioinformatics and machine learning.

Qing Liao, PhD, is an associate professor at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China. Her research areas include bioinformatics and machine learning.

Bin Liu, PhD, is a professor at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China, and School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China. His expertise is in bioinformatics, nature language processing and machine learning.

References

1.

Spirin

V

,

Mirny

LA

.

Protein complexes and functional modules in molecular networks

.

Proc Natl Acad Sci U S A

2003

;

100

:

12123

–

8

.

2.

Zeng

J

,

Zou

Q

,

Wu

Y

, et al.

An empirical study of features fusion techniques for protein–protein interaction prediction

.

Curr Bioinform

2016

;

11

:

4

–

12

.

3.

Alberts

B

.

The cell as a collection of protein machines: preparing the next generation of molecular biologists

.

Cell

1998

;

92

:

291

–

4

.

4.

Hartwell

LH

,

Hopfield

JJ

,

Leibler

S

, et al.

From molecular to modular cell biology

.

Nature

1999

;

402

:

C47

–

C52

.

5.

Gavin

AC

,

Bosche

M

,

Krause

R

, et al.

Functional organization of the yeast proteome by systematic analysis of protein complexes

.

Nature

2002

;

415

:

141

–

7

.

6.

Rigaut

G

,

Shevchenko

A

,

Rutz

B

, et al.

A generic protein purification method for protein complex characterization and proteome exploration

.

Nat Biotechnol

1999

;

17

:

1030

–

2

.

7.

Puig

O

,

Caspary

F

,

Rigaut

G

, et al.

The tandem affinity purification (TAP) method: a general procedure of protein complex purification

.

Methods

2001

;

24

:

218

–

29

.

8.

Gavin

AC

,

Aloy

P

,

Grandi

P

, et al.

Proteome survey reveals modularity of the yeast cell machinery

.

Nature

2006

;

440

:

631

–

6

.

9.

Krogan

NJ

,

Cagney

G

,

Yu

H

, et al.

Global landscape of protein complexes in the yeast Saccharomyces cerevisiae

.

Nature

2006

;

440

:

637

–

43

.

10.

Huttlin

EL

,

Ting

L

,

Bruckner

RJ

, et al.

The BioPlex network: a systematic exploration of the human interactome

.

Cell

2015

;

162

:

425

–

40

.

11.

Huttlin

EL

,

Bruckner

RJ

,

Paulo

JA

, et al.

Architecture of the human interactome defines protein communities and disease networks

.

Nature

2017

;

545

:

505

–

9

.

12.

Hein

MY

,

Hubner

NC

,

Poser

I

, et al.

A human interactome in three quantitative dimensions organized by stoichiometries and abundances

.

Cell

2015

;

163

:

712

–

23

.

13.

Drew

K

,

Lee

C

,

Huizar

RL

, et al.

Integration of over 9,000 mass spectrometry experiments builds a global map of human protein complexes

.

Mol Syst Biol

2017

;

13

:

932

.

14.

Wan

C

,

Borgeson

B

,

Phanse

S

, et al.

Panorama of ancient metazoan macromolecular complexes

.

Nature

2015

;

525

:

339

–

44

.

15.

Young

KH

.

Yeast two-hybrid: so many interactions, (in) so little time…

.

Biol Reprod

1998

;

58

:

302

–

11

.

16.

Ito

T

,

Tashiro

K

,

Muta

S

, et al.

Toward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins

.

Proc Natl Acad Sci U S A

2000

;

97

:

1143

–

7

.

17.

Uetz

P

,

Giot

L

,

Cagney

G

, et al.

A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae

.

Nature

2000

;

403

:

623

–

7

.

18.

Ito

T

,

Chiba

T

,

Ozawa

R

, et al.

A comprehensive two-hybrid analysis to explore the yeast protein interactome

.

Proc Natl Acad Sci U S A

2001

;

98

:

4569

–

74

.

19.

Schwikowski

B

,

Uetz

P

,

Fields

S

.

A network of protein–protein interactions in yeast

.

Nat Biotechnol

2000

;

18

:

1257

–

61

.

20.

Barabasi

AL

,

Oltvai

ZN

.

Network biology: understanding the cell's functional organization

.

Nat Rev Genet

2004

;

5

:

101

–

113

.

21.

Li

X

,

Wu

M

,

Kwoh

CK

, et al.

Computational approaches for detecting protein complexes from protein interaction networks: a survey

.

BMC Genomics

2010

;

11

(

Suppl 1

):

S3

.

22.

Srihari

S

,

Leong

HW

.

A survey of computational methods for protein complex prediction from protein interaction networks

.

J Bioinform Comput Biol

2013

;

11

:

1230002

.

23.

Chen

B

,

Fan

W

,

Liu

J

, et al.

Identifying protein complexes and functional modules—from static PPI networks to dynamic PPI networks

.

Brief Bioinform

2014

;

15

:

177

–

94

.

24.

Teng

B

,

Zhao

C

,

Liu

X

, et al.

Network inference from AP-MS data: computational challenges and solutions

.

Brief Bioinform

2015

;

16

:

658

–

74

.

25.

Snider

J

,

Kotlyar

M

,

Saraon

P

, et al.

Fundamentals of protein interaction network mapping

.

Mol Syst Biol

2015

;

11

:

848

.

26.

Tang

X

,

Feng

Q

,

Wang

J

, et al.

Clustering based on multiple biological information: approach for predicting protein complexes

.

IET Syst Biol

2013

;

7

:

223

–

30

.

27.

Tatsuke

D

,

Maruyama

O

.

Sampling strategy for protein complex prediction using cluster size frequency

.

Gene

2013

;

518

:

152

–

8

.

28.

Wang

J

,

Peng

X

,

Li

M

, et al.

Construction and application of dynamic protein interaction network based on time course gene expression data

.

Proteomics

2013

;

13

:

301

–

12

.

29.

Widita

CK

,

Maruyama

O

.

PPSampler2: predicting protein complexes more accurately and efficiently by sampling

.

BMC Syst Biol

2013

;

7

(

Suppl 6

):

S14

.

30.

Zaki

N

,

Efimov

D

,

Berengueres

J

.

Protein complex detection using interaction reliability assessment and weighted clustering coefficient

.

BMC Bioinformatics

2013

;

14

:

163

.

31.

Zhang

Y

,

Lin

H

,

Yang

Z

, et al.

Protein complex prediction in large ontology attributed protein–protein interaction networks

.

IEEE/ACM Trans Comput Biol Bioinform

2013

;

10

:

729

–

41

.

32.

Guo

Y

,

Shang

XQ

,

Zhu

QP

, et al. Identification of Protein Complexes and Functional Modules in Integrated PPI Networks. In:

2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

.

Belfast, UK

,

2014

.

33.

Ji

JZ

,

Jiao

L

,

Yang

CC

, et al.

MAE-FMD: multi-agent evolutionary method for functional module detection in protein–protein interaction networks

.

BMC Bioinformatics

2014

;

15

:

325

.

34.

Ou-Yang

L

,

Dai

DQ

,

Li

XL

, et al.

Detecting temporal protein complexes from dynamic protein–protein interaction networks

.

BMC Bioinformatics

2014

;

15

:

335

.

35.

Shen

X

,

Zhao

Y

,

Li

Y

, et al.

An efficient protein complex mining algorithm based on multistage Kernel extension

.

BMC Bioinformatics

2014

;

15

(

Suppl 12

):

S7

.

36.

Wang

Y

,

Qian

X

.

Functional module identification in protein interaction networks by interaction patterns

.

Bioinformatics

2014

;

30

:

81

–

93

.

37.

Zhang

XF

,

Dai

DQ

,

Ou-Yang

L

, et al.

Detecting overlapping protein complexes based on a generative model with functional and topological properties

.

BMC Bioinformatics

2014

;

15

:

186

.

38.

Zhao

B

,

Wang

J

,

Li

M

, et al.

Detecting protein complexes based on uncertain graph model

.

IEEE/ACM Trans Comput Biol Bioinform

2014

;

11

:

486

–

97

.

39.

Cai

B

,

Wang

H

,

Zheng

H

.

Identification of protein complexes from tandem affinity purification/mass spectrometry data via biased random walk

.

IEEE/ACM Trans Comput Biol Bioinform

2015

;

12

:

455

–

66

.

40.

Maruyama

O

,

Wong

L

. Regularizing predicted complexes by mutually exclusive protein–protein interactions. In:

Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (Asonam 2015)

,

2015

,

1068

–

75

.

41.

Ou-Yang

L

,

Dai

DQ

,

Zhang

XF

.

Detecting protein complexes from signed protein-protein interaction networks

.

IEEE/ACM Trans Comput Biol Bioinform

2015

;

12

:

1333

–

44

.

42.

Peng

W

,

Wang

JX

,

Zhao

BH

, et al.

Identification of protein complexes using weighted pagerank-nibble algorithm and core-attachment structure

.

IEEE/ACM Trans Comput Biol Bioinform

2015

;

12

:

179

–

92

.

43.

Zhang

W

,

Zou

X

.

A new method for detecting protein complexes based on the three node cliques

.

IEEE/ACM Trans Comput Biol Bioinform

2015

;

12

:

879

–

86

.

44.

Cao

BW

,

Luo

JW

,

Liang

C

, et al. Detecting overlapping protein complexes in weighted protein–protein interaction networks using pseudo-clique extension based on fuzzy relation. In:

2016 International Joint Conference on Neural Networks (Ijcnn)

.

Shenzhen, China

,

2016

,

1244

–

52

.

45.

Lei

XJ

,

Zhang

YC

,

Wu

FX

, et al. Mining protein complexes based on topology potential from weighted dynamic PPI network. In:

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

.

Shenzhen, China

,

2016

,

735

–

8

.

46.

Meng

XM

,

Li

M

,

Wang

JX

, et al.

Construction of the spatial and temporal active protein interaction network for identifying protein complexes

. In:

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

.

Shenzhen, China

,

2016

, pp.

631

–

636

.

47.

Ou-Yang

L

,

Hong

Y

,

Zhang

XF

. Identifying protein complexes via multi-network clustering. In:

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

.

Shenzhen, China

,

2016

,

645

–

50

.

48.

Ou-Yang

L

,

Wu

M

,

Zhang

XF

, et al.

A two-layer integration framework for protein complex detection

.

BMC Bioinformatics

2016

;

17

:

100

.

49.

Pellegrini

M

,

Baglioni

M

,

Geraci

F

.

Protein complex prediction for large protein–protein interaction networks with the Core&Peel method

.

BMC Bioinformatics

2016

;

17

:

372

.

50.

Ramadan

E

,

Naef

A

,

Ahmed

M

.

Protein complexes predictions within protein interaction networks using genetic algorithms

.

BMC Bioinformatics

2016

;

17

(

Suppl 7

):

269

.

51.

Tadaka

S

,

Kinoshita

K

.

NCMine: core-peripheral based functional module detection using near-clique mining

.

Bioinformatics

2016

;

32

:

3454

–

60

.

52.

Zhang

Y

,

Lin

H

,

Yang

Z

, et al.

Construction of dynamic probabilistic protein interaction networks for protein complex identification

.

BMC Bioinformatics

2016

;

17

:

186

.

53.

Ma

CY

,

Chen

YP

,

Berger

B

, et al.

Identification of protein complexes by integrating multiple alignment of protein interaction networks

.

Bioinformatics

2017

;

33

:

1681

–

8

.

54.

Wu

M

,

Ou-Yang

L

,

Li

XL

.

Protein complex detection via effective integration of base clustering solutions and co-complex affinity scores

.

IEEE/ACM Trans Comput Biol Bioinform

2017

;

14

:

733

–

9

.

55.

Xu

B

,

Wang

Y

,

Wang

ZW

, et al.

An effective approach to detecting both small and large complexes from protein–protein interaction networks

.

BMC Bioinformatics

2017

;

18

(

Suppl 12

):

419

.

56.

Manikandan

P

,

Ramyachitra

D

,

Banupriya

D

.

Detection of overlapping protein complexes in gene expression, phenotype and pathways of Saccharomyces cerevisiae using Prorank based Fuzzy algorithm

.

Gene

2016

;

580

:

144

–

58

.

57.

Hanna

EM

,

Zaki

N

.

Detecting protein complexes in protein interaction networks using a ranking algorithm with a refined merging procedure

.

BMC Bioinformatics

2014

;

15

:

204

.

58.

Xenarios

I

,

Salwínski

L

,

Duan

XJ

, et al.

DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions

.

Nucleic Acids Res

2002

;

30

:

303

–

5

.

59.

Stark

C

,

Breitkreutz

B-J

,

Reguly

T

, et al.

BioGRID: a general repository for interaction data sets

.

Nucleic Acids Res

2006

;

34

:

D535

–

9

.

60.

Keshava Prasad

TS

,

Goel

R

,

Kandasamy

K

, et al.

Human protein reference database—2009 update

.

Nucleic Acids Res

2009

;

37

:

D767

–

72

.

61.

Licata

L

,

Briganti

L

,

Peluso

D

, et al.

MINT, the molecular interaction database: 2012 update

.

Nucleic Acids Res

2012

;

40

:

D857

–

61

.

62.

Szklarczyk

D

,

Gable

AL

,

Lyon

D

, et al.

STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental data sets

.

Nucleic Acids Res

2019

;

47

:

D607

–

13

.

63.

Güldener

U

,

Münsterkötter

M

,

Kastenmüller

G

, et al.

CYGD: the comprehensive yeast genome database

.

Nucleic Acids Res

2005

;

33

:

D364

–

8

.

64.

Giurgiu

M

,

Reinhard

J

,

Brauner

B

, et al.

CORUM: the comprehensive resource of mammalian protein complexes-2019

.

Nucleic Acids Res

2019

;

47

:

D559

–

63

.

65.

Pu

S

,

Wong

J

,

Turner

B

, et al.

Up-to-date catalogues of yeast protein complexes

.

Nucleic Acids Res

2009

;

37

:

825

–

31

.

66.

Krissinel

E

,

Henrick

K

.

Inference of macromolecular assemblies from crystalline state

.

J Mol Biol

2007

;

372

:

774

–

97

.

67.

Ho

Y

,

Gruhler

A

,

Heilbut

A

, et al.

Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry

.

Nature

2002

;

415

:

180

–

3

.

68.

Reguly

T

,

Breitkreutz

A

,

Boucher

L

, et al.

Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae

.

J Biol

2006

;

5

:

11

.

69.

Mewes

HW

,

Frishman

D

,

Güldener

U

, et al.

MIPS: a database for genomes and protein sequences

.

Nucleic Acids Res

2002

;

30

:

31

–

4

.

70.

Nepusz

T

,

Yu

H

,

Paccanaro

A

.

Detecting overlapping protein complexes in protein–protein interaction networks

.

Nat Methods

2012

;

9

:

471

–

2

.

71.

Przytycka

TM

,

Singh

M

,

Slonim

DK

.

Toward the dynamic interactome: it's about time

.

Brief Bioinform

2010

;

11

:

15

–

29

.

72.

Singh

R

,

Xu

J

,

Berger

B

.

Global alignment of multiple protein interaction networks with application to functional orthology detection

.

Proc Natl Acad Sci U S A

2008

;

105

:

12763

–

8

.

73.

Geva

G

,

Sharan

R

.

Identification of protein complexes from co-immunoprecipitation data

.

Bioinformatics

2011

;

27

:

111

–

7

.

74.

Wei

L

,

Zou

Q

,

Liao

M

, et al.

A novel machine learning method for cytokine–receptor interaction prediction

.

Comb Chem High Throughput Screen

2016

;

19

:

144

–

52

.

75.

Xie

Z

,

Kwoh

CK

,

Li

XL

, et al.

Construction of co-complex score matrix for protein complex prediction from AP-MS data

.

Bioinformatics

2011

;

27

:

i159

–

66

.

76.

Altaf-Ul-Amin

M

,

Shinbo

Y

,

Mihara

K

, et al.

Development and implementation of an algorithm for detection of protein complexes in large interaction networks

.

BMC Bioinformatics

2006

;

7

:

207

–

19

.

77.

Kim

J

,

Tan

K

.

Discover protein complexes in protein–protein interaction networks using parametric local modularity

.

BMC Bioinformatics

2010

;

11

:

521

–

31

.

78.

Wang

J

,

Li

M

,

Wang

H

, et al.

Identification of essential proteins based on edge clustering coefficient

.

IEEE/ACM Trans Comput Biol Bioinform

2012

;

9

:

1070

–

80

.

79.

Liu

G

,

Wong

L

,

Chua

HN

.

Complex discovery from weighted PPI networks

.

Bioinformatics

2009

;

25

:

1891

–

7

.

80.

Chua

HN

,

Ning

K

,

Sung

W-K

, et al.

Using indirect protein–protein interactions for protein complex prediction

.

J Bioinform Comput Biol

2008

;

06

:

435

–

66

.

81.

Ashburner

M

,

Ball

CA

,

Blake

JA

, et al.

Gene Ontology: tool for the unification of biology

.

Nat Genet

2000

;

25

:

25

–

9

.

82.

Expansion of the Gene Ontology knowledgebase and resources

.

Nucleic Acids Res

2017

;

45

:

D331

–

8

.

83.

Mazandu

GK

,

Chimusa

ER

,

Mulder

NJ

.

Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery

.

Brief Bioinform

2017

;

18

:

886

–

901

.

84.

Luo

F

,

Liu

J

,

Li

J

.

Discovering conditional co-regulated protein complexes by integrating diverse data sources

.

BMC Syst Biol

2010

;

4

:

S4

–

S16

.

85.

Maraziotis

IA

,

Dimitrakopoulou

K

,

Bezerianos

A

.

Growing functional modules from a seed protein via integration of protein interaction and gene expression data

.

BMC Bioinformatics

2007

;

8

:

408

–

22

.

86.

Zaki

N

,

Berengueres

J

,

Efimov

D

.

Detection of protein complexes using a protein ranking algorithm

.

Proteins

2012

;

80

:

2459

–

68

.

87.

Kiemer

L

,

Costa

S

,

Ueffing

M

, et al.

WI-PHI: a weighted yeast interactome enriched for direct physical interactions

.

Proteomics

2007

;

7

:

932

–

43

.

88.

Taghipour

S

,

Zarrineh

P

,

Ganjtabesh

M

, et al.

Improving protein complex prediction by reconstructing a high-confidence protein–protein interaction network of Escherichia coli from different physical interaction data sources

.

BMC Bioinformatics

2017

;

18

:

10

.

89.

Xu

B

,

Lin

H

,

Yang

Z

.

Ontology integration to identify protein complex in protein interaction networks

.

Proteome Sci

2011

;

9

(

Suppl 1

):

S7

.

90.

Cai

B

,

Wang

H

,

Zheng

H

. Incorporating semantic similarity into clustering process for identifying protein complexes from affinity purification/mass spectrometry data. In:

2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

.

Philadelphia, USA

,

2012

,

1

–

4

.

91.

Levy

ED

,

Pereira-Leal

JB

.

Evolution and dynamics of protein interactions and networks

.

Curr Opin Struct Biol

2008

;

18

:

349

–

57

.

92.

Wang

J

,

Peng

X

,

Peng

W

, et al.

Dynamic protein interaction network construction and applications

.

Proteomics

2014

;

14

:

338

–

52

.

93.

Hanna

EM

,

Zaki

N

,

Amin

A

.

Detecting protein complexes in protein interaction networks modeled as gene expression biclusters

.

PLoS One

2015

;

10

(

12

):

e0144163

.

94.

Pereira-Leal

JB

,

Levy

ED

,

Teichmann

SA

.

The origins and evolution of functional modules: lessons from protein complexes

.

Philos Trans R Soc Lond B Biol Sci

2006

;

361

:

507

–

17

.

95.

Sharan

R

,

Ideker

T

,

Kelley

B

, et al.

Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data

.

J Comput Biol

2005

;

12

:

835

–

46

.

96.

Hirsh

E

,

Sharan

R

.

Identification of conserved protein complexes based on a model of protein network evolution

.

Bioinformatics

2007

;

23

:

e170

–

6

.

97.

Dutkowski

J

,

Tiuryn

J

.

Identification of functional modules from conserved ancestral protein–protein interactions

.

Bioinformatics

2007

;

23

:

i149

–

58

.

98.

Nguyen

PV

,

Srihari

S

,

Leong

HW

.

Identifying conserved protein complexes between species by constructing interolog networks

.

BMC Bioinformatics

2013

;

14

(

Suppl 16

):

S8

.

99.

Georgii

E

,

Dietmann

S

,

Uno

T

, et al.

Enumeration of condition-dependent dense modules in protein interaction networks

.

Bioinformatics

2009

;

25

:

933

–

40

.

100.

Radicchi

F

,

Castellano

C

,

Cecconi

F

, et al.

Defining and identifying communities in networks

.

Proc Natl Acad Sci U S A

2004

;

101

:

2658

–

63

.

101.

Chen

J

,

Yuan

B

.

Detecting functional modules in the yeast protein–protein interaction network

.

Bioinformatics

2006

;

22

:

2283

–

90

.

102.

Girvan

M

,

Newman

MEJ

.

Community structure in social and biological networks

.

Proc Natl Acad Sci U S A

2002

;

99

:

7821

–

6

.

103.

Lian

H

,

Song

CS

,

Cho

YR

. Decomposing protein interactome networks by graph entropy. In:

2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

, Vol.

2010

.

Hong Kong, China

,

585

–

9

.

104.

Kenley

EC

,

Cho

YR

.

Detecting protein complexes and functional modules from protein interaction networks: a graph entropy approach

.

Proteomics

2011

;

11

:

3835

–

44

.

105.

Wang

J

,

Li

M

,

Chen

J

, et al.

A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks

.

IEEE/ACM Trans Comput Biol Bioinform

2011

;

8

:

607

–

20

.

106.

Ren

J

,

Wang

J

,

Li

M

. Identifying protein complexes based on local fitness method. In:

2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

.

Philadelphia, USA

,

2012

,

1

–

6

.

107.

Wang

J

,

Chen

G

,

Liu

B

, et al.

Identifying protein complexes from interactome based on essential proteins and local fitness method

.

IEEE Trans Nanobioscience

2012

;

11

:

324

–

35

.

108.

Qi

Y

,

Balem

F

,

Faloutsos

C

, et al.

Protein complex identification by supervised graph local clustering

.

Bioinformatics

2008

;

24

:

i250

–

68

.

109.

King

AD

,

Pržulj

N

,

Jurisica

I

.

Protein complex prediction via cost-based clustering

.

Bioinformatics

2004

;

20

:

3013

–

20

.

110.

Newman

ME

.

Modularity and community structure in networks

.

Proc Natl Acad Sci U S A

2006

;

103

:

8577

–

82

.

111.

Li

X

,

Tan

S-H

,

Foo

C-S

, et al.

Interaction graph mining for protein complexes using local clique merging

.

Genome Inform

2005

;

16

:

260

–

8

.

112.

Navlakha

S

,

Rastogi

R

,

Shrivastava

N

. Graph summarization with bounded error. In:

Proceedings of the 2008 ACM SIGMOD international conference on Management of data

.

Vancouver, Canada

:

ACM

,

2008

,

419

–

32

.

113.

Navlakha

S

,

Schatz

MC

,

Kingsford

C

.

Revealing biological modules via graph summarization

.

J Comput Biol

2009

;

16

:

253

–

64

.

114.

Zhang

XF

,

Dai

DQ

,

Li

XX

.

Protein complexes discovery based on protein–protein interaction data via a regularized sparse generative network model

.

IEEE/ACM Trans Comput Biol Bioinform

2012

;

9

:

857

–

70

.

115.

Bader

GD

,

Hogue

CWV

.

An automated method for finding molecular complexes in large protein interaction networks

.

BMC Bioinformatics

2003

;

4

:

2

–

27

.

116.

Lubovac

Z

,

Gamalielsson

J

,

Olsson

B

.

Combining functional and topological properties to identify core modules in protein interaction networks

.

Proteins

2006

;

64

:

948

–

59

.

117.

Adamcsek

B

,

Palla

G

,

Farkas

IJ

, et al.

CFinder: locating cliques and overlapping modules in biological networks

.

Bioinformatics

2006

;

22

:

1021

–

3

.

118.

Li

M

,

J-e

C

,

Wang

J-x

, et al.

Modifying the DPClus algorithm for identifying protein complexes based on new topological structures

.

BMC Bioinformatics

2008

;

9

:

398

–

413

.

119.

Jiang

P

,

Singh

M

.

SPICi: a fast clustering algorithm for large biological networks

.

Bioinformatics

2010

;

26

:

1105

–

11

.

120.

Mete

M

,

Tang

F

,

Xu

X

, et al.

A structural approach for finding functional modules from large biological networks

.

BMC Bioinformatics

2008

;

9

:

S19

–

32

.

121.

Pu

S

,

Vlasblom

J

,

Emili

A

, et al.

Identifying functional modules in the physical interactome of Saccharomyces cerevisiae

.

Proteomics

2007

;

7

:

944

–

60

.

122.

Leung

HCM

,

Xiang

Q

,

Yiu

SM

, et al.

Predicting protein complexes from PPI data: a core-attachment approach

.

J Comput Biol

2009

;

16

:

133

–

44

.

123.

Wu

M

,

Li

X

,

Kwoh

C-K

, et al.

A core-attachment based method to detect protein complexes in PPI networks

.

BMC Bioinformatics

2009

;

10

:

169

–

84

.

124.

Chen

B

,

Shi

J

,

Wu

F-X

. Not all protein complexes exhibit dense structures in S. cerevisiae PPI network. In:

2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

.

Philadelphia, USA

,

2012

,

1

–

4

.

125.

Jung

SH

,

Jang

WH

,

Hur

HY

, et al.

Protein complex prediction based on mutually exclusive interactions in protein interaction network

.

Genome Inform

2008

;

21

:

77

–

88

.

126.

Ozawa

Y

,

Saito

R

,

Fujimori

S

, et al.

Protein complex prediction via verifying and reconstructing the topology of domain–domain interactions

.

BMC Bioinformatics

2010

;

11

:

350

–

61

.

127.

van

Dongen

SM

.

Graph Clustering by Flow Simulation

.

Universiteit Utrecht

,

2000

.

128.

Enright

AJ

,

Van Dongen

S

,

Ouzounis

CA

.

An efficient algorithm for large-scale detection of protein families

.

Nucleic Acids Res

2002

;

30

:

1575

–

84

.

129.

Satuluri

V

,

Parthasarathy

S

,

Ucar

D

. Markov clustering of protein interaction networks with improved balance and scalability. In:

Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology

.

Niagara Falls, New York

:

ACM

,

2010

,

247

–

56

.

130.

Shih

YK

,

Parthasarathy

S

.

Identifying functional modules in interaction networks through overlapping Markov clustering

.

Bioinformatics

2012

;

28

:

I473

–

9

.

131.

Hwang

W

,

Cho

Y-R

,

Zhang

A

, et al.

A novel functional module detection algorithm for protein–protein interaction networks

.

Algorithms Mol Biol

2006

;

1

:

24

–

34

.

132.

Yr

C

,

Hwang

W

,

Zhang

A

. Identification of overlapping functional modules in protein interaction networks: information flow-based approach. In:

Sixth IEEE International Conference on Data Mining-Workshops (ICDMW'06)

.

Hong Kong, China

,

2006

,

147

–

52

.

133.

Macropol

K

,

Can

T

,

Singh

AK

.

RRW: repeated random walks on genome-scale protein networks for local cluster discovery

.

BMC Bioinformatics

2009

;

10

:

283

–

92

.

134.

Maruyama

O

,

Chihara

A

. NWE: node-weighted expansion for protein complex prediction using random walk distances. In:

2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

.

Hong Kong, China

,

2010

,

590

–

4

.

135.

Asur

S

,

Ucar

D

,

Parthasarathy

S

.

An ensemble framework for clustering protein–protein interaction networks

.

Bioinformatics

2007

;

23

:

i29

–

i40

.

136.

Friedel

CC

,

Krumsiek

J

,

Zimmer

R

.

Bootstrapping the interactome: unsupervised identification of protein complexes in yeast

.

J Comput Biol

2009

;

16

:

971

–

87

.

137.

Greene

D

,

Cagney

G

,

Krogan

N

, et al.

Ensemble non-negative matrix factorization methods for clustering protein–protein interactions

.

Bioinformatics

2008

;

24

:

1722

–

8

.

138.

Collins

SR

,

Kemmeren

P

,

Zhao

X-C

, et al.

Toward a comprehensive atlas of the physical Interactome of Saccharomyces cerevisiae

.

Mol Cell Proteomics

2007

;

6

:

439

–

50

.

139.

Brohee

S

,

van

Helden

J

.

Evaluation of clustering algorithms for protein–protein interaction networks

.

BMC Bioinformatics

2006

;

7

:

488

.

140.

Bu

DB

,

Zhao

Y

,

Cai

L

, et al.

Topological structure analysis of the protein–protein interaction network in budding yeast

.

Nucleic Acids Res

2003

;

31

:

2443

–

50

.

141.

Przulj

N

,

Wigle

DA

,

Jurisica

I

.

Functional topology in a network of protein interactions

.

Bioinformatics

2004

;

20

:

340

–

8

.

142.

Boyle

EI

,

Weng

S

,

Gollub

J

, et al.

GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes

.

Bioinformatics

2004

;

20

:

3710

–

5

.

143.

Tang

X

,

Wang

J

,

Liu

B

, et al.

A comparison of the functional modules identified from time course and static PPI network data

.

BMC Bioinformatics

2011

;

12

:

339

.

144.

Palla

G

,

Derenyi

I

,

Farkas

I

, et al.

Uncovering the overlapping community structure of complex networks in nature and society

.

Nature

2005

;

435

:

814

–

8

.

145.

Wodak

SJ

,

Vlasblom

J

,

Turinsky

AL

, et al.

Protein–protein interaction networks: the puzzling riches

.

Curr Opin Struct Biol

2013

;

23

:

941

–

53

.

146.

Stumpf

MP

,

Thorne

T

,

de

Silva

E

, et al.

Estimating the size of the human interactome

.

Proc Natl Acad Sci U S A

2008

;

105

:

6959

–

64

.

147.

Hart

GT

,

Ramani

AK

,

Marcotte

EM

.

How complete are current yeast and human protein-interaction networks?

Genome Biol

2006

;

7

(

11

):

120

.

148.

Yu

H

,

Braun

P

,

Yildirim

MA

, et al.

High-quality binary protein interaction map of the yeast interactome network

.

Science

2008

;

322

:

104

–

10

.

149.

De Las

RJ

,

Fontanillo

C

.

Protein–protein interactions essentials: key concepts to building and analyzing interactome networks

.

PLoS Comput Biol

2010

;

6

:e1000807.

150.

Nguyen

TN

,

Goodrich

JA

.

Protein–protein interaction assays: eliminating false positive interactions

.

Nat Methods

2006

;

3

:

135

–

9

.

151.

von

Mering

C

,

Krause

R

,

Snel

B

, et al.

Comparative assessment of large-scale data sets of protein–protein interactions

.

Nature

2002

;

417

:

399

–

403

.

152.

Blohm

P

,

Frishman

G

,

Smialowski

P

, et al.

Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis

.

Nucleic Acids Res

2014

;

42

:

D396

–

D400

.

153.

Stelzl

U

,

Wanker

EE

.

The value of high quality protein–protein interaction networks for systems biology

.

Curr Opin Chem Biol

2006

;

10

:

551

–

8

.

154.

Suthram

S

,

Shlomi

T

,

Ruppin

E

, et al.

A direct comparison of protein interaction confidence assignment schemes

.

BMC Bioinformatics

2006

;

7

:

360

.

155.

Orre

LM

,

Vesterlund

M

,

Pan

Y

, et al.

SubCellBarCode: proteome-wide mapping of protein localization and relocalization

.

Mol Cell

2019

;

73

:

166

–

82

.

156.

Yu

HY

,

Luscombe

NM

,

Lu

HX

, et al.

Annotation transfer between genomes: protein–protein interologs and protein-DNA regulogs

.

Genome Res

2004

;

14

:

1107

–

18

.

157.

Liu

B

,

Zhang

D

,

Xu

R

, et al.

Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection

.

Bioinformatics

2014

;

30

:

472

–

9

.

158.

Chen

J

,

Guo

M

,

Li

S

, et al.

ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank

.

Bioinformatics

2017

;

33

:

3473

–

6

.

159.

Liu

B

,

Chen

J

,

Wang

X

.

Application of learning to rank to protein remote homology detection

.

Bioinformatics

2015

;

31

:

3492

–

8

.

160.

Chen

J

,

Guo

M

,

Wang

X

, et al.

A comprehensive review and comparison of different computational methods for protein remote homology detection

.

Brief Bioinform

2018

;

9

:

231

–

44

.

161.

Liu

B

,

Jiang

S

,

Zou

Q

.

HITS-PR-HHblits: protein remote homology detection by combining pagerank and hyperlink-induced topic search

.

Brief Bioinform

. DOI:

10.1093/bib/bby104

.

162.

Zeng

XX

,

Liu

L

,

Lu

LY

, et al.

Prediction of potential disease-associated microRNAs using structural perturbation method

.

Bioinformatics

2018

;

34

:

2425

–

32

.

163.

Zou

Q

,

Li

J

,

Song

L

, et al.

Similarity computation strategies in the microRNA–disease network: a survey

.

Brief Funct Genomics

2016

;

15

:

55

–

64

.

164.

Zhang

X

,

Zou

Q

,

Rodruguez-Paton

A

, et al.

Meta-path methods for prioritizing candidate disease miRNAs

.

IEEE/ACM Trans Comput Biol Bioinform

2019

;

16

(

1

):

283

–

91

.

165.

Zeng

X

,

Ding

N

,

Rodríguezpatón

A

, et al.

Probability-based collaborative filtering model for predicting gene–disease associations

.

BMC Med Genomics

2017

;

10

:

76

.

166.

Liu

Y

,

Zeng

X

,

He

Z

, et al.

Inferring microRNA–disease associations by random walk on a heterogeneous network with multiple data sources

.

IEEE/ACM Trans Comput Biol Bioinform

2017

;

14

:

905

–

15

.

167.

Schlecht

U

,

Miranda

M

,

Suresh

S

, et al.

Multiplex assay for condition-dependent changes in protein–protein interactions

.

Proc Natl Acad Sci U S A

2012

;

109

:

9213

–

8

.

168.

Celaj

A

,

Schlecht

U

,

Smith

JD

, et al.

Quantitative analysis of protein interaction network dynamics in yeast

.

Mol Syst Biol

2017

;

13

:

934

.

169.

Liu

B

,

Gao

X

,

Zhang

H

.

BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches

.

Nucleic Acids Res

. DOI:

10.1093/nar/gkz740

.

170.

Liu

B

.

BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

.

Brief Bioinform

. DOI:

10.1093/bib/bbx165

.

171.

Liu

Y

,

Wang

X

,

Liu

B

.

A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction

.

Brief Bioinform

2019

;

20

:

330

–

46

.

172.

Liu

Y

,

Wang

X

,

Liu

B

.

IDP-CRF: intrinsically disordered protein/region identification based on conditional random fields

.

Int J Mol Sci

2018

;

19

:

2483

.