HKC: An Algorithm to Predict Protein Complexes in Protein-Protein Interaction Networks

With the availability of more and more genome-scale protein-protein interaction (PPI) networks, research interests gradually shift to Systematic Analysis on these large data sets. A key topic is to predict protein complexes in PPI networks by identifying clusters that are densely connected within themselves but sparsely connected with the rest of the network. In this paper, we present a new topology-based algorithm, HKC, to detect protein complexes in genome-scale PPI networks. HKC mainly uses the concepts of highest k-core and cohesion to predict protein complexes by identifying overlapping clusters. The experiments on two data sets and two benchmarks show that our algorithm has relatively high F-measure and exhibits better performance compared with some other methods.


systematicall
analyze the behaviors and properties of biological molecules.Among the various researches on PPI networks, a key topic is to predict protein complexes in PPI networks.A protein complex is a group of proteins that interact with each other at the same time and place, forming a single multimolecular machine [3].These complexes are a cornerstone of many biological processes.PPI networks are modular [3,4] and contain modules that are densely connected within themselves but sparsely connected with the rest of the network.These modules are called cluster and they may represent protein complexes [5,6].Thus, protein complexes can be detected by identifying clusters from PPI networks.Detecting protein complexes in PPI networks is of vital importance to the understanding of the structural and functional properties of PPI networks and can also help to predict the function of unknown proteins.

Due to the high level of noise as well as the topological features of PPI networks, traditional clustering techniques in a metric space cannot successfully predict protein complexes in PPI networks [3], thus various graph an lysis approaches have been proposed to solve this problem.These approaches can be classified into the following three types.(1) Agglomerative methods: Bader and Hogue [6] proposed a graph theoretic clustering algorithm to detect molecular complexes in PPI networks.The method was based on vertex weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate the dense regions according to given parameters.The problem of this method is that using only vertex weighting, it would find sparsely connected subgraphs (such as a rope-like subgraph or a mixed subgraph consisting of several connected clusters) instead of dense clusters.(2) Graph partition methods: King et al. [7] partitioned networks and found an approximate optimal solution in the space of partitions using a cost-based local search algorithm.This technique was nondeterministic and got different result in each run, and it consumed huge space.Chen and Yuan [8] extended a betweennessbased partition algorithm (Girvan-Newman algorithm [9], GN for short) and used it to partition PPI networks into subgraphs and then obtained function modules by filtering these subgraphs.The chief drawback of GN algorithm is it is time consuming.Generally, graph partition methods can only find nonoverlapping clusters, while protein complexes tend to overlap with each other.(3) Methods based on clique (see Section 2.2 Concepts): Palla et al. [10] suggested clique percolation method (CPM) to find overlapping communities.The core concept of CPM is k-clique community which is defined as the union of all k-cliques (complete subgraphs of size k) that can be reached from each other through a series of adjacent k-cliques (where adjacency means sharing k-1 nodes).Some other researchers [11,12] predicted protein complexes by identifying cliques or nearcliques in PPI networks.Clique-based approaches were too stringent with the topological structure and cannot detect protein complexes with other types of topological structure.

In this paper, we present a new topology-based algorithm, HKC (as it mainly uses two important concepts, highest k-core and cohesion), to detect protein complexes in genome-scale PPI networks.Our algorithm uses the concepts of highest k-core, and cohesion to predict protein compl xes by identifying overlapping clusters.It first calculates the score for each node in the PPI network based on the concept of highest k-core then uses the nodes with high scores and high degrees as seeds; from each seed, gets a core according to the corresponding highest k-core of the direct neighborhood of the seed and expands this core to include nodes which are highly possible to form a cluster based on the criteria of node score and cohesion; finally, protein complexes can be predicted by filtering all the found clusters according to predefined features.We apply HKC on two data sets (MIPS and SGD-MC data set) and evaluate the results on two benchmarks (complexcat and Gavin benchmarks).The experiments show that our algorithm has relatively high precision and recall, that is, most of the predicted clusters match well with known protein complexes, and at the same time most of the known protein complexes have been recalled.HKC exhibits better performance compared with some other methods, and besides, it is general and can be used in any type of biological interaction networks and even in nonbiological networks.


Materials and Method

2.1.Materials.Protein-protein interactions of the model organism Saccharomyces cerevisiae (yeast) has been studied thoroughly, and the data of yeast protein complexes is the most comprehensive so far, so we test HKC on the yeast PPI data and use yeast complex da

iveness.

W
use the following two data sets as the input network: (1) MIPS data set, it is processed based on the Munich Information Center for Protein Sequences (MIPS) [13] yeast PPI data set (containing 15,456 records) downloaded from the Comprehensive Yeast Genome Database (CYGD) [14]; aft r removing those repetitive interactions and taking no account of the edge direction, the final input PPI network contains 4,554 proteins and 12,526 interactions.(2) SGD-MC data set, which is downloaded from the Literature Curation Data in SGD database [15]; the original data set contains 252247 records, including two types of interactions, high-throughput and manually curated interactions, and we only use the manually curated interactions.After deleting all repetitive interactions and taking no account of the edge direction, the final SGD-MC data set contains 4,448 proteins and 29,068 interactions.

To evaluate the algorithm, we collect two data sets as benchmarks: (1) complexcat benchmark, it is obtained by processing the MIPS yeast protein complex catalog in CYGD [14].The MIPS yeast protein complex catalog was last modified in 2006, and has been manually curated from the literat re; therefore, it is more realistic than other data obtained by high-throughput methods and has been used in many researches because of its quality [6,7,16].However, it is not proper to use this data set directly as the benchmark, since it contains many complexes composed of a single protein which do not fit the definition of clusters and also contains the complexes by Systematic Analysis [17,18] which are not so reliable as results by small-scale experiments.After removing those complexes by Systematic Analysis (the type of 550) and the complexes consisting of only one protein, the final obtained complexcat benchmark contains 217 protein complexes.(2) Gavin benchmark, it is processed from the experiment results by Gavin et al. [19].They use affinity purification and mass spectrometry to get 491 complexes that differentially combine with additional attachment proteins or protein modules to enable a diversification of potential functions.We adopt the core proteins (those present in 2/3 of the isoforms) of the 491 complexes and 36 additional known complexes, and after removing those of size less than 3, finally we get 204 protein complexes in Gavin benchmark.


Concepts

2.2.1.PPI Network.PPI networks can be intuitively modeled as a static graph G = (V , E), where V is the set of nodes (proteins), and E is the set of edges (protein-protein interactions).

An undirected edge is drawn between each pair of nodes for which there is evidence of a prote

-protein
nteraction.


Basic Concepts in Graph Theory

Degree.The degree of the node n is the number of edges attached to it and is denoted as deg(n).

Graph Density.There is no standard graph theor definition of density, but definitions are normally based on the connectivity level of a graph [6].For undirected si

le graphs, the graph density is
defined as the quotient of the number of real edges in the graph divided by the number of all pos ible edges:
den(G) = |E| |V | * (|V | − 1)/2 , (1)
where den(G) is the density of graph G, |E| is the number of real edges in G, and |V | is the number of nodes in G.

For undirected graph with loops (a loop is the edge that connects a node and the node itself), the number of all possible edges is |V | * |V | ed as
den(G) = |E| |V | * |V |/2 .
(2)

So, the range of graph density is between 0 and 1.

Clique.A clique is a ful y connected subgraph, that is, a set of nodes that are all neighbors of each other [20].For instance, Figure 1(c) is a clique consisting of 4 nodes.

k-Core.A k-core of a graph is a maximal subgraph such that each node in the subgr denoted as kc(G).For example, in Figure 1, the gra

in (b)
s a 2-core of the graph in (a).

Highest k-Core.The highest k-core of a graph is the one with the maximal k value among all the k-cores and is denoted as hkc(G).It is t e central most densely connected subgraph.The highest k-core of G can be found in the following way [6]: suppose the lowest degree of nodes in G is l, delete all nodes with degree l, if all remaining nodes hav a least degree l 1 (l 1 > l), we will get the l 1 -core of G; if some of the remaining nodes have degree lower than or equals to l, continue delete all nodes with the least degree until all remaining nodes have a degree higher than l, or until all nodes have been deleted.In this way we can find all k-cores and the one with the maximal k value is the highest k-core.For example, Figure 1(a) shows a graph G, if we delete the node of degree 1 (i.e., node 6), we can obtain a subgraph in which each node has a least degree 2, that is, the 2-core of G, as shown in Figure 1(b); if we continue delete the node of degree 2 (node 1), we can get a subgraph in which each node has a least degree 3, the 3-core of G, as shown in Figure 1(c), and it is also the highest k-core of graph G.


Cohesion.

During the expansion of a core, only based on the score information, we cannot efficiently decide whether a node should be included into the core.We define a new concept cohesion to measure the connectivity between a node n and an existing cluster c, and we denote it as co(n, c).

Cohesion is calculated in the following way:
co(n, c) =

nc N c ,(
)
where E nc is the number of edges between node n and cluster c, and N c is the number of nodes in cluster c. co(n, c) is a real number between 0 and 1.The larger co(n, c) is, the tighter node n connects to cluster c, and the more likely they belong to a larger cluster.Therefore, cohesion can be used as a criterion during t node can only be included into a core when the cohesion between this node and the core is greater than a specified threshold.


The Algorithm.

The algorithm HKC consists of the following three steps: scoring, cluster finding, and filtering.


2.3.1.

Scoring.The first step of HKC is to score all nodes in the PPI network.For each node, firstly we find the highest kcore of its direct neighborhood (the subgraph consisting of all nodes connecting to the node, including the node

self), which we
denote as H. Then we score H using the properties of highest k-core.Larger and denser cores will g

higher
scores, and it is computed in the following way:
score(H) = N H * den(H) * k max ,( 4 )
where N H denotes the number of nodes in H, den(H) refers to the density of H, k max is the maximal k value corresponding to the highest k-core H, and the larger these three values are, the larger and denser the corresponding highest k-core is.

As highest k-core is the most densely connected central core in the loca node score gives better reflection of the connectivity in the local area, we assign score (H) to each node in H.For node n, it may be contained in more than one highest k-core, thus it may be given more than one score, and the final score of thi node is defined as the maximal score of all highest k-cores in which node n is contained (see the pseudocode in line 14-16 in Pseudocode 1).
score(n) = max{score(H i ) | ∀H i , n ∈ H i }.
(

In this way, we can make sure that all nodes in densely connected subgraphs will have high score and those with little neighbors will have low scores, and therefore we can distinguish the nodes in clusters from those not in clusters with the help of node score.


Cluster Finding.

The second step of this al e process of finding a cluster is as follows: first choose the seed, then obtain the core based on the seed, and then expand the core to include the noncore nodes.The final cluster obtained in this way is thus a circular subgraph with densely connected core and

ss dense
y connected noncore, as shown in Figure 2, which fits our expectation of the topological structure of protein complexes.

The detailed process of finding a cluster can be divided into the following three steps.

(1) Choose the seed.As node scores represent the local density, if a node has very high score, it must have a very dense neighborhood; besides, for the nodes with the same score value (e.g., according to our scoring scheme, nodes in the same highest k-core may have the same score value), the one with the highest degree can be deemed as the most densely conn cted node among them.Therefore, among all the unseen nodes with the highest score we choose the one with the highest degree as the seed (see the pseudocode in line 2-4 in Pseudocode 2), and this way of seed selection can insure that the seed is in the center of the corresponding cluster.

(2) Get the core based on the selected seed.First get the highest k-core of the direct neighborhood of the seed, and after removing all nodes that have been already seen and those with too low scores we can get the core (see the pseudocode in line 8-13 in Pseudocode 2).In order to avoid repetitive computation, all nodes in the core (including the seed) would e marked seen and cannot be used as the core or the seed of another cluster.

(3) Expand the core.As the node score indicates the local connectivity in the neighborhood of this node, it can be used as a criterion during core expanding.When expanding a core, in order to guarantee the local density, nodes with too low scores could not be included into the core; on the other side, to avoid excessively expanding a little core to include a de ser and larger cluster, nodes with extortionate scores could neither be included into the core.Furthermore, if we use node score threshold as the only criterion to decide whether a node should be included into the core, connected nodes with similar scores will be detected as one cluster when they do not actually make up a densely connected subgraph, such as those with rope-like shape.So we adopt cohesion as another criterion.lower bound and upper bound, respectively, T 1 is a real number between 0 and 1, and T 2 is an integer greater than 1; T 3 is the cohesion threshold, a real number ranging from 0 to 1. Add the found nodes to the core and continue to expand the core until all nodes (including the new added nodes) in the core have been expanded.After finding a cluster using the above method, choose another seed and repeat the above cluster-finding process until no satisfying nodes can be considered as seed, in this way we can get all clusters in the PPI network.

It is worth noting that while expanding a core, we never consider whether a node has been seen or not.As a result, a noncore part of a cluster can include nodes that have been seen as the core of another cluster, that is to say the different clusters detected by HKC can have overlaps between cores and noncores.In this way, we can find overlapping clusters, which better coincide with the fact that different protein complexes have over aps.


2.3.3.

Filtering.The clusters found in step two contain many clusters of size one or two, and these little clusters are insignificant, since they can be obtained by randomly select nodes in a PPI network.Thus we filter out the clusters that contain less than three nodes.Score all clusters using the product of cluster density and cluster size, and larger and denser clusters will get higher scores (see the pseudocode in line 7-11 in Pseudocode 3).As the algorithm allows overlaps between cores and non-cores, the results in step two may contain highly similar clusters, which must be filtered in the postprocessing.We use overlap ratio (OR, see Section 3.1 for more details) to measure the similarity between clusters; compare each two clusters, when their overlap ratio is higher than 0.95, delete the one with lower score (see the pseudocode in line 12-19 in Pseudocode 3).Finally, rank all remaining

luste
s in descending order according to cluster score.


Pseudocode


Implementation. The algorithm has been implemented

in Java and we plan to convert it into a Cytoscape plugin.Now the source code of the algorithm is available freely for noncommercial purposes upon request.All maps of networks were performed by Cytoscape [21].


Results and Discussions


Evaluation of the Algorithm.

To evaluate the performance of our algorithm, we compare the predicted clusters with the protein complexes in two different benchmarks: complexcat and Gavin benchmarks.Each of the predicted clusters is compared with the benchmark complexes.The similarity between a predicted cluster and a benc mark complex is measured by overlap ratio (OR), which is defined as follows:
OR = 2 * O (C 1 + C 2 ) , (6)
where O is the number of proteins shared by a predicted cluster and a benchmark complex, C 1 is the number of proteins in the predicted cluster and C 2 is the number of proteins in the benchmark complex.The scope of OR is between 0 and 1. OR = 0 means the predicted cluster has no proteins in common with the benchmark complex; OR = 1 mean

it is p
rfectly matched with the benchmark complex.The higher OR is, the more biologically meaningful the detected cluster would be.A detected cluster can be deemed as being matched with a benchmark complex only when their overlap ratio is above a given threshold.And we call a cluster an effective cluster as long as it has at least one benchmark complex matching with it.In the same way, a matched complex refers to the benchmark complex that has a least one detected cluster that matching with it.A rational OR threshold should ensure that the detected cluster shares a large proportion of proteins with the matching benchmark complex, and meanwhile it could not be too stringent.In this paper, we adopt 0.4 as the OR threshold.

In [6] they use overlap score, defined as O 2 /(C 1 * C 2 ), to determine how effectively a predicted cluster matched to a known complex in the benchmark set, and it is assumed that a predicted cluster is more or less matches a known co

ts overlap score
is above 0.2.Here, we did not adopt this scoring scheme, because it is biased, that is, it would get a relatively high score when a small predicted cluster matching with a large known complex or a large predicted cluster matching with a small comp

icted cluster of size 2 share
2 proteins with a known complex of size 10, its overlap score equals 0.2 and would thus be considered as matched with the known complex; actually, it is not so appropriate to deem such a small cluster as matching a known complex much larger that it.However, our definition of overlap ratio is more balanced and only gives high score when the matching cluster and complex h above example, its overlap ratio is only 0.33, less than the threshold 0.4, and would not be deemed as a match.

To compare the performance of different algorithms, we define three criteria: precision, recall, and F-measure, defined as the following formulas:
precision = EC AC , recall = MC BC , F-measure = 2 × precision × recall precision + recall , (7)
where EC is the number of effective clusters found by the algorithm, AC is the number of all clusters predicted by the algorithm, MC is the number of matched complexes in the benchmark set, and BC is the total number of benchmark complexes.Note that according to the overlap score threshold, EC may not equal MC, since one predicted cluster may match with several benchmark complexes as long as their overlap scores are higher than the given threshold, and in the same way one benchmark complex may correspond with several predicted complexes.Precision describes the accuracy of the algorithm result; recall denotes the percentage of benchmark complexes that are recovered by the algorithm.Fmeasure, which is the harmonic m an of precision and recall, shows a good balance of precision and recall, and thus can be used to measure the overall performance of algorithms.


Experiments and Comparison.

We, respectively, use MIPS and SGD-MC data sets as the input PPI network and run HKC with 120 groups of parameter combination.The range of T 1 is between 0.3 and 0.7, with the step of 0.1, and the range of T 2 is between 5 and 20, with the step of 5, and the range of T 3 is between 0.4 and 0.9, with the step of 0.1.We evaluate the results using the two benchmarks: complexcat and Gavin benchmarks, and then choose the optimized parameters which enable F-measure to get the highest value.

The best result and the corresponding optimized parameters for HKC are shown in Table 1.

To show the influence of different parameters on the algorithm performance, we draw the plot of average Fmeasure versus T 1 , T 2 , and T 3 , respectively, as shown in Figure 3.Note that here the F-measure in y-axis is the average value of all F-measures with one parameter specified among the 120 groups of experiment results evaluat d by the complexcat benchmark.From Figure 3, we can see that among the three parameters T 3 has the greatest influence on average F-measure, and fo ta set, the average F-measure gets the maximum value when T 3 = 0.7.This is understandable, as the SGD-MC network (which contains 4,448 proteins and 29,068 interactions) is much denser than the MIPS network (which contains 4,554 proteins and 12,526 interactions), and during the core expansion process the nodes in SGD-MC network would have higher cohesion with the core than the nodes in MIPS network.Therefore, for SGD-MC network when expanding the core, the cohesion threshold T 3 should be higher than that for MIPS network.Furthermore, the figures show that a good range for T 1 is in the middle, between 0.4 and 0.6, the best value for T 2 is 10, and for T 3 the best range is between 0.5 and 0.8.

To show the performance of HKC, we compare it with MCODE [6], as shown in Table 1.We run the MCODE plugin in Cytoscape with 840 parameter combinations (the same with t

t used in [6]) on MIPS and S
D-MC data sets respectively, and then use the two benchmarks to evaluate the results.The optimized parameters (see Table 1) are chosen based on the highest F-measure.The result of HKC is the best one in 120 groups of parameters, and the corresponding optimized parameters are shown in Table 1.From this table, we can see that for MIPS data set, the recall of HKC is 0.429 corresponding to the complexcat benchmark, considerably higher than that of MCODE (0.194); for SGD-MC data set, HKC can recall as high as 58% of protein complexes in complexcat benchmark, notably higher than t at of MCODE (around 22%).Experiment results show that whichever benchmark is adopted, for both MIPS and SGD-MC data set, the recall and F-measure of HKC are remarkably higher than that of MCODE, and the overall performance is substantially improved.

As shown in Figure 4, whatever the OR threshold is, HKC can extract much more effective clusters than MCODE in MIPS data set, and also the number of matched complexes by HKC is much higher than that by MCODE.

To show the overall performance improvement of our algorithm, we also plot precision versus recall for all results with different parameters in Figure 5.As can be seen from the figure, for all four cases, the data points resulted by HKC are located in the upper right portion of the plot, corresponding to high values of F-measure, while most of the data points resulted by MCODE are located in the lower left part of the plot.The figure illustrates that both precision and recall of The influence of different parameters on algorithm performance.Note the F-measure in y-axis is the average value of all F-measures with one parameter specified among the 120 groups of experiment results evaluated by the complexcat benchmark, triangles mark the result on MIPS network, and circles mark the result on SGD-MC network.

Table 1: omparison with MCODE.P, R, and F stand for precision, recall, and F-measure, respectively, and their definitions are given in Section 3.1.MIPS data set contains 4,554 proteins and 12,526 interactions, and SGD-MC data set contains 4,448 proteins and 29,068 interactions.AC is the number of all clusters predicted by the algorithm; EC is the number of effective clusters (with a least one matching complex above overlap ratio 0.4) found by the algorithm; MC is the number of matched complexes in the benchmark set.The sizes of complexcat benchmark and Gavin benchmark are 217 and 204, respectively.For HKC the optimized parameters are T 1 , T 2 , and T 3 , respectively, and for MCODE the optimized parameters are NodeScoreCutoff, fluff (T for true, F for false), haircut (T for true, F for false), and other unspecified parameters adopt the default values.HKC results for most parameter combinations are higher than that of MCODE, showing the overall improvement of the algorithm performance.From this figure, we can also see that the d ta points resulted by HKC are much more centralized than MCODE, indicating that our algorithm does not rely so severely on parameter selection.


Algorithm


3.3.

Discussions.Among the 237 clusters found by H C in MIPS data set, 8 clusters perfectly match with known protein complexes in complexcat benchmark.Figure 6(a)

gives an example of one perfectly matched cluster: cluster 23 (consisting of 11 proteins and 54 interactions) perfectly matches with the TRAPP (transport protein particle) complex (catalog 260.60 in the complexcat benchmark), which plays an essential role in the vesicular transport from endoplasmic reticulum to Golgi. Figure 6(b) shows an example of a containment match.Cluster 12 (consisting of 14 proteins and 92 interactions) is totally contained in a known complex of size 16, SAGA complex (catalog 510.190.10.20.10 in the complexcat benchmark), and their overlap ratio is 0.93.The two proteins YCL010c and YGL066w that are not recovered by cluster 12 have only one interaction with the cluster and do not exhibit good graph theoretic property.Actually, based on the available information currently, we cannot assert that YCL010c is contained in SAGA complex, and according to [22] it is only a probable subunit of SAGA complex.

Figure 6(c) gives an example of a well-matched cluster: cluster 103 matches with the complex of cytoplasmic translation initiation factor 3 (eIF3, catalog 500.10.40 in the complexcat benchmark).Each of them contains 7 proteins, their overlap is 6 and their overlap ratio is 0.857.As shown in the figure, protein YNL062c is contained in the benchmark complex, but is not included in cluster 103 predicted by HKC.Furthermore, it has only one interaction with cluster 103 and does not show an ideal topological property of belonging to a cluster.We searched it in Gene Ontology (GO) database [23] and found that according to the most updated GO annotation (release date 2011-05-14), YNL062c is not contained in the eIF3 complex (GO:0005852), but is a subunit of tRNA (1-methyladenosine) methyltransferase with Gcd14p required for the modification of the adenine at position 58 in tRNAs, especially tRNAi-Met.This indicates that the complexcat benchmark we use here may contain error

because it w
s last modified in 2006 and many new protein complexes have been identified through experiments since then.In a way, it is possible to correct errors in the benchmark by carefully examining the difference between predicted clusters and their corresponding benchmark complexes with high overlap ratio.

Figure 6(d) shows a novel cluster (ranked 61) detected by HKC, and it involves 5 proteins and 10 interactions.Cluster 61 does not match with any known protein complex in the complexcat benchmark, but is highly homogenous in the cellular component ontology and biological process ontology.Search results on Gene Ontology database show that protein YGL153w, YLR191w and YNL214w form the docking complex that facilitates the import of peroxisomal matrix proteins, and YGL153w is a central component of  the peroxisomal protein import machinery.The other two proteins in this cluster, YDR244w, and YDR142c, are the PTS1 signal recognition factor and the PTS2 signal recognition factor, respectively, and they also participate in the same biological process protein docking during peroxisome matrix protein import (GO: 0016560) as the docking complex.

The above illustrative examples show that HKC can not only effectively detect protein complexes in genome-scale PPI networks, but also through the comparison of predicted clusters and their matching benchmark complexes, it may help to correct the errors in the benchmark.Furthermore, HKC can discover novel protein complexes which can be used as candidates for experimental verification, and thus greatly helps to reduce the time consumption and cost of experiments.Among the 237 clusters resulted by our algorithm on MIPS data set, 147 clusters do not match with known complexes in complexcat benchmark, and we give all 49 clusters with score ≥4 and size ≥5 as novel predictions in Table 2, which would be a starting point for experimental validation in the future.


Conclusions and Future Work

A genome-scale PPI network is usually very large, consisting of thousands of proteins and tens of thousands of interactions, for example, the SGD-MC data set we use as the input PPI network in this paper contains 4,448 proteins and 29,068 interactions.It is a challenging task to extract protein complexes in such a large and complicated network.

To solve this problem, many computational methods have been proposed, including the graph theoretic clustering algorithm.In this paper, we presented a new topologybased algorithm, HKC, which mainly used the concepts of highest k-core and cohesion to predict protein complexes by identifying overlapping clusters.The experiments on two data sets and two benchmarks showed that HKC can The nodes encircled by the red-dotted line are known complex in the complexcat benchmark, the nodes contained within the blue circle are clusters predicted by HKC, and the yellow node in each cluster denotes the seed of the cluster.(a) A cluster of size 11 perfectly matches with the TRAPP complex.(b) A cluster of size 14 shares 14 proteins with the SAGA complex (size 16), and their overlap ratio is 0.93.The two proteins YCL010c and YGL066w that are not contained in the predicted cluster are isolated nodes with only one edge connecting with the clust r.(c) An example of a well-matched cluster, involving 7 proteins, among which 6 is in common with the complex of cytoplasmic translation initiation factor 3 (eIF3).(d) A novel cluster detected by HKC, which does not match with any known protein complexes in the complexcat benchmark, and the proteins in the black circle form the docking complex that facilitates the import of peroxisomal matrix proteins according to GO annotation.

effectively extract protein complexes from genome-scale PPI networks and exhibited better performance compared with some other methods.Besides, HKC is general and can be used in any type of biological interaction networks.

There is huge amount of work to be done in PPI network analysis.As for protein complex prediction, there are al

a lot of researches to be d
ne in the future.One of the problems with the current PPI networks is that they are consisting of interactions that do not necessarily happen at the same time and space.Instead, the interactions in PPI networks may be unstable, transient or conditional, and may also happen in different subcellular locations.However, by definition, a protein compl x is a group of proteins that interact with each other at the same time and place.As a result, to increase  the precision and recall of complex prediction algorithm, further information about the time, space, or conditions of interactions should be taken into consideration.

Figure 1 :
1
Figure 1: The illustration of k-cores of a graph.(a) Graph G; (b) 2-core of G; (c) 3-core of G, it is also the highest k-core of G.




Input:PPI network (an indirect simple graph): G = (V , E) Node score threshold: T 1 and T 2 Cohesion threshold: T 3 Output: The predicted protein clusters: Clusters Call Scoring Call ClusterFinding Call Filtering // step3: Filtering Procedure Filtering for all cluster c in Clusters do if the size of c is less than 3 then remove c in Clusters end if end for for all cluster c in Clusters do den(c) = the density of c s = size of c score(c) = den(c) * s // compute the score of cluster end for for all cluster c 1 , c 2 in Clusters do o = the overlap ratio of c 1 and c 2 if o > 0.95 then if score(c 1 ) > score(c 2 ) then delete c 2 in Clusters else delete c 1 in Clusters end if end if end for Clusters = sort Clusters descendingly according to cluster scores end procedure Pseudocode 3: Procedure Filtering.


Figure 3 :
3
Figure3: The influence of different parameters on algorithm performance.Note the F-measure in y-axis is the average value of all F-measures with one parameter specified among the 120 groups of experiment results evaluated by the complexcat benchmark, triangles mark the result on MIPS network, and circles mark the result on SGD-MC network.


Figure 4 :
4
Figure 4: The number of effective clusters and the number of mat hed complexes by HKC and MCODE with respect to different OR thresholds.The result is corresponding to MIPS data set and evaluated on complexcat benchmark.Triangles mark the results of HKC and circles mark the results of MCODE.


Figure 5 :
5
Figure 5: Precision versus Recall plots of all WCODE and HKC results with different parameters on various data sets.(a) Input data set: MIPS, benchmark: complexcat.(b) Input data set: MIPS, benchmark: Gavin.(c) Input data set: SGD-MC, benchmark: complexcat.(d) Input data set: SGD-MC, benchmark: Gavin.For all four cases, the data points resulted by HKC are located in the upper right portion of the plot, corresponding to high values of F-measure, while most of the data points resulted by MCODE are located in the lower left par