A Novel Algorithm for Detecting Protein Complexes with the Breadth First Search

Most biological processes are carried out by protein complexes. A substantial number of false positives of the protein-protein interaction (PPI) data can compromise the utility of the datasets for complexes reconstruction. In order to reduce the impact of such discrepancies, a number of data integration and affinity scoring schemes have been devised. The methods encode the reliabilities (confidence) of physical interactions between pairs of proteins. The challenge now is to identify novel and meaningful protein complexes from the weighted PPI network. To address this problem, a novel protein complex mining algorithm ClusterBFS (Cluster with Breadth-First Search) is proposed. Based on the weighted density, ClusterBFS detects protein complexes of the weighted network by the breadth first search algorithm, which originates from a given seed protein used as starting-point. The experimental results show that ClusterBFS performs significantly better than the other computational approaches in terms of the identification of protein complexes.


Introduction
Protein complexes are molecular aggregations of proteins assembled by multiple protein-protein interactions. Many proteins are functional only after they are assembled into a protein complex and interact with other proteins in this complex [1][2][3][4]. The vast amount of genes and proteins that participate in biological networks imposes the need for determination of protein complexes within the network in order to reduce the complexity, while these complexes will be the first step in deciphering the composite genetic or cellular interactions of the overall network.
High-throughput experimental technologies, along with computational predictions, have produced a large amount of protein interactions [5][6][7][8][9][10][11], which make it possible to uncover protein complexes from protein-protein interaction (PPI) networks. Pair-wise protein interactions can be modeled as a graph or network, where vertices represent proteins and edges are protein-protein interactions. Protein complexes are groups of proteins that interact with one another, so they generally correspond to dense subgraphs in PPI networks. Different research groups have developed a wealth of algorithms to identify protein complexes from the PPI networks [12][13][14][15][16][17][18]. In these approaches, the protein networks are considered as unweighted graphs. These methods work well on PPI networks and extract successfully protein complexes. Nevertheless, it has been noticed that protein interaction data produced by high-throughput experiments are often associated with high false positive rate and false negative rate due to the limitations of the associated experimental techniques, which may have a negative impact on the complex discovery algorithms [19][20][21][22][23].
In order to address that particular question, a number of data integration and affinity scoring schemes have been devised [10,[23][24][25][26][27][28][29][30]. In the paper of Gavin et al. [11], the weights of the interactions were defined by using the socalled socio-affinity index introduced in [11] that is based on the log-odds of the number of times two proteins were observed together in a purification, relative to the expected frequency of such a cooccurrence based on the number of times the proteins appeared in purifications. Krogan et al. [10] have used MALDI-TOF mass spectrometry and LC-MS/MS to identify protein-protein interactions, based on the observation that either mass spectrometry method often fails to identify a protein, and the usage of two independent methods can increase the coverage and confidence of the obtained interactome. The results of the two methods were combined by supervised machine learning methods with two rounds of learning, using hand-curated protein complexes in the MIPS reference database as a gold standard dataset. Collins et al. [23] have combined the experimentally derived PPI networks of Krogan et al. [10] and Gavin et al. [11] by re-analyzing the raw primary affinity purification data of these experiments using a novel scoring technique called purification enrichment (PE). The PE scores were motivated by the probabilistic socio-affinity scoring framework of Gavin et al. [11] but also take into account negative evidence (i.e., pairs of proteins where one of them fails to appear as a prey when the other one is used as a bait). These affinity scores encode the reliabilities (confidence) of physical interactions between pairs of proteins. Therefore, the challenge now is to mine meaningful and novel complexes from protein interaction networks derived by combining multiple high-throughput datasets and by making use of these affinity scoring schemes. In this direction, some algorithms have also been proposed [25,[31][32][33][34].
In this study, we propose a novel algorithm to derive yeast complexes from weighted (affinity-scored) PPI network and call it ClusterBFS (Cluster with Breadth First Search). ClusterBFS builds clusters in terms of breadth first search algorithm, starting from local seeds and adding nodes that maintain the weighted density of the clusters. The experimental results show that our ClusterBFS method outperforms existing computational methods, such as MCL [31], Clus-terONE [32], HC-PIN [33], SPICi [34], and MCODE [14].

Preliminaries.
Given a weighted network, the goal of our algorithm is to output a set of disjoint dense subgraphs. We model the network as a undirected graph = ( , ) with a confidence score 0 < ,V ≤ 1, for every edge ( , V) ∈ . For any two vertices, and V without an edge between them, we set ,V = 0. For each set of vertices ⊂ , we define its weighted density as the sum of the weights of the edges among them divided by the total number of possible edges (i.e., the density of a set is measure of how close the induced subgraph is to clique, and varies from 0 to 1):

Algorithm Overview.
We use a breadth first search approach to build clusters. ClusterBFS builds one cluster at a time, and each cluster is expanded from an original seed protein. The unclustered node is added, if it has the highest edge weight and the density of the cluster remains higher than a user-defined threshold Td; otherwise, the cluster is output. The growth process is repeated from different seeds to form multiple, possibly, overlapping groups. Although some overlaps are likely to have biological importance, groups overlapping to a very high extent in comparison to their  Figure 1: Example to illustrate the clustering process. This example network has 12 vertices, and every edge has confidence. Suppose the weighted density threshold = 0.2. The vertex 0 is taken as a seed protein and the original cluster 0 is constructed. In the first step of the breadth first search, the vertex 1 has the highest edge weight 0.75 among the neighbors of the vertex 0. We add vertex 1 to the cluster and this cluster {0, 1} now has the weighted density 0.75 that is bigger than the density threshold 0.2. Similarly, the vertices 2, 3, 4, and 5 are added to the cluster in sequence and the cluster {0, 1, 2, 3, 4, 5} now has the weighted density 0.23 which is still more than the threshold 0.2. Next, the neighbors of vertex 4 are considered. Of these, vertex 6 has the highest edge weight 0.52 and is added to the cluster. However, the weighted density of the cluster {0, 1, 2, 3, 4, 5, 6} is 0.19 and less than the threshold 0.2. Thus, the vertex 6 is removed and the neighbor of the vertex 3 is examined. Because the weighted value between the vertex 3 and its neighboring vertex 9 is 0.51 and less than 0.52, the vertex 9 is not added to the cluster. When the neighbors of the vertex 2 are checked, the vertex 10 is added to the cluster. Since the weighted density of the cluster {0, 1, 2, 3, 4, 5, 10} is less than 0.2, the vertex 10 is removed. And, likewise, the vertex 11 is not added to the cluster. We stop extending the cluster and output the final cluster {0, 1, 2, 3, 4, 5}. For simplicity, the elimination of redundant clusters is not shown in this figure. sizes should likely be discarded. We quantify the extent of overlap between each pair of groups and discard the smaller group, if the overlap score [14] is above a specified threshold. ClusterBFS thus has two parameters: Td, the weighted density threshold and . For threshold , we set a firm value 0.8 [32]. (See Figure 1 for a simplified example.)

Seed Selection.
Every vertex in the yeast PPI network is used as the seed and is equally important.

Cluster Expansion.
After obtaining the seed vertex, we use the breadth first search method to grow each cluster in terms of the weighted density. At each step, we have a current vertex set for the cluster, which initially contains one seed protein V. We search for the vertex with maximum value of the edge weight amongst all the unclustered vertices that are adjacent to the seed V in breadth first. If the weighted density of the cluster is smaller than a threshold, we stop expanding this cluster and output it. If not, we put vertex into and update the density value. If the density value is smaller than our density threshold Td, we do not include in the cluster and output . We repeat this procedure until all vertices in the graph are clustered. Algorithm 1 illustrates the

Input:
weighted PPI network = ( , ); weighted density threshold ; overlap score threshold ; Output: set of protein complexes SC discovered from ; Description: Algorithm 1: ClusterBFS algorithm. over framework to detect protein complexes. Algorithm 2 is the breadth first search procedure.
Since all vertices in the graph have been selected as seeds, the clusters produced have large overlaps, which will result in high redundancy. Hence, a Redundancy-filtering procedure is designed to process candidate clusters and finally generate protein complexes by eliminating such kind of redundancy. Algorithm 3 shows details of the redundancy process. Suppose that SC is the set of all currently detected complexes and = ( , ) is a newly identified complex. We will first selected an element = ( , ) in SC, which has the highest similarity (OS, overlap score) [14] with . In Algorithm 3, the procedure Redundancy-filtering ( ) is used to check and decide whether to discard or preserve the newly selected Complex . If and are not quite similar (with OS < ), will be inserted into SC in lines 2-3; otherwise, we prefer to preserve the complexes that have larger size in lines 4-8. For instance, suppose Complex of Figure 2 is one complex belonging to the complex set SC and is the most similar to the new complex, that is, Complex . After computing the OS of the two complexes, we obtain a score 0.11 which is less than the threshold = 0.8. So Complex will be inserted into the complex set SC.

Results
We test the performance of our ClusterBFS method with other five competing algorithms, Markov cluster (MCL) [31], clustering with overlapping neighborhood expansion (Clus-terONE) [32], hierarchical clustering on protein interaction network (HC-PIN) [33], speed and performance in clustering (SPICi) [34], and molecular complex detection (MCODE) [14] using the weighted Collins [23] and Krogan datasets [10]. For each algorithm, the final results are obtained after having optimized the algorithm parameters to yield the best possible results. We compare predicted complexes to the reference complex set CYC2008 [35]. We assess the quality of the predicted complexes by two scores: the fraction of protein complexes matched by at least one predicted complex and the maximum matching ratio (MMR) [32]. Our benchmarks show that ClusterBFS outperforms the other approaches on weighted networks, matching more complexes with a higher -measure and providing a better one-to-one mapping with reference complexes in three datasets. To examine the biological relevant of detected complexes we calculate the colocalization and coannotation scores of the entire identified complex set [24]. Comparison of colocalization and coannotation scores of ClusterBFS complexes and other algorithms reveals that ClusterBFS has higher scores on three datasets.

Data Sources.
Yeast has long been known as a highly effective model organism for mammalian biological functions and diseases. We evaluate the effectiveness of Clus-terBFS using three different yeast PPI weighted networks. The first dataset is prepared by Collins et al. [23]. For the weighted interaction map of Collins et al., we use the top 9074 interactions as suggested by the authors. These interactions among 1622 proteins have very high confidence scores. The second dataset is the Krogan core dataset [10]. It consists of 7123 reliable interactions involving 2708 proteins. We also use Krogan's extended dataset [10] containing 3672 nodes and 14317 edges to test ClusterBFS. For evaluating our identified complexes, the set of real complexes from [35] is selected as benchmark.

Evaluation Measures.
One evaluation method we use is to match the generated complexes with known complex set [35] and calculate sensitivity, positive predictive value (PPV), -measure, and MMR, respectively. In information retrieval, positive predictive value is called precision, and sensitivity is called recall. We derive 408 typical complexes including two or more proteins from the CYC2008 [35] as the benchmark complex set and use the same scoring scheme used by [14] to determine how effectively a predicted complex matches a reference complex. If two complexes overlap each other, they must share one or more proteins. The overlap score (OS) of a predicted complex versus a benchmark complex is then a measure of biological significance of the prediction, assuming that the benchmark set of complexes is biologically relevant. The overlap score between a predicted and a real complex is calculated using where refers to the number of proteins shared by a predicted complex and a benchmark complex, is the number of proteins in the predicted complex, and ℎ is the number of proteins in the benchmark complex. If OS is 1, it means that a complex has the same proteins as a benchmark complex. On the contrary, when OS is more than 0, there is not a shared protein between the predicted complex and the benchmark complex [14]. The number of true positives (TP) is defined as the number of predicted complexes with OS over a threshold value and the number of false positives (FP) is the total number of predicted complexes minus TP. The number of false negatives (FN) equals the number of known complexes not matched by predicted complexes. and are defined as TP/(TP + FN) and TP/(TP + FP), respectively [14].measure, or the harmonic mean of and , can then be used to evaluate the overall performance of the clustering algorithms: MMR score is proposed by Nepusz et al. [32] based on a maximal one-to-one mapping between detected and reference complexes. Figure 3 illustrates the maximum matching ratio. Owing to the fact that gold standard protein complex sets are incomplete [36], a predicted complex that does not match any of the reference complexes may belong to a valid but previously uncharacterized complex as well. To this end, the matching measures should be complemented with scores that assess the biological relevance of predicted complexes based on the colocalization and coannotation of the constituent proteins instead of relying on a predefined gold standard. Since protein complexes are formed to perform a specific cellular function, proteins within the same complex tend to share common functions and be colocalized [37]. Generally, higher coannotation and colocalization scores [24] show that proteins within the same protein complexes tend to share higher functional similarity. We employ the software suite ProCope (http://www.bio.ifi.lmu.de/Complexes/ProCope/) to compute the colocalization and coannotation scores in our experiment.

Comparison with the Real Complexes on the Collins
Dataset. Table 1 shows the number of detected complexes that match at least one real complex over a range of OS thresholds from threshold of 0 to 1.0 (in 0.1 increments). From Table 1, it can be found that the ClusterBFS algorithm detects the most complexes which match at least one known complex over every interval of OS. The second line in Table 1 shows the number of all complexes discovered by each approach. For instance, ClusterBFS predicts altogether 1229 complexes from the Collins dataset, whereas MCL, ClusterONE, HC-PIN, SPICi, and MCODE find 300, 203, 281, 156, and 111 complexes, respectively. The third line displays that when OS is more than 0.1, ClusterBFS curates 829 complexes matched at least   Table 2 gives the number of real complexes which match at least a predicted one. Table 2 shows that the number of real complexes matched by predicted ones from ClusterBFS is also the largest. The experimental results demonstrate that although ClusterBFS obtains the largest number of complexes, the matched complexes from Clus-terBFS are much more than those from the other techniques. That is, ClusterBFS identifies a vast amount of high-quality complexes from the weighted Collins network.
In addition, as shown in Tables 1 and 2, when OS is 1, ClusterBFS identifies 102 real complexes. In other words, 102 predicted complexes from ClusterBFS also belong to the known complex set [35] and are much more than ones from one of the other approaches including MCL, ClusterONE, HC-PIN, SPICi, and MCODE, respectively. More importantly, we observe that the reference set includes 408 real complexes, of which 259 complexes are the small size complex only containing 2 or 3 proteins. Actually, our statistical results (not presented in the tables of the paper) show that, in the 102 real complexes predicted by ClusterBFS, there are 78 small size complexes like that. However, MCL, HC-PIN, SPICi, and MCODE only find 74, 70, 33, and 30 real complexes, respectively. At the same time, MCL, HC-PIN, SPICi, and MCODE just detect 54, 49, 15, and 9 small size real complexes. Since ClusterONE discards the complex candidates that contain less than three proteins, we do not compare it with ClusterBFS. The experimental results show that ClusterBFS has the significant performance advantage over the other algorithms in terms of the identification of small size complexes.
Next, we calculate the -measure and MMR scores of the complex sets detected by various techniques. When themeasure is computed, the OS between a predicted complex and a real complex in the benchmark is set as 0.2 [14]. Figure 4 displays the overall comparison according tomeasure and MMR. On Collins dataset, the -measure of ClusterBFS is 0.68, which is 23.6%, 51.1%, 30.8%, 58.1%, and 83.8% higher than MCL, ClusterONE, HC-PIN, SPICi, and MCODE, respectively. ClusterBFS can achieve the highestmeasure, which shows that our method can predict protein complexes very accurately. From Figure 4, it also can be found that our ClusterBFS method obtains the highest MMR of 0.64, which is 21.8%, 33.3%, 21.8%, 25.5%, and 36.2% higher than MCL, ClusterONE, HC-PIN, SPICi, and MCODE, respectively. That is, ClusterBFS provides a better one-to-one mapping with real complexes in the Collins dataset.

Biological Coherence of Predicted Complexes on Collins
Dataset. Figure 5 shows the colocalization and coannotation scores of complexes detected by various methods. From  Figure 5, it can be observed that ClusterBFS has the second highest colocalization score in the five methods after SPICi. However, SPICi cannot handle overlaps. Proteins may have multiple functions, and therefore the corresponding nodes may belong to more than one cluster; for example, 207 of 1,628 proteins in the CYC2008 hand-curated yeast complex dataset [35] participate in more than one complex. So it is important to detect the overlapping complexes. In addition, it can be seen that the coannotation score of ClusterBFS is lower than that of MCODE coannotation and colocalization, the complexes predicted by our ClusterBFS method are observed to have comparable quality with those predicted by SPICi and MCODE but much better than those predicted by MCL, ClusterONE, and HC-PIN.

Results Using Krogan Dataset.
To support the credibility of our method, we perform our ClusterBFS on Krogan's core dataset [10]. The -measure and MMR of each method using this data are shown in Figure 6.   Figure 8 shows how the variation of parameter Td affects the -measure of ClusterBFS. based Collins dataset. As shown in Figure 8, when 0.01 ≤ Td ≤ 0.22, the -measure score of ClusterBFS is more than 0.6 and much higher than those of the other algorithms (See Figure 4). Besides, Figure 8 also shows that when ∈ [0.09, 0.11], ClusterBFS gets the highest -measure score.
In this interval, the -measure score remains unchanged. Therefore, the parameter Td is set as 0.1 in our experiment.

Conclusion
Protein complexes are important for understanding principles of cellular organization and function. Therefore, much work has been concerned with the prediction of protein complexes from the PPI networks. However, the PPI datasets from high-throughput techniques are flooded with false interactions. In response, some research groups propose a number of data integration and affinity scoring schemes and construct various weighted networks.
In this research, we devise a novel algorithm called ClusterBFS to identify protein complexes from the weighted PPI networks. ClusterBFS derives from the breadth first search method and constitutes protein complexes that originate from a protein seed based on the weighted density. In order to characterize these clusters as protein complexes, we check their biological relevance. This is achieved through some criteria such as -measure, MMR, colocalization, and coannotation measures. The evaluation of our predictions demonstrates the following advantages of ClusterBFS over the compared approaches. First, ClusterBFS has achieved significantly higher -measure and MMR than the existing methods. Thus, our predicted complexes match very well with benchmark complexes. Second, ClusterBFS also performs very well in terms of other measures such as coannotation and colocalization, indicating that ClusterBFS can predict protein complexes very accurately. Last but not least, as mentioned above, the real complex set CYC2008 contains a lot of small complexes and so it is necessary to mine them. In comparison with the other approaches, ClusterBFS discovers much more small size complexes. Our identified complexes, therefore, could be probably the true complexes to help the biologists to get novel biological insights.