A Least Square Method Based Model for Identifying Protein Complexes in Protein-Protein Interaction Network

Protein complex formed by a group of physical interacting proteins plays a crucial role in cell activities. Great effort has been made to computationally identify protein complexes from protein-protein interaction (PPI) network. However, the accuracy of the prediction is still far from being satisfactory, because the topological structures of protein complexes in the PPI network are too complicated. This paper proposes a novel optimization framework to detect complexes from PPI network, named PLSMC. The method is on the basis of the fact that if two proteins are in a common complex, they are likely to be interacting. PLSMC employs this relation to determine complexes by a penalized least squares method. PLSMC is applied to several public yeast PPI networks, and compared with several state-of-the-art methods. The results indicate that PLSMC outperforms other methods. In particular, complexes predicted by PLSMC can match known complexes with a higher accuracy than other methods. Furthermore, the predicted complexes have high functional homogeneity.


Introduction
Proteins do not function in isolation but interact together to form complexes. Protein complex plays an important role in cellular activities, such as signal transduction, cell cycle, DNA transcription, and DNA repair [1][2][3]. Identifying protein complexes is crucial for understanding molecular mechanism in cellular activities. It is important to develop computational methods for identifying complexes [1]. Recent developments in high-throughput technologies have produced large amount of high-quality protein-protein interaction (PPI) data that can be represented as a PPI network, an undirected graph, in which nodes denote that proteins and edges are interactions between pairs of proteins. Graph clustering techniques are used to identify protein complexes by finding dense regions in a PPI network [4]. Since proteins may belong to several complexes, most of previous methods detect overlapping clusters [1,[4][5][6].
Many methods [7][8][9] detect complexes from PPI network by finding cliques, in which all nodes connect to each other. CFinder is one of the most popular clique-based methods, which searches adjacent cliques in the network [8,10,11].
OCG [12] takes the cliques as initial classes for hierarchy fusion to detect overlapping clusters in PPI networks. Another kind of methods detects complexes by expanding a set of seed proteins or clusters. MCODE [13] chooses the proteins with high weights as seeds and expands these seeds by including their neighboring proteins with weights higher than a threshold. ClusterONE [14], the latest and powerful seed-expansion method, starts from a set of seed complexes and expands them by maximizing the cohesiveness function. The expanding method depends on the density-based definition of the complexes. Random walking techniques have been also used to detect complexes. Markov clustering (MCL) algorithm [15] iteratively applies "expansion" and "inflation" steps to the transition matrix that denote the Markov chain of random walk. Reference [16] proposes a new spectral method based on the two-hop transition matrix of Markov random walk (SLCP2). In general, although much progress has been made, identifying protein complexes from PPI network still remains a challenge. The complexes derived by existing methods match few known complexes. The reason is that the topological structures of complexes are too complicated. It is difficult to define the topology by a specific type of pattern. It is necessary to develop a new method to avoid the problem of topological dependence.
In this paper, we present an optimization framework that uses a penalized least squares method to identify complexes from PPI network, named PLSMC. Intuitively, our method is on the basis of the fact that if two proteins are in a common complex, they are likely to be interacting [1,4,5]. PLSMC employs this relation to detect complexes using a penalized least squares method. By optimization, the propensities of proteins to complexes can be determined. The PLSMC is tested and compared with other methods on several public PPI networks of yeast. The results show that PLSMC has higher accuracy on matching with known complexes than other state-of-the-art methods. Moreover, the analysis of functional homogeneity indicates that complexes identified by PLSMC are biological relevance.

Penalized Least Squares Method for Complex Detection.
In order to introduce our method, we first introduce several notations. A PPI network is denoted by a matrix of × , where is the number of proteins and is equal to 1 if proteins and are interacting, 0 otherwise. Since an interaction may be a false positive one when the corresponding proteins share less common interacting partners, we compute the weight matrix for a PPI network as in [17], where ( ) is a set consisting of protein and all of its neighbors. Let ( > 0) be the propensity denoting how likely protein belongs to complex , which is an unknown variable needing to be estimated. The cocomplex coefficient of proteins and denotes the likelihood that they participate in the same complexes. Given that there are at most complexes existing in the PPI network, is calculated as Hence, the sum of distances between interaction weights and cocomplex coefficients over all pairs of proteins can be written as follows: Minimizing with respect to Θ = [ ] is to make the cocomplex coefficient close to interaction weight for each pair of proteins. If two proteins are not interacting, the cocomplex coefficient of them is supposed to be minimized to 0. However, only considering the cocomplex coefficient is not sufficient for complex detection, since a protein may have large number of propensities with high values. It will assign a protein to too many complexes and thus produce pervasive overlapping complexes. Therefore, to control overlapping rate, we augment (3) with a penalty term to shrink the propensities as in (4). Consider where ( > 0) is the parameter of the penalization. Finally, the optimization in PLSMC is written as Taking the derivation of (6) with respect to and setting it to zero give It is difficult to estimate in above equation using an analytical method, as it depends on , where ̸ = and ̸ = . Therefore, we use an iterative method, to find the optimal . Because = 0 for the Karush-Kuhn-Tucker condition, we multiply both sides of the equation by and get Then, we can write the multiplicative updating rule as As suggested in the literature [18], we use the updating rule as With the updating rule, we could estimate the propensities . The reason why we use the multiplicative updating rule is that it is a gradient descent method with an adaptive step length and is guaranteed to converge to an optimum [19][20][21].

Postprocessing.
After estimating the propensities, we could obtain complexes using the estimated propensity matrix Θ = [ ]. We introduce a propensity threshold to derive the complexes. If ≥ , the protein is allocated to the complex . Thus, a set of predicted complexes in the network is obtained, in which each element consists of a group of proteins. Moreover, as previous methods, the predicted complexes in set that include less than 3 proteins are removed.

A Speeding-Up Strategy.
The time-consuming is prohibitive when the optimizing process is directly conducted on a large-scale real world PPI network. Therefore, it is appropriate to execute the estimating process on a set of subnetworks that are of small scale but enough to identify complexes. To get the subnetworks, we recursively cluster the network into subnetworks containing proteins less than a specific size . Then, apply the optimization procedure to each subnetwork to detect complexes. We use the tool of fastCommunity [22] to cluster the network. The reason is that it is a fast and robust algorithm in the field of network clustering.
In particular, we first use fastCommunity to cluster the input network and let each cluster be a subnetwork. Redo the process on each subnetwork larger than , until there is no subnetwork larger than .

PLSMC Algorithm.
Three main steps in PLSMC are as follows: (1) get subnetworks from the input PPI network; (2) compute the weight matrix and initialize the propensity matrix with random values for each subnetwork; (3) estimate protein propensities in each subnetwork; (4) identify complexes of proteins using the postprocessing step. The pseudocode of PLSMC is in Algorithm 1.

Results and Discussion
We implemented a Java archive and a Web tool of the PLSMC algorithm, which is available at http://nclab.hit.edu .cn/PLSMC/. To examine its effectiveness, PLSMC is tested on several public PPI networks of yeast and compared with some state-of-the-art methods. The matching with known complexes and functional homogeneity of predicted complexes are both studied.

Dataset and Evaluation Metrics.
We investigate the performance on several PPI networks of yeast (Saccharomyces cerevisiae), including Krogan [23], Collins [24], Gavin [2], and BioGRID [25] datasets. For Krogan, we use high confidence interactions with the probability higher than 0.273. For Gavin, only interactions with socioaffinity index larger than 5 are considered. For Collins network, we choose the top 9074 interactions with respect to purification enrichment score. The above cutoffs are suggested by original papers and [14]. In addition, all of physical interactions in BioGRID dataset (version 3.1.92) are downloaded. The general characteristics of these networks are listed in Supplementary Table S1 available online at http://dx.doi.org/10.1155/2014/720960.
The matching between predicted complexes and known complexes is studied to evaluate the accuracy of the prediction. We use CYC2008 catalogue [26] as the gold standard of known complexes in this work, which is available at http://wodaklab.org/cyc2008/. The CYC2008 includes the complexes that are all validated by small-scale experiments and it is an up-to-date comprehensive dataset of known complexes of yeast. As in the literature [14], the known complexes in CYC2008 containing less than 3 proteins are removed.
Three metrics in the following are used to evaluate the accuracy of matching between a predicted complex set and a gold standard .

f-Measure.
A predicted complex ∈ and a known one ∈ are considered to be matching, if the overlapping score os ( , ) is greater than a matching threshold ov (ov is set to 0.25 as in [4]). The overlapping score is defined as Let be the number of predicted complexes that match at least one known complex and let be the number of known complexes that match at least one predicted complex. The precision and recall are defined as follows: The -measure is the harmonic mean of precision and recall as -measure = 2 × precision × recall (precision + recall) .

Acc Metric.
Let be the number of common proteins between a known complex and a predicted complex . Then, the sensitivity (Sn) and positive predictive value (PPV) are as follows: where is the number of proteins in a known complex . Then, the accuracy metric [14] is defined as

MMR Metric.
Recently, [14] proposed a novel metric called maximum matching ratio (MMR) as follows: where and are th known complex in and th predicted complex in , respectively. It is important to note that each of above evaluation metrics does not provide an adequate description of the matching between predicted complexes and known complexes. To make a comprehensive evaluation, we consider the composite score that is the sum of above three scores in this study. Similar composite score is also used in the literature [14].

Investigation of PLSMC. The parameter
in PLSMC controls the size of subnetwork and is significantly related to the effect of the speed-up strategy. We test different values of = {50, 100, 200, 300, 400, 500}. Because of the prohibitive cost of computation, larger than 500 is not investigated. For each value of , we try different values of penalty parameter ( ∈ {2 −5 , . . . , 2 5 }) and repeat executing the algorithm 100 times with random initialization. We choose the execution that the estimated propensity matrix gives the minimal value of in (5). We choose the values of propensity threshold from 0.05 to 0.5 with increment 0.05 that gives the best composite score. Supplementary Table S2 shows the best parameter setting for each value of .
We demonstrate the effect of with different values on the four networks in Figures 1(a) and 1(b). As in Figure 1(a), on all networks, the composite score decreases with the parameter when ≤ 200 and fluctuates when > 200. Meanwhile, the execution time increases with the parameter dramatically as in Figure 1(b). It indicates that the speedup procedure could make a good balance between the computation time and prediction performance when = 200. Interestingly, this is also consistent with that in CYC2008 [26], in which there is no known complex including more than 200 proteins. Therefore, in the following of this study, is set to 200.
To examine the effect of the penalty term introduced in (4), we compare the PLSMC using the term and the one without using it (denoted by LSMC) applied to the four networks. The parameter setting of LSMC is shown in Supplementary Table S3. Figure 1(c) illustrates the results of PLSMC and LSMC. As shown, the PLSMC outperforms LSMC applied to all four networks. This confirms that the penalty term in (4) is essential.

Comparison with Other Methods on Matching Known
Complexes. We compare PLSMC with SLCP2 [16], Clus-terONE [14], RSGNM [21], OCG [12], MCL [15], and CFinder [10]. The parameters of these algorithms are tuned as follows: ClusterONE: density ( ) and merging threshold ( ) both from 0.1 to 1.0 with increment 0.1; RSGNM: rate parameter ∈ {2 −5 , . . . , 2 5 } and the parameter ∈ {2 −5 , . . . , 2 5 }; MCL: inflation from 1.2 to 5.0 with increment of 0.1; CFinder: the size ( ) of -clique is changed from 3 to 10; OCG: using centered cliques initialization and modularity maximization; SLCP2: no parameter needs to be tuned. We remove the predicted complexes of above methods with size smaller than 3 and choose the parameter setting that yields the best composite score. The general information including parameter settings of the algorithms applied to four networks is in Supplementary  of predicted complexes, (Prot.) is the number of covered proteins, and (Size) is the average size of predicted complexes. We cannot obtain the results of CFinder on BioGRID network, as the calculation requires more memory than a typical computer.
We present the comparison result of matching with gold standard in Figure 2. On all four networks, PLSMC could get better composite score than other methods. ClusterONE gets close results to PLSMC on all networks. SLCP2 and OCG provide good performance when applied to Collins network but make poor predictions about other networks. It indicates that these two methods are prone to be affected by different networks. MCL achieves poor performance when applied to all networks.
In addition, we also investigate the number of known complexes that are matched by predicted complexes. The number of matched known complexes of various algorithms applied to Krogan, Collins, Gavin, and BioGRID networks is illustrated in Figures 3(a)-3(d), respectively. We show the results of the overlapping threshold V from 0.5 to 1.0. It denotes a perfect matching when V = 1. As shown, PLSMC can hit 15, 36, 16, and 23 known complexes with perfect matching on four networks, respectively. It can also be found that, on Krogan, Collins, and BioGRID networks, PLSMC can provide the greatest number using all thresholds. On Gavin network, PLSMC could get comparative results with ClusterONE with all thresholds and match more known complexes with perfect matching than others. Generally, the  above comparisons confirm that the PLSMC outperforms other methods in terms of matching known complexes in gold standard.
We show how the studied algorithms identify the known COMPASS complex from the Krogan network in Figure 4. The COMPASS complex is an important conserved protein complex that catalyzes methylation of histone H3, which is collected in both CYC2008 and GO (GO: 0048188). The complex contains 8 proteins (YKL018W, YPL138C, YBR175W, YDR469W, YHR119W, YLR015W, YAR003W, and YBR258C), which are denoted by hexagon nodes in Figure 4. The clusters under the shaded areas are detected by the algorithms, which have the max overlapping scores (os) with COMPASS complex. As shown, PLSMC is the only algorithm that is able to detect this complex with perfect matching. All of the other algorithms make inaccurate prediction. SLCP2 detects a part of the complex and other algorithms include unrelated proteins into the complex. The result of CFinder is not shown, because the detected cluster that has the best matching with the complex is a huge cluster, which consists of 627 proteins.

Biological Relevance of Predicted Complexes.
The known complex dataset is incomplete. For example, CYC2008 only covers 1627 proteins, while the number of proteins in yeast is more than 5000. Therefore, a predicted complex that does not match with any known complex is possibly not a false positive one and it is worth further in-depth analysis.
To this end, we also examine the biological relevance of predicted complexes in terms of functional homogeneity. This is because the proteins within a complex tend to be located in the same cellular component (CC) or are involved in a common molecular function (MF) or biological process (BP) [4,14]. We use the tool of GO::TermFinder (Version 0.83) [27] to compute the value for each predicted complex. The GO corpus is downloaded from Saccharomyces Genome Database [28]. We investigate all three aspects of GO.
A predicted complex that has more than one annotation with the value smaller than a threshold is considered functional homogeneity. The threshold is set to 1.0 − 10 [4]. The fraction of predicted complexes that are functional homogeneity is used to evaluate the performance of the prediction method. Table 1 presents the comparison of functional homogeneity of complexes predicted by different methods. The result of known complexes in CYC2008 is also listed. It can be found that the complexes predicted by PLSMC are more functional homologous than those of other methods. Moreover, the results of PLSMC applied to Krogan, Collins, and Biological networks are all better than that of CYC2008. More interestingly, on all networks, the results of PLSMC in regard to CC aspect are better than MF and BP aspects. This tendency is consistent with that of CYC2008. On the whole, the comparison demonstrates that the complexes derived by PLSMC are more biologically relevant.

Conclusion
In this paper, we present PLSMC, a penalized least squares method, to detect complexes from PPI network. PLSMC identifies complexes by minimizing the distances between cocomplex coefficients and interaction weights of all pairs of proteins. We test it on several yeast PPI networks. The results show that PLSMC achieves higher accuracy in matching with known complexes than some state-of-the-art methods. Moreover, the predicted complexes also have good biological relevance to functional homogeneity. This study confirms that PLSMC, based on a least squares method, is an effective approach to identify complexes from the PPI network.
We note that integrating multiple biological data sources in addition to PPI network [29] can improve the identification of protein complexes. On the one hand, most of available protein-protein interaction networks are static. Combining dynamic information such as expression profiles can infer the dynamic properties of protein-protein interactions under different time points or various conditions [1,30]. On the other hand, when two or more proteins form a complex, some interface information as physical folds [31], biochemical properties [32], and posttranslation modifications [33] is very important to the complex formation. In the future, based on PLSMC, we will study the identification of protein complexes from dynamic protein-protein interaction networks and interface datasets.