MOfinder: A Novel Algorithm for Detecting Overlapping Modules from Protein-Protein Interaction Network

Since organism development and many critical cell biology processes are organized in modular patterns, many algorithms have been proposed to detect modules. In this study, a new method, MOfinder, was developed to detect overlapping modules in a protein-protein interaction (PPI) network. We demonstrate that our method is more accurate than other 5 methods. Then, we applied MOfinder to yeast and human PPI network and explored the overlapping information. Using the overlapping modules of human PPI network, we constructed the module-module communication network. Functional annotation showed that the immune-related and cancer-related proteins were always together and present in the same modules, which offer some clues for immune therapy for cancer. Our study around overlapping modules suggests a new perspective on the analysis of PPI network and improves our understanding of disease.


Introduction
PPI networks have been widely used to understand biology at the system level [1][2][3]. However, PPI data sets suffer from high false positive and false negative rates [4]. Network module, a group of proteins that are connected with each other to carry out a function [5], will be more accurate because a loss or gain of interaction will not break down the module structure. Modules have been applied to predict protein function [6] and disease genes [7] and trace the evolutionary history of networks [8][9][10].
To perform complex biochemical or developmental functions, modules have to work together. Thus several proteins are used to pass information from one module to another. For example, three modules in S. cerevisiae-the Set3C complex, protein phosphatase type 2A (PP2A) complex, and cell polarity budding-share a protein: Zds1 [11]. Zds1 can bind PP2A to control mitotic progression [12], and it also participates in Set3C complex during budding processes and repress meiotic process [13], so Zds1 may serve as a bridge between mitosis and meiosis. Here we define these three modules as overlapping modules and define the shared protein as the overlapping nodes. The overlapping modules can form a module-module communication network. Construction of such network can be helpful for understanding the coordinated relationship between different biological processes.
The problem of identifying modules has been studied by bioinformatics, applied mathematics, and physics [14]. Many methods have been developed to identify modules within a network, and they have been reviewed and evaluated [15][16][17][18][19]. We thought these approaches can be classified into two types. (1) Local seed-based methods which start from a node or clique (fully connected subgraph) and follow by an expanding search strategy. MCODE [20] is the first method for module detecting, and it expands highly scoring seed nodes by a local search procedure. But this method only detects a few modules. CFinder [11] is the first algorithm for overlapping communities detection, and it develops a Clique Percolation Method (CPM) where k-cliques are explored by rotating about its component (k-1)-cliques. CFinder is too slow when applied to dense PPI networks, and particular-ly it cannot detect spoken-like module (noncliques). To overcome this problem, Zhang et al. [21] combines the Line Graph Transformation (LGT) and CPM to detect overlapping network modules and builds the overlapping modules network. Wu et al. [22] proposes COACH (core-attachment based method) to predict complexes by detecting proteincomplex core and then adding attachments. The Local Protein Community Finder [23], LPCF for short, uses two local clustering algorithms to find a community close to a queried protein.
(2) Global cluster methods. (NeMo) [14] combine a neighbour-sharing score with hierarchical agglomerative clustering to identify both dense network and dense bipartite network structures in a single approach. Reichardt and Bornholdt [24] propose a method to detect overlapping (fuzzy) communities that maps the graph onto a zero-temperature q-Potts model with nearest-neighbor interactions. Zhang et al. [25] combine the idea of modularity function Q, spectral relaxation, and fuzzy c-means clustering method for detecting overlapping community structure. Wang et al. [26] propose a BCD (Betweenness-Commonality Decomposition) algorithm which uses edge commonality and edge-betweenness. Other methods such as nonnegative matrix factorization (NMF) technique were also used for uncovering overlapping (fuzzy) communities [27,28]. Besides these topological-based methods, Chen and Yuan [29] integrate 265 microarray datasets to detect functional modules in yeast protein-protein interaction network.
Here we describe MOfinder, an alternative method we have developed that can effectively identify functional modules, especially overlapping modules, from a PPI network. MOfinder allows flexibility and user customization with adjustable parameters. We compared the performance of MOfinder with other available methods. We explored the overlapping information of modules in yeast and human PPI network. We used all the overlapping modules detected from human PPI network to generate a graph of module-module communication, and we analyzed the functional properties of the overlapping modules.

Definition of Clustering Coefficient.
Clustering coefficient of node n is defined as CC(n) = 2S n /K n (K n − 1), where K n is the degree of n and S n is the number of connected links between all neighbors of n.

Definition of Functional Module.
Given a predicted module, the P-value of it with respect to a GO term is computed by the hypergeometric distribution in (1) and corrected by Bonferroni correction. The functional module is defined as a module enriched in at least one GO term (Bonferroni Pvalue <0.01): where a predicted complex with size m, k proteins share a GO term, and in a total of N proteins, n of them have the same GO term.

MOfinder Algorithm.
MOfinder is based on an AMD (Approximate Minimum Degree Ordering) algorithm [35,36] which has been used for network clustering from electrical engineering [37]. AMD algorithm is usually used in ordering a sparse matrix prior to Cholesky factorization (or for LU factorization with diagonal pivoting), and it can transform the sparse matrix to make the nonzero elements close to the diagonal. The approach used by MOfinder is summarized in Figure 1. MOfinder first converts the PPI file into a sparse matrix, where a nonzero element represents a protein-protein interaction. It then performs a global AMD of the sparse matrix in which the densely connected elements (module) will be clustered along the diagonal. Besides the global AMD, which produces the global ordering, a local AMD is performed to give the approximate minimum degree ordering. MOfinder uses a sliding window along the diagonal to fetch the local sparse matrix and make the local AMD. The clustering coefficient (CC) [38] value of the submatrix in the sliding window is calculated; if the CC value is not less than the cut-off, MOfinder will save the submatrix as a module. Then the sliding window moves one step along the diagonal to find new modules, and the iteration process is repeated until the sliding window reaches the end. Lastly, MOfinder removes redundant modules (if module A belongs to module B, A is removed) and saves results. The pseudocode of MOfinder algorithm is (see Algorithm 1).

MOfinder Is a Flexible Method.
MOfinder contains two adjustable parameters: the CC cut-off value and the size of sliding window. Different parameters will vary the results.
To optimize the parameters, the performance was assessed in term of accuracy of identified modules with respect to annotated function. MOfinder was tested over a broad range of parameters for CC cut-off value (0.2-1) and sliding window (20-450) using PPI data from yeast and human. First, the percentage of functional modules was plotted against a range of CC cut-off values, and for each CC cutoff value, all sizes of sliding window (20-450, step = 10) were Cluster coefficient threshold CC; Sliding window s. Output: modules.
if cc (ml) CC do (10) insert ml to modules; (11) end if (12) a = a + 1; (13) end while (14) for each module A in modules do (15) if module A ∈ module B do // module B in modules and module A / = module B (16) discard module A from modules; // delete redundancy (17) end if (18) end for Algorithm 1 tested and the resulting percentages of functional modules were plotted as a group of points. As shown in Figure 2 Figure 3 shows the number of functional modules matched for the 0.67 cut-off value over all tried sizes of sliding window (20-450, step = 10).
The curve of the resulting number of functional modules first increases and then decreases. So the sliding window should be set to 350 which maximized the number of functional modules. To achieve best performance, we recommended that the parameter set was CC cut-off value = 0.67 and size of sliding window = 350.

Performance Evaluation.
MOfinder was tested using PPI data from yeast and human and compared with the performance of other five software available algorithms: MCODE (default parameters), CFinder (k = 4, as suggested), COACH (default parameters) NeMo (default parame-ters), and LPCF (community size was set to 3-11 which was comparable to MOfinder). The percentage of functional modules was used to indicate accuracy, and MOfinder was the top performing algorithm with respect to accuracy in yeast (93.9%) (Figure 4(a)) and human (81.5%) (Figure 4(b)). Also, we compared the major module size of six methods in yeast (Table 1)     size 4 for NeMo, size 10 for LPCF, and size 5 for MOfinder. Although the number of modules and the number of proteins assigned to modules were smaller for MOfinder than some of these methods, the percentage of functional modules was highest for MOfinder.

Overall Overlapping Properties in Yeast and Human.
We applied MOfinder to the yeast and human PPI network with default parameters (CC cut-off = 0.67, sliding window = 350). Then we explored the distribution of overlapping size. As shown in Figure 5, the overlapping size distribution is 6

Journal of Biomedicine and Biotechnology
Overlapping size  different between yeast and human. Most of the modules in yeast PPI network share one protein ( Figure 5(a)), but in human PPI network the most common overlapping size is 4 ( Figure 5(b)). Some overlapping parts might be overcounted. For example, three modules (A, B, and C) share a protein D, so protein D is counted 3 times (A-B, A-C, B-C). To avoid the overcount problem, we deleted the repeats, so protein D is only counted once. Figure 6(a) shows that the resulting distribution of overlapping size in yeast is obviously changed, and the most common overlapping size changes into 4 which is similar to human (Figure 6(b)). These observations suggest that although modules in yeast tend to share less proteins than modules in human, the small overlapping parts (size 1 and size 2) are more repeatedly used in yeast than human, and thus the distribution of overlapping size becomes similar in yeast and human after removing repeats.
Since proteins in one module work together to perform functions, a similar function is expected to appear if two modules are overlapping with each other. And the larger the overlapping size, the more likely the same function. To verify this, we used the GO annotation similarity to represent the functional similarity. Figure 7 shows that the average functional similarity is increased with the increase of overlapping size. Such a trend has been observed in both yeast (Figure 7(a)) and human (Figure 7(b)). with at least one other module. These overlapped modules were used to construct a module-module communication network (Figure 8(a)). In the communication network, each node is a module, two modules being connected if they share at least one protein. To explore the functional of this network, we used DAVID 6.7 [40,41] to search for enrichment of Gene Ontology (GO) terms and the KEGG pathways. We found that GO terms and pathways related to cancer and immune response were enriched in the network proteins, so we mapped the cancer and immune-related proteins to the modules. As shown in Figure 8(a), of the 47 modules containing immune-related proteins, 33 included cancerrelated proteins, and the ratio (33/47) was greater than expected by chance (62 of 152 modules have cancer-related proteins, Binomial test, P < 0.01). Therefore, the modules containing immune-related proteins always included cancerrelated proteins and vice versa (33/62 was greater than expected 47/152, Binomial test, P < 0.01).

Overlapping Modules in the Human
To explore the communication between functional modules, we map the functional annotation to each module and evaluate the functional similarity between two overlapping modules. The functional similarity is shown as edge color in Figure 8: the values between 0 and 1 are painted with a pink/blue color gradient, and modules without GO annotation have gray edges. Figure 8(b) gives the functional annotation of modules from the largest cluster in Figure 8(a). Some overlapping modules have the same function, such as the three modules involve in the acetylation of peptidyllysine, while several overlapping modules have distinctfunction, for instance, a module involved in the change of mast cell is overlapping with another module which takes part in the reactions mediated by protein kinases. Figure 9 shows an example of two overlapping modules. One module function is in B-cell activation processes and it contains five proteins: Q15464, O75791, O43561, Q13094, and P08575. The other module (P08575, P20963, P06729, and P06127)involves in T-cell activation. These two Modules share a protein: P08575 (receptor-type tyrosine-protein phosphatase C, CD45), which plays a critical role in receptor-mediated signalling in both B and T-cells [42,43]. The shared node between two modules suggests a pathway crosstalk between them. Consistent with this hypothesis, several studies have illustrated T-cell-dependent B-cell activation [44].
The module-module communication network included 341 overlapping nodes (nodes belonging to two or more modules). Several studies showed that modular overlaps are potential drug targets because they are key determinants of cooperation between network modules [45]. So we investigated the potential druggability of overlapping nodes: 56 of them were established drug targets and another 43 proteins were from druggable family [46], which were 99 druggable proteins in all. The ratio of druggable proteins (99/341) was significantly higher than expected (2000-3000 druggable proteins in human [46], Binomial test, P < 0.01).

Discussion
For both yeast and human interactomes, MOfinder surpasses the other five methods in accuracy. Furthermore, MOfinder is fast in practice for large networks. For example, when applied to a yeast network including nearly 40,000 interactions (from I2D [47]), the running time of MOfinder was only 15 seconds. Since the size of biological networks continues to grow, MOfinder is likely to meet the needs of biological analysis. However, MOfinder has two possible limitations. One is that MOfinder specifically detects small-sized modules (less than 12), but the major module size (5) is close to the average size of MIPS complexes (6) [20]. MOfinder detects 125 modules from the yeast PPI network, which is less than COACH and LPCF. From the perspective of the covered proteins of predicted modules, MOfinder is rank 4. These observations suggest another limitation: MOfinder is of too strict to detect loosely connected modules, partly because the CC cut-off value is set to 0.67. We suppose that setting the  Yeast is a simple single-celled eukaryote, so the overlapping modules in yeast generally use one protein for communication. On the contrary, human, a multicellular organism, employs more complex system, and thus the overlapping size of human is larger than that of yeast. We also found the overall distribution of overlapping size is similar between yeast and human after removing repeats. And in Figure 2 the functional steps occur at similar places between yeast and human. These observations reflect the evolutionary conservation across eukaryotes. Although overlaps may lead to redundant modules which overlap with each other heavily, excluding Overlapping modules will work together to carry out several complicated jobs, such as signal transduction. So constructing a module-module communication network to explore how these modules communicate with each other can help to understand biological complexity. Although we just built such a network in human, similar approach can be applied to other species. We found that the immune-and cancer-related proteins are always in the same modules. The association between immune cells and cancer has been discussed [48], and several clinical studies and experiment have proven that the immune system is a new weapon against cancer [49]. Antitumor adaptive immune responses can suppress tumor growth [50], and several immunotherapy drugs could cure cancers [51]. We provided the evidence for their close relationship on the system level.

Conclusions
In this paper, we describe a novel algorithm for the identification of overlapping modules in PPI networks. MOfinder performs competitively with other methods and uses two adjustable parameters that enable it to identify modules flexibly. MOfinder is a cross-platform package which is implemented as a C/C++ script, and it can be downloaded and installed free of charge (http://bsb.kiz.ac.cn/mofinder/). The application of MOfinder to human PPI gives clues for fighting against cancer using immune system. And the overlapping nodes, which are in charge of intermodule crosstalk, could help to identify potential drug targets.