An Algorithm for Finding Functional Modules and Protein Complexes in Protein-Protein Interaction Networks

Biological processes are often performed by a group of proteins rather than by individual proteins, and proteins in a same biological group form a densely connected subgraph in a protein-protein interaction network. Therefore, finding a densely connected subgraph provides useful information to predict the function or protein complex of uncharacterized proteins in the highly connected subgraph. We have developed an efficient algorithm and program for finding cliques and near-cliques in a protein-protein interaction network. Analysis of the interaction network of yeast proteins using the algorithm demonstrates that 59% of the near-cliques identified by our algorithm have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with the experimentally determined protein complexes catalogued in MIPS.


INTRODUCTION
Proteins in a highly connected subgraph of a protein interaction network usually share a common function [1]. Therefore, a highly connected subgraph such as clique and nearclique in a protein interaction network can be used to predict the function of uncharacterized proteins in the highly connected subgraph. Finding a clique with a maximum size in a graph is an NP-hard problem [2]. There are several heuristic algorithms for the maximum clique problem [2,3], but most of them focus on finding a complete subgraph (i.e., clique) and cannot be used to find near-cliques.
Several topological analysis methods have been developed for identifying biologically meaningful groups from protein interaction networks or for assessing the reliability of protein interactions. A recent program called CFinder [4,5] finds overlapping cliques in protein interaction networks. It allows a protein to belong to more than one clique, but cannot find near-cliques. Our study shows that the near-cliques can reveal higher functional coherence than the overlapping cliques.
The primary focus of this study is to find functional groups by identifying cliques and near-cliques in protein interaction networks. This study attempts to answer two ques-tions as follows. "Can we efficiently find all cliques and nearcliques?" and "does a dense subgraph such as clique and near-clique indeed represent a functional module or protein complex?" This study demonstrates that the answers to both questions are "yes." This paper presents an algorithm for finding near-cliques and its application to the interaction network of yeast proteins.

ALGORITHMS FOR FINDING NEAR-CLIQUES
A clique is a complete graph G = (N, E) in which every node is connected to every other node in the graph. In our previous work, we developed a heuristic algorithm and implemented the algorithm in a program called InterViewer [6], which identifies all edge-disjoint cliques (i.e., cliques that do not share an edge).
Our experience with protein interaction networks suggests that a near-clique as well as a clique often represents a biologically meaningful unit such as functional module or protein complex. A near-clique is almost a clique but is not a clique due to a few missing edges. We consider near-cliques of the following basic types, which are biologically meaningful clusters (see Figure 1). x (c) Figure 2: (a) After removing nodes p, q, r, and s and their edges, node x forms a near-clique of type A with the remaining nodes. (b) This graph becomes a near-clique G of type C since indegree(x, G) ≥ 0.5|G|. (c) A big near-clique is too big (e.g., near-clique with more than 50 nodes) and is split into smaller near-cliques (in this example, 3 small near-cliques).

Type A
When a protein outside a clique interacts with two or more proteins in the clique, the protein and the clique forms a near-clique.

Type B
When a clique shares a protein with other cliques, the cliques form a near-clique.

Type C
When two or more cliques interact with a common protein outside them and the protein has at least two interactions with each clique, the cliques and the protein form a nearclique.
The near-cliques of types A and C can be refined using the indegree and outdegree of a node (there is no change to the near-clique of type B). For a node x in subgraph G ⊂ G, indegree(x, G ) is the number of the edges connecting node x to other nodes in G , and outdegree(x, G ) is the number of edges connecting node x to other nodes that are in G but not in G . We use the definition of a community in a strong sense [7] to find more near-cliques in a graph.
The original definition of a strong community misses many near-cliques due to a single node in the communities. For example, in Figure 2(a), node x cannot belong to a nearclique since indegree(x, G ) = 3 < outdegree(x, G ) = 4. Likewise, node x in Figure 2(b) cannot belong to a nearclique because indegree(x, G ) < outdegree(x, G ). Thus, nodes with only one edge connected to them and their edges are removed from the graph when we search near-cliques in the graph. In the graph of Figure 2(a), nodes p, q, r, and s and their edges are removed. After removing them, node x and the existing clique form a near-clique of type A. A cluster that satisfies indegree(x, G ) ≥ 0.5|G | for every x in G , where |G | is the number of nodes in G , forms a near-clique, too.The example shown in Figure 2(b) becomes a near-clique since it satisfies indegree(x, G ) ≥ 0.5|G | even if it does not satisfy indegree(x, G ) < outdegree(x, G ).
Therefore, a near-clique G of basic types A and C should satisfy at least one of the following conditions.
After finding all edge-disjoint cliques first, we identify near-cliques as follows. More detailed description of finding near-cliques are outlined in Algorithms 1 and 2. In the algorithms, cIdx represents the index of a clique.
(1) Assign every node of a clique the index of the clique containing the node. (2) When a node of a clique has already an assigned clique index, assign the index to all nodes of the clique, and merge two cliques into a near-clique of type B. (3) When a node x outside a clique forms a basic near-clique G of type A due to the interactions with two or more proteins in the clique, and either indegree(x, G) ≥ outdegree(x, G) or indegree i(x, G) ≥ 0.5|G| is true, assign the index of the clique to the node. (4) When two or more cliques form a near-clique G due to two or more interactions with a common protein outside the cliques, and either indegree(x, G) ≥ outdegree(x, G) or indegree i(x, G) ≥ 0.5|G| is true, merge the cliques and the protein into a near-clique of type C. A near-clique is formed by selecting nodes with the same clique index (cIdx) as those nodes with cIdx > 0.
{set the current clique index to 1} (5) for all node N ∈ G do (6) if (isClique(N)) then {if the node N belongs to a clique} (7) for all edge E ∈ N do (8) if (E.target.cIdx > 0) then {if the cIdx of the node connected to N is positive} (9) for all tmpN ∈ G do {for all nodes in G} (10) if (tmpN.cIdx = E.target.cIdx) then (11) tmpN.cIdx = curCIdx {assign curCIdx to tmpN as its cIdx} qCliqueCnts = ∅{qCliqueCnts the number of edges, which the node N connected with different near-cliques} (4) for all edge E ∈ N do (5) if (E.target.cIdx > 0) then (6) qCliqueCnts[E.target.cIdx] + + (7) end if (8) end for (9) qC value = 0 {initialize cIdx of node N} (10) for all (c ∈ qCliqueCnts) do (11) if ((c > 1) and indegree(x, G ) ≥ outdegree(x, G )) or ((c > 1) and indegree(x, G ) ≥ 0.5 * |G |) then {a node outside a clique interacts with multiple nodes in the clique, and either indegree( Since the most relevant processes form a group of proteins of moderate size in biological networks [8], we obtain near-cliques smaller than the maximum size specified by a user. That is, when a near-clique bigger than the maximum size is found (e.g., near-clique with more than 50 nodes), it is split into smaller near-cliques (3 near-cliques in Figure 2(c)). The way we split a big near-clique is as follows. When our program finds a big near-clique with the minimum clique size set to k, we rerun the program on the big near-clique with the minimum clique size set to k + 1 to find a new clique and a near-clique with the clique. After removing the new near-clique from the original, big near-clique, we run the program again with the minimum clique size set to k. The big near-clique shown in Figure 2(c) is split into 3 small near-cliques with at least 4 proteins each.

RESULTS AND COMPARISON WITH EXPERIMENTAL DATA
We tested the algorithms on the data with 8,397 interactions between 4,380 yeast proteins, which is the combined data of Ito et al. [9], Uetz et al. [10], and MIPS (http://mips.gsf.de) with redundant data removed. To every protein in the nearcliques, we assigned the functional categories of the Functional Catalog (FunCat) version 2.0 [11], which includes 97 functional categories. There are six levels of hierarchy in the FunCat structure.
In the data with 8,397 interactions between 4,380 yeast proteins, we found 100 near-cliques with the minimum size of a clique set to 3 and the maximum size of a near-clique set to 40. Only one near-clique contains more than 40 proteins, and so it was split into 17 small near-cliques, resulting in total 116 near-cliques. Figure 3 shows an example of the network of yeast protein interactions with 6 near-cliques. Proteins in each near-clique share at least one function with other proteins within the near-clique.
As shown in Table 1, 68 (59%) out of the 116 nearcliques have at least one function shared by all the proteins in the near-cliques (100% sharing), and 39 nearcliques have a function shared by more than 50% of the proteins in the near-cliques, supporting data are available at http://wilab.inha.ac.kr/ppi/homepage.mht. Only 9 nearcliques have no function shared by >50% of the proteins in the near-cliques. As shown in Figure 4, the functional coherence of each near-clique is high. The functional coherence was computed by the ratios of the number of proteins having a specific functional category to the group size (i.e., the number of proteins in the group).
Interestingly, most near-cliques found by our algorithm belong to multifunctional categories. For example, two functional categories are common to all the proteins in a nearclique of Figure 5. As shown in Table 2, the near-clique identified as group 93 by our program is involved in both stress response (functional category 32.01) and biosynthesis of vitamins, cofactors, and prosthetic groups (functional category 01.07.01).
Near-cliques may correspond to protein complexes in addition to functional modules. So, we compared the near-cliques identified by our algorithms with known yeast protein complexes, which are cataloged in the MIPS Saccharomyces cerevisiae genome database (http://mips.gsf.de/genre/proj/yeast). For each near-clique, we found a best-matching protein complex by minimizing Table 1: Functional groups identified from the yeast protein interaction data. 68 modules have at least one function shared by all the proteins in the groups (100% sharing), and 39 groups have a function shared by more than 50% of the proteins in the groups. Only 9 groups have no function shared by >50% of the proteins in the group. This table shows only one function with the highest functional coherence in each group. All the functions shared by more than 50% of the proteins in each group are available at http://wilab.inha.ac.kr/ppi/homepage.mht.
Group ID Proteins in the group  Journal of Biomedicine and Biotechnology  the probability of a random overlap between the two, using the following equation [4,5]: where n1, n2 are the sizes of a known protein complex and a computed module, k is the number of their common proteins, and N is the size of the network. As shown in Table 3, 65 near-cliques (56% of the total 116 near-cliques) identified by our algorithm show a good agreement (ln(P overlap ) < −14) with the protein complexes cataloged in MIPS.
To compare the functional coherence of the groups found by our program with that of cliques found by CFinder, we tested both programs on the same dataset. 75.9% of the groups identified by our program have at least two functional categories shared by all the proteins in the groups, whereas 63.1% of the groups identified by CFinder have at least two functional categories shared by all the proteins in the groups (Table 4). This result indicates that our program finds groups with stronger functional coherence than CFinder. Table 5 shows the actual running times of our program and CFinder on three datasets of yeast protein interactions. Our program is faster than CFinder on all datasets, and the difference in speed becomes more obvious as the dataset becomes bigger.

CONCLUSION
Identifying hidden topological structures of protein interaction networks often unveil biologically relevant functional groups and structural complexes. We developed an efficient heuristic algorithm for finding cliques and near-cliques in protein interaction networks. From the interaction data of yeast proteins, the algorithm identified 116 near-cliques. Comparison with the experimental data showed that 59% of the near-cliques have at least one function shared by all the proteins within a near-clique, and that 56% of the near-cliques show a good agreement with known protein 8 Journal of Biomedicine and Biotechnology

10
Journal of Biomedicine and Biotechnology complexes,which are cataloged in the MIPS Saccharomyces cerevisiae genome database.