Investigating Topological and Functional Features of Multimodular Proteins

To generate functional modules as functionally and structurally cohesive formations in protein interaction networks (PINs) constitutes an important step towards understanding how modules communicate on a higher level of the PIN organisation that underlies cell functionality. However, we need to understand how individual modules communicate and are organized into the higher-order structure(s) of the PIN organization that underlies cell functionality. In an attempt to contribute to this understanding, we make an assumption that the proteins reappearing in several modules, termed here as multimodular proteins (MMPs), may be useful in building higher-order structure(s) as they may constitute communication points between different modules. In this paper, we investigate common properties shared by these proteins and compare them with the properties of so-called single-modular proteins (SMPs) by analyzing three aspects: functional aspect, that is, annotation of the proteins, topological aspect that is betweenness centrality of the proteins, and lethality. Furthermore, we investigate the interconnectivity role of some proteins that are identified as functionally and topologically important.


Introduction
One of the challenges that systems biology is facing consists of explaining biological organisation in the light of the existence of modules in networks [1][2][3][4]. A proposal that cellular function is carried out by modules [5] has fired a "modular era" of systems biology in which the focus has been on studying modularity at different levels of cellular organisation. A series of studies attempting to reveal the modules in cellular networks, ranging from metabolic [6] to protein networks [7,8], support the proposal that modular architecture is one of the principles underlying biological organisation.
To generate functional modules as functionally and structurally cohesive formations in PINs is an important step towards understanding how individual modules communicate and are organised on a higher level of the PIN organisation that underlies cell functionality. We here investigate whether the proteins that appear in several modules, that we term multimodular proteins (MMPs), may be useful in building higher-order structure(s) as they may constitute communication points between different modules.
In this paper, we investigate common properties shared by these proteins and compare them with the properties of single-modular proteins (SMPs), that is, proteins that occur in only one module, by analysing three aspects: functional aspect, that is, annotation of the proteins, using Gene Ontology (GO), topological aspect that is betweenness centrality of the proteins, which is used to find topologically important proteins, and their lethality. Furthermore, we investigate the interconnectivity role of some proteins that are identified as functionally and topologically important.

Experimental Data Sets.
The data set referred to as CORE data consists of protein-protein interactions that were downloaded from the Database of Interacting Proteins (DIP: http://dip.doe-mbi.ucla.edu/). DIP stores and organises experimentally determined interactions between proteins in Saccharomyces cerevisiae [9]. The majority of the interactions were identified with high-throughput yeast 2 Journal of Biomedicine and Biotechnology two-hybrid (Y2H) screens. [10] We used the subset of DIP-YEAST, denoted as CORE, which has been validated in [11]). After removal of 195 self-interactions, the CORE subset contained 6375 interactions between 2231 proteins.
The second data set, referred to as von Mering data, consists of protein interactions critically evaluated by von Mering et al. (2002) [11], where a quality assessment of large-scale data sets of protein-protein interactions in Saccharomyces cerevisiae was performed. In [12], data sets from yeast two-hybrid (Y2H) systems, protein complex purification techniques that rely on mass-spectroscopy (TAP and HMS-PCI), correlated mRNA expression profiles, genetic interactions, and in silico interaction predictions were analysed. As stated further in this study, each of these methods can be used to predict protein interactions, even though their goals are slightly different. While the main purpose with yeast two-hybrid and mass spectrometry is to identify physical binding between pairs of proteins, the remaining of the mentioned methods is mainly focused on predicting functional associations, which in many cases also requires physical binding [12]. The authors integrated about 80 000 interactions between proteins in and found that only 2455 were supported by more than one method. This low overlap between sets of protein interactions obtained from different methods may be due to the high fraction of false positives but may also be caused by the difficulties for some methods to capture certain types of interactions. All interactions are classified by the level of confidence (low, medium, high), based on the evidence that supports them. In this paper, we have used the interaction set with high level of confidence, meaning that all interactions are confirmed by several methods. We will refer to this data set as "von Mering." The data set contains 2455 interactions between 988 proteins.

Algorithm for Module Identification.
In previous work by Bader and Hogue (2003), an algorithm for finding complexes in large-scale networks, called MCODE, based on the weighting of nodes with a core-clustering coefficient was proposed. The core-clustering coefficient of a node i is defined as the density of the highest k-core of the closed neighbourhood N[i]. The highest k-core of a graph is the central most densely connected subgraph. We have earlier proposed a weighted core-clustering coefficient for identifying topologically and functionally cohesive clusters [13]. The weighting scheme uses the weighted core-clustering coefficient of node i, which is defined as the weighted clustering coefficient of the highest k-core of the closed neighbourhood N[i] multiplied by the highest core number.
We called the algorithm SWEMODE (Semantic Weights for MODule Elucidation). SWEMODE has three options concerning traversal of nodes that are considered for inclusion in a module, as described in [13]. Here, we use depthfirst search; that is, the protein graph is searched starting from the seed node, which is the highest weighted node, followed by recursively traversing the graph outwards from the seed node, identifying new module members according to the given NWP (Node Weight Percentage) criterion. As in [14], the requirement for inclusion of the neighbours in a module is that their weights are higher than a threshold, which is a given NWP of the seed node. At this stage, once a node has been visited and added to the module, it cannot be added to another module [13]. However, in the postprocessing step, overlap is allowed to some extent. Because we here choose to go further by inspecting the interconnectedness, it is valuable to traverse not only the immediate neighbours but also other indirect neighbours.
In a postprocessing step, modules that contain less than three members may be removed, both before and after applying a so-called "fluffing" step. The degree of "fluffing" is referred to as "fluff " parameter and can vary between 0.0 and 1.0 [14]. For every member in the module, its immediate neighbours are added to the module if they have not been visited and if their neighbourhood weighted cohesiveness is higher than the given fluff threshold f .
To identify topologically and functionally important proteins, we calculated the number of module occurrences for each protein across 200 sets of overlapping modules (the fluff parameter was varied between 0 and 1 in increments of 0.1 and the NWP parameter was varied between 0 and 0.95 in increments of 0.05). All three GO aspects were combined into a single weight for each protein. All modules that only contain a single member are removed from further analysis.
For each seed protein, we calculated the number of times each protein appears in different modules in each module set, divided by the number of module sets it appears in. For example, if protein Nup100 is member of 10 modules in one module set and 20 modules in the another module set, the average number of module occurrences of the protein will be (10 + 20)/2 = 15.

Betweenness Centrality.
Betweenness centrality has been applied in the context of social networks, to measure the centrality and influence of a person or a group [15]. The betweenness centrality of a node v is originally defined by Freeman (1977) as the number of shortest paths between other nodes that pass through v and it is given by where g iv j is the number of the shortest path linking i and j that contain v, and g i j is the total number of the shortest path between i and j. High-betweenness nodes occur on large number of nonredundant shortest paths between other nodes. If a node with high-betweenness centrality is removed, it may disconnect different parts of the network completely. Thus, such nodes may be thought of as potential bridges between modules in network and have most influence on the information transfer.

Lethality.
We obtained lethality data from the MIPS database [16]. There are 1015 lethal proteins obtained from manually curated MIPS database. The list of MMPs and SMPs observed across modules in both data sets was compared to the list of lethal proteins.

CORE Data Set.
We started by analysing annotations with help of SGD GO Term Finder (http:// www.yeastgenome.org/help/goTermFinder.html), in order to identify the most significantly shared GO terms among the MMPs with varying number of module occurrence. The subontology "biological process" was chosen. The majority of the most frequent multimodular proteins (top 10) are annotated with the GO biological process term "cell organization and biogenesis," which has the following GO definition: "the processes involved in the assembly and arrangement of cell structures, including the plasma membrane and any external encapsulation structures such as the cell wall and cell envelope," as described in [17]. Table 1 shows the top ten MMPs, where 80% (highlighted proteins) belong to the above mentioned class. GO Frequency in Table 1 shows the percentage of those proteins that are annotated with the given GO term. The most significantly shared term is obtained by examining the group of proteins to find the GO term to which the highest fraction of the proteins is associated, compared to the number of times that the term is associated with other yeast proteins. The significance (P value) of the shared GO term describing the biological process for the ten most frequent proteins is shown in the last row in Table 1.
In addition, we have repeated the same evaluation procedure by adding proteins with decreasing module frequency to analyse how the annotation statistics is affected by adding those proteins. The summary of those results may be found in Table 6. The first column shows the statistics for the top 50 protein, where all proteins are present in approximately 2 modules in average. Still, the majority of the proteins share the GO term "cell organization and biogenesis", which is also the most significant term (P = 1.3 · 10 −11 ), and the GO frequency has increased slightly from 80% to 82%. For comparison, 50 random SMPs were evaluated with the same procedure. Here we found that the most significant term that is shared among 96% of those proteins is the GO biological process term "cellular process" (P = 2.1 · 10 −5 ), which may not help us to derive any conclusions about the more specific roles of those proteins. Also in this subset of proteins, we found that the GO term "cell organization and biogenesis" is shared among proteins, but the GO frequency for this term is 63%, compared to 82% of most frequent MMPs that are annotated with this term.
GO term frequency for the most significant terms decreases gradually as we add more proteins with decreasing module frequency. Several nonsignificant annotation terms appear as we add proteins with decreasing module frequency, meaning that those proteins have more dispersed annotation, while high-frequent MMPs seem to have more consistent annotation dominated by their participation in cellular organisation.
Cdc28, which appears most frequently in modules, is one of five different cyclin-dependent protein kinases (CDKs) in yeast and has a fundamental role in the control of the main Ribonucleoprotein complex biogenesis and assembly (5.5%) Organelle organization and biogenesis  events of the yeast cell cycle [18]. Topologically, it acts as a hub; that is, it holds together several functionally related clusters in the interaction network. In previous work, this protein was suggested to be a part of the intramodule path within the yeast filamentation network, because it had the highest intracluster connectivity; that is, it was the protein with the highest number of interactions with other members of the same cluster [1]. It is therefore highly interesting that we have identified this protein as the most frequent in our modules, as described in [17].
We further evaluated the proteins by analysing their MIPS functional categories [16], to determine what functional characteristics may be derived by studying proteins based on their module frequency. We observed that proteins involved in cellular organisation (O) appear more frequently among the top 100 MMPs, compared to the random Journal of Biomedicine and Biotechnology     set of SMPs. Among SMPs, we found that transcription seems to be enriched, as 13% of proteins are annotated with T-transcription and 10% were annotated with Btranscriptional control. This result supports our findings based on studying GO biological process annotation, where "cell organization and biogenesis" were the most significant term among multimodular proteins. We have also found a lower percentage of uncharacterised proteins in the chart that shows the statistics for the 100 most frequent MMPs (see Figure 1), while none of the proteins   in the top 50 MMPs is uncharacterised (see Figure 7). This indicates that the more often the protein takes part in the different modules, the higher is the probability that the protein has a defined function. In the same chart (see Figure 1(a)), the proteins that belong to amino acid metabolism and energy production are absent. By studying Figure 7, we can conclude that there is a high fraction of the proteins belonging to the cellular organisation category in each of the module frequency intervals. To make the charts comparable, we have sorted the proteins in decreasing order of module frequency and divided them into the four groups of high-frequent proteins, where each group contains 50 proteins (see pie charts in the first row), and four different groups that contain SMPs (see pie charts in the bottom row). The fraction of proteins that belong to the category "cellular organisation" in multimodular proteins is constantly higher (varies between 18% and 26%) than the fraction of such proteins in the single-modular groups of proteins (varies between 4% and 8%).
For this purpose, the method proposed here is compared with another related method. In previous work by Pržulj et al. (2004) [18], topologically important proteins are identified by using the most frequent "bottle neck" nodes [19]. The method starts from a tree of the shortest paths for each node v. Such tree consists of n v nodes that are directly or indirectly connected to v. All nodes w from the tree, such that more than n v /4 paths from v to other nodes meet at node w, are defined as "bottle necks". Pržulj et al. (2004) presented only the top ten most frequent "bottle neck" proteins, and stated that 70% of those are involved in supporting cellular structure and organisation. We here evaluate the annotations for different groups of proteins based on how often they appear in different modules (see Table 2). After each specific GO term in the first column, the total percentage of all proteins that are annotated with this term is given. It can be noticed that the percentage of proteins that are annotated with the chosen terms drops for the proteins with module frequency ≤1, with the exception of the term in the last row "primary metabolic process", which is the most common of all presented terms.
We also present a more systematic comparison between our protein groups, chosen based on their average occurrence in the modules, and the bottle neck proteins (see Table 3). The top 25 proteins obtained by our approach significantly share the term "ribonucleoproteins complex biogenesis and assembly", which is a child term of "cellular component organization and biogenesis". No significantly shared ontology terms appear in the corresponding set of bottle-neck proteins.

CORE Data Set.
We started by investigating general properties of the data set by studying the relation between degree and betweenness centrality. Figure 2 shows degree k versus betweenness centrality plotted on algorithmic scale. The few highly connected nodes (hubs) in the PIN must have high betweenness values because there are many nodes directly and exclusively connected to these hubs and the shortest path between these nodes goes through these hubs. However, the low-connectivity nodes also exhibited a wide range of betweenness values in the yeast PIN.  In Figure 3, node betweenness centrality is plotted as a function of average number of module occurrences. We can notice that all proteins with average module frequency ≥ 2 have considerably high betweenness values. However, the single-modular nodes also exhibited a wide range of betweenness values in the yeast PIN.

Von Mering Data Set.
We repeated the same experiment for the von Mering data set. In Figure 4, betweenness is plotted as a function of degree k. Here, we could not use any characteristic degree k or any interval of k values to denote the importance of nodes (based on the betweenness).  Also in Figure 5, besides the most frequent multimodular proteins (MMPs) that have high betweenness values, there is a wide range of betweenness centrality values for single-modular proteins (SMPs) as well. However, modular frequency seems to be a better indicator of node importance, in terms of betweenness centrality.

Lethality.
There are 1015 lethal proteins obtained from manually curated MIPS database. The list of MMPs and SMPs observed across modules in both data sets was compared to the list of lethal proteins. The results from this comparison are presented in Tables 4 and 5.
In the CORE data set, we found 222 lethal proteins among the multimodular proteins (MMPs). This corresponds to 46.3%, as there are 480 frequently occurring proteins in total. The corresponding percentage for MMPs derived from modules in the von Mering data set is 68.7, as there are 57 lethal proteins among the 83 MMPs (see Table 4).
We made the same comparison for single-modular proteins (SMPs) across the modules based on both data sets. In the CORE data set, we found 173 lethal proteins among the SMPs, which correspond to 34.5%, as there are 502 SMPs in total (see Table 5). The corresponding percentage for the fraction of lethality in SMPs derived from modules in the von Mering data set is 54.5, as there are 116 lethal proteins among the 213 SMPs, as shown in Table 5.
In both cases, the difference is statistically significant at a 95% confidence level, meaning that there is a significantly larger proportion of lethal proteins, also referred to as important proteins, among multimodular proteins. These results are obtained by performing a z-test for the differences between the two proportions (z = 3.8 in the CORE data set, and z = 2.2 in the von Mering data set). Figure 6 shows the result from an example run from module-identifying method, where Cdc28 was predicted as taking part in six modules matching MIPS complexes. In addition, this protein occurs 830 times in 200 module sets and hereby has the highest average number of module occurrences (4.2). Cdc28 is a cyclin-dependent kinase and it is believed to be a key regulator of the cell-division cycle. In this example, it is connected to several proteins from Origin Recognition Complex (ORC), which is involved in DNA replication. Cdc28 is also connected to actin cytoskeleton-associated complex, which is reorganised in accordance with cellcycle progression. This process is according to previous study believed to be controlled, directly or indirectly, by Cdc28 [22]. Furthermore, there is an important connection between Cdc28 and proteasome complex. The central role of this complex is to direct a cell to proceed with the decision to replicate itself. In yeast cells a critical trigger for cell replication is degradation of Sic1, which is a protein that inhibits the chemical activity of Cdc28. After eliminating the biochemical Sic1 "brake" due to the action of SCF and the proteasome, the kinase is then free to trigger the progress toward DNA replication and associated events of cell replication. This is a clear example of the network involving hub that interconnects several functional modules. This example is supported by several topological and functional features, such as average number of occurrences in modules, betweenness centrality, and node degree. However, there are several examples where those features are conflicting, which will be interesting to evaluate in future.

Conclusions
We have here presented approaches for identifying topologically and functionally important proteins by calculating the frequency of each protein across 200 sets of overlapping modules. Initial results show that the majority of frequently appearing proteins that connect several modules are involved in the assembly and arrangement of cell structures, such as the cell wall and cell envelope, which indicates that they are involved in supporting the cell structure rather than signal transduction, for example. We also observed by studying MIPS functional classes of the MMPs and SMPs that proteins involved in cellular organisation (O) appear more frequently among the top 100 MMPs, compared to the random sets of SMPs. The results from studying lethality show the significantly higher fraction of lethal proteins among multimodular proteins (MMP), when compared to single modular proteins (SMP) reflecting the tendency of MMP to be more lethal, and hereby indicating their essentiality.
The investigation of different features of so-called multimodular proteins, that is, proteins that take part in multiple modules within the PIN, shows that these may be involved in the assembly and arrangement of cell structures (according to GO annotation) to a greater extent than single-modular proteins or proteins with lower numbers of occurrences across the generated module sets. Also, the analysis of MIPS functional categories, along with the analysis of GO annotation, shows that the fraction of the proteins that belong to the category "cellular organisation" in multimodular proteins is higher than the fraction of such proteins in the single-modular groups of proteins. Another frequently occurring GO term that is assigned to multimodular proteins is "ribonucleoproteins complex biogenesis and assembly" which is a child term of "cellular component organisation and biogenesis". Hence, we find evidence supporting the hypothesis that this GO term reveals the role of modules in building and supporting higher-order structure(s) of the PIN organisation. Other features that we have analysed to characterise possible differences between multimodular and single-modular proteins are betweenness centrality and lethality. In both data sets, it is shown that there is significantly higher fraction of lethal proteins among multimodular proteins, also pointing at their significance. From the analysis of betweenness centrality, it is also notable that proteins with high average module frequency have considerably high betweenness values, while the single-modular nodes exhibit a wide range of betweenness values in the yeast PIN. This also points to the greater importance of the multimodular proteins, as those nodes may be potential bridges between modules in the network and have most influence on the information transfer between communicating modules. If a node with high betweenness centrality is removed, it may disconnect a different part of the network completely.
Possible limitation of this approach should finally be discussed. The method for assigning the weights to proteins, which are used for the purpose of module identification, that, in turn, consists the basis for identifying multimodular feature of the proteins, relies to a great extend on GO terms. Proteins may be annotated at different levels in the hierarchy, that is, some of more specifically described than the others. Another limitation that also should be discussed is that quality of GO annotation in terms of experimental evidence may vary. Currently, all evidence types are used, but some types of evidence such as "traceable author statement" are considered more reliable than others. As we used the proteinprotein interactions that are validated by different method, and are generally well annotated it should not affect the performance of module identifying method to a great extent, but the method may benefit from future more fine grained versions of GO.
In future, it would be very interesting to make a systematic comparison with other module-identifying methods and other topological features used to identify essential proteins in protein interactions networks.