To generate functional modules as functionally and structurally cohesive formations in protein interaction networks (PINs) constitutes an important step towards understanding how modules communicate on a higher level of the PIN organisation that underlies cell functionality. However, we need to understand how individual modules communicate and are organized into the higher-order structure(s) of the PIN organization that underlies cell functionality. In an attempt to contribute to this understanding, we make an assumption that the proteins reappearing in several modules, termed here as multimodular proteins (MMPs), may be useful in building higher-order structure(s) as they may constitute communication points between different modules. In this paper, we investigate common properties shared by these proteins and compare them with the properties of so-called single-modular proteins (SMPs) by analyzing three aspects: functional aspect, that is, annotation of the proteins, topological aspect that is betweenness centrality of the proteins, and lethality. Furthermore, we investigate the interconnectivity role of some proteins that are identified as functionally and topologically important.
One of the challenges that systems biology is facing consists of explaining biological organisation in the light of the existence of modules in networks [
To generate functional modules as functionally and structurally cohesive formations in PINs is an important step towards understanding how individual modules communicate and are organised on a higher level of the PIN organisation that underlies cell functionality. We here investigate whether the proteins that appear in several modules, that we term multimodular proteins (MMPs), may be useful in building higher-order structure(s) as they may constitute communication points between different modules.
In this paper, we investigate common properties shared by these proteins and compare them with the properties of single-modular proteins (SMPs), that is, proteins that occur in only one module, by analysing three aspects: functional aspect, that is, annotation of the proteins, using Gene Ontology (GO), topological aspect that is betweenness centrality of the proteins, which is used to find topologically important proteins, and their lethality. Furthermore, we investigate the interconnectivity role of some proteins that are identified as functionally and topologically important.
The data set referred to as CORE data consists of protein-protein interactions that were downloaded from the Database of Interacting Proteins (DIP:
The second data set, referred to as von Mering data, consists of protein interactions critically evaluated by von Mering et al. (2002) [
In previous work by Bader and Hogue (2003), an algorithm for finding complexes in large-scale networks, called MCODE, based on the weighting of nodes with a core-clustering coefficient was proposed. The core-clustering coefficient of a node
We called the algorithm SWEMODE (Semantic Weights for MODule Elucidation). SWEMODE has three options concerning traversal of nodes that are considered for inclusion in a module, as described in [
In a postprocessing step, modules that contain less than three members may be removed, both before and after applying a so-called “fluffing” step. The degree of “fluffing” is referred to as “fluff” parameter and can vary between 0.0 and 1.0 [
To identify topologically and functionally important proteins, we calculated the number of module occurrences for each protein across 200 sets of overlapping modules (the fluff parameter was varied between 0 and 1 in increments of 0.1 and the
For each seed protein, we calculated the number of times each protein appears in different modules in each module set, divided by the number of module sets it appears in. For example, if protein Nup100 is member of 10 modules in one module set and 20 modules in the another module set, the average number of module occurrences of the protein will be
Betweenness centrality has been applied in the context of social networks, to measure the centrality and influence of a person or a group [
We obtained lethality data from the MIPS database [
We started by analysing annotations with help of SGD GO Term Finder (
Annotation statistics for top ten multimodular proteins.
Proteins | Cdc28 | Nap1 | Prp43 | Pre1 | Pwp2 | Sed5 | Tfp1 | Nop4 | Utp7 | Rpc40 |
Module frequency | 4.2 | 3.9 | 2.9 | 2.7 | 2.7 | 2.6 | 2.6 | 2.6 | 2.5 | 2.5 |
GO biological process | cell organization and biogenesis | |||||||||
GO frequency | 80 | |||||||||
In addition, we have repeated the same evaluation procedure by adding proteins with decreasing module frequency to analyse how the annotation statistics is affected by adding those proteins. The summary of those results may be found in Table
GO term frequency for the most significant terms decreases gradually as we add more proteins with decreasing module frequency. Several nonsignificant annotation terms appear as we add proteins with decreasing module frequency, meaning that those proteins have more dispersed annotation, while high-frequent MMPs seem to have more consistent annotation dominated by their participation in cellular organisation.
Cdc28, which appears most frequently in modules, is one of five different cyclin-dependent protein kinases (CDKs) in yeast and has a fundamental role in the control of the main events of the yeast cell cycle [
We further evaluated the proteins by analysing their MIPS functional categories [
We have also found a lower percentage of uncharacterised proteins in the chart that shows the statistics for the 100 most frequent MMPs (see Figure
Statistics for MIPS functional categories: D: genome maintenance; T: transcription; F: protein fate; C: cellular fate/organisation; O: cellular organisation; G: amino acid metabolism; M: other metabolism; E: energy production; R: stress and defence; B: transcriptional control; P: translation; A: transport and sensing; U: uncharacterized.
One of the important goals in systems biology is to find relations between the topological properties and functional features of genes and proteins in the networks. In previous network studies, the focus has been on highly connected proteins, so called “hubs”, and proteins with high-betweenness centrality, so called “bottle necks” [
For this purpose, the method proposed here is compared with another related method. In previous work by Pržulj et al. (2004) [
Statistics for the most significant GO terms based on GO biological process. Module frequency decreases from left to right, and the last column contains a group of proteins that occur in only one module or are not present in any of the modules.
Module frequency | ||||||
GO biological process | GO term frequency | |||||
Ribonucleoprotein complex biogenesis and assembly (5.5%) | ||||||
Cellular component organization and biogenesis (30%) | ||||||
Organelle organization and biogenesis (17.8%) | ||||||
RNA metabolic process (14.2%) | ||||||
Primary metabolic process (44%) | ||||||
We also present a more systematic comparison between our protein groups, chosen based on their average occurrence in the modules, and the bottle neck proteins (see Table
Comparison between top 100 most frequent multimodular proteins and most frequent “bottle neck” proteins, identified by Przulj et al. (2003).
Module freq. | ||||||||
“bottle necks” | ||||||||
GO biological Process | GO term frequency | |||||||
Cellular process (64.1%) | — | — | — | |||||
Ribonucleoprotein complex biogenesis and assembly (5.5%) | ||||||||
Cellular component organization and biogenesis (30%) | — | — | ||||||
Organelle organization and biogenesis (17.8%) | — | — | ||||||
Cellular metabolic process (46.6%) | — | — | — | |||||
RNA metabolic process (14.2%) | — | — | — | |||||
Primary metabolic process (44%) | — | — | — | |||||
We started by investigating general properties of the data set by studying the relation between degree and betweenness centrality. Figure
Degree (
In Figure
Average number of module occurrences versus betweenness centrality plotted on algorithmic scale.
We repeated the same experiment for the von Mering data set. In Figure
Degree (
Also in Figure
Average number of module occurrences versus betweenness centrality plotted on algorithmic scale.
There are 1015 lethal proteins obtained from manually curated MIPS database. The list of MMPs and SMPs observed across modules in both data sets was compared to the list of lethal proteins. The results from this comparison are presented in Tables
Lethality among multimodular proteins (MMPs) across both data sets.
No. of MMP | No. of lethal proteins | Percentage | |
---|---|---|---|
CORE | 480 | 222 | 46.3% |
Von Mering | 83 | 57 | 68.7% |
Lethality among single-modular proteins (SMPs) across both data sets.
No. of SMP | No. of lethal proteins | Percentage | |
---|---|---|---|
CORE | 502 | 173 | 34.5% |
Von Mering | 213 | 116 | 54.5% |
Annotation statistics for multimodular proteins of different module frequency versus single-modular proteins from Yeast CORE data set. Statistics for the most significant annotation terms of the multimodular proteins with varying occurrences intervals, compared to the corresponding statistics for single-modular proteins (CORE data set).
Module frequency | ||||||||||
No. proteins | ||||||||||
GO biological process | GO term frequency | |||||||||
GO: 0016043 (30%) cellular component organization and biogenesis (30%) | ||||||||||
GO: 0006996 (17.8%) organelle organization and biogenesis (17.8%) | ||||||||||
GO: 0043283 (30.2%) biopolymer metabolic process (30.2%) | ||||||||||
GO: 0006139 (20.7%) nucleobase, nucleoside, nucleotide and nucleic acid metabolic process (20.7%) | ||||||||||
GO: 0016070 (14.2%) metabolic process (14.2%) | ||||||||||
GO: 0044238 (44.0%) primary metabolic process (44.0%) | ||||||||||
In the CORE data set, we found 222 lethal proteins among the multimodular proteins (MMPs). This corresponds to 46.3%, as there are 480 frequently occurring proteins in total. The corresponding percentage for MMPs derived from modules in the von Mering data set is 68.7, as there are 57 lethal proteins among the 83 MMPs (see Table
We made the same comparison for single-modular proteins (SMPs) across the modules based on both data sets. In the CORE data set, we found 173 lethal proteins among the SMPs, which correspond to 34.5%, as there are 502 SMPs in total (see Table
In both cases, the difference is statistically significant at a 95% confidence level, meaning that there is a significantly larger proportion of lethal proteins, also referred to as important proteins, among multimodular proteins. These results are obtained by performing a
Figure
Modular network involving modules in which Cdc28.
Functional groups statistics for proteins in von Mering data set. The first row shows charts with statistics for multimodular proteins (MMP) in varying intervals of module frequency (in decreasing order of frequency). There are 50 proteins in each interval. For comparison, the second row shows the corresponding statistics for the same number of single-modular proteins.
This is a clear example of the network involving hub that interconnects several functional modules. This example is supported by several topological and functional features, such as average number of occurrences in modules, betweenness centrality, and node degree. However, there are several examples where those features are conflicting, which will be interesting to evaluate in future.
We have here presented approaches for identifying topologically and functionally important proteins by calculating the frequency of each protein across 200 sets of overlapping modules. Initial results show that the majority of frequently appearing proteins that connect several modules are involved in the assembly and arrangement of cell structures, such as the cell wall and cell envelope, which indicates that they are involved in supporting the cell structure rather than signal transduction, for example. We also observed by studying MIPS functional classes of the MMPs and SMPs that proteins involved in cellular organisation (O) appear more frequently among the top 100 MMPs, compared to the random sets of SMPs. The results from studying lethality show the significantly higher fraction of lethal proteins among multimodular proteins (MMP), when compared to single modular proteins (SMP) reflecting the tendency of MMP to be more lethal, and hereby indicating their essentiality.
The investigation of different features of so-called multimodular proteins, that is, proteins that take part in multiple modules within the PIN, shows that these may be involved in the assembly and arrangement of cell structures (according to GO annotation) to a greater extent than single-modular proteins or proteins with lower numbers of occurrences across the generated module sets. Also, the analysis of MIPS functional categories, along with the analysis of GO annotation, shows that the fraction of the proteins that belong to the category “cellular organisation” in multimodular proteins is higher than the fraction of such proteins in the single-modular groups of proteins. Another frequently occurring GO term that is assigned to multimodular proteins is “ribonucleoproteins complex biogenesis and assembly” which is a child term of “cellular component organisation and biogenesis”. Hence, we find evidence supporting the hypothesis that this GO term reveals the role of modules in building and supporting higher-order structure(s) of the PIN organisation. Other features that we have analysed to characterise possible differences between multimodular and single-modular proteins are betweenness centrality and lethality. In both data sets, it is shown that there is significantly higher fraction of lethal proteins among multimodular proteins, also pointing at their significance. From the analysis of betweenness centrality, it is also notable that proteins with high average module frequency have considerably high betweenness values, while the single-modular nodes exhibit a wide range of betweenness values in the yeast PIN. This also points to the greater importance of the multimodular proteins, as those nodes may be potential bridges between modules in the network and have most influence on the information transfer between communicating modules. If a node with high betweenness centrality is removed, it may disconnect a different part of the network completely.
Possible limitation of this approach should finally be discussed. The method for assigning the weights to proteins, which are used for the purpose of module identification, that, in turn, consists the basis for identifying multimodular feature of the proteins, relies to a great extend on GO terms. Proteins may be annotated at different levels in the hierarchy, that is, some of more specifically described than the others. Another limitation that also should be discussed is that quality of GO annotation in terms of experimental evidence may vary. Currently, all evidence types are used, but some types of evidence such as “traceable author statement” are considered more reliable than others. As we used the protein-protein interactions that are validated by different method, and are generally well annotated it should not affect the performance of module identifying method to a great extent, but the method may benefit from future more fine grained versions of GO.
In future, it would be very interesting to make a systematic comparison with other module-identifying methods and other topological features used to identify essential proteins in protein interactions networks.