Multiscale fragPIN Modularity

Modularity in protein interactome networks (PINs) is a central theme involving aspects such as the study of the resolution limit, the comparative assessment of module-finding algorithms, and the role of data integration in systems biology. It is less common to study the relationships between the topological hierarchies embedded within the same network. This occurrence is not unusual, in particular with PINs that are considered assemblies of various interactions depending on specialized biological processes. The integrated view offered so far by modularity maps represents in general a synthesis of a variety of possible interaction maps, each reflecting a certain biological level of specialization. The driving hypothesis of this work leverages on such network components. Therefore, subnetworks are generated from fragmentation, a process aimed to isolating parts of a common network source that are here called fragments, from which the acronym fragPIN is used. The characteristics of modularity in each obtained fragPIN are elucidated and compared. Finally, as it was hypothesized that different timescales may underlie the biological processes from which the fragments are computed, the analysis was centered on an example involving the fluctuation dynamics inherent to the signaling process and was aimed to show how timescales can be identified from such dynamics, in particular assigning the interactions based on selected topological properties.


Introduction
PIN [1] are almost pervasively studied in genomics, but especially when H. Sapiens is considered they present limitations due to sparse coverage and suboptimal accuracy of both experimental (yeast two-hybrid, for instance) and in silico measurements (literature mining, orthology, etc.) [2,3].is overall uncertainty is re�ected in a pathological presence of false positives and negatives and ultimately complicates data mining and analysis tasks.In order to bypass the complexities induced by such factors, data integration strategies are widely pursued (for instance, studies in [4,5] have become quite popular).However, a difficulty comes from the fact that the integrated entities are usually heterogeneous, and thus normalization and rescaling need to be considered.An excellent example of the complexity underlying a sequence of integrative omics tasks is offered by the personal omics pro�ling work recently published by Chen et al. [6], soon considered a reference for personalized medicine research.
e working hypothesis of this short paper is to adopt an opposite investigation strategy compared to aggregation: instead of integrating the PIN dataset with data from other omics sources, its constituent entities were explored, considering the building blocks that biologically allow for the protein interactions to be observed and measured, at least in part.A PIN map consists of three main types of constituent entities: positive data, that is, the measured physical interactions, which represent the real evidence; negative data, that is, the interactions that are not present, considered as latent variables; and uncertain data, that is, noisy information (false positives) for which partial recovery is possible through data integration.Notably, this mix is usually measured through both transient and persistent PIN dynamics, together with the related degree of uncertainty.is work aims to elucidate through the comparative assessment of module-�nding algorithms the relationships between topologies that belong to the same network.In particular, PIN can be considered to assemble various interactions which depend on specialized biological processes.e integrated view generally offered by modularity maps represents indeed a synthesis of a variety of possible interaction maps [7][8][9] embedded in the same network.Individual reference to such maps was made, at least for a list of them, and fragPIN were used to indicate the type of network which is generated from fragmentation, a process that retrieves from the same network source a certain number of biologically differentiated subnetworks.en, elucidation of the characteristics of modularity in each obtained fragment was carried out, helping to investigate the hypothesis that different timescales may underlie the interactive dynamics related to the biological processes from which the fragments are computed.As an example, analysis of PIN �uctuation dynamics for signaling was carried out to show how the inherent timescales can be identi�ed, and interactions assigned to them based on selected topological properties.
Following the work of Huthmacher et al. [10], previous examples of comparative network biology analysis have been suggested by Durek and Walther [11] with the attempt to elucidate the implications of PIN for the regulation of the underlying reaction networks.A comprehensive analysis of enzyme-enzyme interactions in metabolic networks of E. coli and S. cerevisiae has thus been performed.e latter has involved the analysis of topological properties of these different but related networks and addressed issues such as the efficiency of metabolic processes and how the organization of enzyme interactions correlate with metabolic efficiency.e methods adopted in the above papers required the study of the global network connectivity properties, various �ltering steps to reveal organization differences between all interaction sets and networks targeted to metabolism, and the analysis of scale-free exponents, average cluster coefficient, degree correlation, distance, and centrality was performed.Priority was assigned to fragPIN modularity, and by computing modules according to two popular techniques, the differential con�gurations thus obtained were assessed.Modules are characterized by interactions occurring at different timescales and to a degree that depends on the involved biological processes.Unfortunately, technological and experimental sources cannot provide the needed detail of information.erefore, the timescale decomposition offered by fragPIN and inherent to each particular process must be determined in some other ways, for instance in silico through the computational approach described below.

Methods
Similarly to all the interactome datasets, also the S. cerevisiae (yeast) interactome presents its complexities; the work of Reguly et al. [12] is an optimal choice, particularly with regard to the literature-curated interactions from smallscale experiments (among other interactome disaggregated information presented by the authors).e dataset involves 31793 publications and reports about 11334 nonredundant interactions (from a total of 33311) and 3289 proteins.Given this yeast source, a compilation of PINs was built and studied to compare their modular properties.Each subinteractome was analyzed according to the characterizing biological process.is process was called PIN fragmentation.e natural consequence of fragmentation is that speci�c PINs are built whose connectivity patterns re�ect the dynamics inherent to the separately involved biological process.e list is reported below.
It is obtained by �ltering the rPIN such that proteins with their GO terms not associated to metabolism (source: SGD db, http://www.yeastgenome.org/)(iv) pPIN = pathways PIN.
It is obtained by �ltering the rPIN through pathways retrieved from the KEGG db.pPIN contains only interactions between proteins involved in annotated pathways.
It is obtained by �ltering rPIN through proteins involved in cell cycle processes (source: MIPS, mips.helmholtz-muenchen.de/genre/proj/yeast/andSGD db).cPIN contains interactions between proteins involved in cell-cycle process.
It is is obtained by �ltering rPIN through transcription factors from the YEASTRACT db.It contains interactions between transcription factors and their target proteins.
It is obtained by �ltering rPIN using signalling pathways retrieved from KEGG db.sPIN contains only interactions between proteins involved in signalling annotated pathways.

Results
As a �rst check, distributional properties are computed through the power laws, that is, () ∝  − , and reported in Figure 1 with reference to each fragPIN and the corresponding estimated exponents too (see [13][14][15][16][17] for general treatment of the topic).e distributions appear quite different, as expected, and this depends on the structure and size of the fragPIN which is considered.

Modularity.
Modularity is oen naturally computed when networks are employed.Many algorithms have become available, and a couple of them have been selected based on the popularity and consensus achieved.e �rst of such methods that we employed is MCODE [18], which exploits local graph density to suggest possible associations between protein complexes and locally dense regions of a graph computed from a clustering coefficient, that is,   = 2/(  (  −1)), where   is the node size of the neighborhood of node , and  is the number of edges in the neighborhood.e -core is the structure that one �nds in a graph; it is a network of minimal degree  de�ned as the remaining subgraph, aer that all the nodes with degrees 1− have been removed successively.e procedure is as follows: (a) when a node is removed, all its adjacent edges will also be removed; (b) aer a node of degree ≤1 −  is removed, in the remaining graph all the remaining nodes with a new degree ≤1 −  also need to be removed.In other terms, given  = ( ), the -core is computed by pruning all the  (with their ) with degree less than  until all nodes in the remaining network have at least degree .
en, if a node ∈ -core but ∉ (1 + )-core of the graph, it has coreness degree .e highest -core of a network is the central most densely connected sub-network.Aer vertex weighting, complex prediction is conducted where the relevance of each cluster is validated against known complexes or functional modules, and �nal statistics are computed about clusters size, density, and functional homogeneity.
e main modules identi�ed for all fragPIN are reported in Figure 2 (table format).To obtain them, parameters for network scoring have been set as follows: degree cutoff = 2; for cluster �nding: node score cutoff = 0.2; haircut = true; �uff = false; -core: 2; and maximum depth from seed: 100.
Modularity can then be computed by another very popular community-�nding method called maximum modularity (MaxMod).To implement such a method, greedy optimization algorithms have been employed by Clauset et al. [19] to �nd the best possible modularity structure in networks.In summary, a greedy procedure iteratively merges module pairs showing the largest modularity increase until a gain is observed.
e optimization function  [20] is reported below.It is de�ned as an approximate difference between links observed in a modular network versus those expected in a network of equivalent size where they have been randomly placed.erefore, a value of zero for  indicates that the fraction of within-module links is not different from what would be expected from a randomized network of equivalent size.Nonzero values of  indicate deviation from randomness, and values around 0.3 suggest the presence of modular structure (this result comes from extensive simulations reported in the above references) as e formula reports fractions of links related to nodes within a module  and fractions of links coming from all other modules relatively to module .erefore, a good partition into modules leads  to approach 1; vice versa, the presence of random links between nodes (i.e., poor modularity) would make the two terms not too different, thus delivering a  close to 0. Figure 3 shows cores detected in cPIN, while Figure 4 shows a community map for it.

Timescale Decomposition.
Biological processes embeds dynamics that respond to different timescales; a major problem is how to measure them, in particularly in relation to interactive associations [21].One way to introduce dynamics at the interactome scale is to integrate gene expression values ideally obtained through time course measurements.However, when such coupled measurements are not available, the problem of deciphering network dynamics is of difficult solution.In a companion paper [22], a special network decomposition approach elucidating both coarse and �ne timescales through wavelets [23][24][25][26] was proposed.While the focus in previous work was on some particular pathways, a generalization is put forth here.
Using wavelets depends on the entities to be measured, and those ones allowing for suitable timescale decomposition can be good candidates.Such entities, in our case, can be identi�ed by topological features that once measured at each protein (e.g., node) contribute to quantifying a vector-valued signal.e latter can then be decomposed by wavelets.In our application, every entry of the feature vector computed from the PIN and to be decomposed across timescales represents a topological property.An example, apart from the usually exploited degree feature, is provided by betweenness [27][28][29].is centrality measure is computed at each network node and increases depending on the volume of crossing at the node, that is, shortest paths (geodesics) going from an origin to a destination through the node relative to the total number of geodesics observed between start and end nodes.For distinct nodes     ,   the number of the shortest paths from  to , and   () the number of the shortest paths passing through , it holds that Another problem is how to establish signi�cant variation between timescales in the wavelet values.e approach proposed in our previous methodological paper was centered around two steps: (a) denoising [30][31][32][33] applied to get rid of disturbances of random nature; (b) clustering [34] aimed to discriminate between signi�cant and nonsigni�cant values.
e variability in the measures was initially analyzed through the IQR (interquartile range) robust statistic in order to select the most variable fraction of the data (the half that was selected was called coreset), while discarding the residual part (the box values proximal to the median).A second partitioning was then made of the selected data fraction.In order to control the coreset timescale speci�city, some clusters were retrieved.However, also the remaining scattered values were evaluated, that is, the values not assigned to clusters.
A tight clustering technique was adopted, based on a mix of hierarchical and -means approaches integrated by bootstrap to form stable clusters.Overall, clusters did not �nd signi�cant protein modules through which to analy�e connectivity or inherent association power of biological relevance.Clusters were also computed over the entire sets of values (without IQR split into coreset and scattered values), and yet did not deliver biological evidence.Conversely, the analysis of the scattered feature values proved to be more fruitful in terms of reference to timescale speci�city, especially for the impact on pathway proximity rather than on network connectivity.

Transient versus Permanent Interactions.
A �nal aspect is how to measure transiency and permanence of interaction dynamics.e emphasis went on their speci�c interaction dynamics relative to modular connectivity computed within and between timescales, together with pathway proximity.Graphical evidence was reported through Figures 5 and 6.Basically, a scan was �rst produced through the entire wavelet resolution spectrum for each module under differential conditions then followed by back projection to the PIN of the established associations between particular protein interactions and timescales.us, the cases for which interactive dynamics are simultaneously present at multiple timescales were visualized, together with the links that are possibly appearing between them.S1 (see S1 in the supplementary available at http://dx.doi.org/10.1155/2013/307608)reports timescale proximity at pathway level (signaling), which complements the graphical evidences reported at modular network scale.S2 reports the histograms of wavelet-decomposed feature signals (levels and their differences) and diagnostic plots; S3 reports module connectivities detected from each feature across timescales; and S4 reports GO annotation for the identi�ed interactions.
Figure 5 shows timescale-speci�c interactions computed from feature-dependent modules in sPIN.Note that the diversity of colors identi�es the different timescales that have been detected by the algorithms.Figure 6 reports instead much denser modules, with reference to ttPIN.In terms of comparative evaluation, while Figures 3 and 4 refer to cores and communities, respectively, and these are typical modules found in many studies aer applying very wellknown methodologies, the proposed approach shows their limitations in detecting resolutions or timescales.erefore, by involving topological properties computed over specialized PINs, and in particular the information coming from the biological processes, the induced connectivity dynamics between proteins can be emphasized and suitably represented.From a biological point of view, this passage might be important for a series of reasons, (a) the possibility to adopt a differential network analysis based on a comparison of PINs evaluated before and aer certain perturbations; (b) the assessment of PIN module con�guration changes that might explain phenotypical alterations based on well-characterized protein dynamics.

Concluding Remarks
Fragments of PIN offer interesting inference perspectives.e most important aspect is that in reduced dimensionality and complexity, some specialized module functions could be analyzed and possibly validated with reference to speci�c aspects related to a target pathway or biological process.e second aspect of potential interest is the development of differential network analysis in response to conditions that may affect network dynamics.Finally, time and space dimensions are two entities that de�ne network dynamics and oen are overlooked; the timescale analysis here proposed is an example of computational analysis that might provide relevant information to build more accurate pro�les.Without observing protein interactomic dynamics from measurements directly at the experimental level, thus embedding the dynamics from their generating timescales, an attempt to computationally dissect the interactome was made, then separating the effects induced by all the biological processes that were found to be involved.e differences that were detected �nd justi�cation in a variety of reasons that cannot be inferred from the plain interactome data; however, aer examining each separate PIN, a result was that in some cases the timescale dynamics can be revealed through the employed PIN topologies.

F 1 :
Distributional laws for fragPIN: pattern comparisons and goodness of �t between degree distributions and power laws.

F 3 :
Cores (a) and best core computed in cPIN.Graphs obtained by using MCODE.

F 4 :
Community map computed in cPIN, with hubs indicated as red points, and red links connecting them to underline the highconnectivity patterns.Graphs obtained from MaxMod computations.

F 5 :
Timescale interactions computed for sPIN by betweenness (a), cluster, coe�cient, and degree (c).�ach color identi�es a di�erent timescale under which detection occurred through wavelets.

F 6 :
Timescale interactions computed by degree (a) and betweenness (b) for ttPIN.�ach color identi�es a di�erent timescale under which detection occurred through wavelets.