Constrained Network Modularity

Static representations of protein interactions networks or PIN reflect measurements referred to a variety of conditions, including time. To partially bypass such limitation, gene expression information is usually integrated in the network to measure its “activity level.” In general, the entire PIN modular organization (complexes, pathways) can reveal changes of configuration whose functional significance depends on biological annotation. However, since network dynamics are based on the presence of different conditions leading to comparisons between normal and disease states, or between networks observed sequentially in time, our working hypothesis refers to the analysis of differential networks based on varying modularity and uncertainty. Two popular methods were applied and evaluated, k-core and Q-modularity, over a reference yeast dataset comprising a PIN of literature-curated data obtained from the fusion of heterogeneous measurements sources. While the functional aspect of interest is cell cycle and the corresponding interactions were isolated, the PIN dynamics were externally induced by time-course measured gene expression values, which we consider one of the “modularity drivers.” Notably, due to the nature of such expression values referred to the “just-in-time method,” we could specialize our approach according to three constrained modular configurations then comparatively assessed through local entropy measures.


Introduction
Despite the fact that research on PIN [1] is quite mature at both methodological (systems biology) and applied (biomedical and clinical bioinformatics) levels, there are still some domains that remain partially unexplored, in particular from an integrative dynamic standpoint.The first attribute, that is, integrative, includes the consideration of complementary omic layers that provide information on causality, for instance (through gene coexpression, transcription factors, microRNAs, etc.).The second attribute, that is, dynamic, aims at investigating differential properties of networks, and it is based on the assessment of the effects of different conditions at which network properties are measured.
The field of "differential network biology" has been already explored from a variety of differential conditions, such as expression during drug and stress response [2] or condition-responsive subnetwork identification [3].Recently, Ideker and Krogan [4] reviewed the field, suggesting new interesting directions.Currently, some of the main limitations that are encountered can be summarized as follows.
(i) The available interactome coverage [5] varies between organisms and depends on the technologies, but is in general quite limited [6].Consequently, a relevant role can be played by data integration to ensure control of data uncertainty and validation quality.
(ii) The accuracy of measurements (experimental and predictive) is limited too.Evidence provided by [7] showed that literature-curated interactome data need quality control filters for reliable inference.
(iii) The inherent reliability of modular configurations is also limited, and leads to just approximate solutions (see [8]).In particular, all methods suffer from the "network resolution limit" [9,10] problem that binds the detection power.
Various types of errors thus influence the accuracy of PIN maps to a degree that remains difficult to quantify.The fact that both positive and negative interactions include many false entries allows for the association of PINs to samples taken from a quite sparse interactome space.However, due to the presence of modular organizations, local densities identify "modules" that may drive inference despite the limitations.
Modules represent a sort of static entity when they are computed for networks measured at certain specific (including temporal) conditions.When the conditions change, the effects of the network-induced dynamics should be assessed, and the uncertainty inherent to the resulting configurations should be quantified to establish robustness and reliability aspects.
A common strategy to control the uncertainty level is that of data integration.For instance, in PIN applications the use of gene coexpression and pathway information sources can complement the information underlying the constituent interactions by allowing for extensive quality annotation.
Another standard strategy involves the analysis of topological properties [11,12], which may characterize intraand intermodular architectures.Further refinement of interactome data can be expected from the analysis of global similarity (dissimilarity) measures useful to perform differential network analysis and to assign confidence scores to the interactions depending on both biological and computational features.
The implementation of the novel paradigm of differential network biology characterizes this work, which leverages on comparing a variety of modular configurations produced by different algorithms applied to PINs.In particular, the structure of the paper is as follows.We describe in Section 2 our methodological approach; then, we present our results in Section 3; finally, we report concluding remarks with a discussion in Section 4.

Data Generation Approaches.
For the purpose of our work, the reference dataset is the yeast interactome available from Reguly et al. [13] through a series of disaggregated interactomes.We have considered in particular the literaturecurated interactome (LIT-Int) obtained from small-scale experiments.We refer to this data sets as rPIN, from which we have built another series of interactomes based on the observation that heterogeneous entities like biological processes are comprised, and each of such processes determines a subset of interactive protein dynamics (specific ones, in particular, when referred to the process itself).
Given such a hypothesis, we have attempted to separate the individual contributions of biological processes and generated subnetworks from a "PIN fragmentation" approach.The latter consists of filtering the original PIN according to the biological process of interest in order to obtain differentiated subinteractomes.The idea of multiple analysis from the same PIN source can be found in Durek and Walther [14], and Huthmacher et al. [15,16] for comparisons between topological characteristics of protein and metabolic interactomes from E.coli and S.cerevisiae model organisms.
The general advantage of such decomposition is a reduction of the overall dimensionality when each individual "fragment" is considered.In terms of differential analysis, PIN fragmentation interestingly builds a sequence of "constrained" subinteractomes that are functionally specialized, depending on the biological process that has been selected.Changing the interactome scale, pathway isolation could also be achieved following the same strategy, but this step is not pursued in the present work.We instead have focused on extracting the cell cycle sub-interactome, thus inducing a specific functional constraint.In order to embed the PIN with dynamic information, and further constraining our network, extra work was needed.
Gene coexpression and physically interacting proteins tend to correlate, in addition to the coupling of colocalization and coexpression observed at the transcriptional level.Thus, network integration of gene expression values is expected to improve module detection power by computational algorithms and further corroborate the protein modularity maps [17].Evidence for interacting protein pairs in a complex that show mRNA coexpression was for instance provided by Dezso et al., and is also available from human interactome experimental work [18] and tissue-specific interactome analysis [19].
The functionally constrained PIN was thus further perturbed by gene expression generated dynamics.In particular, the yeast cell cycle study of de Lichtenberg et al. offers experimentally validated instruments for this type of analysis through the characterization of mRNA transcripts by time-course expression peaks achieved during their observed periodical variation.Consequently, a functionally and dynamically constrained PIN is obtained, and protein interactions are monitored in terms of pairwise or partial (just one of the interactors) association with the corresponding gene expression measurement peaks.From a variety of constrained network configurations the influence exerted by modularity drivers could be assessed.After the rPIN fragmentation, the following sub-interactome list is available: Mapping of such values onto rPIN yielded "peakto-peak" protein interactions (i.e., only proteins with related gene expression peaks are considered, thus making maximally constrained the network).
(iii) cePIN-1: proteins = 444, interactions = 977.This is a less constrained cell cycle PIN whose constituent interactions depends only on a peak signature associated with one of the interacting proteins (i.e., the other interactor may represent any other cell cycle protein whose related gene may have any expression level).
(iv) cePIN-2: proteins = 1193, interactions = 2254.Thus it is the minimally constrained cell cycle PIN due to interacting proteins with a peak signature linking to any other protein not necessarily related to the cell cycle.
In terms of biological annotation of the modules, a comparison between cePIN-2, cePIN-1, and cePIN maps was recently proposed by Travaglione et al. [20] through multiple steps involving protein complexes, GO categories, and pathways.The novelty in the present work compared to that companion paper is represented by the analysis of the uncertainty of such configurations through the classical instrument of entropy.[21] can be retrieved by algorithms inspired by different principles (see [22] for a wide review and examples).For example, some strategies may direct deterministically (divisive and greedy algorithms) or stochastically (random walk) the search.The different module structures can be characterized by topological properties, but the modules present differences that depend on the generating algorithms.

Modularity Structures. Modular structures underlying PIN
We applied two popular methods to retrieve module structures represented by communities and cores.The community finding method works through the maximization of a Q-Modularity function, and is based on a greedy optimization algorithm [23].This very popular procedure iteratively merges module pairs originated by seeds and continues to expand by monitoring a modularity index that keeps increasing until a gain is detected, otherwise it stops.Q-Modularity maximization is simply defined as a difference between links modularized in a network versus those expected to be modularized in a network of equivalent size but with randomly placed vertices.
In particular, a network partition in N modules with m i and m j linked by e i j appears in the modularity function as follows: (1) Links that connect nodes within a module i are compared with all links from any other module j connected to module i.A good partition into modules leads to Q ∼ 1, while random (i.e., poor modularity) would deliver Q ∼ 0, thus meaning that the fraction of modular and randomized links is not significantly different.A generally accepted rule is that values Q > 0.3 may already suggest the presence of modular structure.Overall, modular partitions obtained by this procedure show relatively dense intramodular links and sparse intermodular links, which reflects the presence of few local maxima capturing the most relevant information about the internal network organization.
MCODE [24] is another well-known method that exploits network local density areas to identify clusters supposed to match protein complexes.In particular, the dependence between nodes is represented by structures called "cliques", and a hierarchy of modules of different clique sizes is obtained at the end.A clique is a maximally connected structure, that is, a network in which every pair of distinct node is connected by a link.MCODE starting its exploration from locally dense regions from a clustering coefficient computed a given node, that is CC i = 2n/k i (k i − 1), where k i is the size of the neighborhood of node i, and n is the number of edges in it.A k-core is delivered by the method based on the clique benchmark, and represents a network of minimal degree k after which all the nodes with degree less than k have been successively eliminated.As several groups are formed at each k, an internal ranking through scores based on node weighting is obtained.

Module Maps.
The implementation of MCODE requires that some parameters are set in order to compute the network scoring values.For instance, we set degree cutoff = 2 and node score cutoff = 0.2, and then proceeded by fixing other values, such as haircut = true, fluff = false; k-Core = 2; max.depth from seed = 100.The output for our PIN list is reported in Table 1 for the retrieved communities, and in Table 2 for the various cores computed at different k values.The relevance of each module can finally be biologically validated against known protein complexes.
Notably, both strategies consider modules that have an inherent resolution structure (see [25]).Overall, the interest here is to establish how modules are influenced by both drivers, both functionally and dynamically.While the comparative constrained PIN analysis provides some evidence, the uncertainty of modularity remains an open problem.We thus turn to the consideration that the probability distribution of accessible states of a constrained system both in equilibrium and far from equilibrium can be referred to an entropy, and develop this part in the next section.

Entropy. Shannon
Entropy is considered in a context of measure-preserving transformations [26][27][28], where it may address information and uncertainty.As such, it can be seen as an ex-post measure, based on the information gained from a finite number of experimental outcomes, and also as an exante measure, based on the uncertainty about such outcomes before performing the experiment.
The equilibrium and disequilibrium conditions are important factors, especially for the complexity involved in the latter conditions.The entropy E for a certain number "n" of accessible states is thus: for c positive real constant and p i normalized probabilities (i.e., i p i = 1).
A system in equilibrium would have p i = 1/n, for all its accessible states, thus reaching a condition of maximal entropy c log n.Instead, a system far from equilibrium would have additional terms to be considered, characterizing the disequilibrium in terms of distance from the equilibrium, that is the equiprobable configuration.
In the network context, entropy is a measure of uncertainty that adapts to PIN at both global and local scales (see [29]).In particular, when local dynamics are investigated, entropy allows to assess the uncertainty of all the constituent modules of the map.This measure thus also applies to both cores and communities computed in our examples.
Additionally, comparing modularity configurations in entropy terms allows to assess the influence exerted by the PIN constrains.Even if estimating the entropies from finite samples remains a complicated task due to the presence of statistical fluctuations, and such limitation holds also for any sampled PIN, nevertheless it represents a practical approach adaptive to any network scale, for example, in principle approximating the steady state conditions that deliver the ensemble entropy E associated with the network distribution p: (3)

Graphical Evidence.
Applied to the retrieved cores and communities characterizing the constrained subinteractomes, the entropy landscapes for cores (red) and communities (blue) of the three constrained PINs are shown in Figures 1, 2, and 3.In particular, the maximally constrained PIN appears in Figure 1 (referred to cePIN) and the minimally constrained PIN appears in Figure 3 (referred to cePIN2).The top plots report entropies as circles for each module (in sorted order in the X axis).The observed entropy sizes (with value reported in the Y axis) depend on the presence of hubs (as reported) in both communities and cores (they are indicated apart for Figure 3 due to a long list).
The individual contributions of each module to the overall entropy may be observed from such plots.The bottom plots show circles not differentiated by size but by the number of included nodes (as reported in the X axis).Again, both core and community modules can be compared with the entropy value for the global PIN (based on their value on the Y axis).Interestingly, Figure 1 with the maximally constrained PIN shows especially the entropy contributions from communities, while quite similar contributions between cores and communities can be observed from the distribution of the nodes (bottom plots).Overall, no substantial redundancy appears from such a system.In Figure 2 the communities show instead substantial additional redundancy, and similarly cores, even if to a lesser extent.Communities remain larger than cores as far as concerning module size.We checked the complementary plots too: by considering module sizes of up to 40 proteins, communities and cores behave similarly even with less constraints, while for bigger sizes the community redundancy appears.Finally, Figure 3 represents an exceedingly redundant system given minimal constraints, as reflected at both community and core levels ISRN Biomathematics in all dimensions (entropy value, number of modules, and number of included nodes).Overall, the cell cycle dynamics may be monitored by measuring entropy in relation to the module sizes, and the evidence reflects the expected fact that randomness plays a major role under relaxed conditions or no constraints.

Discussion.
A careful exam of the modular space represented in the plots reveals that for the maximally constrained PIN there is a prevalence of small-entropy communities over cores, and this intensifies for less constrained PIN with a module entropy increase by communities.The minimally constrained PIN that is functionally more relaxed than the previous PIN, amplifies to an extreme the previous feature.
Overall entropy is substantially controlled in constrained networks due to a certain stability in the system obtained through the modules, regardless of whether they are computed by either hierarchically agglomerative (bottom-up merging) or divisive (top-down split) algorithms.Under less stringent constraints, the equilibrium is broken in terms of strength of the connectivity links (less stringent dynamics exert their influence) and of functional characterization (extra cell cycle links allowed).As a result of such disequilibrium, a bigger number of high-entropy modules appears to reflect the unknown uncertainty of the system.It is known that modularity suffers from a resolution limit, and for such a reason it might be hard to detect small functional modules.Equivalently, the interactome space is expansive in entropy terms, which justifies moving away from the resolution limit when less constraints apply.Despite cores and communities showing a different sensitivity to the presence of constrains, communities participate quite heavily in the observed expansion, while cores instead appear more stable structures.Following this line of reasoning, a possible future direction of study would look at extensive and nonextensive entropies to capture the complexities appearing in form of dependencies and convolutions in cores and communities that are difficult to represent by the above entropies.
The observed evidences would suggest that core rather than communities would be non-extensive structures.In other terms, nonextensivity implies a different form of complexity embedded by the entropy, while extensivity that is observed in communities suggests that redundant dynamics prevail.Similarly, since communities simply reflect assigned internal links in the given network relatively to an equivalent (in node degree distribution) random network, it is possible that with constraints the overall impact of network randomness can be reduced.

Concluding Remarks and Future Directions
The proposed approach started with PIN fragmentation to offer the possibility of building a compilation of PINs was selected according to specific biological criteria.An immediate advantage is the possibility to comparatively evaluate both general topological features and modularity of multiple PINs with reference to a common source.In order to explore PIN dynamical aspects, time-course experiments were considered together with their associated gene expression signatures.Thus, we could center the rest of the analysis on the influence on modularity by functional drivers, that is, through the cell cycle, and by expression drivers, that is, through the integrated gene measurements.
Modularization has been mainly investigated from structure based on clique-centric methods.The comparison between community and core maps offers therefore an initial coarse-grained analysis useful to verify what complexes are matched by modules and up to what extent, together with the involved pathways.The module characterization pursued here aims at including dynamic conditions, and then measuring the uncertainty associated to it and reflected in the modular configurations.This last aspect is almost always overlooked in network studies, as usually methods to assign scores or confidence measures focus more on the individual network entities rather than the modular structures to which they participate.
Two final notes for future follow-up work: one specific and one more general.The modularization induced by the employed methods remains conditioned on the resolution allowed, which determines the configuration to be uncovered.However, we showed that differential modularity is an integrative approach that through the combined biological process-driven interactions and coexpression dynamics elucidates in part the corresponding complexities.Then, the proposed approach for PIN could be extended to biological contexts where a crucial goal is establishing a role for biological processes involved in disease.But also other applications could be examined through our approach.For instance, a clinical context characterized by co-morbidity could be interestingly investigated in order to remodularize the network after the occurrence of an acute phase in one of the pathologies.Also, drug-target networks could be studied to establish the effects of treatments on the variation of modularity when they are acting selectively (e.g., through activation of some pathway), thus specifically constraining the network.
(i) rPIN: proteins = 3289, interactions = 11333.It represents the LIT-Int reference PIN.(ii) cePIN: protein = 190, interactions = 381.This is the maximally constrained cell cycle PIN built from the following two-step procedure.(a) The gene expression profiles obtained from the time course experiments, and based on the "expression peaks" (maximal expression levels observed during the cell cycle phases) generated the gene signatures.(b)

Figure 3 :
Figure 3: cePIN2 entropies for both cores and communities, and sorted by size (b).

Table 2 :
Bolded numbers of MCODE-detected k-cores at k ranging between a minimum of 2 and maximum of 13 across PIN.