Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.365 Conference Review The evolution of protein interaction

Interactions between proteins are essential for intracellular communication. They form complex networks which have become an important source for functional analysis of proteins. Combining phylogenies with network analysis, we investigate the evolutionary history of interaction networks from the bHLH, NR and bZIP transcription-factor families. The bHLH and NR networks show a hub-like structure with varying γ values. Mutation and gene duplication play an important role in adding and removing interactions. We conclude that in several of the protein families that we have studied, networks have primarily arisen by the development of heterodimerizing transcription factors, from an ancestral gene which interacts with any of the newly emerging proteins but also homodimerizes.


Introduction
Genome, proteome and other 'ome' projects have generated a vast amount of data over the last few years. These data can be analysed by comparative analysis. Proteome analysis became most important for studying regulatory proteins, since many signalling proteins, from membrane-bound receptors down to the DNA-binding regulatory elements (transcription factors), interact directly with each other via so-called protein-protein interactions (PPIs). Some of these interactions are 'unspecific', e.g. both SH3 and SH2 domains interact with many other domains [8,27]. Since some of the interaction partners interact with yet other proteins, this gives rise to a 'network of interactions' where proteins can be imagined as nodes that are connected by edges, representing their physical interactions. However, interactions can also be very specific and limited to only few partner proteins, e.g. many transcription factors heterodimerize with several partners under specific physiological conditions and expression states, while they may homodimerize under other conditions. An example is the leucine zipper-mediated interaction, e.g. the competing Jun-Fos/Jun-Jun interactions in the bZIP family [7,19] or the much weaker Mad-Max/Max-Max interactions in the bHLH family [7,12,17].
The molecular details of dimerization are very complicated and are not the focus of this study. Several groups have begun to investigate the 'global' features of PPI networks, trying to infer functional properties by applying statistical methods to the analysis of the networks [9,14,24,25]. Among the most intriguing of these findings are their small-world characteristics. Although the network is very big, each protein is linked to every other by chains of only a few edges. This is possible because some proteins interact with many others, representing so-called 'hubs', and many have only very few interactions [14]. Furthermore, these interactions appear to be confined to cellular compartments [24]. The reliability of experimental PPI data is controversially discussed [13,22,23]. However, one should consider that binding is not an all-or-nothing process and at least in vivo interactions are frequently competitive, as shown in the bZIP network [18]. Furthermore, the combination of data from various sources can significantly improve their quality. In our research efforts we have collated database information with data mined from the literature to generate more reliable, 'confirmed', datasets [1]. Several models of network evolution have been developed recently [20,21,26,29]. Some models are based on gene duplication, some on domain rearrangements. Others assume that an existing initial network is duplicated when all genes coding for the interacting proteins are duplicated simultaneously. This could happen, for example, via a whole-genome duplication or other large-scale duplication events, such as tandem duplications. In the following a certain fraction of interactions is assumed to be lost again.
The goal of this study is to complement existing perspectives on network evolution with studies based on phylogenies and comparative analysis from genomic and proteomic data. We have chosen to work on several families of eukaryotic transcription factors for which many data from different sources are known and for which phylogenies are either known or can be computed with a reasonable reliability. In particular, we concentrate on the question of how the evolution of interaction specificity, such as homo-vs. heterodimerization, may reveal the evolutionary dynamics of network evolution. Accordingly, results on the families of NR, bZIP and bHLH proteins are reported and discussed in the following.

Methods
Interactions for all three networks were extracted from a literature search in PubMed, with the focus on mammalian transcription factors (http://www4. ncbi.nlm.nih.gov/PubMed/). Specifically for the bZIPs, interactions were extracted from Newman and Keating [18], in which protein arrays were used to test 492 pairings of a nearly complete set of coiled-coil strands from all known human bZIP proteins. We included interactions that were symmetrical in the consensus interaction matrix, meaning that the use of the protein in the array surface or as a probe was not inhibiting the interaction. Also, we disregarded interactions with Z < 1, which is the threshold value for the signal : noise ratio. Data for bHLH proteins are the same as in Amoutzias et al. [1] but with close family members collapsed to one node. For the NR family we excluded all data for which interactions were shown not to be direct.

bHLH proteins
We first focus on bHLH proteins. They represent an ancient family of transcription factors being present in all eukaryotic clades, that expanded since the emergence of multicellularity [16]. bHLH proteins are involved in cell-cycle regulation, metabolic sensing and tissue-specific development. Their main constituent is a ≈60 amino acid-long domain comprising the DNA-binding basic region and the HLH motif which mediates interaction. However, they usually also contain additional dimerization domains which contribute to the specification of homo-and heterodimerization. While further details on their evolution are reported elsewhere [4,16], the feature which is most relevant for this study is that five major groups exist. These can be distinguished by their domain architecture, in particular the presence of the additional dimerization domain (leucine zipper -LZ, PAS, Orange), such that their clustering into groups is fairly reliable. Recently, we found that PPIs between bHLH proteins form two hub-based networks. One of them can be divided into two hub-based subnetworks. The two networks have a striking similarity in their topology but no interactions between any member of one family to any member of the other family are known to exist [1].
Analysing these networks of protein families (Figure 1), the most prominent feature appeared to be that, just like the overall network, they have a hub-like topology. Interaction data have been obtained from a number of different sources and the low number of interactions for the majority of the nodes has been explicitely confirmed by experiments on these nodes. Hubs are a feature of scale-free networks and such properties have been observed in social networks, the world-wide web, the western US power grid, citations of scientific publications, metabolic networks, protein domains, protein interaction networks and the distribution of proteins in sequence space [3,5,6,31,32]. For the bHLH protein-interaction network, we calculated the frequency of nodes with K interactions, P (K ), and plotted this against the number of interactions, K . The plot (Figure 2.) of ln[P (K )] − ln(K ) shows clearly that the bHLH PPI network is scale-free, because the distribution of connectivity decays as a power-law P (K ) = K −γ , where γ ≈ 1. A scalefree network is a non-homogeneous network with a few highly connected nodes (the hubs) and many poorly connected nodes (the peripheral members). Other analyses on scale-free networks estimated that γ is usually in the range 2-3 [11]. The biological significance of this lower value for γ appears to be a direct consequence of the fact that gene duplication events (single or large-scale) have generated new peripheral proteins that then interact preferentially with the hub. Apparently the homodimerizing factors are the most highly linked (at least within their networks) and represent hubs (Figure 2). Consequently, we conclude that homodimerization was the ancestral function. This feature was maintained with the emergent hub, even though the peripheral members, which emerged as constitutive interaction partners, became free to bind only under more limited physiological conditions. This is reflected by the fact that hubs are typically constitutive and widely expressed, while peripheral members are often tissue-specific. It is also noteworthy that in a recently published model [20], which is based on gene duplication alone, γ ≈ 1.2 when only few links (PPIs) are lost, while the 'classical' values of γ ≈ 2-3 require a relatively high rate of loss.

NR proteins
Members from the superfamily of nuclear receptor proteins are transcription factors which can homo-or heterodimerize or even bind to DNA as monomers. With the exception of a few so-called 'orphan receptors', they are activated by binding a ligand and regulate metabolic pathways, development homoeostasis and reproduction [15].
NR proteins are organized in four domains: the N-terminal transactivation domain A, the DNAbinding domain (DBD) that contains two zincfingers, the ligand-binding domain (LBD) and a flexible hinge between the DBD and the LBD. The DBD and the LBD are involved in dimerization.
Phylogenies based on sequence analysis between LBD and DBD revealed six distinct subfamilies (I-VI), with two of them (I and IV) being more closely related than the rest [15]. No other phylogenetic relationship between the subfamilies has been reliably inferred as yet.
Again, the interaction pattern is well correlated with the group membership. While there are, in general, no interactions between different groups of receptors within subfamilies I and IV, most members of subfamilies I and IV tend to form efficient heterodimers with some members of subfamily II, whereas in other subfamilies, homodimerization is most frequent (see Figure 1).
Laudet [15] and co-workers investigated the evolutionary rates for the members of the NR family. While they reported strong differences in evolutionary rates between individual proteins, there were no significant differences among the six subfamilies. However, by relating individual proteins with protein-protein interaction data it becomes apparent that it is the hubs, which are generally also homodimerizing, that evolved slower than the peripheral members. This is in agreement with the idea that hubs are the predecessors from which new interaction partners (repressors or activators) emerged by gene duplication, followed by mutation. The two weakly linked proteins ER and ERR (see Figure 1b) are homodimerizing. Apparently they arose more recently and have not as yet differentiated into an independent network. Further evidence that the hubs are the ancestral part of the network comes from the fact that they were present in early metazoans such as sponges and cnidariaus, whereas the peripheral members that belong to subfamilies I and IV appeared much later, after the emergence of the Bilateria [10,30]. bZIP proteins bZIP proteins are an ancient family of transcription factors present in all eukaryotic clades. They regulate genes that are involved in proliferation, immune response, cell death and response to stress and toxicity [2]. bZIP proteins are named after their well-conserved α-helical bZIP domain. The bZIP domain comprises the DNA-binding basic region (BR) and, C-terminally adjacent to it, the leucine zipper (LZ), which forms a coiled coil and determines the dimerization partner for homo-and heterodimers [7,[17][18][19].
It is difficult to reconstruct the phylogeny from sequence information alone and domain arrangements are not as conclusive as for the bHLH family.

83
However, a classification has been suggested, based on the amino acid composition and on dimerization partners [28]. More comprehensive interaction data from bZIP proteins have been analysed most recently, using protein chip technology [18]. Applying network analysis as above reveals a more even distribution of connectivities; however, there is still a fair amount of clustering ( Figure 2). This is an indication of an evolutionary mechanism for this family which is different from the bHLH and NR families. Also, there is no such clear differentiation between homo-and heterodimerization, since most proteins have at least a limited capacity to homodimerize.

Discussion
In this study we have combined network analysis and phylogenies to investigate the emergence of new interactions in the gene networks of eukaryotic transcription factors.
In all three families we have studied (bHLH, bZIP and NR), there is an indication that homodimerization preceded the development of heterodimerization. In NRs, strong evidence comes from phylogenetic studies and the distribution of NR families in early metazoans [10,30]. The evolution of the bHLH networks is also consistent with the ancestral nature of homodimerization. Typically, the ability for homodimerization appears to be conserved, such that hubs emerge from the ancestral homodimerizing proteins. Subsequent gene duplication (large-or small-scale) and mutation results in changes in dimerization properties, thus forming a complex network. In particular, the bHLH family have apparently evolved by repeated single-gene duplications which led to the initial network topology [1]. Subsequent large-scale gene duplications may have increased the complexity of the bHLH network. While the role of gene duplication was also important in the evolution of the NR and bZIP networks, the central role of single gene duplication cannot be confirmed with current data.
The basic principles, i.e. the hub-like structure of the interaction networks, comply with the global features as they were shown by other groups. However, the statistical properties (γ ) for the subnetworks differ between the families and deviate more or less from the global properties of PPI networks as they have been analysed previously.
Obviously, these differences reflect different evolutionary dynamics, such as the relative frequency of gene duplication, large-scale duplication events and loss of interactions. The loss of interactions appears to be particularly important in the initial stages of network development and its influence on the value of γ appears to be in good agreement with the predictions by Pastor-Satorras and co-workers [20].
Our results have obvious implications for the understanding of network evolution. Further theoretical studies and models of network evolution should consider these variations in γ and the fact that, at least in many cases, heterodimerization emerges from homodimerization.