Protein Interactions from Complexes: A Structural Perspective

By combining crystallographic information with protein-interaction data obtained through traditional experimental means, this paper determines the most appropriate method for generating protein-interaction networks that incorporate data derived from protein complexes. We propose that a combined method should be considered; in which complexes composed of five chains or less are decomposed using the matrix model, whereas the spoke model is used to derive pairwise interactions for those with six chains or more. The results presented here should improve the accuracy and relevance of studies investigating the topology of protein-interaction networks.


INTRODUCTION
The working of living cells is underpinned by an almost overwhelming array of molecular interactions that form a complex and multifaceted network: "the interactome." The explosive growth in both the type and volume of experimental data available to researchers interested in elucidating the properties of the interactome has led to a wide range of studies investigating a number of the core biological networks. Networks of gene regulation [1], metabolism [2][3][4], and protein interactions [5] have all been studied using data from the yeast Saccharomyces cerevisiae. The majority of cellular processes are mediated by protein-protein interactions, including signal transduction pathways and the regulation of gene expression. Thus protein-interaction networks have been afforded the most attention, in the hope that knowledge of their structure and topology will help us to understand their functions and evolutionary history.
Recently, there have been a number of large-scale protein-interaction studies. Efforts have ranged from studies that investigated only a subset of the protein interactome to attempts identifying all protein interactions within the cell [6]. It is clear that none of these studies, when taken individually, constitute a comprehensive picture of the underlying biology. Therefore, in order to derive the greatest benefit from their analyses, researchers wishing to undertake computational studies of the protein-interaction network as a whole must first integrate the results of many individual experiments in order to produce a single unified network representation. Typically, this is done by use of a graphtheoretical approach in which the proteins within the network are depicted as nodes, with interacting proteins connected by undirected links (edges) [7].
Generating this type of network representation is relatively easy for studies employing technologies that identify interactions in a binary fashion, such as the yeast two-hybrid system Y2H [8]. However, for experiments that identify all the proteins within a given complex, the process of defining a set of pairwise protein-protein interactions is more difficult. Protein complexes are often isolated in an affinitypurification experiment in which a single protein (the bait) is provided with a molecular tag (such as FLAG [9] or TAP [10]) that allows the purification of the "bait" together with all of the "prey" proteins that belong to the same multiprotein complex. It is unlikely that every "prey" protein interacts directly with the "bait" protein used to purify the complex; rather, the topology of the complex will include both "bait-prey" and "prey-prey" interactions. The true topology of these experimentally derived complexes cannot be determined from the individual experiments themselves, although some progress towards achieving this goal has been made by combining different datasets [11].
In practice, a model is used, in which pairwise interactions are assigned by applying either a matrix-or spokebased modelling methodology to each complex [12,13] (Figure 1). Recently, we showed that the choice of modelling methodology has a dramatic effect on the topological features of the protein-interaction network [14]. However, it is unclear, at present, which of the two proposed models (if either) should be selected when performing this type of analysis. In this study, we draw on crystallographic data within the protein quaternary structure (PQS) database [15] to determine the actual topology of the protein complexes under investigation.

METHODS
We began by extracting all available protein complexes from the PQS database (≈ 32,000 structures as of 29/07/05). An automated protocol was then used to triage the collection until a set of complexes that fulfilled the criterion for analysis remained. That is, they were at least heterotrimeric in nature (i.e., they contained at least three polypeptide chains, with unique amino-acid sequences of greater than or equal to 30 residues), as analysis of the connectivity of complexes with less than three unique chains is clearly meaningless. This resulted in approximately 900 protein complexes that were then further filtered to identify and remove redundancy. Subcomplexes were collapsed into their parents and redundant structures were identified by performing an all-againstall comparison of each protein complex. Those complexes sharing two thirds or more of their chains with others in the set were then removed. The 133 protein complexes that remained after this triage procedure were then subjected to further analysis. Actual physical interactions between polypeptide chains within the complexes were identified by computational analysis of the crystal structure, using an empirically equivalent algorithm to the full atom contact (FAC) method employed by Gong et al. [16], with an interaction between two chains defined as the occurrence of at least five instances in which Cα atoms within the chains come within 7.5Å of each other. The ability of the matrix model to define the real set of interactions within these complexes was then assessed by plotting the actual number of interactions in a complex (as a percentage of those defined by the matrix model) against the number of unique chains within that complex ( Figure 2).
We then performed a direct comparison between the matrix and spoke models by combining crystallographic data with protein complex data obtained experimentally for the yeast Saccharomyces cerevisiae [9,10,17,18]. First, we further triaged our protein complex dataset to include only (at least) heterotrimeric complexes composed of unique chains that had experimentally determined homologs in yeast. By extracting the protein sequence of each chain from every complex within the PQS and blasting it against the yeast genome (cutoff; e 1e −10 , fraction of conserved residues 35), we were able to identify, and make use of, protein complexes determined in other species that are structural homologs of those found in Saccharomyces cerevisiae. After collapsing subcomplexes into their parents and removing any redundant structures, we were left with a set of thirteen structures that were suitable for use in the final part of our analysis. Crystallographic data for each structure and the FAC method were used to determine the "true topology" of the interactions within a complex and the ability of each model to describe it was assessed by calculating a score based on the following functions: where, for the matrix model (1), TP is the number of interactions observed in both the crystal structure and in the experimental network and FP (3) is the number of false positive interactions between the polypeptide chains, calculated using the formula where N is the number of unique polypeptide chains in the crystal structure. For the spoke model (2), TP is as above, FN is the false-negative count, calculated by considering interactions between proteins that are identified in the crystal structure and that were used as baits in high-throughput proteininteraction studies, but for which no experimental interaction was identified. Finally, FP is the false-positive count; the number of interactions identified between proteins in an experimentally determined complex that do not occur in the actual crystal structure. All self-interactions identified in both the experimental and PQS complexes were excluded.

RESULTS AND DISCUSSION
Analysis of protein connectivity within complexes over a range of sizes revealed that those complexes composed of ≤ 5 unique polypeptide chains are generally appropriately described by the matrix model (as can be seen from the points lying along the red line of the plot in Figure 2). However, as the number of chains within a complex increases, its topology is less and less likely to be described by the fully connected graph specified by the matrix model and the Luke Hakes et al.
"performance" of this model rapidly diminishes. This suggests that for complexes composed of > 5 chains, the application of the spoke model is a more appropriate choice. In order to test this hypothesis, we performed a direct comparison of the two models by applying our scoring functions to the eleven usable structures that we had identified as either being purified from yeast or that we could say had an identifiable yeast homolog. As expected, this analysis revealed that (in every case) the matrix model performed as well or better than the spoke model for complexes containing up to five unique chains ( Table 1). The superior performance of the spoke model and the expected reduction in performance of the matrix model for complexes with higher chain numbers were also observed, supporting the hypothesis that the spoke model provides the appropriate description of these larger complexes. Therefore, we suggest that when using this type of data to construct protein-interaction networks, the optimal method for decomposing the interactions into node-edge relationships is a combined one, with the matrix model used for complexes of five chains or less, and the spoke model for complexes of six chains or more.
Clearly, no single model (either matrix or spoke) can provide the true representation of the actual interactions that occur within protein complexes and we make no attempt here to state that this is the case. Rather, we aim to suggest a method by which experimental data from studies elucidating protein complexes are best processed so that they most accurately depict reality, prior to their incorporation into a global protein-interaction network and its subsequent analysis using graph-theoretical methods.
It has long been assumed that the matrix model provides a relatively poor description of the true interaction space for any given complex, and previous work by Bader et al. demonstrated that the spoke model was more accurate (in agreement with published literature) [12]. The present study is complementary to, and an extension of, earlier work that aimed at validating experimentally determined interactions [19]. Edwards et al. compared the topology of three large protein complexes of known structure to a wide range of proteomics data, in order to estimate the error rate associated with the matrix model. We have taken this analysis a step further, and in addition to assessing the performance of the matrix model using a larger dataset of known structures (133), we also assess the performance of the spoke model and suggest which is likely to have the better performance and in what circumstances. We find that the large complexes studied by Edwards et al. [19] are more appropriately described by the spoke model, suggesting that their estimates of error rates may be pessimistic. This improvement in our understanding of how the individual proteins within complexes interact, and the increase in clarity about how data on protein complexes derived from proteomics studies are best processed should allow us to produce more accurate and meaningful network representations.
Crystal structures of large complexes provide the best way of validating the protein-protein interaction networks and for developing appropriate models for integrating and interpreting data from high-throughput studies that employ techniques like Y2H and TAP-tagging. Although, in general, the solution of protein crystal structures is becoming more automated, the structures of protein complexes must still be solved by careful and painstaking validation of the crystallized complex at each stage. Structural genomic initiatives will often systematically miss these complexes because they generally attempt only to produce, crystallize, and solve the structures of individual proteins. In this light, the lack of overlap between structural data and network interaction data, while striking, is not unexpected.
It should be noted that our analysis methodology regards the structural data deposited within the PQS database as being representative of the "real" biological unit (BU). In reality, the data provided by PQS is a prediction of the BU based on the crystallographic asymmetric unit (ASU). However, as these predictions have been found to be accurate in approximately 75% of tested cases [20], it is unlikely to substantially affect the general trends observed here.
While the paucity of overlap between the structurally solved protein complexes and those determined experimentally precluded any type of rigorous statistical analysis, we believe, given the result shown in Figure 2 (which covers 133 unique protein structures) and the clear trend identified in the direct comparison of the models, that our conclusions are valid.