Protein–DNA Interactions: The Story so Far and a New Method for Prediction

This review describes methods for the prediction of DNA binding function, and specifically summarizes a new method using 3D structural templates. The new method features the HTH motif that is found in approximately one-third of DNAbinding protein families. A library of 3D structural templates of HTH motifs was derived from proteins in the PDB. Templates were scanned against complete protein structures and the optimal superposition of a template on a structure calculated. Significance thresholds in terms of a minimum root mean squared deviation (rmsd) of an optimal superposition, and a minimum motif accessible surface area (ASA), have been calculated. In this way, it is possible to scan the template library against proteins of unknown function to make predictions about DNA-binding functionality.


Background -the story so far
The 3D structures of over 660 proteins bound to DNA molecules have been determined [Nucleic Acid Database (NDB): version 23 April 2003 [4]]. These proteins have diverse structural folds, and achieve binding and recognition of specific sites on nucleic acids in many different ways. Protein-DNA interactions are critical for the flow of biological information from genes to proteins, and have consequently been the focus of considerable research. Much of this has involved the description of specific complexes (for review of recently solved structures, see [1]) and of families of proteins sharing the same DNA binding motif (e.g. [6,2,19]).
With the large number of protein-DNA complexes deposited in the Protein Data Bank (PDB) [5] and curated in the NDB [4], it has been possible to analyse large sets of non-homologous complexes and derive general characteristics of DNA binding sites on proteins [10,14,12]. These sites comprise discontinuous sequence segments forming one or more hydrophilic surfaces capable of direct and water-mediated hydrogen bonds. The extent of the binding site varies widely [618-2833Å 2 accessible surface area (ASA) per monomer] and most sites are rich in lysine and arginine residues [10,14].
Proteins binding to DNA commonly force structural deformation upon both parts of the complex. The deformation of the DNA, usually described as DNA bending, has been extensively studied (e.g. [15]). Forced bending commonly occurs through specific kinks of the double helix, generally at pyrimidine-purine base steps [7]. In comparing bound and unbound DNA molecules the deformations in bound DNA were observed to be more extreme than those of unbound DNA [10]. The conformational change in the protein can also be substantial with disorder-to-order transitions, domain movements and quaternary changes all documented [14].
With the recent development of structural genomics projects in which protein structures are solved that have very low sequence identity (and potentially little or no fold similarity) to any currently in the PDB [5], the number of DNA binding proteins in the PDB can only be set to Protein-DNA interactions: a new method for prediction 429 increase. This will provide further structures for analysis, but more importantly gives rise to a need for methods that predict the potential DNAbinding function of a new structure that has little or no structural similarity to any currently known.
Methods for the prediction of protein-DNA interactions fall into two categories, the prediction of the DNA sequence bound given a protein binding site, and the prediction of a DNA binding site on the protein given the unbound structure. The first category has been addressed using pairwise potentials that estimate the likelihood of a amino acid making favourable contacts with a DNA base [13,11]. The second category of prediction is more pertinent to the problems faced by structural genomics projects that require fast and reliable methods for the prediction of protein function, and has only recently been addressed.
The paper by Stawiski et al. [18] presents an automated method for the prediction of DNAbinding proteins, using a combination of features derived for electrostatic patches on the protein surface. The method uses a neural network to discriminate between DNA-binding and non-DNA binding positive electrostatic patches, using parameters such as surface area, hydrogen bonding potential, amino acid composition, surface concavity and sequence conservation. The method predicts DNA-binding proteins with high accuracy, and is capable of predicting those with novel binding motifs, and those solved in an unbound state. This is the first automated prediction method that has been successfully applied to a large data set.
In contrast to the complex method of Stawiski et al. [18], a relatively simple and fast method is now presented that is based on the assessment of the superposition of 3D structural templates of DNA-binding motifs on complete protein structures [9]. The method uses the HTH motif as a prototype template, but it is envisaged that the method is applicable to other DNA-binding motifs. The simplicity of the method has allowed it to be set up as a web server (http://www.ebi.ac.uk/thorntonsrv/databases/DNA-motifs), which allows users to upload published and proprietary protein structures for the prediction of DNA-binding function.

A new method for prediction using structural templates
The start point for the new method was a list of 86 non-identical proteins from the PDB known to contain at least one HTH motif. The list was derived from a combination of searches with Hidden Markov Models from multiple sequence alignments in Pfam [3] and SMART [17] and initial structure database searches [9]. These proteins were clustered into seven fold families (H-level) using CATH [16], and the structure with the highest resolution was taken as a representative.
For each representative an HTH motif template was created. A template is a set of C α backbone coordinates of an HTH motif, sequentially continuous in terms of residue number, and comprising all the residues from two residues preceding H1 to two residues succeeding H2. The templates were scanned against whole protein structures using an algorithm that computed a gapless optimal superposition. The match of a template on a complete protein was taken as the minimum rmsd obtained from all possible superpositions.
The seven templates were scanned against (a) the 86 non-identical HTH containing structures (termed HTH × TRUE ) and (b) the 8264 non-identical structures in the CATH database that excluded the known HTH proteins (termed HTH × FALSE ). In each case the rmsd recorded for each structure was the minimum value calculated from any of the templates (excluding self-matches). The distribution of rmsd values is shown as a histogram in Figure 1. Using this data, a threshold value (below which a protein was predicted to contain a DNA-binding HTH motif) was selected at 1.6Å. At this threshold there are 0.7% (61/8264) false positives, i.e. proteins predicted to include a DNA-binding HTH motif but not known to do so. This threshold also gave 11.6% (10/86) false negatives, i.e. proteins known to include a DNA-binding HTH motif but predicted as not containing one, and 88.4% (76/86) true hits.
The number of false positives was reduced by analysing the accessible surface area (ASA) of the residues comprising the HTH templates using NACCESS [8]. The absolute ASA for the residues in the 86 non-identical HTH templates ranged from 992Å 2 to 2740Å 2 . A minimum ASA value for a DNA binding HTH motif was set at 990Å 2 . Using this value, the number of false positive proteins  ). A threshold value is indicated at 1.6Å, below which a protein is predicted to contain a DNA-binding HTH motif (note: the maximum rmsd shown is 2.7Å but the distribution for HTH × FALSE extended to 6.1Å) was reduced to 0.5% (38/8264). Of the remaining 38 structures classified as false positive matches, three structures were predicted to included new HTH motifs, polymerase I (1taq), histone acetyltransferase (1fy7) and methyltransferase (1mgt).
To demonstrate the potential of the method, the template library was scanned against 30 structures from the Midwest Center for Structural Genomics (MCSG) Initiative (http://www. mcsg.anl.gov). One structure (target APS048) was predicted to have an HTH motif involved in DNA binding. This target (1 mkm) is the structure of T. maritima 0065, a member of the IcIR (isocitrate lyase regulator) transcriptional factor family [20]. It is now known that the N-terminal domain of the structure has a DNA binding function, with a HTH-motif comprising H2 and H3 with a fourresidue turn between them [20]. This motif is the one matched by a template at position 21-44 of the target.

Discussion
This new method of using 3D structural templates to make predictions about the potential DNA-binding function of proteins has been successfully used to make predictions for structural genomics targets. However, the functionality of any new prediction method will be measured by its independence from overall fold similarity. For the current method the occurrence of matches between templates derived from structures of one fold family and complete structures from a different fold family clearly demonstrates the methods independence of fold similarity ( Figure 2).
Methods such as the one described here (and more fully elsewhere [9]) and that recently published by Stawiski et al. [18] are amongst the first to address the issue of predicting DNA-binding function. These, and other new methods, will be an integral part of a larger prediction system that will be capable of making inferences on function, from the presence of binding clefts, and the identification of enzyme active sites and small molecule binding sites. Wheel diagram depicting the identification of HTH motifs using structural templates. The seven proteins from which motifs were derived, are representatives of different fold families. A line joining two PDB codes indicates the successful match of one structure's template against the complete structure of the second protein. A successful match was taken as one where a maximal superposition gave an rmsd < 1.6Å. The diagram effectively shows that the templates are generic, identifying structures from more than one fold family