Computer-Based Annotation of Putative AraC/XylS-Family Transcription Factors of Known Structure but Unknown Function

Currently, about 20 crystal structures per day are released and deposited in the Protein Data Bank. A significant fraction of these structures is produced by research groups associated with the structural genomics consortium. The biological function of many of these proteins is generally unknown or not validated by experiment. Therefore, a growing need for functional prediction of protein structures has emerged. Here we present an integrated bioinformatics method that combines sequence-based relationships and three-dimensional (3D) structural similarity of transcriptional regulators with computer prediction of their cognate DNA binding sequences. We applied this method to the AraC/XylS family of transcription factors, which is a large family of transcriptional regulators found in many bacteria controlling the expression of genes involved in diverse biological functions. Three putative new members of this family with known 3D structure but unknown function were identified for which a probable functional classification is provided. Our bioinformatics analyses suggest that they could be involved in plant cell wall degradation (Lin2118 protein from Listeria innocua, PDB code 3oou), symbiotic nitrogen fixation (protein from Chromobacterium violaceum, PDB code 3oio), and either metabolism of plant-derived biomass or nitrogen fixation (protein from Rhodopseudomonas palustris, PDB code 3mn2).


Introduction
Due to recent advances in high-throughput structure determination, structural genomics initiatives are proceeding fast. The Protein Structure Initiative (PSI) reports the determination of over 5,500 proteins structures in its Structural Genomics Knowledgebase as of November 2011 [1]. A total of 3,020 proteins (>50%) are of unknown function or only minimally characterized [1]. Hence, there is an urgent need for computational annotation methods of these structures of unknown function. We present here an integrated bioinformatics method for the functional annotation of transcription factors that combines sequence-based relationships and three-dimensional (3D) structural similarity of transcriptional regulators with computer prediction of their cognate DNA binding sequences. We applied this method to the AraC/XylS family of transcription factors.
AraC/XylS is a large family of transcriptional regulators found in many bacteria controlling the expression of genes with diverse biological functions involved in metabolism, stress response, and virulence [2][3][4][5]. Most members of this family are comprised of 250 to 300 amino acids and contain two domains: a conserved DNA binding domain (DBD) of ca. 100 residues found at the C-terminus of most regulators and a variable domain thought to be responsible for effector binding or multimerization [2]. A few exceptions include regulators which are considerably smaller (ca. 150 residues) containing only the DBD domain (e.g., MarA and SoxS) [6], substantially larger (e.g., HrpB, 477 residues), or that contain the DBD at the N-terminus (e.g., Rob) [6] or in the central domain (e.g., Ada) [7].
The DBD assumes a conserved tertiary structure of two helix-turn-helix (HTH) domains, each made up of three regulators involved in carbohydrate metabolism (e.g., ChbR and AraC from E. coli), metabolism of benzene derivatives (XylS from Pseudomonas putida), alkane metabolism (AlkR from Acinetobacter), and amine metabolism (FeaR from E. coli). Regulators of the stress response group are involved in the direct or indirect response to stress factors such as Ada (alkylating agents in E. coli), AdiY (acid resistance in E. coli), MarA (multiple antibiotic resistance in E. coli), SoxS (oxidative stress in E. coli), and RipA from Corynebacterium (iron limitation stress). The third group of pathogenesis or virulence contains family members, such as VirF which regulates the expression of proteins of the type III secretion system (TTSS) in Shigella flexneri, Rns from E. coli which regulates cell adhesion proteins, and Caf1R from Yersinia pestis which is involved in capsule formation [11]. Here, we have included a fourth category, dubbed bacteria-plant interaction, which includes RhrA from Rhizobium meliloti that regulates iron concentration required for nitrogen fixation by production of a siderophore and Y4fK from Rhizobium sp. (strain NGR234), which induces formation of nodules in plant roots where nitrogen fixation takes place.
In a recent update on the AraC/XylS family of transcriptional regulators, Ibarra et al. identified 58 well-characterized members and found a total of 1,974 known and putative members in 149 bacterial genomes [3]. They generated dendrograms from the multiple sequence alignment of the DBD to functionally classify putative family members.
The aim of the present study was the development of a bioinformatics method for the functional annotation of putative transcriptional regulators with solved tertiary structure but unknown function. We applied this method to members of the AraC/XylS family of transcriptional regulators. Similar to Ibarra et al., we analyzed the DBD to perform a functional classification. However, our method is not restricted to sequence information only but rather incorporates primary and tertiary structure information from the transcriptional regulator protein sequence, structure, and DNA binding site. We performed three-dimensional (3D) structure similarity searches against the whole Protein Data Bank (PDB) and identified three putative new members of the AraC/XylS family with uncharacterized function. Detailed structural analyses and sequence comparison between these three proteins and 62 well-characterized AraC/XylS family members suggest that they could be involved in plant cell wall degradation (Lin2118 protein, PDB code: 3oou), symbiotic nitrogen fixation (PDB code: 3oio), and either metabolism of plant-derived biomass or nitrogen fixation (PDB code: 3mn2). Structure-based DNA binding site prediction for these transcription factors is concordant with these functional assignments.

Structural Similarity Search.
Scanning of the complete PDB for target proteins sharing a similar tertiary structure with a given query protein was performed with the web server COPS-TopSearch of the Center of Applied Molecular Engineering (University of Salzburg) available at http://www.came.sbg.ac.at/ [12]. TopSearch is a fast threedimensional (3D) search method that relies on the TopMatch structure alignment software to rank protein structures available in the PDB (target structures t) according to their absolute similarity S(q,t) to a provided query structure q [13,14]. The absolute similarity is defined as the number of structurally equivalent residue pairs or the length of the structural alignment. Starting with MarA from E. coli (PDB code: 1bl0), the top 100 most similar proteins were manually inspected based on their structural alignment with the query and functional annotations available from the PDB and the UniProt database (http://www.uniprot.org/) [15]. The search was repeated for each known and putative AraC/XylS family member identified. Structures were manually modified to contain the DNA binding domain only, where necessary. More than 70,000 protein structures (PDB release: 6 June 2011) were searched for similarity and a total of thirteen (10 known and 3 putative) structures of AraC/XylS-family transcription factors were identified.

Multiple-Sequence Alignment.
Protein sequences that belong to the HTH-AraC family (PFAM family PF00165) were retrieved from the UniProt database [15,18]. Only those entries marked as reviewed by UniProt and with experimental demonstration of their functional role were selected. A total of 62 sequences fulfilled the criteria (Table S1 see in Supplementary Material available online at doi:10.1155/2012/103132.). Sequences from four PDB structures (PDB codes: 3lsg, 3mn2, 3oio, and 3oou) identified from the structural similarity search were also incorporated into this dataset.
A multiple-sequence alignment was constructed with the MAFFT 6 web server available at http://mafft.cbrc.jp/ alignment/server/ [19]. The gap penalty was set to a value of 3.0. Based on the multiple sequence alignment and available structural information, we defined a region of 100 amino acids as the DBD (corresponding to Ile-13 to Thr-112 in the sequence of MarA from E. coli, PDB code: 1bl0). Assignment of biological roles (regulated biological processes) to the sequences was transferred from evidence described in the literature where possible, following a similar nomenclature to that already proposed [2,3].

Dendrogram Generation.
The multiple-sequence alignment obtained as described previously was used to build a dendrogram of HTH-AraC family members with experimental evidence of their biological role. The tree was constructed employing the maximum likelihood optimization criteria with the software MEGA5 [20]. The WAG amino acid substitution model was selected using the ProtTest server and applied as previously described [21,22]. Bootstrap values were calculated with one hundred replicates. Phylogenetic tree images were produced with the iTOL web server tool [23] available at http://itol.embl.de/.

2.4.
Multiple-Structure Alignment. PDB files were manually modified to include only amino acids of the defined 100residue region of the DBD. Then, a multiple-structure alignment of the DBD was constructed with the SALIGN module from the MODELLER version 9.9 software package [16,25]. The SALIGN module reports a table with the number of equivalent C α positions (the alignment length; 3.5Å cut-off), the root mean squared (RMS) distance of equivalent positions, and the sequence identity of equivalent residues for all pairs of proteins, as well as the multiplesequence alignment (MSA) derived from the multiple optimal superposition of protein structures. A dendrogram of the structure-derived MSA was generated as described previously.

Comparative
Modeling. MODELLER version 9.9 was employed to generate full-atom 3D comparative models of protein-DNA complexes [25]. The complex of MarA with the 22-mer DNA binding site mar (PDB code: 1bl0) was defined as template. DNA was treated as rigid body (nucleotides were defined as "block residues") and ten models were initially generated. For each target, a single model with reasonable orientation of key residues of helices three and six was finally selected by visual inspection and energy minimized with the Molecular Operating Environment (MOE) version 2010. 10 [17]. Hydrogen atoms were added, nucleotide atoms were fixed, protein backbone atoms were restrained by a quadratic force term, and the protein part was energy minimized with the AMBER99 [26] force field and a Born solvation term until a gradient of 0.05 kcal/(molÅ) was reached.
2.6. DNA Binding Site Prediction. The prepared comparative models were further processed to construct static protein-DNA models with varied DNA sequence. Briefly, the protein part and the DNA backbone were fixed and DNA bases were permitted to vary, thereby preserving the double-helix structural parameters of the initial model. The software 3DNA [27] was employed to build a new DNA double helix with varied sequence while at the same time retaining the helical parameters of the original DNA structure. For each initial model, 10,000 models with randomized DNA sequences were built. Next, from the Protein-DNA Interface database [28], a nonredundant set of 208 protein-DNA complexes was obtained. This set was used to derive statistical distance-dependent pairwise potentials at the protein-DNA interface. The statistical potential parameters were the same as those previously described [29,30]. These potentials were then utilized to score the observed protein-DNA interactions in the comparative models. From each ensemble of random models, sets of low-energy structures were selected according to the 0.5% lower tail of each distribution of energy scores. Position-weight matrices (PWMs) were finally derived from the low energy score sets, and sequence logos were generated with the software WebLogo [31]. PWMs were converted into linear vectors, the all-against-all Euclidean distances calculated for all possible pairs of them and recorded as a distance table, which was finally used to perform a hierarchical clustering with Ward's method.

Genomic Context of Predicted DNA Binding Sites.
Genomes and annotations were downloaded from the EMBL suite of databases and web servers (http://www.ebi.ac.uk/ genomes/). High-scoring binding sequences were mapped with the software bowtie version 0.12.7 [33], allowing up to three mismatches. SeqMonk version 0.16.0 (http:// www.bioinformatics.bbsrc.ac.uk/projects/seqmonk/) and the EMBL genome browser, Genome Reviews [34], were used to determine the neighboring genes of the putative binding sites. The six genes closest to the binding site, including those overlapping with the binding sites, were selected for further analysis. To establish functional relations for each gene, we employed the databases UniProt (http://www.uniprot.org/) [15], InterPro (http://www.ebi.ac.uk/interpro/) [35], and KEGG (http://www.genome.jp/kegg/) [36].
Structural similarity, as quantified by the number of equivalent residue pairs (length of the structural alignment, cf. Materials and methods), of ten of the thirteen structures is high (Table 2 and Figure 1). The average number of  equivalent C α positions of ten structures without Ada is 89.5 (of a maximum of 100.1), the average root mean squared (RMS) distance is 1.9Å, and the mean sequence identity of equivalent residue pairs is 23%, which indicates high structural similarity and moderate levels of sequence identity. We removed the three structures of Ada from this analysis, since this protein consists of only a single HTH domain joined to an unrelated DNA repair domain and thus has a structural coverage of roughly 50% with the other structures. We scanned the PDB with any of the ten non-Ada structures as query structure and retrieved the remaining nine structures ranked as top hits in all cases. Two of the uncharacterized proteins (PDB codes: 3oou, 3oio) are currently classified as members of the AraC family in the Pfam database [18] and our structural analysis confirms membership to this family. Structure 3mn2 with unknown function was titled as "probable AraC family" by the authors of the crystal structure, and through our structural analysis, we confirm this membership, as well. It should be noted that the gene of the 3oio protein was named argR. The gene product of argR is the transcriptional regulator arginine repressor (ArgR), which controls the expression of operons involved in arginine biosynthesis. However, structural alignment of 3oio with the DNA binding domain of ArgR (e.g., PDB code: 3fhz) reveals only moderate similarity (equivalent residue pairs: 40). ArgR binds DNA as a hexamer and its DBD consists of a single three-helix domain with low sequence identity (15%) to 3oio. Based on this structural analysis, we conclude that 3oio Journal of Biomedicine and Biotechnology 5  is more likely a member of the AraC/XylS family rather than of the ArgR family of transcriptional regulators.

Functional Annotation.
Despite their similar structures, the transcriptional regulators of the AraC/XylS family are known to act on various genes involved in distinct biological processes. To attempt a possible functional classification of the new putative family members on sequence level, we followed an approach similar to that of Ibarra et al. [3]. We compiled a list of 62 well-characterized AraC/XylS family members with known biological roles (Supplementary Table  S1) and generated a maximum likelihood dendrogram (  Stress response   10  20  30  40  50  60  70  80  90  100   V|P16114|RNS_ECOLX  V|P25393|CFAD_ECOLX  V|P43460|CSVR_ECOLX  V|P43464|AGGR_ECOLX  V|P0A2T1|VIRF_SHIFL  V|P23774|FAPR_ECOLX  V|P43459|PERA_ECO27  S+V|P63201|GADW_ECOLI  S+V|P05052|APPY_ECOLI  V|Q8Z4A6|SIRC_SALTI Figure 2: Multiple-sequence alignment of the DNA binding domain of 62 AraC/XylS family members with experimental evidence of their biological role. Four proteins found by the structural similarity search and whose function is unknown or poorly characterized were included in this multiple sequence alignment. Each sequence label contains the general functional category, the UniProt accession code, and the protein name. Functional categories are BPI: bacteria-plant interaction; M: metabolism; S: stress response; V: virulence and Unk: unknown. One or two functional categories were assigned. The sequences are sorted according to the tree order (see Figure 3). Secondary structure elements are given at the top (according to MarA (PDB code: 1bl0)). The alignment was plotted with JalView version 2.6.1 [24]. Color legend: light blue: hydrophobic; green: polar and aliphatic; turquois: polar and aromatic; red: basic; purple: acidic; orange: glycine; yellow: proline.
sequence of the poorly characterized protein YesN from F. nucleatum (PDB code: 3lsg) were added to the MSA and dendrogram in order to permit annotation of their biological roles based on their neighborhood in the tree. The putative AraC/XylS family member Lin2118 protein from Listeria innocua (PDB code: 3oou) is found in a branch together with YesN and YesS from Bacillus subtilis. In the next neighboring branch, we find the sequence of YesN from Fusobacterium nucleatum subsp. Nucleatum (PDB code: 3lsg). YesS is a probable transcription factor regulating a pathway responsible for rhamnogalacturonan depolymerization, which is a carbohydrate product of plant cell wall degradation [41]. YesN is a member of the twocomponent regulatory system yesM/yesN. Its biological role is not well understood. Genes upregulated by YesN, which include yesS, are thought to be involved in rhamnogalacturonan degradation in Bacillus subtilis, and an ortholog of YesN from Paenibacillus sp. is thought to be involved in hemicellulose degradation [42][43][44]. A similar role of the Lin2118 protein (PDB code: 3oou) in plant cell wall degradation seems, therefore, likely. The three proteins, YesN from B. subtilis, YesN from F. nucleatum (PDB code: 3lsg), and Lin2188 from L. innocua (PDB code: 3oou), share a common additional response regulator receiver domain of the CheY-like superfamily at their N-terminus. This domain contains a phosphoacceptor site that is phosphorylated by histidine kinase homologs, for instance, YesM [43]. The common regulatory domain further supports a possible involvement in the biological process of plant cell wall degradation.
The sequence of structure 3oio from Chromobacterium violaceum is found in a common branch together with Y4fK, encoded on the pNGR234a plasmid of Rhizobium sp. (strain NGR234), and RhrA, encoded on the plasmid pSymA of Rhizobium meliloti. Y4fk activates the transcription of nod genes that play a role in the formation of plant root   Figure 3: Dendrogram of AraC/XylS family members annotated with functional categories. A dendrogram of 62 AraC/XylS family members and four proteins found by structural similarity search and whose biological role is unknown or poorly characterized is presented (marked with an asterisk). The tree was constructed using the maximum likelihood optimization criteria and a bootstrap test was conducted with 100 replicates. Bootstrap values greater than 50 are shown as number in the corresponding nodes of the tree. Leaves contain the UniProt or PDB accession code and the protein name. Two classification levels are included in this dendrogram. The inner color strip represents a primary classification (functional category) that contains four general classes as in Figure 2. The outer circle contains several colored shapes that represent a secondary and more specific classification scheme, which provides more detail to the biological process associated with each protein. One or two functional categories were assigned. Journal of Biomedicine and Biotechnology nodules, where symbiotic nitrogen fixation takes place [45]. RhrA activates the expression of a siderophore necessary to maintain low levels of iron, required for efficient nitrogen fixation. Both plasmids are essential for the symbiosis between plant and bacteria in the rhizosphere [46,47]. Based on these similarities, and the fact that the habitat of Chromobacterium violaceum is primarly soil and water, a tentative role in symbiotic nitrogen fixation seems possible. The last putative AraC/XylS member (PDB code: 3mn2) from Rhodopseudomonas palustris is located in between the previously described nitrogen fixation branch of the dendrogram and a branch that contains the regulators ThcR, EutR, and HrpB. ThcR from Rhodococcus erythropolis is responsible for degradation of a thiocarbamate herbicide [48], and EutR from Escherichia coli is involved in metabolism of ethanolamine derived from phosphatidylethanolamine contained in biological membranes. EutR was also linked to virulence in both animals and plants [49]. Finally, HrpB, which is from the phytopathogen Ralstonia solanacearum, activates the expression of a type III secretion system required for plant host cell infection [50,51]. It is known that symbiotic rhizobia and phytopathogenic bacteria share a common mechanism for plant host recognition [52], which could explain the observed proximity of the two branches associated to symbiotic and pathogenic behavior.
A pathogenic role of R. palustris has not been reported. R. palustris is a free living bacterium and is capable of acquiring carbon from many types of green plant-derived compounds. In addition, the bacterium is able to assimilate atmospheric nitrogen, albeit not in symbiosis with plants, as in the neighboring nitrogen fixation branch of the dendrogram [53]. R. palustris is one of the most metabolically versatile bacteria known [53] and one of a few prokaryotes described so far that possess three types of nitrogenases [54]. Prediction of a possible biological role of the putative AraC/XylS transcription factor with PDB code 3mn2 is not straightforward, due to the heterogeneity of the available data. A tentative role in either metabolism of plant-derived biomass or nitrogen fixation seems possible.

Structural Comparison.
Taking advantage of the protein structures available, we extended the sequence-based functional annotation with structural data. All thirteen identified AraC/XylS structures were aligned with the software SALIGN [16] and an MSA was derived from the resulting structural alignment (Figure 4(a)). The structures aligned nicely with a typical fold: two HTH domains of each three α-helices are connected by a longer, central α-helix. The two HTH domains themselves are superimposable. One α-helix of each HTH domain (α-helix three and six) inserts into the major groove of DNA establishing base-specific contacts (Figure 1). The orientation of the C-terminal HTH domain (HTH2) is conserved while the N-terminal domain (HTH1) is more variable. Domain rotation of up to approximately 30 degrees with respect to MarA (PDB code: 1bl0) is observable for HTH1. The position of α-helix two is shifted by up to 3.5Å in some structures and the position of the connecting loops to helices one and three is variable. On sequence level, the average pairwise sequence identity of HTH1 is 17%, while it is 27% for HTH2. These observations are in agreement with the current hypothesis that DNA binding affinity and specific recognition by different regulators are governed by HTH1, while HTH2 might additionally enhance affinity and binding site discrimination through a more conserved mechanism common to regulators of the AraC/XylS family [2,38,55]. The reported variability of HTH1 could further indicate a possible induced fit mechanism upon DNA binding [39].
Conserved residues in the multiple sequence alignment in Figure 4 are mostly related to hydrophobic amino acids stabilizing the hydrophobic core of the two HTH domains or small amino acids at the beginning and end of α-helices. A prominent exception are the three conserved residues Leu-28, Leu-30, and Val-33 (1bl0 numbering, Figure 4(a)). Due to the structural flexibility of α-helix two, these residues are shifted by one or two positions in the MSA in Figure 4 for MarA (PDB code: 1bl0) and Rob (PDB code: 1d5y), in contrast to all other structures. In the case of MarA and Rob, these residues are pointing toward the solvent, whereby they point inside, toward the hydrophobic core in all other structures. It is conceivable that this conformational change is part of the already mentioned induced fit mechanism. MarA and Rob structures were crystallized in complex with DNA, in contrast to the other bipartite HTH structures. However, in structures of Ada (PDB codes: 1u8b, 1zgw), which were solved in complex with DNA, these residues point inward as well. Another possible explanation is that this conformational change relates to the biological role of the respective transcriptional regulator. MarA and Rob are closely related regulators involved in stress response. However, movement of α-helix two seems to be unrelated to both conditions (biological role; complexed with DNA). The observed flexibility could thus also be an artifact of the structure determination experiment.
We further calculated the electrostatic potential surfaces of the AraC/XylS protein structures with the software DelPhi version 4 (Supplementary Figure S1). We observed a general dipolar character of the protein structures, with a positive potential on the side facing the DNA double helix and a negative potential on the opposite side. However, the detailed distribution of potential charges was generally diverse, and any relation to the biological role of the transcription factors was not obvious.
To perform a more systematic structure-based analysis of the relationship of bipartite HTH AraC/XylS family members, we generated a dendrogram from the structurederived MSA (Figure 4(b)). We compared this structurebased dendrogram with the sequence-based dendrogram in Figure 3. Figure 4(c) shows an extract of the latter dendrogram containing the same AraC/XylS members as in Figure 4(b), for easier comparison. The two trees are highly similar with only a minor topology change involving branch GadX/TcpN. Consistency of the structure and sequencebased MSAs and dendrograms provides further support for the provided functional annotation.

DNA Binding Site Prediction.
All analyses up to this point focused on the transcription factor proteins (sequence and structure). We will now turn to their cognate DNA binding sequences. Since the DNA binding sites for most identified AraC/XylS members are unknown, we devised a computational method to predict these sequences. The modeling of transcription factor-DNA complexes puts a strong focus on the DNA binding interface of the proteins and thus highlights key residues of helices three and six involved in specific base recognition.
In this structure-based approach, we generated initial comparative models based on MarA (PDB code: 1bl0) as a template structure and replaced the DNA sequence of the crystallized DNA molecule, while retaining the helical parameters of the original DNA structure. A total of 10,000 models with randomized full duplex DNA sequences were generated for all structures except Ada, which lacks the second HTH domain. A statistical potential was employed to score these protein-DNA models and position weight matrices (PWMs), along with sequence logos, were generated from top-scoring complexes (Figure 5(a)). A dendrogram was produced from the PWMs based on their Euclidean distance ( Figure 5(b)).
We included the PWM generated from 24 known DNA binding sequences of the marA/rob/soxS regulon as a positive control [32]. The results show that our structure-based modeling/scoring approach is capable of reproducing the experimentally determined DNA binding sequence of MarA and Rob reasonably well. As expected, the PWMs of MarA and Rob form a distinct cluster together with the known MarA/Rob/SoxS binding sequences in the dendrogram ( Figure 5(b)). Another cluster is formed by the two putative AraC/XylS family members with PDB codes 3mn2 and 3oio. This result is in agreement with the relative proximity of the two proteins observed in our DBD sequence-based dendrogram (Figure 3). A third cluster contains the remaining structures of YesN (PDB code: 3lsg), Lin2118 (PDB code: 3oou), GadX (PDB code: 3mkl), AraC (PDB code: 2k9s), and TcpN (PDB code: 3gbg). Despite distinct biological roles, these proteins were organized in the same major branch of the sequence-based dendrogram, as well.
A functional classification based only on the binding site prediction was not feasible, mainly due to the small amount of available structures (ten in total) and the small amount structures complexed with DNA (three in total).   4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TcpN (3gbg)   4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TcpN (3gbg)  In addition, a more stringent calibration of our structurebased method on available binding site data is required in the future. However, the similar grouping of the transcriptional regulators in the sequence-based, structure-based, and DNA binding site-based dendrograms is promising and increases the confidence of the functional annotation of the three putative AraC/XylS members. It should also be noted that the comparison of DNA binding specificity treats function on a molecular level, whereas the analysis of the sequence-based dendrogram treats function on a level of biological processes. The two functional levels are connected in the next section by matching the predicted binding sites against the genomes of the source organism and analyzing the genetic context of the matched binding sites.
An interesting observation is the preference of thymine in position five of the predicted binding sequences of the Lin2118 protein (PDB code: 3oou). This position corresponds to a cytosine in the crystal structures of MarA (C-32) and Rob (C-7). In these two structures, we observe a specific hydrophobic interaction of a tryptophan residue (Trp-42 in MarA and Trp-36 in Rob) with the base of this cytosine nucleotide in the major groove. This tryptophan residue is conserved in family members related to stress response and in fact seems to be a unique characteristic of this group (Figure 2). This is remarkable as tryptophan has a low statistical propensity to interact with DNA, but its propensity to interact with either cytosine or thymine is similar (Refs: [56] and A. Schüller, unpublished data). In the Lin2118 protein (PDB code: 3oou) this tryptophan residue is substituted by valine (Val-36). Valine has an over threefold increased propensity to interact with thymine in comparison with any of the other bases, and this preference is reflected in the corresponding predicted binding sequences of the Lin2118 protein.

Genomic Context of Predicted DNA Binding Sites.
We further analyzed the predicted DNA binding sites by mapping the predicted nucleotide sequences against the genomes of the respective source organism. We included the three putative AraC/XylS members (PDB codes: 3oou, 3mn2, and 3oio) and the poorly characterized transcriptional regulator YesN (PDB code: 3lsg) in this analysis. A total of 195 topscoring sequences were retrieved for the four structures and, of these, only 6 sequences could be mapped to their respective genomes, allowing up to three mismatches. A detailed analysis of the genomic context of the binding sites (three genes upstream and three genes downstream) is provided in the Supplementary Material ( Figure S2 and Table  S2).
The transcriptional regulator YesN (PDB code: 3lsg; F. nucleatum) is thought to be involved in plant cell wall degradation [43,44]. Mapping 39 high-scoring, predicted DNA binding sites of YesN against the genome of F. nucleatum retrieved a single match. 1,438 base pairs (bp) upstream of the binding site we found the xylose repressor gene xylR. XylR is a transcriptional regulator of the xylose operon that contains genes required for degradation of xylose, the most abundant sugar monomer of hemicelluloses [57]. The vicinity of a gene related to hemicellulose depolymerization is consistent with the proposed biological process of plant cell wall degradation. 64 high-scoring binding sites were generated for the Lin2118 protein from L. innocua (PDB code: 3oou). Based on its vicinity to YesN and YesS in the dendrogram of AraC/XylS family members, we proposed an involvement in plant cell wall degradation for this protein. Of the 64 binding sites, only two matched against the genome of L. innocua. 1,404 and 1,026 bp downstream of the first matched site we found the two genes crcB1 and crcB2, respectively (Supplementary Figure S2). It has been shown in E. coli that overexpression of these crcB homologs (along with crcA and cspE) protected cells from the DNA decondensing agent camphor [58]. Camphor is a terpenoid found in Lauraceae and Lamiaceae families of angiosperms. Although unrelated to plant carbohydrate metabolism, the two genes are involved in the response to an antimicrobial substance produced by certain plants.
The second binding site is located in a cluster of three genes with similarity to proteins of the phosphoenolpyruvate-dependent sugar phosphotransferase system (PTS), which is involved in the uptake and phosphorylation of specific carbohydrates from the extracellular environment. The predicted binding site matched inside the lin2833 gene that encodes a protein similar to domain IIA of enzyme II. Enzyme II is a carbohydrate-specific permease responsible for sugar uptake and a component of PTS [59]. The other two genes have similarity with domains IIB (lin2831 gene, 1,610 bp upstream) and IIC (lin2832 gene, 166 bp upstream) of enzyme II (Supplementary Figure S2). The identified enzyme II domains are of the cellobiose-specific subfamily. Cellobiose is a major component of cellulose found in plant cell walls. These data agree well with a possible role in plant cell wall degradation.
Based on our AraC/XylS-family sequence analysis we predicted a role in symbiotic nitrogen fixation for the uncharacterized protein with PDB code 3oio from C. violaceum. 72 sequences were predicted as high-scoring DNA binding sites, of which only two sequences mapped against the genome of C. violaceum. In the vicinity of the first binding site, we found the ibeB gene (726 bp upstream), which by homology encodes an outer membrane efflux protein (OEP) (Supplementary Figure S2). IbeB is a member of the NodT subfamily of the resistance-nodulation-cell division (RND) type efflux systems involved in Nod factor secretion, which are important for nodulation in species of Rhizobium [60]. Root nodules are specialized plant organs where symbiotic nitrogen fixation takes place. This finding is well in agreement with a proposed biological role in nitrogen fixation.
In the genetic context of the second binding site, we identified the gene locus CV 3010 (1261 bp upstream) (Supplementary Figure S2). By sequence similarity, this gene contains a molybdopterin cofactor binding domain found in a variety of oxidoreductases. Main members of this family are nitrate reductase and sulfite oxidase. The presence of an assimilatory nitrate reductase domain is consistent with a proposed biological role in nitrogen fixation [61].

Journal of Biomedicine and Biotechnology
Twenty high-scoring binding site sequences were predicted for the uncharacterized protein with PDB code 3mn2 from R. palustris, which produced a single match in the genome. Based on our analysis of DBD sequences, we proposed a tentative biological role in either metabolism of plant-derived biomass or nitrogen fixation. The locus Rpal 1214 is found 1,510 bp upstream of the binding site and encodes a DSBA oxidoreductase (Supplementary Figure  S2) that is involved in cellular respiration [62]. Nitrogen fixation is an energetically expensive process. It has been shown, in another strain of R. palustris, that genes involved in the electron transfer of cellular respiration increase their expression under conditions of high nitrogen fixation [63]. Electron transfer has been proposed as rate-limiting for nitrogenase activity [64]. Moreover, between 95 and 295 genes of R. palustris that are not directly associated with nitrogenase synthesis and assembly were induced under nitrogen-fixing conditions [54]. We found a second gene 2,688 bp upstream of the predicted DNA binding site (locus Rpal 1216, Supplementary Figure S2), which encodes a flavoenzyme ferric reductase. This protein is involved in the electron transfer system and might be related to obtaining energy for nitrogen assimilation. However, proteins associated with cellular respiration may be induced by a multitude of pathways and a false positive binding site prediction cannot be ruled out.
In summary, by matching high-scoring predicted binding sites of uncharacterized transcriptional regulators against their host genomes, in five of six cases we were able to identify related genes in the genetic context of the binding sites. It should be noted that the majority of the analyzed genes encode hypothetical proteins identified by sequence similarity/homology lacking experimental validation. However, the fact that database annotations of these genes are consistent with the biological roles that we proposed based on the analysis of AraC/XylS-family DBD sequences is promising. Yet, the presented analysis is, by no means, complete. Of the 4 22 (≈2 × 10 13 ) possible binding site sequences we generated a subset of 10,000 sequences, which is a mere 6 × 10 −8 percent of the possible sequence space. This number was slightly increased by allowing up to three mismatches; however, we expect a large number of false negatives, that is, undetected binding sites. Increasing the sampling rate and switching to an optimized sampling algorithm will improve this limitation in the future.

Conclusions
We presented an integrated bioinformatics method that combined sequence-based relationships, structural similarity, and prediction of DNA binding sites of transcriptional regulators. The advantage of the proposed method is the utilization of information derived from different structural levels (primary and tertiary) and different biological entities (proteins and DNA binding sites). The prediction of DNA binding sequences is a promising approach to complement available information, where experimental data are scarce. Mapping of the predicted binding sites to the genomes of the source organisms and analysis of the genetic context was in good agreement with sequence-derived functional annotations. Although the presented method needs further refinement and validation in the future, the consistency of the presented results is promising. We applied the method to the AraC/XylS family of transcriptional regulators and predicted the biological roles of the three putative new family members with PDB codes 3mn2, 3oio, and 3oou, which originated from structural genomics initiatives. Our bioinformatics analyses suggest that they could be involved in plant cell wall degradation (PDB code 3oou), symbiotic nitrogen fixation (PDB code 3oio), and either metabolism of plant-derived biomass or nitrogen fixation (PDB code 3mn2).
The utilization of combined information derived from structure-based and sequence-based analysis of transcription factor proteins and binding sites is proposed as a convenient way to assign a biological role to structures of unknown function and may guide biologists in conducting proper experimental characterization.