Distribution, structure and diversity of “bacterial” genes encoding two-component proteins in the Euryarchaeota

Summary The publicly available annotated archaeal genome sequences (23 complete and three partial annotations, October 2005) were searched for the presence of potential two-component open reading frames (ORFs) using gene category lists and BLASTP. A total of 489 potential two-compo-nent genes were identified from the gene category lists and BLASTP. Two-component genes were found in 14 of the 21 Euryarchaeal sequences (October 2005) and in neither the Crenarchaeota nor the Nanoarchaeota. A total of 20 predicted protein domainswereidentifiedintheputativetwo-component ORFs that, in addition to the histidine kinase and receiver domains, also includes sensor and signalling domains. The de-tailed structure of these putative proteins is shown, as is the dis-tributionofeachclassoftwo-componentgenesineachspecies. Potential members of orthologous groups have been identified, as have any potential operons containing two or more two-component genes. The number of two-component genes in those Euryarchaeal species which have them seems to be linked more to lifestyle and habitat than to genome complexity, with most examples being found in Methanospirillum hungatei, Haloarcula marismortui, Methanococcoides burtonii and the mesophilic Methanosarcinales group. The large numbers of two-component genes in these species may reflect a greater requirement for internal regulation. Phylogenetic anal ysis of orthologous groups of five different protein classes, three probably involved in regulating taxis, suggests that most of these ORFs have been inherited vertically from an ancestral Euryarchaeal speciesandpointtoalimitednumberofkeyhori-zontal gene transfer events. that the number of two-com-ponent genes possessed by an organism is related to the com-plexity of its genome, its physiology the changeability of The greater the value of any of those parameters, the greater the need for regulation of cellular activities. The aim of this study was to analyze the complement of genes in archaeal genomes that could encode two-component proteins. The putative two-component proteins were classified by their domain structure, and the number of each class was


Introduction
Two-component systems are one of the key means by which bacteria respond to environmental changes (Hoch 2000, Stock et al. 2000, Alves and Savageau 2003, Hellingwerf 2005. They are assumed to be of bacterial origin, having radiated into archaea and some eukaryotes by horizontal gene transfer (HGT) (Koretke et al. 2000). Two-component systems consist of a sensor and a response protein. The sensor protein is char-acterized by a histidine kinase (HK) made up of two main domains, a phosphoacceptor (HisKA) and a histidine kinase ATPase (HATPase) and, in many cases, other sensory domains are present (Galperin et al. 2001, Zhulin et al. 2003. The response protein (response-regulator, RR) is characterized by a response regulator domain that has a conserved aspartate residue. The histidine kinase autophosphorylates a conserved histidine residue in response to a signal and the phosphate group is then transferred to the conserved aspartate residue of the response-regulator. The transfer of the phosphate group to the response-regulator elicits a response causing a change in taxis, development or gene expression. Histidine kinases and response-regulators are sometimes found together in a single polypeptide known as a hybrid kinase (Hoch 2000, Stock et al. 2000.
The recognition of the Archaea as a distinct division of life has been strengthened by the availability of a number of complete genome sequences, representing three phyla (Euryarchaota, Crenarchaeota and Nanoarchaeota). This has, in turn, enabled a more rigorous phylogenetic analysis based on the fusion of ribosomal protein sequences (Matte-Tailliez et al. 2002, Brochier et al. 2004, Bapteste et al. 2005, Makarova and Koonin 2005) and clusters of conserved orthologous genes (COGs) (Makarova and Koonin 2003). Analysis of genome sequences has revealed genes of bacterial origin in the genomes of archaea and vice versa (Nelson et al. 1999). The importance of HGT in the evolution of prokaryotes and the implications for phylogeny and definition of species is still being discussed (Ochman et al. 2000, Boucher et al. 2003, Koonin 2003, Kurland et al. 2003, Lawrence and Hendrickson 2003.
For bacteria, it has been shown that the number of two-component genes possessed by an organism is related to the complexity of its genome, its physiology and the changeability of its habitat (Ashby 2004, Galperin 2005. The greater the value of any of those parameters, the greater the need for regulation of cellular activities.
The aim of this study was to analyze the complement of genes in archaeal genomes that could encode two-component proteins. The putative two-component proteins were classified by their domain structure, and the number of each class was determined for each species. Potential orthologous groups and those that may be part of operons, with two or more two-component genes, are indicated. Phylogenetic analysis of possible orthologous groups representing five classes of protein, three associated with taxis, is shown.

Genome sequence data
The list of publicly available Euryarchaeal genome sequences is shown in Table 1, along with brief details of the habitat, physiology, genome size, putative number of open reading frames (ORFs) and the abbreviation used with gene sequences (Makarova and Koonin 2003). The sequences and annotations for the annotated sequences (up to October 2005) were accessed at the Integrated Microbial Genomes (IMG) server (http://img.jgi.doe.gov/pub/main.cgi) and HaloLex (http:// www.halolex.mpg.de/). The identity of potential two component genes was determined by reference to the published assignments located at http://www. tigr.org/tdb/ (Bult et al. 1996, Smith et al. 1997, Kawarabayasi et al. 1998, Klenk et al. 1997, Ng et al. 2000, Deppenmeier et al. 2002, Slesarev et al. 2002 ASHBY ARCHAEA VOLUME 2, 2006  Galagan et al. 2002, Cohen et al. 2003, Baliga et al. 2004, Falb et al. 2005. This was supplemented by BLASTP (Altschul et al. 1997)  To determine orthologous groups, orthology information, based on the bidirectional best hits from BLASTPs of each organism against each other organism polypeptide, is accessible at IMG (http://img.jgi.doe.gov/pub/main.cgi). This definition is not completely accurate, but it provides a useful approximation as it is not always possible to know whether the polypeptides arose from a single gene present in the last common ancestor (orthologues) or from a gene duplication within a genome (paralogues). Alignments for phylogenetic analysis were performed by TCoffee (Notredame et al. 2000) and accessed at the Centre Nationale de la Recherche Scientifique website (http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi) and ClustalW alignments (Thompson et al. 1994) were performed at the European Bioinformatics Institute (http://www.ebi.ac.uk/clustalw/). Representatives from three bacterial phyla were included in the alignments (chosen by having the best match to one of the archaeal ORFs, either as an orthologue or by BLASTP at IMG). Phylogenetic analysis by neighbor-joining (Bootstrap 250) was performed using MEGA version 3.0 (Kumar et al. 2004) and by maximum-likelihood (Felsenstein 1996) using Molphy, accessed at the Institut Pasteur, biological software website (http://bioweb. pasteur.fr/intro-uk.html#phylo).

Results and discussion
The structural classification of potential two-component proteins is shown in Table 2. No two-component encoding gene could be found in the Crenarchaeota or Nanoarchaeota (data not shown). Sensor domains are drawn as ellipses and twocomponent (HisKA, HATPase_c and response regulator) and output domains are drawn as rectangles. Parentheses followed by figures indicate the number of similar domains that may be found in the proteins listed in each subclass.

Histidine kinases
Different types of histidine kinases (HK) are listed in Table 2A. Histidine kinases contain two domains; a dimerization and a phosphoacceptor domain (HisKA or HisKA_2) and a HATPase_c domain (Grebe and Stock 1999). HisKA and HisKA_2 are part of a His kinase A phosphoacceptor domain superfamily that also includes HWE_HK and HisKA_3 (Karniol and Vierstra 2004; Pfam accession CL0025).
HKI Histidine kinase Is are HKs containing HisKA and HATPase domains. There may be other domains in some of these examples which are not currently recognized. The HKI ORFs vary greatly in size, ranging from 175 to 592 amino acids in length.
HKII Histidine kinase IIs are HKs containing sensor GAF and PAS/PAC domains. The GAF domains (cGMP phosphodiesterase, adenylyl cyclases, bacterial transcription factors FhlA) are associated with small molecule binding, in particular cAMP and cGMP , Ho et al. 2000. The GAF domain is usually found in combination with PAS (Drosophila period clock protein, vertebrate aryl hydrocarbon receptor nuclear translocator and Drosophila single-minded protein) or PAC (PASassociated C-terminal motif) domain, or both. One class of PAS domains is known to bind cofactors such as heme and FAD (Bibikov et al 2000, Sardiwal et al. 2005. Sensing of light, oxygen or redox potential by PAS domains requires cofactors, whereas sensing signals such as voltage, xenobiotics and nitrogen availability does not Aravind 1997, Gilles-Gonzalez andGonzalez 2004). The PAC domains are proposed to contribute to the PAS domain fold. The shared feature of GAF and PAS/PAC domains is the binding of a diverse set of regulatory small molecules that often remain unidentified; all three domains are common signal transduction system components Aravind 2001, Zhulin et al. 2003). There is one example containing Cache and one containing SBP_bac_3 (bacterial extracellular solute-binding proteins, family 3). The Cache domain is a signalling domain found in animal calcium channel subunits and it is thought to form an extracellular or periplasmic ligand sensor . SBP_bac_3 is involved in active transport of solutes across the cytoplasic membrane and in the initiation of signal transduction pathways (Tam and Saier 1994). This is by far the largest subgroup of ORFs, containing 161 out of the total of 489 (33% of the total). 14 ASHBY ARCHAEA VOLUME 2, 2006   Table 1 for gene name prefixes.
continued on facing page HKIII Histidine kinase IIIs are HKs that possess a HAMP "linker" (histidine kinase, adenylyl cyclase, methyl-accepting chemotaxis protein and phosphatase) domain. The HAMP domain is usually associated with the transmission of a signal across a membrane from periplasmic ligand-binding domains (Aravind and Ponting 1999, Appleman and Stewart 2003, Zhu and Inouye 2004. Eight examples of HKIIIs have an N-terminal putative periplasmic signalling CHASE4 domain and four have an N-terminal periplasmic signalling Cache domain Aravind 2001, Zhulin et al. 2003). These domains are positioned next to the HAMP domain, presumably for efficient transfer of the signal.
HKVI Histidine kinase IVs are the CheA-like chemotaxis signalling proteins that contain an N-terminal Hpt (histidine phosphotransfer) and Hkd (histidine kinase dimerization) domain and a C-terminal CheW domain. Some contain one or two P2 domains between the Hpt and Hkd. The Hpt domain is involved in mediating phosphotransfer from one receiver domain to another (Hoch 2000). Hkd (H-kinase-dim) is the dimerization domain of CheA and CheW that interacts with methylaccepting chemotaxis proteins (MCPs), relaying signals to CheY, and thereby affecting flagellar rotation (West et al. 1995). The P2 domain is involved in enhancing the interaction of CheY with the HK (Jahreis et al. 2004, Stewart andvan Bruggen 2004). Thermococcus kodakaraensis has two open reading frames with a frame shift mutation that probably encodes for a CheA-like protein. All of the HKVI genes discussed are located close to other genes that could be involved with signal transduction and are probably transcribed as single operons (see Table A2).
HATPase_c These contain no dimerization or phosphoac-ceptor domains currently recognized at INTERPRO.
His_KA There are five groupings that contain His_KA without a discernable HATPase_c domain.

Response regulators
Response regulators (RR) are listed in Table 2B. These contain a characteristic receiver (RR/T_reg) domain, which is about 120 amino acids long and contains a conserved aspartate residue about halfway along the molecule that accepts a phosphate group from an HK.

Distribution of putative two-component ORFs
The total number of ORFs within each class of two-component proteins, for each species of Euryarchaeota, is shown in Haloarcula marismortui has the largest number of two-component encoding genes, of the complete annotations, which represents 1.93% of the total protein coding capacity of the genome. Methanospirillum hungatei appears to have the largest number of two-component genes at 87, though the annotation of the genome is incomplete (so no percentage is given in Table 3). The HKs form half to two-thirds of the two-component ORFs for each species (except Pyrococcus sp. and Methanospirillum hungatei). The DNA-binding domains (putative) were only detected as part of the RRs in H. marismortui.
The PAS/PAC and GAF sensory domains are found in 293 of the 489 putative proteins surveyed. These sensory domains are absent in the Pyrococcus sp. A total of 18 ORFs were found that contain the HAMP domain that would in most cases be involved in transferring signals from sensor domains detecting information outside the cell.

Orthologous groups
Potential orthologous groups are shown in Table 4. These results are based on the bidirectional best hits from BLASTPs at IMG. The identification of orthologous groups at IMG may not be correct in all cases as some groupings may include ORFs that are due to gene duplication, hence a paralogue (in a different organism) rather than an orthologue. It is, nevertheless, a useful tool for assigning putative orthologous groups when no functional information is available. The groups have been named with a three letter acronym for ease of reference (see Table 4). In arr18/19, RRIV-CheB, the grouping was modified from the information at IMG based on the phylogenetic analysis presented in Figure 1 (see below). There are many orthologous groups that contain two or three members, partic-  ments made by ClustalW (data not shown), but the results were not found to differ significantly. Figure 1 contains the ORFs from the two HKII groups, ahk5 and ahk22, with the three closest bacterial ORFs (to MaceMA2890) from Cyanobacteria, Firmicutes and Proteobacteria. Figure 2 is composed of ORFs from the four HKVI 'CheA like' groups, ahk40-43, and the three closest bacterial ORFs (to MaceMA0014) from Thermatogae, Firmicutes and Proteobacteria. Figure 3 is the analysis of a number of RRI orphans, arr7 and arr10 (from putative taxis operons), arr12 and arr14 (> 200 amino acids)  with bacterial ORFs from Thermatogae, Firmicutes and Proteobacteria (closest to MaceMA3068). Figure 4 shows the results for the two RRIV CheB orthologous groups from taxis operons, arr18 and arr19 with bacterial representatives from Thermatogae, Proteobacteria and Actinobacteria (closest to MaceMA0015). Figure 5 shows results for ahy2 and bacterial representatives from Cyanobacteria, Actinobacteria and Proteobacteria.

Linked genes
Genes that are located close to each other on the genome and transcribed in the same orientation are shown in Table A2.
Most of these are likely to be part of operons. This provides clues to some cognate pairs of HKs, HYs and RRs. All putative HKVI encoding genes are located with other "chemotaxis" genes in "chemotaxis operons," including two such operons for M. acetivorans and M. mazei. Included in these "chemotaxis operons" are the orthologous groups, ahk40-43, arr7, arr10 and arr18 and arr19 (see Table 4).

Distribution of two-component ORFs
Ten species, representing four genera have at least 17 putative two-component ORFs. Some of these two-component ORFs are quite sophisticated in structure, including the multiple sensor HKIIs, CheA-like HKVIs and the hybrid kinases. The results presented here show that a number of euryarchaeal species have an extensive array of two-component sensory ORFs.
These proteins may sense a number of different internal signals by means of PAS/PAC domains and their associated cofactors Aravind 1997, Gilles-Gonzalez andGonzalez 2004). In addition, the potential to sense other small molecules (particularl cNMPs) via the GAF domains , Ho et al. 2000 and extracellular signals, by the CHASE4 and Cache putative sensory domains, via HAMP domains Aravind 2001, Zhulin at al. 2003) shows that these organisms (in particular H. marismortui, Natronomonas pharaonis, Methanospirillum hungatei and the Methanosarcinales) possess sophisticated and complex sensory networks. As yet, none of these putative two-component genes have a functional name, so functions can be assigned only by similarity. The DNA-binding RRs are common in bacteria that regulate gene expression (Ashby 2004, Galperin 2005. However only three RRs have been identified with putative DNA binding domains, all in H. marismortui. If regular indiscriminate HGT were taking place, one would expect to see more DNA-binding RRs in archaeal sequences. Presumably the large number of orphan RRs are involved in regulation of cellular activity by interacting directly with other proteins. Transcriptional control is probably maintained by the many DNA-binding domains that have been identified as part of one-component systems in archaea (Ulrich et al. 2005). In these systems the DNA-binding output domain is linked directly to a sensor domain without any phosphotransfer.
Of the species that have the most two-component genes, H. marismortui and Natronomonas pharaonis are halophilic and the Methanosarcinales and Methanospirillum hungatei are mesophiles. The mesophiles coexist with a large and diverse population of bacteria, giving ample opportunity for HGT, whereas the opportunity for HGT in the halophilic organisms would be more restricted. This begs the question of how the distribution of two-component genes that can be seen in the Euryarchaeota arose. Was it through HGT exclusively or by vertical transfer from a common ancestral euryarchaeal organism coupled with gene duplications?

Phylogeny and inheritance of two-component ORFs
The phylogenetic analysis of five different sets of orthologous ORFs, chosen because they are found in most of the species that contain two-component ORFs (Figures 1-5), were found to closely match the published phylogenies for these organisms (Matte-Tailliez et al. 2002, Brochier et al. 2004, Bapteste et al. 2005. For ahk5 and ahk22, shown in Figure 1, the phylogeny of each group agrees with the current phylogeny of these organisms and the position of the three bacterial examples indicates that the two groups may have arisen through a separate HGT event in an ancestral euryarchaeal species for ahk5 and possibly, into an ancestral methanogen for ahk22. The results for the CheA-like HKVI ORFs are shown in Figure 2. Ahk40, ahk42 and ahk43 (except Mhun_401793120) cluster together and probably represent vertical inheritance from a single HGT event into an ancestral Euryarchaeota species (one bacterial ORF from T. maritima giving the best match). Ahk41 appears to be a separate group, found in the Methanosarcinales, that clusters on its own and seems to be more closely associated with the Firmicutes and Proteobacterial examples, presumably representing a separate HGT event. The three Methanospirillum hungatei ORFs seem to be due to separate (Mhun_401793120 probably should not be in ahk43) HGT events and Mhun_401784470 and Mhun_401776240 are probably true paralogues. Figure 3 shows the results for four orphan RR groups. The two groups, associated with putative taxis operons ahk7 and ahk10, group closely together, however, ahk10, which is found only in the Methanosarcinales is probably due to an HGT event into a direct ancestor of this group. The other two orthologous groups, arr12 and arr14 are quite separate from the first two mentioned groups (arr7 and arr10) and probably arose from separate HGT events into the ancestors of methanogens (AfulAF2419 appears to be a distant member of ahk12). Figure 4 shows the results for the two RRIV-CheB orthologous groups associated with taxis. The arr18 orthologous group found in Methanosarcinales groups separately from arr19, being closer to two of the bacterial ORFs. Therefore arr18 appears to be the result of a separate HGT event in an ancestor of the Methanosarcinales, whereas arr19 appears to be the result of an HGT event into an ancestor of Euryarchaeota.
The phylogeny for ahy2 (the biggest hybrid kinase orthologous group), shows that these members probably arose from more than one HGT event. The combined results for the orthologous groups found in potential taxis operons are shown in Table A2.
The operon that contains HKVI (ahk40/42/43), RRI (arr7) and RRIV-CheB (arr19) appears to have arisen as an HGT event that transferred the whole operon into an ancestor of the Euryarchaeota. In contrast, the taxis operon containing HKVI (ahk41), RRI (arr10) and RRIV-CheB (arr18) appears to have arisen from a separate HGT event of the whole operon into a direct ancestor of the Methanosarcinales.
The results presented here suggest that HGT has taken place from bacterial species both into ancestral Euryarchaeota and more recently into the methanogens. However the large numbers of two-component genes in the mesophilic methanogens and the Halobacteriales probably reflect their well known metabolic flexibility (Bapteste et al. 2005, Falb et al. 2005. This in turn, necessitates an increased requirement for regulation of cellular activity in a changing environment rather than the increased potential for HGT from bacteria. Most of the twocomponent ORFs that can be observed in these groups of organisms are probably derived from paralogous gene duplication events, the number of two-component ORFs observed would be driven by the requirement to control cellular activity as the organisms evolve. A limited number of HGT events could be sufficient to account for the diversity of phosphotransfer and sensory domains. Any function of two-component ORFs is inferred by homology to known bacterial genes (e.g. HKVI and chemotaxis) and awaits in situ or in vitro studies, or both. This highlights the importance of interfacing between bioinformaticians and biochemists to plan experiments in an informed way, particularly where orthologues are identified and found in more than one genus and hence may play central roles in cellular regulation.