Pressures in Archaeal Protein Coding Genes: A Comparative Study

Our studies on the bases of codons from 11 completely sequenced archaeal genomes show that, as we move from GC-rich to AT-rich protein-coding gene-containing species, the differences between G and C and between A and T, the purine load (AG content), and also the overall persistence (i.e. the tendency of a base to be followed by the same base) within codons, all increase almost simultaneously, although the extent of increase is different over the three positions within codons. These findings suggest that the deviations from the second parity rule (through the increasing differences between complementary base contents) and the increasing purine load hinder the chance of formation of the intra-strand Watson–Crick base-paired secondary structures in mRNAs (synonymous with the protein-coding genes we dealt with), thereby increasing the translational efficiency. We hypothesize that the ATrich protein-coding gene-containing archaeal species might have better translational efficiency than their GC-rich counterparts.


Introduction
Recent knowledge of the complete sequences of some archaeal genomes has led us to a comparative study of different sequence features among archaeal species. Archaea do share similarities with bacteria, viz., large circular genomes, sometimes with some circular plasmids and the absence of nucleosomal structure, a single initiation site for genome replication (Myllykallio et al., 2000), etc. However, the archaeal species also show closeness to eukaryotes. There are similarities in DNA and RNA polymerases, in ribosomal RNA and proteins, in several other proteins related to information processes, and in the presence of TATA box binding sites etc. (Pühler et al., 1989;Zillig et al., 1993;Brown and Doolittle, 1997).
The separate taxonomic status of archaea (Woese and Fox, 1977;Woese, 1987;Woese et al., 1990;Olsen and Woese, 1993;Olsen et al., 1994) owes a lot to some unique features, such as the presence of prenyl ether lipids instead of acyl ester lipids, and of a tiny large subunit ribosomal protein LX, the absence of HSP90 chaperone, and the presence of a split in RNA polymerase A (Cavalier- Smith, 2002). Considering the similarities and the differences, it is believed that archaea narrow the gap between bacteria and eukaryotes (Keeling and Doolittle, Pressures in archaeal genes 57 were Aeropyrum pernix K1 (Aper), Sulfolobus solfataricus (Ssol ) and Sulfolobus tokodaii (Stok ). The short names in parentheses have been used in the rest of this paper. For each genome we analysed a single large DNA sequence obtained by concatenating all of the protein coding genes. During concatenation, when we found 'complementary' regions on the GenBank strand, we converted them into the protein-coding genes sitting on the other strand in the 5 to 3 direction, and considered those genes along with the protein-coding genes on the GenBank strand. Therefore, we concatenated all the protein-coding sequences from both strands, maintaining the coding direction. Our samples were these 11 concatenated sequences.
The main aim of this paper has been to isolate some of the pressures on these archaeal genes. Violations of the second parity rule, PR2 (Sueoka, 1995(Sueoka, , 1999, were observed by applying PR2 to the single strand of DNA sequences. Here we investigated PR2 on the concatenated sequences of genes to study the global violations of PR2 within all the protein-coding genes from both strands taken together. Both G -C (the difference between G and C contents) and A -T (the difference between A and T contents) did get bigger with increasing AT content. We have shown, however, that there were other simultaneous pressures that, in effect, gave rise to an increase in purine load. This rise in purine load was not uniform over the codon sites. It was a spatially differentiated rise, with the load rising maximally on the 1st codon position, followed by the rise on the third site; while the interspecific variation in purine load was the least for the middle position of the codons. Accompanying these changes in base composition was the trend towards increasing persistence within codons with AT richness. While the presence of persistence was consistent with Chargaff's (1963) clustering rule, which suggested the clustering of individual bases to an extent larger than random expectations, we found that the major contributions to this clustering were through persistence between the first and third and between the second and third positions within codons. The persistence between the first and second did not contribute significantly towards this clustering. We showed the clustering in archaea was again a spatially differentiated process that did not occur uniformly over all the codon positions.

Usage of bases within codons
We studied the frequencies of occurrence of A, C, G and T in each of the three positions in codons. We also calculated the overall average frequencies of the four bases in all three positions. In each case, we measured the purine load, in terms of AG content, as well as the differences between A and T contents, and between G and C contents.

Measuring similarity index
We studied the frequencies of occurrence of any nearest-neighbour codon-pairs. On this basis for each species we developed a 64 × 64 matrix. These matrices carried the footprints of nearest-neighbour selectional influence on codon usage. We measured the extent of similarity among them. If C ij (M1 ) and C ij (M2 ) were the values of any particular cell C ij (where both i and j run from 1 to 64) of matrices M1 and M2, respectively, the similarity index (SI ) was given by: C ij denoted the total number of cells in the 64 × 64 matrix.

Measuring persistence index
Persistence meant a base did tend to be followed by the same base. We looked for the level of persistence within codons. Therefore, our definition suggested that AAA, CCC, GGG and TTT are the most persistent codons; while codons such as ACA, TTG, and TCC, where two of the three bases were identical, did bring a somewhat lower level of persistence in the sequence. On the contrary, GCA, CAT, etc. were antipersistent codons (Chattopadhyay et al., 2002). We computed the square of the number of any particular base within each codon along the sequence and the averaged value gave the persistence index (PI ) within codons for that particular base. PI, therefore, was given by: where b was any base (A, C, G, T), and (3)' is to highlight the base length of codons, which is three.

Base usage within codons
Halo was the most GC-rich protein-coding genecontaining species, followed by Aper, while both Aful and Meth have about 50% GC content. The remaining seven species were AT-rich; with Ssol, Stok and Mjan having maximum AT content. It might be worth noting here that GC richness in protein-coding genes did not correlate with increasing thermophilicity of species; e.g. the only mesophilic species dealt with here, Halo, was the most GC-rich gene-containing species. As the 11 species were placed in decreasing order of GC content separately for the three positions of codons, the extent of decrease was noted to be most pronounced for the third position of codons ( Figure 1). It is well known that the third position of codons is the most susceptible to change over time. In fact, the codon usage study showed that with the increase in AT richness within genes, the archaeal species opted for increased usage of comparatively AT-rich synonymous codons, mostly differing in the third position of codons (e.g. for glycine the use of GGC and GGG were gradually overshadowed by GGA and GGT in AT-rich genecontaining species).
The averages of individual base content over the three codon positions showed that for all the 11 samples, %A exceeded %T; similarly %G was greater than %C except for Halo ( Figure 1d). It was known from earlier studies of bacteria and primates that %A did exceed %T in proteincoding sequences (Mrázek and Kypr, 1994;Bell and Forsdyke, 1999). The archaeal samples were no exception in this regard. Let A − T be denoted by W, and G − C by S. In Table 1, the %(A + T) content averaged over all the three positions within codons is shown for 11 archaeal species.
As we plotted W and S with increasing %(A + T) content for the 11 species, we found both the overall W and the overall S increased ( Figure 2). Again, for samples with %(A + T) > %(G + C), S > W. For these AT-rich samples, W stayed roughly the same. Taken together, we saw that C content reduced faster compared to G. Since W increased with decreasing G + C, we conclude that A + G, i.e. the purine load, increased with increasing %(A + T) ( Figure 3).
Interestingly, as we plotted the values of W, S and %(A + G) for first, second and third positions within codons and for all three positions taken together (Figures 2 and 3), we found the first position to be the strongest so far as the PR2 violations and purine load in protein coding genes were concerned. For the second position, both W and S values were negative in most cases, and species-wise purine load was minimum, with little interspecific variation except for Mjan. The third position showed some negative W and S values and weak purine load compared to the first position, but the increase in purine load from Halo (42.677%) to Mjan (55.447%) was extremely conspicuous. The first position always had high W and S values with considerable

Comparison based on codon relatives
A considerable amount of the literature witnesses the regularity of context sensitivity in the usage of codons from different prokaryotes and eukaryotes (Irwin, Heck and Hatfield, 1995;Karlin and Mrázek, 1996;Berg and Silva, 1997;Antezana and Kreitman, 1999;McVean and Hurst, 2000;Fedorov et al., 2002). We compared 64 × 64 matrices for any pair of species based on nearest-neighbour codon-pair frequencies to get SI for those two species. When simply viewed, we noted that the distance of 10 other species from Halo, based on SI, increased with AT richness; therefore, the SI between Mjan (the most AT-rich one) and Halo (the most GC-rich one) was found to be minimal ( Figure 4).     Table 1 Persistence within codons As expected, moving from GC-rich codon-containing species to AT-rich ones showed an increase in persistence for A and T channels, while antipersistence increased in G and C channels (Figure 5a). Interestingly, the rates of increase in persistence and antipersistence were not equal; and the former dominated. Therefore, the average over the sum of individual PI values, termed: showed an overall increase in persistence towards AT-rich codon-containing species (Figure 5b). The We thereafter traced which pair of bases within codons did hold maximum responsibility in imparting the trend in overall persistence. The study showed that paired bases 1 and 3, as well as 2 and 3, shared major and more or less equal contributions in increasing persistence within codons; while 1 and 2 contributed the least (Table 2). This suggested that there was a spatially differentiated 1-periodic (only due to the second and third positions) and a 2-periodic (due to the first and third positions) structure within codons. But, as we calculated the 1-and the 2-periodicities over the entire sequences, we found that the 1-periodicity values showed a prominent increasing trend, unlike 2-periodicity values, with decreasing GC content and increasing purine load (results not shown). Again, the extent of persistence between the third position of the previous codon and the first position of the next codon was really low, even lower than that between the first and second positions Table 2. Periodicities as measured from the dinucleotide second moments. (1-2), (2-3), (1-3) denote the persistence between the first and second, the second and third, and the first and third positions within codons. (3-1) denotes the persistence between the third position of previous codon and the first position of the next codon within codons (Table 2). This implied the dominance of the intra-codon 1-periodicities over the inter-codon 1-periodicity. We noted that the values for Aful in Table 2 were distinctly different from other archaeal species: while the level of persistence between the second and third positions was much smaller than that in any other species, the persistence levels between all other pairs of positions were relatively higher than those for any other species.

Hints for translational efficiency
The W and S values in Figure 2 led us to assess Szybalsky et al.'s (1966) transcriptiondirection rule and purine loading. The mRNA synonymous DNA sequences, precisely the proteincoding sequences that we dealt with here, were purine-rich (Dang et al., 1998). This purine richness in mRNA reduces the formation of doublestranded RNA secondary structure (more precisely, the formation of intra-strand Watson-Crick base pairing). This presumably increases the corresponding translational efficiency (Lao and Forsdyke, 2000). The pressure in the 11 archaeal genomes, as observed here, resulted in increase of W and S with increasing AT content. Thus, the deviations from PR2 increased. These increases in PR2 deviations were important, but not sufficient. When we added the effects of PR2 together with C > G with decreasing G + C, and W increasing with decreasing G + C, we arrived at the increase in purine load. Overall this pressure of increasing purine load in archaea reduces the strength of the mRNA secondary structure in comparatively ATrich species; the strongest footprints being in the third position of codons where the positive correlation between the increase in AT content and the increase in purine load was found to be most conspicuous.
There was another important component to the pressures in archaeal protein coding genes, which related to persistence of bases within codons. This persistence, or clustering, was between the first and the third position and between the second and the third ( Table 2). The relative weights of these two possibilities were roughly equal, both being greater than the one between the first and the second. The 'cluster rule ' of Chargaff (1963) that the individual bases tend to cluster more than expected on a random basis is true, but curiously not between the first and the second position in the case of archaeal protein coding genes; furthermore, the level of persistence was even worse if we considered the third position of the previous codon and the first position of the next codon, suggesting the dominance of intra-codon 1-periodic persistence over inter-codon 1-periodic persistence.
The overall increase of A + T at the expense of G + C obviously increased the melting flexibility (Ussery, 2001), but it had consequences for mRNA secondary structure as well. Its importance was emphasized earlier in that a rigid secondary structure implied a limited coding potential (Salser, 1970;Ball, 1972Ball, , 1973. It was also noteworthy that while GC content of rRNA had a positive correlation with the optimum growth temperatures of thermophilic microbes (Dalgaard and Garrett, 1993;Forterre and Elie, 1993;Galtier and Lobry, 1997), the same sort of correlation was not found between their mRNA GC content and their optimum growth temperatures (Galtier and Lobry, 1997;Filipski, 1990). Bernardi and Bernardi (1986) suggested that genomic GC might have important roles in more fundamental adaptive processes and temperature remained unable to dictate the GC content. On the other hand, the genomic DNA might achieve its high thermal stability through its association with polyamines (Oshima et al., 1990) or through relaxation of supercoiling (Friedman et al., 1995). The increase in AT, in this view, might lead to greater translational efficiency. This is in addition to what we have discussed earlier about purine load.