Gene decay in archaea

The gene-dense chromosomes of archaea and bacteria were long thought to be devoid of pseudogenes, but with the massive increase in available genome sequences, whole genome comparisons between closely related species have identified mutations that have rendered numerous genes inactive. Comparative analyses of sequenced archaeal genomes revealed numerous pseudogenes, which can constitute up to 8.6% of the annotated coding sequences in some genomes. The largest proportion of pseudogenes is created by gene truncations, followed by frameshift mutations. Within archaeal genomes, large numbers of pseudogenes contain more than one inactivating mutation, suggesting that pseudogenes are deleted from the genome more slowly in archaea than in bacteria. Although archaea seem to retain pseudogenes longer than do bacteria, most archaeal genomes have unique repertoires of pseudogenes.


Introduction
Genomic studies have allowed for the in-depth analysis of the genetic structure of organisms from all domains of life.Unfortunately, archaea are rather poorly represented among the more than 400 fully sequenced genomes, due in part to the difficulties associated with their cultivation and genetic manipulation (Schleper et al. 2005).Although the Eukarya and the Archaea are sister taxa (Ciccarelli et al. 2006), the overall organization of archaeal genes and genomes is more similar to that of the Bacteria.Like bacteria, archaea usually contain a single circular chromosome and have a high gene density, with genes organized in operons and lacking introns.
The elucidation of complete genome sequences has instigated large-scale experimental and computational analyses that have attempted to identify and annotate all genes encoded in a genome.Despite such efforts, up to 40% of the predicted coding sequences in many archaeal genomes lack a predicted function (Galperin andKoonin 2004, Fricke et al. 2006).It has been suggested that, within virtually all genomes, there are annotated genes that have been mutationally inactivated and can never be assigned a function (Ochman and Davalos 2006).Eukaryotic genomes have long been known to contain large numbers of non-functional genes (Vanin 1985), but the full extent of their pseudogene contents was not evident until whole genome sequences became available.Genome-wide analyses of the nematode (Harrison et al. 2001) and human (Torrents et al. 2003) genomes have detected massive amounts of now-defunct genes, whose numbers likely exceed the number of functional genes in each genome.
Because of the small size and high gene density of bacterial genomes, it was originally thought that prokaryotes would contain few, if any, pseudogenes (Lawrence et al. 2001).In addition, the majority of pseudogenes in higher eukaryotes are generated by retrotransposition (Vanin 1985), a process unknown in bacteria or archaea.Nevertheless, pseudogenes are now known to be a common feature of many bacterial genomes (Lerat andOchman 2004, 2005) and may constitute nearly half of the annotated coding sequences (CDSs) in the genomes of some pathogens (Andersson et al. 1998, Cole et al. 2001, Toh et al. 2006).
A previous assessment of prokaryotic genomes estimated that up to 5% of the annotated genes in archaeal genomes may, in fact, be pseudogenes (Liu et al. 2004).This analysis, in which the contents of 53 bacterial and 11 archaeal genome sequences were compared, suggested that the pseudogenes originated predominantly from failed horizontal gene transfer events (as opposed to the mutational inactivation of resident genes).However, this analysis included comparisons of genes over broad phylogenetic distances, did not discriminate between orthologous and paralogous genes, and ignored genes without an assigned function.As such, this approach could lead to inaccurate appraisals of the pseudogene contents, particularly in the Archaea, which were not densely sampled and in which the functions of many genes have not been characterized.Moreover, if archaea have more split genes than bacteria, as suggested previously (Snel et al. 2000), the recognition of pseudogenes through inter-domain comparisons becomes less straightforward.
An alternative approach, one in which closely related genome sequences are compared, can enhance the resolution of pseudogenes because larger fractions of the genome are shared and there are fewer problems in assigning orthology.Most bacterial genomes have been found to contain largely unique sets of pseudogenes, suggesting that pseudogenes are constantly formed in, and rapidly eliminated from, the genome (Lerat and Ochman 2004); however, the frequencies with which pseudogenes are generated, maintained and removed from archaeal genomes are unknown.With the current avail-ability of numerous archaeal genome sequences, including multiple members of particular genera, we sought to identify and enumerate the pseudogenes in archaea, and assessed their mechanisms of formation and erosion.

Selection of phylogenetically related species
We used 16S rDNA sequences to determine the phylogenetic relationships of archaea for which genome sequences were available at NCBI as of April 1, 2006.Using the MEGA3.1 software package (Kumar et al. 2004), we applied a Minimum Evolution bootstrap test of phylogeny with pairwise deletion and 10,000 replicates.The relationships, accession numbers and genomic properties of the 15 selected species are presented in Figure 1.

Pseudogene identification
Pseudogenes were identified by comparing the genome contents of sequenced members of the same clade, as shown in Figure 1.Full genome sequences and sets of GenBank annotated proteins for the corresponding genomes were obtained from NCBI (http://www.ncbi.nih.gov).Inclusion thresholds for aligned sequences were as described previously (Lerat and Ochman 2005): E-values of < 10 -15 and sequence identity of > 75% for clades containing species with 16S rDNA identity > 95% (i.e., Pyrococcus spp., Methanosarcina spp., Thermoplasma spp.; highlighted in grey in Figure 1); and E-values of < 10 -10 and sequence identity of > 49% for clades containing species with 16S rDNA identity < 95% (i.e., Halobacterium/Haloarcula, Methanococcus/Methanocaldococcus, Sulfolobus spp.).
Within each clade of closely related organisms, the anno-tated proteins within the genome of one species were queried against the complete nucleotide sequence of another species from the same clade.This was done reciprocally for all crosswise comparisons within each clade using the TBLASTN search tool (Altschul et al. 1997).The Ψ-Φ program suite, developed to recognize truncated and otherwise mutationally altered CDSs (Lerat and Ochman 2004), was applied to the TBLASTN output, returning a list of candidate pseudogenes that was then curated manually.One way in which Ψ-Φ recognizes potential pseudogenes is by identifying internal stop codons in a query gene.For the comparisons of the three Methanosarcina spp., it was necessary to disable this feature for the amber stop codon TAG, which instead codes for pyrrolysine in this genus (James et al. 2001).In addition, all pseudogenes were curated manually for known recoding events (Cobucci-Ponzano et al. 2005).
Pseudogenes detected by the comparative analysis of full genome sequences can be either positional homologs or nonpositional homologs of CDSs in the reciprocal genomes, based on gene context conservation: positional pseudogenes are those that share at least one neighboring gene with the corresponding functional copy in the related genome.
To identify gene-inactivating mutations, the putative pseudogenes were aligned with their counterparts using CLUSTALW 1.83 (Thompson et al. 1994).Gene-inactivating mutations were partitioned into five classes: frameshifts (insertions or deletions of 1 or 2 nucleotides in length), deletions (> 2 nucleotides in length), insertions (> 2 nucleotides in length), truncations (large deletions at either or both ends of a coding sequence), and nonsense mutations.In cases where more than one gene-inactivating mutation was identified in an alignment, the mutation was classified as a combination of two or more of these classes.

Pseudogene content analyses in archaeal genomes
We assessed the pseudogene contents of 15 species, representing eight genera of archaea, by comparing the full genome sequences of the most closely related taxa.The fractions of predicted pseudogenes range from 0.3% to 8.6% of the total number of annotated protein coding sequences, including unannotated pseudogenes, i.e., intergenic regions that contain the eroded remnants of genes that have been annotated as CDS in a related genome (Figure 2).Applying identical methods, the range and the average pseudogene fractions are much lower in archaea than in pathogenic bacteria but similar to those of free-living bacteria (Lerat andOchman 2004, 2005).
There are almost equal fractions of positional (i.e., sharing at least one neighboring gene with its counterpart in a related genome) and non-positional pseudogenes in most genomes, and within both classes, there are large numbers of unannotated pseudogenes (Table 1).Across taxa, the fraction of these unannotated pseudogenes increases with the total number of detected pseudogenes.As expected, large proportions of predicted pseudogenes are annotated as having hypothetical functions (Table 1).Inactivated mobile element-associated genes are more abundant among non-positional pseudogenes, especially in Sulfolobus spp.and Methanosarcina spp.The lists of the predicted pseudogenes are provided in Supplementary Tables S1 and S2.

Mechanisms of gene inactivation
When compared to bacterial pseudogenes, archaeal pseudogenes are more highly decayed, with a larger fraction contain-ing more than one inactivating mutation (Figure 3).In both archaea and bacteria, non-positional pseudogenes show greater gene decay than do positional pseudogenes.Truncations, frameshifts, and combinations thereof, are the most widespread mechanisms by which genes are inactivated in archaea (Supplementary Figure S1).
A total of six pseudogenes were inactivated by changes in the lengths of a homopolymeric stretch spanning more than six nucleotides, and all of these frameshifts occurred in A/ T tracts (data not shown).Unlike the situation observed in bacterial pathogens, the interruption of genes by insertion sequences is relatively rare in archaea, with only 5 and 28 IS-inactivated genes from a total of 550 positional and 711 non-positional pseudogenes, respectively.Among the non-positional pseudogenes, 22 of the 28 IS-inactivated genes occur in S. solfataricus, which contains exceptionally large numbers of mobile elements (She et al. 2001).
Because pseudogenes are under no functional constraint and are evolutionarily neutral, such regions can divulge the underlying rate and pattern of mutations within a genome (Li et al. 1981, Mira et al. 2001).In general, deletions occur more frequently than insertions in archaeal pseudogenes (Figure 4).Thermoplasma volcanium is the only archaeal species in which the cumulative indel length is positive (i.e., more DNA is inserted than deleted in the detected pseudogenes), and this is due to a single 88-bp duplication in the NADH ubiquinone oxidoreductase gene.Although their overall indel lengths are negative, in the non-positional pseudogenes of species within the Methanosarcina clade, insertions outnumber deletions, with most cases involving the insertional inactivation of transposases within mobile elements.

GC content of archaeal genes with pseudogene counterparts
Previous analyses have reported a significant difference between the nucleotide composition of genes that have a pseudogene counterpart and those that do not (Lerat and Ochman 2004).We searched for this in archaea for the functional counterparts of both positional and non-positional pseudogenes (Supplementary Table S3) and found no consistent trend.Genes with positional or non-positional pseudogene counterparts have higher GC contents than genes without pseudogene counterparts in Methanocaldococcus jannaschii, Methanococcus maripaludis and the Sulfolobus spp.genomes.Genes with non-positional pseudogene counterparts have a significantly lower GC percentage in the Methanosarcina clade.In this case, the difference in GC contents is due to the many inactivated mobile elements, which typically have a different GC content than that of their hosts.

Shared pseudogenes between closely related archaeal genomes
The occurrence of only a single inactivating mutation in the vast majority of bacterial pseudogenes implies that pseudogenes are rapidly generated in, and removed from, these genomes.This process has resulted in genomes containing largely nonoverlapping pseudogene inventories, even among strains averaging only 1% in sequence divergence (Lerat and Ochman 2004).Although archaea seem to retain pseudogenes longer than do bacteria (as evident from the higher incidence of pseudogenes containing multiple inactivating mutations), most archaeal genomes have unique repertoires of pseudogenes.Among Pyrococcus spp., which average 30 positional pseudogenes, not more than three are shared between any two strains.Higher fractions of shared pseudogenes are observed in Sulfolobus and Methanosarcina, which can be ascribed, in part, to shared inactivated transposases that occur in multiple copies (Supplementary Tables S1 and S2).

Discussion
Comparative analyses of full genome sequences show that pseudogenes can occur at high frequencies and often outnumber the functional copies of genes.Even among bacteria, which have relatively small and streamlined genomes, up to half of the genome of some facultative pathogens, such as Rickettsia prowazekii and Mycobacterium tuberculosis (Andersson et al. 1998, Cole et al. 2001), is relegated to pseudogenes.In our analyses of 15 archaeal genomes representing eight genera, we found that the predicted fractions of pseudogenes range from 0.3% to 8.6%, corresponding to 4 to 260 pseudogenes per genome.These numbers are lower than the pseudogene contents found in most bacteria by the same approach; however, previous studies focused primarily on bacterial pathogens, which are known to have more severely de-  graded genomes than non-pathogenic bacteria.The pseudogene contents of archaea are likely to be more similar to those non-pathogenic, free-living bacteria, although relatively few bacterial genomes have been evaluated by these methods.
The available annotations of many of the completed archaeal genomes identify no pseudogenes.Although the original publications of the genomes of M. jannaschii (Bult et al. 1996), S. solfataricus (She et al. 2001), S. acidocaldarius (Chen et al. 2005), T. acidophilum (Ruepp et al. 2000) and T. volcanium (Kawashima et al. 2000) each describe certain genes as being inactivated, they are not classified as such in the most commonly used public databases (e.g., NCBI, www.ncbi.nlm.nih.gov).Among the genomes that we analyzed, only the annotations of M. mazei and M. barkeri contain pseudogenes (112 and 136 compared with the 90 and 153 that we identified in these genomes, respectively).And similar to our observations for S. solfataricus, She et al. (2001) identified 43 partial transposases.
In contrast to NCBI, the IMG database (img.jgi.doe.gov)(Markowitz et al. 2006) provides re-annotated archaeal genomes and lists somewhat different numbers of total genes as well as pseudogenes for these genomes.Because the specific annotation can impact the identification of pseudogenes (IMG predicts on average 2.5% more protein coding genes than does NCBI), it is not possible to compare directly the numbers of pseudogenes listed at IMG with those detected in the present study.Perrodou et al. (2006) have pointed out an association between different genome annotation approaches and pseudogene prediction, noting that, in some genomes, pseudogenes might be attributable to sequencing errors.Because a minority of the pseudogenes is formed by the most common sequencing errors, such artifacts have probably contributed little to the sets of pseudogenes that we recognized.
Although it has been found that, in Pyrobaculum aerophilum, a putative deficiency in mismatch repair genes has created a variety of long and variable mononucleotide runs (Fitz-Gibbon et al. 2002), intraspecific variation in homopolymeric tract lengths are not a major source of gene inactivation in the archaeal genomes considered in this study.Liu et al. (2004) also applied a comparative approach to identify the pseudogenes in 64 sequenced genomes, including 11 archaeal species.Their analyses compared species from different domains (e.g., the Archaea versus the Bacteria), which can make assignments of orthology difficult.To refine their analyses and to safeguard against the inclusion of annotation artifacts, they limited their comparisons to protein coding genes with assigned functions, thereby excluding all those annotated as "hypothetical."Such restrictions, though valid, preclude accurate assessments of the full complement of pseudogenes in a genome.On one hand, the numbers of pseudogenes detected by such methods might be overestimated because orthologous genes from divergent taxa might remain functional despite extreme length or sequence differences.And this may be complicated by the occurence of more split genes in archaea than in bacteria (Snel et al. 2000).On the other hand, most pseudogenes detected through comparisons of closely related genomes are functionally annotated as "hypothetical," which is not surprising since expendable genes are less likely to be among those whose functions have been assigned.Therefore, excluding these genes would result in underestimates of the actual number of pseudogenes.Despite these caveats, the pseudogene contents of archaea reported by Liu et al. (2004) are qualitatively similar to those identified with Ψ-Φ; both approaches indicate that archaeal genomes contain fewer pseudogenes than do pathogenic bacteria, and both indicate that archaeal pseudogenes are more decayed than bacterial pseudogenes.Also, in both our and Liu et al.'s analyses, S. solfataricus contains the largest number of pseudogenes of all surveyed archaeal genomes.
In addition to cataloging archaeal pseudogenes, our analyses provide insights into the mutational processes that occur within these genomes.In bacteria, strand slippage in mononucleotide repeats is considered a common mutagenic mechanism; however, few archaeal pseudogenes are attributable to variation in mononucleotide repeats.The association of strand slippage with immune evasion in bacterial pathogens (van der Woude and Baumler 2004) may explain the lower incidence of this type of frameshift in archaea.Strand slippage is evident in some of the mobile-element associated genes in S. acidocaldarius (IS1-family pseudogenes) and S. solfataricus (IS1048family pseudogenes), many of which were modified by frameshifts in adenine mononucleotide repeats larger than six residues.Because mobile elements have recently been shown to create different functional proteins by strand slippage (Baranov et al. 2006), such events might not always indicate an inactivated gene.In addition, recent evidence has indicated that recoding may occur in Sulfolobus (Cobucci-Ponzano et al. 2003, 2005).
We find that the primary mechanism of gene inactivation in archaea is by truncation, which is responsible for the formation of over 30% of all pseudogenes.Among pseudogenes that have been inactivated by a single truncation event, unannotated pseudogenes are, on average, shorter than annotated pseudogenes, and non-positional pseudogenes are shorter than positional pseudogenes.Defunct transposable elements contain more inactivating mutations than other pseudogenes in S. tokodaii, S. solfataricus, M. acetivorans and M. barkeri.Whether the higher decay observed in transposable elements is caused by unsuccessful transposition events is unclear, although previous studies in Sulfolobus have shown that the rate of precise excision of mobile elements was low (Blount and Grogan 2005).
Pseudogene contents, as predicted by most comparative studies, still represent rather conservative estimates of the actual numbers of inactivated genes within a genome.Several classes of pseudogenes, such as those caused by missense mutations that abolish protein function as well as regulatory mutations that disrupt gene expression, will go undetected by this approach.Such comparative analyses also ignore strain-specific genes (i.e., ORFans) for which there are no homologous sequences available for comparison.Although many ORFans are thought to be functional (Daubin and Ochman 2004), they are unlikely to be essential to cell function and are prone to inactivation and loss.
Pseudogene detection by comparative analyses relies on the quality of the genome annotation, which can deviate substantially among different approaches (Brenner 1999).Increases in the availability of genome sequences from closely related species, which are still in short supply for archaea, have greatly facilitated genome annotation.As more genome sequences become available, we suspect that there will be less need to rely on experimental evidence to make accurate functional predictions about the majority of genes in a genome.

Figure 1 .
Figure 1.Relationships and features of archaeal genome sequences used in this study.Numbers at the nodes of the tree represent bootstrap values, with asterisks indicating 100% bootstrap support.The column labels indicating the numbers of pseudogenes represent total numbers, with the numbers in brackets indicating the positional pseudogenes.Horizontal lines separate different clades, and members within each clade were compared to detect pseudogenes.

Figure 3 .
Figure 3. Differences in the gene decay ratios (calculated as the ratio of pseudogenes eroded by multiple inactivating mutations to pseudogenes inactivated by a single mutation) for archaeal (filled bars) and bacterial (open bars) pseudogenes.

Figure 4 .
Figure 4. Cumulative size of insertion (open bars) and deletion (filled bars) events in positional (P) and non-positional (NP) pseudogenes of archaeal genomes.Numbers above the bars indicate numbers of pseudogenes on which the analyses are based.

Table 1 .
Numbers, proportions and characteristics of pseudogenes in the Archaea.Numbers depicted are the total number of predicted pseudogenes, with bracketed numbers indicating the number of positional pseudogenes.