The Complete Mitochondrial Genome of Yarrowia Lipolytica

We here report the complete nucleotide sequence of the 47.9 kb mitochondrial (mt) genome from the obligate aerobic yeast Yarrowia lipolytica. It encodes, all on the same strand, seven subunits of NADH: ubiquinone oxidoreductase (ND1-6, ND4L), apocytochrome b (COB), three subunits of cytochrome oxidase (COX1, 2, 3), three subunits of ATP synthetase (ATP6, 8 and 9), small and large ribosomal RNAs and an incomplete set of tRNAs. The Y. lipolytica mt genome is very similar to the Hansenula wingei mt genome, as judged from blocks of conserved gene order and from sequence homology. The extra DNA in the Y. lipolytica mt genome consists of 17 group 1 introns and stretches of A+Trich sequence, interspersed with potentially transposable GC clusters. The usual mould mt genetic code is used. Interestingly, there is no tRNA able to read CGN (arginine) codons. CGN codons could not be found in exonic open reading frames, whereas they do occur in intronic open reading frames. However, several of the intronic open reading frames have accumulated mutations and must be regarded as pseudogenes. We propose that this may have been triggered by the presence of untranslatable CGN codons. This sequence is available under EMBL Accession No. AJ307410.


Introduction
Mitochondrial genomes from different eukaryotic lineages show an astonishing degree of diversity in size, gene content and genome organization. In recent years, systematic sequencing of protist and fungal mitochondrial genomes, mainly through the efforts of the Organelle Megasequencing Program (Gray et al., 1998) and the Fungal Mitochondrial Genome Project (Paquin et al., 1997) has yielded valuable information on ancestral and derived mitochondrial (mt) genomes and their evolutionary relationships. While ancestral mt genomes, as found in the jakobid flagellate Reclinomonas americana (Lang et al., 1997), contain a large number of additional genes encoding components of their own transcription/translation machinery, derived mt genomes, as found in animals and higher fungi, are characterized by a much reduced gene content, which typically consists of the genes encoding hydrophobic subunits of respiratory chain complexes, large and small ribosomal RNAs and a full or partial set of tRNAs (Gray et al., 1999).
Among the mt genomes of ascomycetous fungi, there is some variability with respect to gene content. Seven genes encoding hydrophobic subunits of NADH : ubiquinone oxidoreductase (complex I) are present in most cases but absent in species that lack this enzyme, such as Saccharomyces cerevisiae (Foury, 1998) and Schizosaccharomyces pombe (Lang, 1993). The ATP9 gene is present in the mt genome of most species, as in S. cerevisiae (Foury, 1998), but not in Podospora anserina (Cummings et al., 1990). Other fungi, such as Neurospora crassa and Aspergillus nidulans, have both a mitochondrial and a nuclear gene for ATP9, with the nuclear gene being the active one (van den Boogaart et al., 1982;Brown et al., 1985). Some fungal mt genomes also contain additional genes, encoding accessory ribosomal proteins such as VAR1 in S. cerevisiae (Hudspeth et al., 1982), Torulopsis glabrata (Ainley et al., 1985) and H. wingei (Okamoto et al., 1994;Sekito et al., 1995) or SP5 in N. crassa (Collins, 1993), or the RNA component of RNase P in S. cerevisiae (Foury, 1998). The mould mitochondrial genetic code appears to be well conserved, with the UGA codon being read as tryptophan as the only exception from the universal genetic code. Two notable exceptions are mitochondria from S. cerevisiae, where AUA is read as methionine and CUN is read as threonine, and from Sz. pombe, where, as in plant mitochondria, the universal genetic code is used (Dirheimer and Martin, 1990;Lang, 1993;Paquin et al., 1997).
Yarrowia lipolytica is an obligate aerobic, ascomycetous yeast which can efficiently be manipulated by classical and molecular genetic techniques. It can utilize a range of unusual hydrophobic carbon sources, including alkanes like n-hexadecane Gaillardin, 1996, 1997). Y. lipolytica's proficiency in secreting high amounts of an alkaline extracellular protease encoded by the XPR2 gene has been exploited for the production of heterologous proteins under the control of XPR2 hybrid promoters (Madzak et al., 1999(Madzak et al., , 2000.
In contrast to S. cerevisiae, which is adapted to alcoholic fermentation (Lagunas, 1986), mitochondrial respiration is essential for Y. lipolytica. Also in contrast to S. cerevisiae, proton-translocating NADH : ubiquinone oxidoreductase (complex I) is present in the respiratory chain of Y. lipolytica mitochondria (Kerscher et al., 1999). Owing to its genetic versatility and superior protein stability, Y. lipolytica has recently been established as a powerful new model system for the analysis of complex I (Ahlers et al., 2000;Djafarzadeh et al., 2000). In all eukaryotes containing complex I, seven of its hydrophobic subunits are mitochondrially encoded. Since so far only the sequence of a 6.6 kb SpeI-BglII fragment, containing the genes for ATP8, ATP6, COX3, ND4 and several tRNAs, has been sequenced (GenBank Accession No. L15359) and functionally characterized (Matsuoka et al., 1994a(Matsuoka et al., , 1994b, we set out to analyse the complete sequence of the mt genome from Y. lipolytica.

Materials and methods
The Y. lipolytica mt DNA sequence was generated from two closely related strains, wild-type isolate W29 and laboratory strain E150 (MatB,. Owing to the fact that the predecessors of strain E150 had been made isogenic by several rounds of backcrossing to strains derived from W29 (Barth and Gaillardin, 1996), the mt DNA sequences of W29 and E150 were found to be virtually identical. There is up to 1% divergence in the non-coding part of introns and in intergenic regions, much less in exonic sequences. The sequence published here is the one found in strain W29. Shotgun sequencing of genomic DNA from wild-type strain W29 was as described in Artiguenave et al. (2000) and Tekaia et al. (2000). This resulted in six large contigs containing up to 200 sequence reads and covering almost 96% of the mt genome. Gaps were closed by polymerase chain reaction (PCR) sequencing of PCR products generated with W29 total DNA.
Mitochondrial DNA from strain E150 was isolated from total DNA by caesium chloride density gradient centrifugation in the presence of bisbenzimide (Hoechst 33258), as described in (Kraiczy et al., 1996). Individual HindIII fragments were subcloned into pBluescriptSK x (Stratagene) and sequenced using the BigDye 2 Terminator Cycle Sequencing Ready Reaction Kit (PE Applied Biosystems) and analysed on an ABI 310 genetic analyser (PE Applied Biosystems); 88% of the Y. lipolytica mt genome was sequenced in this way.
For N-terminal sequencing of Y. lipolytica mt proteins, mt membranes were prepared as described (Kerscher et al., 1999) and subjected to preparative two-dimensional electrophoresis (blue native polyacrylamide gel electrophoresis, BN-PAGE; denaturing polyacrylamide gel electrophoresis, SDS-PAGE). Individual proteins were electroblotted onto Immobilon P membranes, incubated for 24 h at 37uC in a 1 : 1 mixture of trifluoroacetic acid and methanol for deformylation and sequenced directly using a Complete mitochondrial genome of Yarrowia lipolytica 81 473 protein sequencer (Applied Biosystems) as described (Arnold et al., 1998).

Composition
The mitochondrial genome of Y. lipolytica strain W29 consists of a circular molecule with a size of 47.9 kb. This is intermediate between the compact 20 kb and 27.7 kb mt genomes of T. glabrata (Clark-Walker, 1992) and H. wingei (Okamoto et al., 1994;Sekito et al., 1995) and the large 100.3 kb mt genome of P. anserina (Cummings et al., 1990). All genes are encoded on the same strand, as commonly observed among ascomycetous fungi, with the notable exception of S. cerevisiae, where a single tRNA gene (thr1) resides on the opposite strand (Foury, 1998).
The coding strand has a purine content of 53.0%. A BglII site upstream of the rnl gene was chosen as the start position for sequence numbering (see Figure 1).
Of the Y. lipolytica mt genome, 26.1% code for 14 subunits of respiratory chain complexes (exonic ORFs only), 13.2% code for the large and small ribosomal RNAs and a total of 27 functional tRNAs. The G+C content is 22.7% for the whole mt genome and 25.8% for the exonic ORFs.

Protein encoding genes
The Y. lipolytica mt genome has a typical gene content, since it holds the 14 hydrophobic subunits of respiratory chain complexes commonly encoded in the mt genome of ascomycetous fungi. ND1-6 and ND4L are subunits of NADH : ubiquinone oxidoreductase (complex I), COB is the apocytochrome of ubiquinone : cytochrome oxidoreductase (complex III), COX1-3 are subunits of cytochrome oxidase (complex IV), and ATP6, ATP8 and ATP9 are subunits of ATP synthetase (complex V). Additional genes found in the mt DNA of other fungi, such as genes encoding ribosomal proteins or the RNA component of RNAse P, could not be detected. The sequence coordinates of the protein encoding genes in the Y. lipolytica mt genome are summarized in Table 1. All proteins have homologous counterparts in the mt genome of other ascomycetous fungi. Identity and similarity scores are summarized in Table 2. It should be noted, however, that due to the possibility of horizontal transfer of mitochondrial DNA between species (Marinoni et al., 1999) and due to the unusual mode of mitochondrial inheritance (Piškur, 1994), nuclear and mitochondrial genomes may not always share the same evolutionary histories.
Sequence alignments also strongly supported the notion that the usual mould mitochondrial genetic code is used in the Y. lipolytica mt genome. This was further confirmed by the N-terminal sequences of COB, COX1, COX2, COX3 and ATP8 (see Table 3), which showed translation of a CUU codon to leucine and translation of a AUA codon to isoleucine. COX2 was found to be proteolytically processed by removal of six amino acids from the N-terminus. Translation of COX1, COX3 and ATP8 start at a standard AUG codon. This is probably true for the other genes as well. In the case of ND6, ND1, ATP6, COX3, ND4, ATP9, ND4L, COB, ND2 and ND3, the initiator AUG is preceded by a short stretch rich in adenine. In terms of conserved gene order, the Y. lipolytica mt genome is most similar to the H. wingei mt genome. Five gene clusters, namely ND6/ND1, COX1/ATP8/ATP6, ND4/ATP9/COX2, ND4L/ ND5 and COB/ND2/ND3, are conserved between these two organisms. Interestingly, the COX1/ ATP8/ATP6 cluster is also present in the genus Saccharomyces, where it represents a transcription unit (Groth et al., 2000). This cluster appears to represent an ancient organization, presumably being present already in the common progenitor of Yarrowia and Saccharomyces.

Introns and intronic ORFs
Fungal mt introns fall into two groups. Group 1 introns are characterized by a common secondary structure, which is due to base pairings between conserved internal sequence elements and the unique mechanism of their splicing reaction, which involves the attack of a guanine nucleotide at the 5k end of the intron. Many group 1 introns contain ORFs that may be contiguous with the upstream exon or may be free-standing. These encode maturase or maturase/endonuclease proteins which may assist in the splicing reaction or allow homing into cognate intronless alleles. Most group 1 endonucleases belong to one of four families, characterized by their possession of sequences homologous to the LAGLIDADG, GIY-YIG, H-N-H or His-Cys box motifs. The mobility of group 2 introns depends on intron-encoded proteins with maturase, endonuclease and reverse transcriptase activity. For a review, see Belfort and Perlman (1995).
In the Y. lipolytica mt genome, the genes for ND1, COX1, COX3, ND5 and COB are interrupted by a total of 17 introns, all of which belong to group 1 (see Table 4). Splice junctions were deduced from the resulting exonic protein sequences. All conform to the rule that invariably the last base of the upstream exon is a pyrimidine and the last base of the intron is a guanine (Davies et al., 1982), with the notable exception of the sixth intron in COX1. In this case, a different splice acceptor located 12 bp upstream of the proposed one might be used, which would result in the addition of the amino acid sequence YKIL between exons six and seven.
There is no intron in the rnl gene for the large ribosomal RNA that would correspond to the v intron of S. cerevisiae mitochondria (Dujon, 1980). This had already been observed in Southern hybridization experiments (Jacquier and Dujon,   Table 4).
Interestingly, many of them have accumulated mutations that are either frameshifts, in-frame stop codons or small insertions and must be regarded as pseudogenes (see Figure 1). The functionality of two intronic ORFs that are not contiguous with the upstream exon (COX1-I6 and COX3-I1) is doubtful. It appears that the maturase function for the introns containing defective ORFs must be provided in trans by intron-encoded proteins that are still functional. Group 2 introns could not be found. Consequently, intron loss, which is believed to depend on the reverse transcriptase activity of group 2 intron encoded proteins (Belfort and Perlman, 1995), should not be feasible in the mt genome of the Y. lipolytica strains studied here.

rRNAs and tRNAs
The genes for the large and small of ribosomal RNAs (rnl, nt position 72-3042 and rns, nt position 9333-10923) and their 5k and 3k boundaries were identified by their homology to their counterparts from H. wingei, P. anserina and other ascomycetous fungi. Twenty-seven tRNA genes could be detected in the mt genome of Y. lipolytica. These are depicted in Figure 2. Interestingly, tRNA Cys from Y. lipolytica mitochondria, like tRNA Cys from Sz. pombe mitochondria (Dirheimer and Martin, 1990), can form a highly abnormal cloverleaf structure. A putative tRNA pseudogene, termed ySer3 (see Figure 3), is found in the third COB intron, destroying the intronic ORF for a LAGLIDADG endonuclease.
The set of functional tRNAs in the Y. lipolytica mt genome is larger than the minimal set of 24 required for the translation of the mould mt genetic code, assuming that, as first proposed for N. crassa mitochondria, an unmodified U in the first anticodon position ('wobble position') can pair with all   four bases, while G can pair with U or C and a modified U can pair with A or G (Heckman et al., 1980;Dirheimer and Martin, 1990). Apart from the two methionine tRNAs, the first of which by virtue of its homology to S. cerevisiae tRNA f-Met (Canaday et al., 1980) seems to be specific for the initiation codon, there is only one additional pair (Leu2, Leu3) with identical anticodon sequences. Since tRNA Leu2 lacks several hydrogen bonds in the D-, anticodon-and y-stems, its functionality is uncertain. Unlike the corresponding tRNA(UAG) from S. cerevisiae, where an insertion mutation has produced an unusual eight nucleotide anticodon loop and changed the aminoacylation specificity from leucine to threonine (Dirheimer and Martin, 1990), both tRNA Leu2 and tRNA Leu3 from Y. lipolytica possess the normal seven nucleotide anticodon loops. This is consistent with their function as leucine acceptors. In two cases, extra tRNAs are present, whose anticodons in the wobble positions differ from the expected sequence (Tyr2, Lys2). The case of tRNA Tyr2 is particularly puzzling, since if as commonly found in fungal mt tRNAs (Dirheimer and Martin, 1990), the A in the first anticodon position was deaminated to I, this would result in a tRNA able to read through stop codons UAA and UAG. We conclude that Y. lipolytica mt tRNA Tyr2 , like the S. cerevisiae tRNA Arg specific for CGN codons, is one of the rare examples of an organellar tRNA containing an unmodified A in the first anticodon position. Another remarkable species is tRNA Ile2 , which has an UAU anticodon. Although this tRNA is far from being perfectly base-paired, it appears to be functional. This finding is in contrast to the situation seen in S. cerevisiae mitochondria, where AUA is read as methionine by a tRNA(CAU) and in Sz. pombe mitochondria, where AUA is read as isoleucine by a tRNA(CAU) that probably contains a modified C in the first anticodon position allowing C : A wobble (Dirheimer and Martin, 1990). It should also be noted that the tRNA Ile2 described here is different from the putative AUAspecific tRNA proposed to lie between the genes for ATP6 and ATP8 (Matsuoka et al., 1994b). This sequence can indeed form a cloverleaf-like structure but, judged from the most unusual length and composition of its y-loop, it seems unlikely that it can function as a tRNA.
Most interestingly, the set of tRNAs in the Y. lipolytica mt genome appears to be incomplete. No tRNA Arg able to recognize CGN codons could be detected. We therefore analysed codon usage within the protein coding exons and the presumably Complete mitochondrial genome of Yarrowia lipolytica 85 functional intronic ORFs, i.e. those that are not pseudogenes and are contiguous with the upstream exon (see Table 5). Generally, codon usage in intronic ORFs is slightly less biased towards codons ending in A or U. The most striking observation, however, was that while CGN codons could not be found in exonic ORFs, they do occur in three intronic ORFs. CGC is found once in COX1-I5, while CGA is found once in ND5-I1 and three times in ND5-I2. We assume that the presence of 'forbidden' codons makes these intronic ORFs untranslatable, further reducing the number of functional intronic ORFs in the Y. lipolytica mt genome (see Figure 1). Similarly, it has been reported that in the mt genome of S. cerevisiae, CGN codons are not found in the genes COB, COX1, COX2, COX3, ATP6 and ATP9 (Bonitz et al., 1980). This is also true for ATP8, while in VAR1 (Hudspeth et al., 1982), CGG and CGU are used once each. In intronic ORFs, arginine may be encoded as CGN, but such codons are much less frequent than AGR codons. A tRNA specific for CGN codons is present in the mt genome of S. cerevisiae, but appears to be a minor species compared to the tRNA that decodes AGR (Dirheimer and Martin, 1990). Notably, the anticodon of S. cerevisiae mt tRNA Arg is ACG, raising the question of how it interacts with CGA, CGG and CGC codons. Possibly, ORFs of the S. cerevisiae mt genome containing such codons can be translated with low efficiency only, if at all. This is reminiscent of the fact that in Sz. pombe mitochondria, out of a total of three intronic ORFs, two contain UGA codons that normally function as stop codons in this organism (Dirheimer and Martin, 1990; see also http://megasun.bch.umontreal.ca/People/ lang/FMGP/FMGP.html).
The absence of a tRNA specific for CGN codons in the mt genome of Y. lipolytica could in theory be compensated by import of a nuclear coded tRNA  Complete mitochondrial genome of Yarrowia lipolytica 87 into mitochondria, as has been reported for a variety of lower fungi and also the basidiomycete Schizophyllum commune (Paquin et al., 1997), or by post-transcriptional editing of tRNAs, which has been demonstrated to occur in plants (Weber et al., 1990) and in marsupials (Bö rner et al., 1996). However, since CGN codons only occur in intronic ORFs in the Y. lipolytica mt genome, there is no need to invoke such hypothetical mechanisms. Rather, we propose that the loss of a tRNA able to read CGN codons initially rendered many of the intronic ORFs in the Y. lipolytica mt genome nonfunctional and that some of these subsequently were turned into pseudogenes.

Intergenic regions, genome instability
Most of the DNA outside exonic and intronic ORFs consists of A+T rich sequences of low complexity. For example, in the single intron in ND1, six direct repeats of the tetranucleotide sequence AATT (starting at position 7064) are followed by ten direct repeats of the pentanucleotide sequence TATGT. There is evidence for plasticity of the latter repeat from comparison with strain E150, where it is found nine times only. The A+T-rich intergenic regions of the Y. lipolytica mt genome are interspersed with GC clusters, many of which are arranged as inverted repeats (see Table 6). Similar to those found in the mt genome of S. cerevisiae (Foury, 1998) and other fungi, these may function as mobile genetic elements. Two 43 bp long, almost perfectly palindromic GC-rich clusters with identical sequences are found in the second intron of the COB gene, where they destroy the ORF for the intron-encoded maturase. Interestingly, both of these putative minitransposon sequences are flanked by adenine pairs, which may indicate that small direct repeats are generated at the target site.
Rearrangements involving tRNA genes may also represent a source of genome instability. This is evidenced by a direct repeat involving the last 23 bp of tRNA Tyr1 and 62 bp of downstream DNA (positions 5045-5129). After a 187 bp long spacer, this sequence is repeated with only one mismatch within 85 bp.

Future perspectives
Little is known about the sequence motifs that are involved in replication and expression of the Y. lipolytica mt genome. It has been shown by primer extension analysis (Matsuoka et al., 1994b) that transcription of a polycistronic mRNA containing the genes for ATP8, ATP6, COX3 and ND4 starts at the nonanucleotide motif ATATAAATA, Table 6. GC-rich clusters in the mt genome of Y. lipolytica Inverted repeats are marked with arrows.