A Re-Annotation of the Saccharomyces Cerevisiae Genome

Discrepancies in gene and orphan number indicated by previous analyses suggest that S. cerevisiae would benefit from a consistent re-annotation. In this analysis three new genes are identified and 46 alterations to gene coordinates are described. 370 ORFs are defined as totally spurious ORFs which should be disregarded. At least a further 193 genes could be described as very hypothetical, based on a number of criteria. It was found that disparate genes with sequence overlaps over ten amino acids (especially at the N-terminus) are rare in both S. cerevisiae and Sz. pombe. A new S. cerevisiae gene number estimate with an upper limit of 5804 is proposed, but after the removal of very hypothetical genes and pseudogenes this is reduced to 5570. Although this is likely to be closer to the true upper limit, it is still predicted to be an overestimate of gene number. A complete list of revised gene coordinates is available from the Sanger Centre (S. cerevisiae reannotation: ftp://ftp/pub/yeast/SCreannotation).


Background
The publication in 1996 of the first complete eukaryotic genome sequence, that of Saccharomyces cerevisiae, heralded a new era in biology (Goffeau et al., 1996). This resource not only benefited those investigating S. cerevisiae, but also enabled inferences from the functional data to be transferred to a diverse range of other organisms. Unexpectedly, a significant proportion (56%) of annotated genes had not been studied previously, despite more than 50 years of traditional biochemistry and genetics (Oliver et al., 1992, Oliver, 1996, Mewes et al., 1997. This observation stimulated the application of functional genomics technologies to characterise these genes and their products, either gene-by-gene in small laboratories, or on a larger scale in some research institutes (Hieter and Boguski, 1997).
In the five years since the S. cerevisiae genome was sequenced, the majority (70%) of the predicted genes have been assigned an initial functional characterisation in the Yeast Protein Database (YPD, Proteome Inc. http://www.proteome.com/ databases/index.html). Establishing the functional inter-relationships between all the genes in a genome requires, in the first instance, the assignment of genes to preliminary functional classes. These initial assignments will authenticate predicted genes as coding entities and partition the data into categories for subsequent biological analyses. However, it will be difficult to assess when this milestone has been reached since the exact number of genes in S. cerevisiae is still unclear: the Munich Information Center for Protein Sequences (MIPS) database has a protein complement of 6368 (http://www.mips. biochem.mpg.de/proj/yeast) Saccharomyces Genome Database (SGD) has 6310 (http://genome-www. stanford.edu/Saccharomyces/), and YPD has 6142 as of 26 January 2001.
It is likely that a significant cause of the discrepancies between these gene numbers are due to small, fortuitously occurring ORFs (open reading frames), which are notoriously difficult to distinguish from real genes (Dujon, 1996). In the original S. cerevisiae annotation, only ORFs greater than 100 amino acids in size were considered. This threshold was imposed in order to reduce the chance of missing small proteins without overprediction due to the statistically expected frequency of small ORFs (Sharp and Cowe, 1991). Those without assigned function or homologues were designated sequence orphans (Dujon, 1996). As genome sequencing proceeded, the ratio of orphans to ORFs with homologues increased rapidly-so much so that this phenomenon was termed 'The mystery of orphans' (Dujon, 1996).
The existence of relatively high numbers of orphans can only be attributed to one or a combination of the following: 1. They may simply be spurious ORFs. In S. cerevisiae, a number of predicted genes are also completely or substantially overlapping with defined coding features and should therefore be disregarded. 2. They may arise due to the acquisition of novel species-specific functions. 3. They escape functional characterisation by homology because they are rapidly evolving. 4. Identifiable homologues in other organisms exist, but these have not yet been sequenced.
The question of S. cerevisiae gene number has been addressed many times, with differing outcomes. Mackiewicz et al. (1999) estimated the total number of protein coding ORFs to be 4800, based on their sequence properties. Zhang and Wang (2000) calculated the likely number to be j5645, based on the assumption that unknown genes have similar statistical properties to known genes. As part of the Genolevures project, Blandin et al. (2000) performed a consistent re-annotation of the S. cerevisiae genome using uniform criteria, revealing 50 possible novel genes and 26 gene extensions. They proposed a protein coding gene set of at least 5600 genes. As part of the same initiative, Malpertuy et al. (2000) estimate that the S. cerevisiae genome contains 5651 actual protein coding genes (including the 50 new predictions), and that the public databases contain 612 predicted ORFs that are not protein coding.
The availability of an additional yeast genome, that of Schizosaccharomyces pombe (fission yeast), which has 99.5% of its coding sequence annotated and deposited in the EMBL database (manuscript in preparation), will allow the comparison of the complete genomes and proteomes of two well-studied unicellular eukaryotes, which diverged around 330 million years ago (Berbee and Taylor, 1993).

Aims
The discrepancies in gene and orphan numbers proposed by previous analyses suggested that S. cerevisiae would benefit from a consistent reannotation, applying new analytical methods and incorporating the data which have become available over the last four years. In doing so, we wished to achieve: 1. The refinement of gene complement. 2. The classification of orphans into hypothetical, very hypothetical, and spurious ORFs which should be disregarded. 3. The identification of gene prediction errors. 4. The identification of new genes.
The Sz. pombe genome annotation effort has benefited immensely from the availability of the complete genome of S. cerevisiae. The analysis methods used for the Sz. pombe genome combine ab initio gene prediction algorithms and homology search results with rigorous manual inspection of biological context (Xiang et al., 2000). In addition, consistency checks using available cDNAs and ESTs have been routinely performed, and new experimental data from the fission yeast community immediately incorporated into the dataset. We believe that these methods provide an accurate, detailed gene set for this organism. The Sz. pombe analysis procedure has been applied to S. cerevisiae in order to define an up-to-date non-redundant gene set with consistent annotations and a new estimate of orphan numbers.

DNA sequences
The sequences of the 16 S. cerevisiae chromosomes, and associated ORF translations, were downloaded from SGD on the 16th November 2000. ORF coordinates were then converted into EMBL feature table format and imported into the Artemis sequence analysis and annotation tool (Rutherford et al., 2000).

144
V. Wood et al.

Analysis procedure
A number of standard analysis tools were used to assist the interpretation of the sequence data (as applied to the Sz. pombe genome) (Xiang et al., 2000). Searches were performed against public databases (SWISS-PROT and TrEMBL (Bairoch and Apweiler, 1999), EMBL (Stoesser et al., 1999), Pfam (Bateman et al., 1999), and PROSITE (Bairoch, 1994)) using standard software (BLAST (Altschul et al., 1990), MSPcrunch (Sonnhammer and Durbin, 1994), tRNAscan (Lowe and Eddy, 1997), FASTA (Pearson and Lipman, 1988) and Genewise (Birney et al., 1996)), to complete a series of automated analyses. This enabled annotated DNA and protein features to be confirmed. Other elements not included in the SGD annotation (experimentally identified snoRNAs and other cellular RNAs, omitted LTRs, and protein domains), were also mapped onto the sequence using in-house Perl scripts. De novo gene predictions were not performed as part of this analysis.

New genes
In the Sz. pombe genome, more than 300 genes have been identified which are conserved at the protein level in other organisms, but absent from the S. cerevisiae dataset (manuscript in preparation). Some of these were small genes (70-150 amino acids); TBLASTN searches were conducted to determine whether these small genes had been omitted from the initial S. cerevisiae gene predictions.

New gene coordinates
Within the annotation tool Artemis (Rutherford et al., 2000), FASTA alignments were performed on existing gene predictions, to assess their accuracy. Overlapping ORFs were subject to systematic manual inspection to determine whether the correction of frameshifts or sequencing errors could extend homology, by merging existing genes or increasing their length.

Disregarded spurious ORFs, overlapping with real genes
ORFs which have all, or the majority of their translation overlapping with other annotated features, were individually assessed for similarity to all organisms, as described in New gene coordinates above, together with experimental data if available.
For ORFs to be considered as spurious, they had to meet all of the following criteria: 1. Small size (35-250 amino acids). 2. Absence of similarity to known proteins. 3. Absence of functional data which could not have been generated by the real overlapping gene. 4. Greater than 25% overlap at the N-terminus or 50% overlap at the C terminus with another coding feature; overlap with another feature at both ends; or ORF containing a tRNA.
Transposon fragments were also removed.

Very hypothetical ORFs
In Sz. pombe, 177 ORFs which are considered unlikely to be coding but cannot yet be dismissed as spurious have been assigned as very hypothetical according to the following criteria: 1. Small size ( 100-250 amino acids).
2. Absence of similarity to other known proteins.
3. Overlap with other features, particularly at the N-terminus, where they might interfere with promoters (the overlaps in these cases are smaller than those observed in disregarded ORFs). 4. Extreme GC content.
The annotation of Sz. pombe adequately discriminates between very hypothetical proteins and real genes and this approach has been applied to a re-annotation of the S. cerevisiae genome.

New genes identified
Three new genes were identified; 1. YBL071W-a a hypothetical conserved protein (simultaneously identified by Blandin et al.) 2. YAL044W-a, the homologue of Sz. pombe uvi31 3. YDL085C-a, the homologue of the human 4F5S disease-associated gene. The new genes and coordinates are listed in Table 1.

New coordinates (merged or extended genes)
The complete list of 46 proposed alterations to gene coordinates are presented in Table 1. Some of these changes have already been confirmed experimentally and deposited in the SWISS-PROT database Reannotation of the S. cerevisiae genome 145 (Bairoch and Apweiler, 1999) but may correspond to mutations in the sequenced strain. However, fragments pertaining to the same sequence should be represented as a single feature in the public databases. In addition to increased homology, data from YPD indicates identical phenotypes and expression patterns for some of these proposed merges. For example, PRM7+YDL038+YDL037 have the same transcript profile (repressed by methylmethanesulphonate). Some of these proposed

Disregarded (spurious) ORFs
Using the criteria described in Methods, 370 ORFs were disregarded (Table 2. and see http://www. sanger.ac.uk/Projects/S_cerevisiae/spurious.shtml). In agreement with Blandin et al. (2000), the ORFs which correspond to SAGE tags within LTRs have been reclassified as spurious.

Orphans-very hypothetical
The discrimination between S. cerevisiae very hypothetical proteins and orphans which are more likely to be coding suggests 193 S. cerevisiae CDS should be described as very hypothetical ORFs (after the removal of ORFs which should be disregarded). Of these, 72 exhibit an overlap with another CDS (Table 3 and see http://www.sanger. ac.uk/Projects/S_cerevisiae/veryhypothet.shtml).
The G+C content, range and average was calculated for the fully partitioned ORFs on chromosomes I-V. ORFs were partitioned as: Real (characterised or well-conserved)=R; Sequence Orphans (possibly coding)=O; and Very Hypothetical (unlikely coding)=V. The mean G+C content for the partitioned ORF sets R : O:V are 40.24 : 40.37 : 38.84 respectively, which indicates there may be compositional differences between them. Even though the range of G+C for the very hypothetical proteins is smaller than for real (23.29 vs. 27.07), the sample standard deviation is greater (V=5.06; R=3.47).

Novel genes
The three novel genes predicted by this analysis have now been incorporated into the MIPS database (M. Muensterkoetter, MIPS, pers comm).
Blandin et al. predicted 49 additional novel genes using interspecies sequence conservation, but some of these proposed new genes are spurious and others could be labelled very hypothetical using the criteria outlined in Methods. Some of these are predicted due to other non-CDS features. Others are extensions to existing genes. For example, YMR013wa is overlapped completely by a cellular RNA, YGL258w is part of VPS5, and YER039ca is part of HGV1. Other gene predictions from this dataset extend beyond the newly proposed coding region, and may correspond to regulatory regions, or to as yet undiscovered cellular RNAs. For example, YDL159wa is predicted to code for a 43 amino acid peptide (129 base pairs corresponding to the largest ORF) but the region of high similarity extends over 391 base pairs. Some predictions are derived from translations between 28 and 99 amino acids in length, and correspond to low complexity DNA sequence, often with only one species homologue. There are attendant risks in defining a CDS solely from an ORF and a statistically significant BLAST score (particularly with closely related organisms), as this may not always be biologically significant, or may pertain to a non-CDS feature. These predicted ORFs have been added to the Sanger annotation as miscellaneous features and will require further analysis before inclusion in the protein set.

Merged and extended genes
Of the 46 alterations we propose, eight belong to subtelomeric duplicated elements and are possibly pseudogenes. For the remainder, those sequences not already corrected or confirmed or corrected by the sequencing of the genomic DNA will require resequencing for verification. However, frameshifts may still persist in the sequenced strain due to mutations.

Disregarded spurious ORFs
Of the 370 genes proposed here to be disregarded ORFs, 227 were also predicted as unlikely to be coding by Zhang and Wang (2000). However, the Zhang and Wang analysis did not adequately differentiate between coding and non-coding sequences when applied to ORFs which were not in the questionable category of the MIPS database. Here, 18 of the 46 ORFs predicted to be non-coding for chromosomes I and II are now either functionally characterised (YPD) or conserved in distantly related organisms. Malpertuy et al. (2000) propose that 91 of the ORFs annotated by MIPS as questionable (because they largely overlap other features) are actually real, based on similarity to the recently sequenced hemiascomycetes. We propose all of these ORFs should be disregarded as they will generate apparently Reannotation of the S. cerevisiae genome  YDR269C  CCC2  YGR114C  SPT6  YBR224W  YBR223C  YDR271C  CCC2  YGR115C  SPT6  YBR226C  YBR225W  YDR278C  TRNA  YGR122C-A  LTR  YBR232C  PBP2  YDR290W  RTT103  YGR137W  YGR136W  YBR266C  YBR267W  YDR327W  SKP1  YGR151C  RSR1  YBR277C  DPB3  YDR340W  TRNA and LTR  YGR160W  NSR1  YCL022C  KCC4  YDR355C  NUF1  YGR164W  TRNA  YCL023C  KCC4  YDR360W  VID21  YGR176W  ATF2  YCL041C  PDI1  YDR366C  LTR  YGR190C  HIP1  YCL042W  GLK1  YDR396W  NCB2  YGR219W  MRPl9  YCL046W  YCL045C  YDR401W  DIT2  YGR226C  AMA1  YCL074W  ty fragment  YDR413C  YDR412W  YGR228W  SMI1  YCL075W  ty fragment  YDR417C  RPL12B  YGR242W  YAP1802  YCL076W ty 148 V. Wood et al.
Reannotation of the S. cerevisiae genome 149 significant, but spurious, TBLASTX matches to alternative frames of the real gene (anti-sense or sense-different reading frame).
Of the ORFs previously defined as questionable but now proposed to be coding by Malpertuy et al. (2000), and retrievable from the Genolevures website (http://cbi.labri.u-bordeaux.fr/Genolevures/ Genolevures.php3), at least 107 out of 136 occur in overlapping pairs. These pairs have two significant TBLASTX hits when the ascomycete DNA is compared to the S. cerevisiae predicted protein set; the best score belonging to the real coding sequence and a lower score generated by the overlap with the spurious ORF and an alternative translation of the closely related organism's DNA. This is illustrated by the three pairs of overlapping genes YGR220C/ YGR219w, YOR054c/YOR055w, and YDR443C/ YDR442w in Table 4 (data from the Genolevures website). The correct reading frame should also be apparent if levels of synonymous and nonsynonymous nucleotide substitution are calculated for the aligned regions.
After the merging of sequences identified in Merged and extended genes (Table 1), only eight genes of known or inferred function in the entire S. cerevisiae genome remain overlapping. The overlaps and their orientations are listed in Table 5. The longest overlaps observed were 55 and 34 amino acids, which are possibly attributable to sequencing anomalies, or deletions; the other six are 10 amino acids or less, and predominantly C-terminal.
Overlapping CDS features are also rare in Sz. pombe. Of 4189 genes which are characterised or conserved, only three pairs have an overlap greater than 10 amino acids in length, none of which were at the N terminus. Moreover, since the completion of the S. cerevisiae genome, no function or biologically significant similarity to any other sequenced organism has been observed for any of the largely overlapping ORFs designated here as spurious. This is despite the major efforts of EUROFAN and other functional genomics studies to determine the function of every yeast gene, and the exponential increase in protein sequences deposited in the public databases.
Considering the rarity of overlapping genes in both yeasts, and the absence of unequivocal functional evidence in support of the coding integrity of any of the spurious ORFs which are wholly or largely overlapping real genes, the likelihood that any encode for proteins is minimal. Therefore, it would be prudent to remove them completely from the genome totals and label them accordingly in the public databases.

Very hypothetical proteins
One advantage of discriminating between sequence orphans likely to be coding, and very hypothetical orphans, is that these regions of DNA can be easily partitioned as a subset, facilitating the identification of other features by bioinformatics analyses.

Implications for post genomics
Many of the spurious overlapping ORFs included in the public databases, and proposed as disregarded ORFs by this analysis, have associated functional genomics data which could be artefacts. The original yeast microarrays (using PCR products), were not strand specific with respect to the probes (DeRisi et al., 1997), and opposite strand transcripts could hybridise to these array spots (D. Vetrie, perscomm). Table 3. Very hypothetical proteins with no homology and low coding potential Positive signals may also result from overlapping UTRs. Not unexpectedly, many of the disregarded ORFs have transcript profiles similar to the overlapping characterised gene. Gene knockouts of spurious ORFs may give phenotypes, particularly if they affect overlapping strand ORFs, promoters, or other cellular RNAs. It has been observed that some of the knockouts of overlapping ORFs have essentially the same, or similar phenotype to the real adjacent gene. These transcript and phenotype artefacts, attached to the database entries, lend these predictions false credibility as proteins. The inclusion of spurious ORFs may therefore affect the accuracy of any previous global analysis of transcription or redundancy.

New gene number estimate
Our analysis provided a new estimate of gene number for each S. cerevisiae chromosome. These are provided in Table 6.
When S. cerevisiae was first published, 6275 ORFs were predicted; 390 of these were proposed to be spurious giving a probable gene number of 5885 (Goffeau et al., 1996). The data used for our analysis (SGD) consisted of 6282 ORFs, of which 370 have been disregarded, giving a new maximum upper limit of 5804. The removal of 42 pseudo or frame-shifted sequences, and 193 very hypothetical proteins further reduces this total to 5570. This is likely to be closer to the true upper limit, because the criteria used for the determination of very hypothetical proteins are quite conservative. There is a possibility that a small number of the very hypothetical proteins may eventually be determined to be coding, but size distribution (unpublished) indicates that we may still be over estimating the number of small ORFs.
Malpertuy et al. predicted a gene number of 5651. In addition, using two different statistical methods, they estimated that the actual number of protein coding ORFs should be either 5542 or 5552, but do not account for the differences between their predicted number of 5651 and the statistical calculations. The statistical calculations are closer to the number of genes predicted by our analysis (5570 or fewer). The discrepancies could be due to the inclusion of novel genes which are in fact spurious (see Discussion, Novel Genes), or the inclusion of genes previously defined as questionable, but proposed by this analysis to be disregarded (see Discussion, Spurious ORFs).

What are the remaining orphans?
Data obtained by Gaillardin et al. (2000) demonstrated that ascomycete specific genes are highly represented in the functional classes of cell wall organisation, extracellular/secreted proteins, and transcriptional regulators suggesting that they diverge more rapidly than other classes of genes. In Sz. pombe, many remaining orphans are low complexity or repetitive proteins e.g. serine-rich with low similarity to alpha-agglutins and other cell surface proteins, or proteins with basic charged regions which may correspond to transcription factors. It may be that most orphans correspond to genes which have diverged so much that they are unrecognisable, rather than novel genes. It is therefore possible that the majority of orphans are genes which have diverged more rapidly and that the number of truly species specific genes is very small.
A comparison of the refined orphan sets of  Sz. pombe and S. cerevisiae will aid the detection of subtle homologies and physical similarities between the sequence orphans themselves, or orphans and previously characterised genes. For example, the final subunit of Sz. pombe RNA polymerase III (the homologue of S. cerevisiae RPC31) was identified due to similarity in amino acid length and the presence of an acidic C terminus, despite a low similarity score (Richard Maraia, NICHD, NIH, Bethesda and George Shpakovski, Russian Academy of Sciences, Moscow. pers comm). Global comparison of the remaining orphans will facilitate definition of the sets of genes necessary for unicellular eukaryotic life. However, to do this effectively, it is important that a distinction is first made between orphans and spurious ORFs (Malpertuy et al., 2000).

Conclusions
A substantial proportion of small orphans are probably not protein coding, yet may define other genome features (regulatory regions, cellular RNAs or even gene-free regions which may be involved in higher order chromosome structure). These may contain spurious ORFs which, if defined as CDS, appear to generate matches at the protein level. Spurious gene predictions, with associated artefactual functional genomics data, will exclude these regions of DNA from being inspected for non-CDS features. Attaching a suitable annotation to these would facilitate the detection of authentic features.
It is important to differentiate firstly between orphans and disregarded spurious ORFs, and secondly, between likely real orphans and very hypothetical orphans. Refinement of the orphan sets of sequenced genomes will enable the detection of more subtle homologies and other physical similarities between the real orphans.
As the number of orphans is gradually eroded by the removal of non-coding ORFs and the detection of distant homologues, it will become easier to determine how many truly species specific genes exist in the Sz. pombe and S. cerevisiae genomes.
The annotation of an ORF's status within the public datasets is important for both functional genomics and bioinformatics. The costs of reagents, labour and curation of 370 ORFs which should be disregarded in functional genomics analyses are not trivial, they account for roughly 5% of the total effort. Bioinformatics on proteome data to examine amino acid composition, charge, etc. require accurate datasets (perhaps with different confidence levels attributed). Integration of contextual information on a gene-by-gene basis to determine status will enable the targeting of future research toward genes which are more likely to be coding.
As more analyses are performed, we should get closer to the absolute gene number. Taken in combination, previous analyses and the interpretion of the biological context of the ORF should enable better estimates of probable gene and orphan number for this yeast.

Data availability
Updated EMBL format sequences (containing nearly 12000 annotations) which can be examined in Artemis and a one-gene one-protein FASTA format protein translations database are available from the Sanger Centre ftp site (ftp://ftp/pub/yeast/ SCreannotation).
The EMBL entries will continue to be maintained (and will be resubmitted to EMBL with permission Table 6. Predicted S.cerevisiae gene numbers, by chromosome of the original authors). Further refinement of the datasets described will include: