The Tomato Sequencing Project, the First Cornerstone of the International Solanaceae Project (SOL)

The genome of tomato (Solanum lycopersicum) is being sequenced by an international consortium of 10 countries (Korea, China, the United Kingdom, India, The Netherlands, France, Japan, Spain, Italy and the United States) as part of a larger initiative called the ‘International Solanaceae Genome Project (SOL): Systems Approach to Diversity and Adaptation’. The goal of this grassroots initiative, launched in November 2003, is to establish a network of information, resources and scientists to ultimately tackle two of the most significant questions in plant biology and agriculture: (1) How can a common set of genes/proteins give rise to a wide range of morphologically and ecologically distinct organisms that occupy our planet? (2) How can a deeper understanding of the genetic basis of plant diversity be harnessed to better meet the needs of society in an environmentally friendly and sustainable manner? The Solanaceae and closely related species such as coffee, which are included in the scope of the SOL project, are ideally suited to address both of these questions. The first step of the SOL project is to use an ordered BAC approach to generate a high quality sequence for the euchromatic portions of the tomato as a reference for the Solanaceae. Due to the high level of macro and micro-synteny in the Solanaceae the BAC-by-BAC tomato sequence will form the framework for shotgun sequencing of other species. The starting point for sequencing the genome is BACs anchored to the genetic map by overgo hybridization and AFLP technology. The overgos are derived from approximately 1500 markers from the tomato high density F2-2000 genetic map (http://sgn.cornell.edu/). These seed BACs will be used as anchors from which to radiate the tiling path using BAC end sequence data. Annotation will be performed according to SOL project guidelines. All the information generated under the SOL umbrella will be made available in a comprehensive website. The information will be interlinked with the ultimate goal that the comparative biology of the Solanaceae—and beyond—achieves a context that will facilitate a systems biology approach.


Introduction
The Solanaceae, also called nightshades, are a large family of more than 3000 species, including the tuber-bearing potato, a number of fruit-bearing vegetables (tomato, eggplant, peppers), ornamental plants (petunias, Nicotiana), plants with edible leaves (Solanum aethiopicum, S. macrocarpon) and medicinal plants (e.g. Datura, Capsicum) [1]. The Solanaceae are the third most economically important plant taxon, and the most valuable in terms of vegetable crops. They are also the most variable of crop families in terms of agricultural utility. The closely related coffee (Rubiaceae) is a highly valuable commodity worldwide. In addition to their role as important food sources, many solanaceous species play a role as model plants, such as tomato and pepper for the study of fruit development [2][3][4][5][6][7][8][9], potato for tuber development [10,11], petunia for the analysis of flavonoids, and tomato and tobacco for plant defence [12][13][14][15][16]. The nightshades have also attracted interest because they produce a number of secondary metabolites, some of which have medicinal properties. The Solanaceae are remarkable in that the gene content of the different species is similar despite the markedly different phenotypic outcomes, making the Solanaceae an excellent model for the study of adaptation to natural and agricultural environments [17]. Many Solanaceae share a basic set of 12 chromosomes and are diploid, indicating an absence of large genome duplications and polyploidizations during the evolutionary history of this family. These intrinsic features of the Solanaceae and closely related species such as coffee make them well suited to address fundamental questions in plant biology.
In November 2003, the International Solanaceae Project (SOL) was initiated at a meeting near Washington, where goals were established for Solanaceae research for the next decade (http://sgn .cornell.edu/solanaceae-project/). The ultimate aim of the SOL project is to address two of the most significant questions in plant biology and agriculture: (a) how can a common set of genes/proteins give rise to a wide range of morphologically and ecologically distinct organisms that occupy our planet?; and (b) how can a deeper understanding of the genetic basis of plant diversity be harnessed to better meet the needs of society in an environmentally friendly and sustainable manner? To answer these questions in the context of the Solanaceae, it is necessary to link traits to sequence, requiring both extensive phenotyping [18] and sequence information. It would be desirable to obtain as many full solanaceous genome sequences as possible for direct comparison. Due to the high cost of sequencing complete genomes at high quality, this is not feasible at present. The alternative is to fully sequence a high quality reference genome, and to map onto it 'cheaper' sequence data, such as ESTs, methyl-filtered sequences [19][20][21] or low Cot sequences [22,23] from other species. The availability of good comparative maps [24][25][26] between many solanaceous plants and the large numbers of EST sequences already available [27] is a great benefit in this approach. Thus, sequencing the generich regions of the tomato genome will be the first cornerstone of the SOL project. After sequencing two rosids, Arabidopsis [28] and Medicago [29], sequencing a solanaceous plant will shed light on a genome of the more distantly related asterid clade, which will permit comparisons between genomes at longer evolutionary distances and thereby help define a larger view of plant evolution. All the information generated under the SOL umbrella will be made available in a comprehensive website, where all information will be interlinked such that, ultimately, the comparative biology of the Solanaceae will become available in a context that will facilitate a systems biology approach to understanding genome evolution, plant development and plant responses to the environment.

Tomato genome structure and sequencing method
Approximately three-quarters (730 Mb) of the 950 Mb tomato genome exists as pericentromeric heterochromatin. The remaining one-quarter (220 Mb) of the tomato genome consists of the distal, euchromatic segments of the chromosomes. The DNA found in heterochromatin is rich in repetitive sequences and poor in genes, making it difficult to sequence. The euchromatin is thought to contain mostly single copy sequences and includes more than 90% of the genes, making it relatively easy to sequence. Therefore, the strategy is to sequence only the euchromatic portion of the genome to cover most of the gene space. Sequencing 220 Mb of the tomato genome is therefore a little less The Tomato Sequencing Project 155 than twice the effort of sequencing the Arabidopsis genome at 150 Mb [30].
The ordered BAC approach was chosen for sequencing the tomato genome. All sequencing will be done using standard BAC libraries: the wellcharacterized HindIII library [31], plus two additional libraries that the US part of the sequencing project will provide; all libraries are or will be deep BAC end-sequenced by the US project, for a total of 400 000 BAC end sequences. All sequence will be derived from Solanum lycopersicum var Heinz 1706, as this was the basis for the original HindIII library. Sequencing will be based on the F2-2000 map (http://sgn.cornell.edu/), which has been used as the basis for anchoring 1500 markers by overgo (overlapping oligo) hybridization [32]. Currently, more than 650 unambiguous anchor points are available, but further analysis will increase this number to an estimated 800-1000 anchor points. In addition, the F2-2000 map is being combined with an AFLP map, containing more than 1200 markers, generated at Keygene in The Netherlands, providing even more anchor points. For each anchor point, one of the many anchored BACs will be selected as a 'seed' and sequenced. The tiling path will then be generated by walking out from this seed in both directions, using the deep BAC end-sequencing data available for the BAC libraries. Information from the fingerprint contig (FPC) map available for the HindIII library will also be used. FISH analysis will be used to confirm chromosome mappings and delineate the euchromatin/heterochromatin boundaries [33]; anchored BACs will also be mapped on IL lines to verify location. Other methods available for full genome sequencing have also been assessed. Full genome shotgun sequencing is not a cost-effective way to sequence a fraction of the genome. Methyl and Cot filtering are both methods providing a bias for coding sequence and therefore euchromatic sequence. All these methods do not by themselves provide gene order, which will be critical for a reference genome.
Currently, 10 countries are involved in sequencing the tomato genome. The 12 chromosomes have been split up between the countries as follows (see Figure 1): Korea (chromosome 2), China (3), UK (4), India (5), The Netherlands (6), France (7), Japan (8), Spain (9), Italy (12) and USA (1, 10, 11). The US project will also generate the additional BAC libraries, perform BAC end-sequencing, and build a data repository as a resource for the entire project. Argentina will sequence the mitochrondrial genome, and Italy the chloroplast genome. The organellar genome sequences will be important for distinguishing genomic insertions of organellar sequences from organellar contamination contained in the BAC libraries.

Bioinformatics
In such a large project, bioinformatics plays a crucial role. In particular, the sequence quality standards and the annotation have to be uniform across the different chromosomes generated through different national projects, and efficient data access has to be provided to the scientific community. A SOL bioinformatics committee comprised of participating sequencing country representatives is working on a bioinformatics guideline document that should be followed by all project members to ensure that results are comparable in the entire genome. The guidelines are available from http://sgn.cornell.edu/solanaceaeproject/. In addition, the following important conventions have been adopted: (a) the original data has to be stored for all data types, such as all chromatograms and assembly files from the BAC sequencing, all supporting data for gene calls etc; (b) all data has to be traceable to the submitters and data generators; and (c) a unified web resource will be created that makes all data accessible to the users without delay and in an intuitive format ('one-stop-shop'). Several centres will work together to implement the portal, including MIPS (http://mips.gsf.de/), VIB Ghent, Wageningen University and the Research Centre, KRIBB (http://www.kribb.re.kr), Kazusa (http://www.kazusa.or.jp/), SGN and centres in other countries. All sequence data will be submitted to GenBank, and to the Solanaceae Genomics Network (SGN; http://sgn.cornell.edu/), which will serve as a repository and access point for the data.

Conclusions
Sequencing the tomato genome offers exciting new perspectives and opportunities for plant biology. The sequence will be compared to other  (7), Japan (8), Spain (9), Italy (12) and USA (1,10,11)], as shown. This overview is available from the SGN website (http://sgn.cornell.edu/help/about/tomato sequencing.html) and will be continuously updated as sequencing progresses. More information on the project is available on the page sequenced genomes such as Arabidopsis and rice to inform us of the evolutionary history of these plants. In conjunction with sequence data for other solanaceous plants, the tomato sequence will be the basis for investigating the phenotypic diversity and comparative biology in the Solanaceae, one of the main aims of the SOL project. This will shed light on mechanisms of gene regulation, evolution, signalling, disease resistance and defence, and fruit development and quality, and finally contribute to the improvement of agriculture through the phenotypic diversity found in nature.