Construction, Characterization, and Preliminary BAC-End Sequence Analysis of a Bacterial Artificial Chromosome Library of the Tea Plant (Camellia sinensis)

We describe the construction and characterization of a publicly available BAC library for the tea plant, Camellia sinensis. Using modified methods, the library was constructed with the aim of developing public molecular resources to advance tea plant genomics research. The library consists of a total of 401,280 clones with an average insert size of 135 kb, providing an approximate coverage of 13.5 haploid genome equivalents. No empty vector clones were observed in a random sampling of 576 BAC clones. Further analysis of 182 BAC-end sequences from randomly selected clones revealed a GC content of 40.35% and low chloroplast and mitochondrial contamination. Repetitive sequence analyses indicated that LTR retrotransposons were the most predominant sequence class (86.93%–87.24%), followed by DNA retrotransposons (11.16%–11.69%). Additionally, we found 25 simple sequence repeats (SSRs) that could potentially be used as genetic markers.


Introduction
Since 1994, bacterial artificial chromosome (BAC) libraries have become an invaluable resource tool to initiate genomics research in the areas of genome sequencing, physical mapping, positional cloning, complex analysis of targeted genomic regions, and analysis of gene structure and function [1][2][3][4][5][6][7][8]. More recently, physical map tiles of BACs have been shown to be suitable for next generation sequencing of chromosomal regions or whole genomes [9].
BAC-end sequencing (BES) is a powerful tool that enhances the value of BAC libraries as a genomic resource by providing partial sequence information that can be used to understand genome content and architecture and develop genetic markers [10,11]. Physical maps constructed from fingerprinted BAC clones, together with associated BACend sequence information, can be used to: construct BAC fingerprint-/BES-based physical maps [1,12] which can be aligned to reference genome sequences; sequence genomes by "walking" from one clone to the next [13]; anchor whole genome shotgun sequence data; integrate genetic linkage maps with physical maps [14].
There are different varieties or types of tea, that is, green tea, black tea, oolong tea, white tea, and so on [20], (see also http://jingtea.com/tea-knowledge/tea-varieties). Oolong tea is a special variety of tea and has been studied for its effects on diabetes, eczema, allergies, bacterial infections, dental caries (cavities), obesity, cardiovascular disease, cancer, and its antioxidant properties [21][22][23][24][25][26]. It has been shown that consumption of oolong tea stimulates both energy expenditure and fat oxidation in normal weight men [21]. Oolong tea is semifermented which occurs when the leaves are gently rolled during processing [20], and this gives oolong tea a unique appearance and flavor. Chin-shin oolong is one of the most widely cultivated oolong tea varieties worldwide [27].
It has been suggested that functional genomic research should be a major emphasis of tea genetics and breeding in the future [28]. Although rapid progress of gene identification and isolation from tea plants has been made in the past several years [29], the study of the tea plant genome lags far behind other crop species due to the lack of good genomic research tools. This is most likely due to the difficulties of preparing high-quality tea DNA and due to the distinctness of tea plant from other taxa (i.e., perennial nature, high inbreeding depression, unavailability of distinct mutants of different biotic and abiotic stress, and large genome size of 4Gb [30]). One key genomic tool that is completely lacking for tea is the availability of a high-quality, deep-coverage, BAC library. In this paper, we report the construction of a high-quality, publicly available BAC library of tea plant from the variety Chin-shin oolong. We generated and analyzed a limited data set of BAC-end sequences from this library which provided an early glimpse into the sequence composition of the tea genome.

Germplasm and Plant Tissue
Processing. Sixty 6-yearold plants, derived from a clonally propagated single mother plant of a Camellia sinensis cultivar Chin-shin oolong, were selected as the plant germplasm source. Healthy young shoot tips and the uppermost two leaves were collected (similar to harvesting top-quality tea), then washed quickly to remove debris, and immediately frozen by submersion in liquid nitrogen followed by short-term storage at −80 • C. All plant growth and tissue selection was kindly provided by Dr. Francis Zee (USDA, Hilo, Hawaii).

Preparation of High Molecular Weight (HMW) Tea Plant
DNA in Agarose Plugs. Tea genetic resources have lagged behind other important plants due to the complex and difficult nature of adequate extraction of quality nuclear DNA. However, a detailed manuscript describing the experiments that led to the successful isolation of high-quality, high molecular weight, nuclear DNA, suitable for BAC library construction, was recently made available [31]. The method for tea plant BAC library construction utilized the standard Arizona Genomics Institute (AGI) protocol [32], the method of Luo and Wing [3], and Ammiraju et al. [2]. Modifications necessary to achieve successful construction of a tea plant BAC library are described as follows.
Twenty grams of frozen tissue was homogenized and transferred to a flask containing 200 mL of prechilled extraction buffer (10 mM Tris-HCL, pH 8.0, 10 mM EDTA, pH 8.0, 100 mM KCl, 0.5 M sucrose, 4 mM spermidine, 1 mM spermine, 0.10% w/v L-ascorbic acid, 2.00% w/v PVP-40, 0.13% w/v sodium diethyldithiocarbamate trihydrate; PVP-40 was only about 50 percent dissolved) and 400 μl βmercaptoethanol. The homogenate was filtered into a flask that contained 200 mL of the same above prechilled extraction buffer and 400 μl β-mercaptoethanol. The homogenate was filtered a second time into a fresh flask that contained 45 mL of prechilled extraction buffer with 10.00% v/v Triton X-100. The mixture was centrifuged for 15 min at 3250 rpm at 4 • C. The resulting pellet was washed several times with the same buffer and resuspended in 1 mL of prechilled extraction buffer, incubated in a 45 • C water bath for 5 minutes, and gently mixed with one-third volume of 1.00% low melting temperature agarose (in extraction buffer) at 45 • C. The mixture was allowed to solidify after transferring to plug molds. Twenty-four plugs were transferred into a 50 mL-Falcon tube, containing 40 mL of standard proteinase K solution (1.00% w/v N-lauroylsarcosine (sodium salt), 0.1 mg/mL proteinase K, dissolved in 0.5 M EDTA, pH 9.4), and incubated in a hybridization oven at 50 • C with a gentle rotation for 24 h. After the plugs were rewashed with fresh proteinase K solution for an additional 24 h, they were rewashed in 2 additional solutions. First, two times with 40 mL T 10 E 10 containing 1 mM PMSF (phenylmethanesulfonyl fluoride) and then twice with 40 mL TE, each time for about 1 h at room temperature with gentle shaking. The plugs were stored in 70.00% ethanol at −20 • C (for long-term storage).

Restriction Digestion of HMW DNA and Isolation of
Size-Selected Fragments. Two and a half DNA plugs were used to establish optimal HindIII partial digestion conditions. Formal partial digestion, using 5 DNA plugs, was performed using 5U of HindIII restriction enzyme added to each sample (per half plug). Digested samples were loaded to 1.00% agarose gel and subjected to pulsed-field gel electrophoresis (PFGE). DNA was visualized, and agarose fragments, containing specific DNA sizes, were cut from the gel slabs. A second-and third-PFGE run of the fragments was performed to further purify the DNA and remove small DNA fragments. After finishing the third size selection, the gel fractions containing different sized fragments (B2, B1, A2) were recovered and stored at −20 • C in 70.00% ethanol (for long-term storage).

Ligation of Sized DNA Fragments.
For each size-selected fraction, the high molecular weight genomic DNA was electroeluted from the agarose at 4 • C. Pipet tips (with cutoff tips) were used when manipulating high molecular weight genomic DNA to avoid mechanical shearing. The DNA concentrations were estimated, and 120 ng-200 ng DNA was used to ligate to the linearized (HindIII site) and dephosphorylated vector pIndigoBAC536 SwaI, commonly known as pAGIBAC1 [2]. Separately, ligations were mixed well by tapping and then incubated in a water bath at 16 • C for 19 hours. Ligation samples were transferred into 0.1 M glucose/1.00% agarose cones to desalt for 1.5 h on ice. Ligations were transferred into fresh microcentrifuge tubes and stored at 4 • C until transformation tests were completed.  Technologies). Electroporation was performed on ice at 327 DC V with fast charge rate at a low resistance (4 kΩ) and a capacitance of 330 μF. The cells were transferred into 3 mL of SOC media, and incubated at 37 • C for 1 h with shaking at 250 rpm, followed by the addition of an equal volume of sterile glycerol and gentle shaking for 3 minutes. These mixtures were immediately frozen by submersion into liquid nitrogen followed by long-term storage at −80 • C.
To evaluate these transformation tests, 300 μl of each (containing cells, SOC, and glycerol) were spread on Petri dishes (containing LB-X-gal-IPTG agar with 12.5 μg/mL of chloramphenicol, 80 μg/mL X-gal, and 100 μg/mL IPTG) and incubated at 37 • C overnight. Five-hundred-seventy-six white recombinants (positive for insert), 192 from each of the three sublibraries (B2, B1, A2), were randomly selected and grown overnight at 37 • C in 1.2 mL LB broth (with chloramphenicol 12.5 μg/mL) with shaking at 220 rpm. BAC DNAs were isolated and digested with NotI to release the BAC insert. Digestions were separated by PFGE at 6 V/cm, switch time from 5 to 15 s, angle 120 • , and run for 16 h followed by staining, destaining, and visualization.
Transformed E. coli from the B2 ligation, selected to contain the largest insert sizes, were prepared for array to 384-well microtiter dishes. Five mL of the mixture from the B2 ligation were spread on LB-X-gal-IPTG-CM agar Q-trays (22.5 × 22.5 cm) and incubated at 37 • C overnight. The white (insert positive recombinant) clones were robotically picked (using a Genetix Q-Bot; Genetix Ltd.) into fortyeight 384-well microtiter plates containing freezing media and stored at −80 • C. Hybridization screening filters were printed from library copies to facilitate future experiments.
To further evaluate this sub library (B2 ligation), 576 BACs were randomly selected from these forty-eight plates and grown overnight at 37 • C with shaking at 220 rpm in 1.2 mL LB supplemented with chloramphenicol (12.5 μg/mL). BAC DNAs were isolated and digested with NotI. The digested clones were separated by PFGE and analyzed. Additionally, 192 BAC clones were subjected to BAC clone end sequencing (BES) using the method of Kim et al. [12], and the resultant sequences were analyzed for chloroplast and mitochondrial genome contaminations and for repeat content using BLASTN searches with settings that required 98% identity over lengths of at least 51 bp.
The successful C. sinensis BAC library, composed of E. coli transformants from each of the three ligations, was named CSBCBa.

Construction and Characterization of the Tea BAC Library.
Camellia sinensis cultivar Chin-shin oolong was chosen to construct the BAC library since it is one of the most widely cultivated oolong tea varieties. To avoid contamination with small, trapped DNA fragments and improve the size and uniformity of the inserts, the high molecular weight (HMW) genomic DNA was partially digested with HindIII and three separate size fractionations were collected. Following ligations into the HindIII site of the pAGIBAC1 vector, the three size fractionations were transformed, and the new tea BAC library, consisting of 401,280 BAC clones, was named CSBCBa; see Table 1. The overall average insert size of the total library was calculated to be 135 kb with inserts ranging from 8 to 314 kb, and the total library coverage was estimated to equal 13.54 haploid genome equivalents based upon mathematical calculations that utilize a genome size of 4,000 Mb. Sub library 1 (Ligation B2) was composed of 40,320 clones with an average insert size 140 kb; Sub library 2 (Ligation B1) was composed of 160,896 clones with an average insert size 140 kb; Sub library 3 (Ligation A2) was composed of 200,064 clones with an average insert size 130 kb. Five-hundred-seventy-six BACs, 192 from each of the three sublibraries, were randomly selected and analyzed by NotI restriction enzyme digestion in order to evaluate the library (Figure 1). Based on this sample size, no empty vector clones were visualized. Over 74.40% of the BAC clones were shown to carry DNA inserts greater than 130 kb, with a reasonable fraction (22.30%) carrying inserts larger than 170 kb ( Figure 2). BAC clones from sub library 1, Ligation B2, were spread on agar dishes, clones (18,432) were robotically picked and arrayed to 48-384well plates, and hybridization screening filters were printed to facilitate future experiments.

Analysis of Repetitive Sequences from Pilot BAC-End
Sequence of C. sinensis. To examine the quality of the library and to obtain a preliminary view of the major repetitive element content of the C. sinensis genome, unidirectional end sequencing was performed on 192 random BAC clones which produced 182 reads with an average length of 581 high-quality bases; the BESs are available at the URL: http://www2.genome.arizona.edu/files/camellia BES.txt. Of the 182 random BESs, the total number of nucleotides was 111,695 bp and the GC level was 40.35% (Table 2).
Likewise, the 182 random BAC-end sequences were evaluated using BLAST searches and queries to the transposable element databases of Arabidopsis (Arabidopsis thaliana) at Genetic Information Research Institute and of rice (Oryza sativa) at the Arizona Genomics Institute. This preliminary analysis of repetitive sequences from pilot BAC-end sequences indicated that 97 to 98 percent of the BESs contained sequences related to transposable elements. LTR retrotransposons were the predominant class of repeat elements in C. sinensis (Arabidopsis 86.93% and rice 87.24%; Table 2). The second major class of transposable elements belongs to DNA retrotransposon (Arabidopsis 11.16% and rice 11.69%). LINE and non-LTR elements were also observed though at much lower amounts.
BLASTN searches against the chloroplast and mitochondrial genomes of Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) using the 182 random BESs revealed no significant homology found, thus indicating that the BAC library contained very low levels of chloroplast or mitochondrial DNA contamination. Visualization of the inserts from the random 576 BACs from pulsed-field electrophoresis also indicated no patterns typical of organelle contamination (data not shown). Further repeat analysis of the 182 random BESs revealed 25 different microsatellite repeats ( Table 3). The 25 simple sequence repeats (SSRs) were found from 17 different BAC clones (from the forward and reverse reads), thus suggesting that eight BACs represented genomic regions that contained high repeat content. Interestingly, one BES read, > camellia g1-12-3-07 D02 .g1, contained two different dimer, and one pentamer, SSRs.

Discussion
Generally, the world's agriculture and food systems come from the tremendous biological diversity encompassed in over 200 independently (and perhaps convergent) domesticated species of Angiosperms [33]. Paradoxically, much of the potential genomic research with these taxa has been centered on a few model species, or taxonomic families, that represent only a very small amount of this diversity. In the past decade, prominent plant genome initiatives have resulted in genomic resources of enormous magnitude. These include molecular/genetic maps, transcript databases, large-insert libraries, five complete genome sequences (rice [14], Arabidopsis [34], poplar [35], grape [36], and maize [37]), and a suite of several others scheduled for completion. The ultimate goals of these projects are to continue to discover new ways to meet future world agricultural and food needs while simultaneously providing an understanding of functional systems biology. However, the expected dramatic advances in theoretical and applied plant biology for other taxa, following these innovations, have been critically slow due to two major obstacles: lack of adequate genomic tools and resources to efficiently and effectively transfer existing genomic knowledge to other economically and diverse agriculturally important species and lack of representation of novel genes and regulatory networks that underlie key traits of agriculture and ecology in the sequenced model species. Despite the significant economic impact of tea and other similar commodities (i.e., coffee, cocoa), a lack of  adequate public genomic resources, especially access to largeinsert libraries, has contributed to a lack of advanced genetic knowledge useable for modern breeding and improvement. Recent published tea plant works involving molecular tools have shown an AFLP and RAPD marker-based linkage map [38], identified cDNAs involved in secondary metabolism [39], described sequence analysis from 4320 tea ESTs [40], and used intersimple sequence repeats to analyze genetic variability of somaclonal embryo-derived tea plants [41]. A recent review described the use of molecular resources for tea cultivar classification, the identification of parentage of Mulberry scale resistance, and more advanced linkage maps with SSR markers [42]. While these and other reports provide important insight to the understanding of the tea plant, the availability and utilization of BAC resources are absent.
Here, we report the construction and public availability of a high-quality, deep-coverage, large-insert, bacterial artificial chromosome (BAC) library of cultivated tea plant (Camellia sinensis) variety Chin-shin oolong. This BAC library resource is publicly available in the form of whole libraries, filters, and individual clones, through the AGI BAC/EST Resource Center (http://www.genome.arizona .edu/orders), and we expect it to be extensively used worldwide for the analysis of genome evolution and organization, positional cloning, and eventual gap closure of a C. sinensis reference sequence.
This tea BAC library was made possible after significant methodological improvements were accomplished in the 6 Journal of Biomedicine and Biotechnology preparation of purified high molecular weight (HMW) DNA [31] and subsequent partial enzymatic digestions and ligations. We showed that, despite expected difficulties in obtaining HMW DNA of reliable quality from tissues known to contain compounds deleterious to established extraction techniques, our results yielded ligations that produced a BAC library with low organellar contamination, such as chloroplast and mitochondria, yet with high transformation efficiencies and large-insert sizes. Though the genome size of tea is quite large, 4 Gb, the described BAC library has an average insert size of 135 kb and provides over 13x genome coverage to allow for adequate utility. It was divided into three sublibraries: sub library 1 (Ligation B2) composed of 40,320 clones with average insert sizes of 140 kb, sub library 2 (Ligation B1) composed of 160,896 clones with average insert sizes of 140 kb, and sub library 3 (Ligation A2) composed of 200,064 clones with average insert sizes of 130 kb.
Fresh tea leaves are rich in both volatile secondary compounds such as tea polyphenols and carbohydrate matrices such as tea polysaccharides. It has been shown that tea leaves contain 20-33% dry weight of polyphenols [43], and polysaccharides have been shown to be present at 7.02% dry weight in oolong tea [44]. These types of compounds contribute negatively [32] toward the handling of high molecular weight (HMW) DNA necessary for construction of large-insert genomic clone libraries. Tea polyphenols must be prevented from interacting with the nuclear DNA, and tea polysaccharides must be prevented from trapping nuclei in the process of tissue homogenization. In this study, to prepare high-quality HMW DNA, the frozen tissue was homogenized with a buffer (composed of combining ingredients from two methods but requiring the PVP-40 to be partially undissolved) in double volumes followed by additional filtration. Double volumes of buffer were found to dilute the polyphenols and polysaccharides, as evidenced by the absence of sticky nuclear pellets, thus allowing for increased removal during the centrifugation steps. The combining of the nuclei with the low melting temperature agarose at a lower ratio of 1 : 3 (instead of 1 : 1) was required to concentrate the nuclei but was performed in the presence of the undissolved PVP-40. To lower the organellar DNA contamination and to further eliminate tea polysaccharides and tea polyphenols in the process of preparing tea plant HMW DNA plugs, the use of additional washing steps were performed [45].
Removing small restriction fragments is vital for construction of a high-quality BAC library [46,47]. Tea plant genomic plugs prepared by the method used in this paper contained abundant HMW DNA and produced satisfactory restriction fragments when digested with HindIII. To avoid contamination with small-trapped DNA fragments and improve the size and uniformity of the inserts, we performed three separate size selections. Usually two size selections are sufficient, but visual examination of the DNA fragments after the second size selection revealed that an additional selection was required (data not shown). A detailed manuscript describing the experiments for the appropriate isolation of high-quality, high molecular weight DNA that led to the successful construction of this tea plant BAC library (tea plant nuclei isolation, buffer compositions, HMW DNA, large-insert ligations, etc.) was recently published [31], and this method will also yield sufficient quality tea DNA for other purposes, such as next generation sequencing (NGS).
Pulsed-field gel electrophoresis analysis of 576 random BAC clone plasmids showed that the majority of C. sinensis cloned DNA inserts were present as single NotI fragments. This indicated that the C. sinensis genome apparently contained few NotI sites, a feature commonly observed with the genomes of other plant dicot species and contrary to the results obtained with monocot species [48].
Long terminal repeats (LTRs), a component of genome repeat analysis [49][50][51], are sequences of DNA that are repeated hundreds or thousands of times in the genome. They are often found in retrotransposons and flanking functional genes [51,52]. LTR retrotransposons constitute a significant portion of most eukaryote genomes and in plants have been suggested to be causative for the dramatic differences of genome sizes (and polyploidization) and for disruptions of genome organization and structure [2,51,53,54]. While insertions of DNA have been attributed to amplifications of retrotransposons [51,[54][55][56][57], deletions have been suggested to involve all DNA sequence class types and may be the result of homologous recombination and/or illegitimate recombination [56]. Recent analysis of BES from different Oryza species [2] found good correlation with flow cytometric genome sizing and repeat content. Our preliminary analysis of tea repetitive sequences from pilot BAC-end sequences indicated that over 98 percent of the tea genome could be repetitive. We found that LTR retrotransposons are the predominant class of repeat elements in C. sinensis followed by DNA retroelements. These results support the correlation of large genome size with proliferation of repeat content since the tea genome is quite large, 4,000 Mb, which is 1.6X maize and 0.8X barley whose genomes are approximately 80-95 percent repetitive. Therefore, it is not surprising that the majority of tea BESs contained sequences highly enriched with transposable elements.

Conclusion
A high-quality, deep-coverage, HindIII BAC library of tea plant (Camellia sinensis) has been constructed. The library, named CSBCBa, is publicly available from the Arizona Genomics Institute Resource Center (http://www.genome .arizona.edu/orders/). The average insert size of the library was 135 kb, it contained very low organellar contamination, and it provides 13.54x genome equivalent coverage from a total of 401,280 clones. Analysis of BAC clone end sequences revealed that the repetitive fraction of the tea genome was highly enriched for LTRs and DNA TEs. This public resource will provide a useful platform for genomics research, such as genome sequencing, DNA fingerprinting and physical mapping, gene identification, isolation and regulation, as well as complex analysis of targeted genomic regions. Research of genomic information and gene expression patterns of tea plant have advanced slowly probably owing to the lack of an available high-quality BAC library as one key genomic tool.
Since this BAC library has been made available to the public, we expect that the most advanced work with DNA tools, which has been initiated by other researchers, will move forward for deepening genomics research of tea plant. The specific genes of tea plant, such as the genes involved in the oxidative pathways of catechins by polyphenol oxidase [20], which are involved in the pathways of the powerful antioxidant epigallocatechin-3-gallate (EGCG), may be revealed by use of this resource. Previous work on tea plant genetics, such as linkage mapping, genetic integrity of somaclonal variants [38,41,58,59], and tea plant functional genes and biopathways [39,40,60,61], may be integrated and/or expanded with this library to help unravel the tea plant genome.