Generation and Analysis of Expressed Sequence Tags from Olea europaea L.

Olive (Olea europaea L.) is an important source of edible oil which was originated in Near-East region. In this study, two cDNA libraries were constructed from young olive leaves and immature olive fruits for generation of ESTs to discover the novel genes and search the function of unknown genes of olive. The randomly selected 3840 colonies were sequenced for EST collection from both libraries. Readable 2228 sequences for olive leaf and 1506 sequences for olive fruit were assembled into 205 and 69 contigs, respectively, whereas 2478 were singletons. Putative functions of all 2752 differentially expressed unique sequences were designated by gene homology based on BLAST and annotated using BLAST2GO. While 1339 ESTs show no homology to the database, 2024 ESTs have homology (under 80%) with hypothetical proteins, putative proteins, expressed proteins, and unknown proteins in NCBI-GenBank. 635 EST's unique genes sequence have been identified by over 80% homology to known function in other species which were not previously described in Olea family. Only 3.1% of total EST's was shown similarity with olive database existing in NCBI. This generated EST's data and consensus sequences were submitted to NCBI as valuable source for functional genome studies of olive.


Introduction
Oleacea family comprises 600 species in 24 genus and disseminates all around the world. The olive Olea europaea L, which is one of the first domesticated agricultural tree crops in the family Oleaceae, is cultivated mainly for both edible oil and table olives. The domestication of Olea europaea is supposed to be realized some 5700-5500 years ago in the Near-East [1]. Therefore, Anatolia is one of the most important areas of the olive origin of which over 86 varieties of Europea species are present in Turkey (Anatolia). It is known that olive is native to coastal areas of the Mediterranean region such as Spain, Italy, Greece, France, Turkey, Algeria, and Morocco. Olive is the most extensively cultivated fruit crop with its orchards cover about 9.8 mil. ha. in the world. According to the statistics published by FAO, Turkey is the fourth largest producer of olive oil in the world, after Spain, Italy, and Greece. Turkey is the first producer of black table olive in the world and Gemlik cuv. represents 80% of black table olives production in Turkey. Because of economical importance of Gemlik, a lot of research centers in Turkey continue their molecular and classical breeding program for this cultivar.
Most of the genetic studies in cultivated plants are focused on the understanding of genetic mechanisms and improvement of product quality and quantity. With the improvement of DNA-sequencing technology, large-scale single-pass cDNA sequencing is commonly used to obtain large expressed sequence tag (EST) collection which is generated with expressed gene at a particular stage and/or tissue of organism. The sequenced cDNA show direct information on the mature transcripts for coding part of the genome, so EST databases are very useful tools for gene and marker discovery, gene mapping, and functional studies.
After the completion of the genome projects in different species, the number of ESTs has increased rapidly and become available in databases for further applications. Over 40 plant species EST libraries are currently available providing valuable resource for functional genomics studies [2][3][4][5][6][7][8][9].
By using information from these EST databases the possible functions of many genes can be deduced by homologies to known genes.
Although many molecular markers have been developed in olives [10][11][12][13][14][15][16][17][18][19], EST studies for olives are not sufficient. By the end of 2008 around one thousand ESTs were generated for searching development of olive fruits and deposited in NCBI database [20]. Before we submit the olive EST collection to database, there were just around 1126 sequences available in GenBank databases (February 2009). In this paper, we report a rich EST collection from two separate cDNA libraries constructed from the fresh germinated leaves and immature olive fruits for Turkish olive cultivar Gemlik. 2304 clones were sequenced from the leaf cDNA library and 1536 clones were sequenced from the fruit cDNA library. After removal of low-quality ESTs, generated 3734 highquality olive ESTs were analyzed by using Phred-Phrap and Contig Assembly Program 3 (CAP3) software and were submitted to GenBank (dbEST). Annotation is performed by using BLAST and BLAST2GO.

Material and Method
The olive breeding line of O.europea, Gemlik cuv. (G 20/1) is used as a plant material research in this study. Plant materials were supplied by The Ataturk Central Horticultural Research Institute (ACHRI).

Library Construction.
Total RNA was isolated from 10 g fresh germinated leaves and immature olive fruits with the RNeasy Plant Miniprep kit (Qiagen) and pooled. mRNA was purified from total RNA using the Oligotex Spin-Column Protocol (Oligotex mRNA Mini Kit, Qiagen, Valencia, CA). The mRNAs were pooled and final concentration of mRNA was adjusted to 1-3 μg. Two separate cDNA libraries were established with 1.5 μg and 3 μg mRNA leaf and immature olive fruit, respectively. cDNA libraries were constructed with the CloneMiner cDNA Library Construction Kit according to the manufacturer's instructions (Invitrogen, Carlsbad, CA, USA). Double-stranded cDNA was cloned into pDONR222 vector and transformed into E.coli strain DH5 (Invitrogen, Carlsbad, CA, USA). Each cDNA library was plated onto LB-kanamycin agar medium and individual grown clonies were picked into 384-well plates with SOB medium and inoculated overnight. After the addition of glycerol (10% v/v), the library was stored at −80 • C.

Plasmid DNA Purification and DNA Sequencing.
Plasmid DNA was isolated from randomly selected sixty clones with alkaline lysis method [21,22]. Isolated DNA was digested with Bgl1701 and analyzed by a 1% agarose gel electrophoresis to identify insert size. Randomly selected 3840 clones were used as template for PCR amplification of the cloned cDNA by M13 universal primers. Automated sequencing was performed on an automated high-throughput pipeline using the ABI 3730 capillary sequencer (PE Applied Biosystems, Foster City, CA) at the Genome Sequencing Center, Washington University in St. Louis (WUSTL).
Total EST sequences, leaf, and fruit EST sequences, were assembled separately into contigs by using Contig Assembly Program 3 (Cap3) [25,26]. The default values were used for all the parameters. Also, the assembly result was controlled with Consed/Autofinish software [27,28]. Plausible functions for the established contigs were designated by gene homology based on BLAST. The biological meaning of the unique sequences was investigated according to gene ontology (GO) terms based on BLAST definitions using the program BLAST2GO which is a comprehensive bioinformatics tool for functional annotation and analysis of gene or protein sequences [29,30].

Quality of cDNA Libraries and Clustering of ESTs.
Two separate, cDNA libraries were constructed from a pool of RNA extracted from young leaves and fruits independently. The insert size distribution ranged from 200 to 2500 bps in the leaf cDNA library which consisted of 2.4 × 10 6 clones with an average insert length of 1.6 kb. In the immature olive fruit cDNA library, the average insert size was 1.1 kb (min 70 bp to max 1500 bp) and the library consisted of 2.2 × 10 5 clones. After construction of cDNA libraries, 2304 clones were sequenced from the leaf library; 1536 clones were sequenced from the fruit library. Consequently, a total of 3840 EST sequences was generated. Raw EST sequence data was processed and base called by using Phred. The olive EST sequences were trimmed from the start and to the end of the sequences on the basis of trace quality to remove vector, adapter, and low-quality bases with the default value of 0.05. After this process, 106 clones were removed and the average length of 3734 ESTs was determined as 874 bp. For contig assembly, designated 2228 high-quality leaf EST sequences and 1506 high-quality fruit EST sequences were analyzed as individual and total by program CAP3. While assembling the 2228 leaf EST sequences into 205 contigs, length ranged from 514 bases to 1924 bases, and the number of EST ranged from 2-33, 1506 fruit EST's were assembled in to 69 contig, length ranged from 461 bases to 1909 bases, and the number of EST ranged from 2-385 (Table 1). When we assembled two libraries together since there are some common genes expressed in the leaf as well as in the fruit, some of the ESTs obtained from the leaf and fruit established new contigs increasing the total contig number of the assmebled libraries to 299. Some of the singlets of the leaf and fruit libraries established new contigs when the libraries assembled together decreasing the total singlet number of the joint library by 100 to 2368. All 3734 EST sequences and the 249 of high-quality consensus sequences were submitted to GenBank (dbEST) and EST's can be accessed through the accession numbers GO242703-GO246436. Consensus sequences of olive can be reached on the accession numbers EZ421546-EZ421794.

Identification of ESTs' Putative Function.
The annotation of the 3734 ESTs were designated by database search algorithms BLASTN for nucleic acids and the BLASTX for proteins at The National Center for Biotechnology Information (NCBI) web server.
Among the 3734 ESTs, 682 of them (18.2%) showed significant sequence similarities to putative genes registrated in NCBI with score of ≥80 bits or e value ≤10 −10 according to BLASTN similarity search against the nucleotid collection database (last verified on July 2010). The 1647 ESTs (44.1%) resulted in some hits but with weak similarity scores (≤80-40 bits) out of these 896 ESTs (23.9%) had a score between 60-79 bits and 751 ESTs (20.2%) had a score between 40-59 bits. The 1405 ESTs (37.7%), which gave very low similarity scores but stil gave some hits (0-39 hits) or gave no hits since they have no similarity to exisiting sequences in the databases, that is why they were classified in the "No hit" category. Some of the low scoring hits, may also be considered as no hits as well. But since the algorithms provided some hits we put them into weak similarity match category. BLASTN analysis against the nucleotid collection database between our EST and olea sequences in NCBl database has shown that there are only 116 ESTs have similarities, and 38% of these (45 ESTs) have 80% or higher homology (with the score of ≥80 bits). 96.9% of the ESTs generated by us in these studies are different than the ones in olive sequences database already presented by NCBI. On the other hand, with BLASTN analysis against EST database only 81 EST have similarities to olea ESTs in NCBI, and 29% of these have 80% or higher homology (with the score of ≥80 bits).
According to the BLASTN result, 13 different total contigs sequences have similarities with Olea Europaea EST sequences in GenBank Table 2. These are: specifically those acting on the CH-OH group of donor with NAD+ or NADP+ as acceptor from oxidoreductases family "mannitol dehydrogenase1", polypeptide that was employed the phases involved in photosystem II "photosystem II 10 kDa polypeptide mRNA", "glycolate oxidase-like FMN-binding domain protein mRNA", responsible for the shuttling of phospholipids and other fatty acid groups between cell membranes also able to bind acyl groups "plant lipid transfer protein mRNA", most commonly known by the shorter name RuBisCO, is an enzyme that is used in the Calvin cycle to catalyze the first major step of carbon fixation, a process by which the atoms of atmospheric carbon dioxide are made available to organisms in the form of energy-rich molecules such as sucrose "ribulose-1,5-bisphosphate carboxylase/oxygenase activase mRNA", enzyme that acts upon β1− > 4 bonds linking two glucose or glucose-substituted molecules "beta-glucosidase (bglc) mRNA", vacuolar membrane protein in plants "tonoplast intrinsic protein (tip) mRNA", to transmit signals between cells and binding large family of proteins "polyubiquitin OUB2 mRNA", some sequences previously identified in olive and a protein that is involved in gluconeogenesis, the synthesis of glucose from smaller molecules "glyoxisomal malate dehydrogenase mRNA".
In addition to BLAST results, gene ontology (GO) annotations of the leaf, fruit and all contig sequences of Olea Europea L. cv. Gemlik were performed by using Blast2GO. The software performed BLASTX similarity search against the GenBank nonredundant protein database, retrieved GO terms for the top 20 BLAST results and annotated the sequences based on default criteria [29,30]. GO terms were distributed among the biological process, molecular function and cellular component categories; see the following.
The biological process category refers to a biological objective to which a gene contributes, but does not identify pathways. Biological process results are identified by BLAST2GO program like molecular function results. Results are similar for all three contig groups. Especially "carboxylic acid metabolic process", "biosynthetic process", "response to stress", "transport", "biopolymer metabolic process", and "nucleobase, nucleoside, nucleotide and nucleic asit metabolic process" are common for all three results. But there were a lot of different GO terms for biological process results. For instance, in fruit contigs "phosphorus metabolic process", "biological regulation", "cellular carbohydrate metabolic process", "cellular protein metabolic process" and "response to inorganic substance" GO terms were not seen in leaf contigs. Some of GO terms like "response to chemical stimulus", "response to endogenous stimulus", "cellular lipid metabolic process", "glycolysis", "proteolysis" and "protein-chromophore linkage" were not seen in fruit contigs. All the observed differences and similarities between contig groups are summarized before in the paper. When in Figure 1 the biological process which is most observed for leaf in GO terms are transport, response to chemical stimulus, response to stress, in total contigs, GO terms of translation, electron transport, glycolysis, and in fruit, cellular protein metabolic process, carboxylic acid metabolic process, and response to stress are the most observed ones. Facing different GO terms in total contigs depends on the fact that the different sequences among the leaf and fruit contigs do form new consensus sequences.
The final GO term category identifies the locations in the cell where the gene products are found. The Olea europaea gene products were found generally associated with the cellular components, in the intracellular space or in organelles such as the mitochondrion, cytoskeleton, vacuolar membrane, peroxisome, and ribosome. Despite the fact that the most represented GO terms for cellular components of all contigs are integral to membrane and mitochondrion, in the meantime, as expected photosystem II has also been most observed GO term for the leaf.

Discussion
The EST's give very remarkable information about gene expression patterns at a certain stage of the organism. ESTs have been used for gene discovery [31,32] tissue-or stagespecific gene expression [33] and alternative splicing [34]. In this project, we aimed to obtain more information about olive genome, and we have planned to produce a large EST collection for Olea Europea L. which has limited number of ESTs in databases. In order to achieve this goal of creating a larger and richer collection, we have constructed two different cDNA libraries from leaves and fruits for increasing our chance to capture different genes.
Among the assembled leaves contigs some specific putative genes were observed such as asparagine synthetase (AS), germacrene D synthase, desacetoxyvindoline 4-hydroxylaselike (D4H), plastid transketolase 1, ABC transporter family protein, glutamate synthase 1, chloroplast ferredoxin I, glyceraldehyde-3-phosphate dehydrogenase, chlorophyll a/b-binding protein, malate dehydrogenase, alcohol dehydrogenase, and mannitol dehydrogenase 1. Equally among the assembled fruit contigs have some different putative genes than leaves such as SDH2-1, UDP-glucuronate decarboxylase 3, cytoplasmic ribosomal protein, aspartic protease, S-RNase-binding protein, chloroplast oxygenevolving protein, elongation factor 1 alpha subunit, mybrelated transcription factor, Tic20-like protein, and Ca2+ antiporter/cation exchanger. Since less than 10% of olive genes were tagged in each tissue, in this study, some of the GO terms occurring on one tissue and not on the other tissue could be due to the less representative ESTs obtained or sampling variation and may not infer to tissue-specific genes.
It has been the widest olive genome EST collection of Olea Europea L. cv. Gemlik which was constructed to the date. The number of ESTs of Olea europea is 4860 in NCBI (last verified on May 2010), and 3734 out of this figure were generated within this study. This project has dramatically increased the number of Olive ESTs in NCBI GenBank database which is a very useful source for the scientists working on olive genome or on comparative genome researches. For further researches, more ESTs should be generated and be annotated in order to increase the identified number of expressed olive genes for functional analysis.