Global Transcriptional Analysis of Olfactory Genes in the Head of Pine Shoot Beetle, Tomicus yunnanensis

The most important proteins involved in olfaction include odorant binding protein (OBP), chemosensory protein (CSP), olfactory receptor (OR), and gustatory receptor (GR). Despite that the exhaustive genomic analysis has revealed a large number of olfactory genes in a number of model insects, it is still poorly understood for most nonmodel species. This is mostly due to the reason that the small antenna is challenging for collection. We can generally isolate one or few genes at a time by means of the traditional method. Here, we present the large-scale identifying members of the main olfactory genes from the head of Tomicus yunnanensis using Illumina sequencing. In a single run, we obtained over 51.8 million raw reads. These reads were assembled into 57,142 unigenes. Nearly 29,384 of them were functionally annotated in the NCBI nonredundant database. By depth analysis of the data, 11 OBPs, 8 CSPs, 18 ORs, and 8 GRs were retrieved. Sequences encoding full length proteins were further characterised for one OBP and two CSPs. The obtained olfactory genes provide a major resource in further unraveling the molecular mechanisms of T. yunnanensis chemoperception. This study indicates that the next generation sequencing is an attractive approach for efficient identification of olfactory genes from insects, for which the genome sequence is unavailable.


Introduction
Olfaction plays a role in almost every aspect of insect life, which represents one of the key interfaces between insects and the environment. It is used by insects to recognize a huge variety of airborne molecules for providing them with information about food, predators, and potential mates [1]. Olfaction is a complex process requiring the interaction of numerous proteins to generate a neuronal signal. The first step in the recognition of chemical signals is the odorants that enter the insects' antennae and other sensory organs via pores and travel across the hydrophobic space to the chemosensory receptors, where the response of the insect to the signal is initiated [2]. Chemoreceptor of insects is mainly formed by the olfactory receptors (ORs) and gustatory receptors (GRs), which are located in the dendritic membrane of neurons [3,4]. As common odorants are hydrophobic molecules [5], it is difficult for them passing the aqueous barrier of the sensillar lymph surrounding the dendrites of neuronal cells [6]. During the passing process, odorants are thought to be translocated from the air to the chemoreceptors by a variety of protein mainly existing in the sensillar lymph, including odorant-binding proteins (OBPs) and chemosensory proteins (CSPs) [7][8][9]. Studying the genes that code for olfactory proteins could provide valuable insight into the molecular mechanisms of olfactory function.
Since the first discovery of insect OBP in the male antennae of the giant moth Antheraea polyphemus [7], a large number of olfactory genes have now been identified in numerous insect species through molecular cloning, cDNA library sequencing, and genome-wide analyses [10][11][12]. With regard to molecular cloning, olfactory genes are assigned to divergent gene families with relatively low sequence identity (overall 20% protein sequence identity), which results in difficulty in designing appropriate primers for successful cloning [13,14]. To some extent, cDNA library sequencing is also unable for large scale identifying olfactory genes. This is partly due to the fact that it is difficult to collect enough sensory tissues that olfactory genes restrictedly expressed for constructing the tissue targeted cDNA library because of the tiny size of sensory tissues. In addition, sequencing of limited number of randomly selected cDNA clones often have insufficient coverage of less abundant transcripts [15]. Among the available insect olfactory genes, the majority of them have been obtained by means of bioinformatic approaches based on the characteristic features of the protein families from the completed genome sequences [9,16].
Over the past several years, the next generation sequencing technology has emerged as a cutting edge approach for high-throughput sequence determination, promptly improved the efficiency and speed of gene discover, and provided fascinating opportunities in the life sciences with dramatically reduced time, labor, and cost in nonmodel organisms [17][18][19]. We here took advantage of this technology to present the global transcriptome characterization of the head of Tomicus yunnanensis, a serious pest of Pinus yunnanensis occurred in southwestern China [20], with focus on comprehensively revealing the olfactory genes using Illumina sequencing. In a single run, we identified 51,822,228 raw reads, which were assembled into 57,142 unigenes. From the transcriptome database, many genes encoding major olfactory proteins were mined. These data provide an invaluable resource as a first step for grounding future studies investigating chemosensory processes. Furthermore, this study presents the simple method to efficiently uncover the divergent olfactory gene family from tiny insects.

Insect.
Adult beetles identified based on morphological characters [21] as T. yunnanensis by their trunk attacking phase on P. yunnanensis were collected from Qujing city, Yunnan province, China.

RNA Extraction and Sequencing.
Head of both female and male adults were cut off and transferred to an Eppendorf tube. After grinding them under liquid nitrogen, total RNA was extracted by using TRIzol Reagent (Invitrogen) according to the manufacturer's protocol. Total RNA quantity and quality were assessed using the Agilent 2100 Bioanalyzer (Agilent technologies) with a minimum RNA integrated number value of 8. The samples for transcriptome analysis were prepared using Illumina's kit following manufacturer's recommendations. Briefly, mRNA was selected using oligo(dT) probes and then fragmented using divalent cations. cDNA was synthesized using random primers, modified and enriched for attachment to the Illumina flowcell, and then sequenced on the Illumina GA II platform. Raw data have been deposited in the NCBI Short Read Archive (SRA).

Bioinformatics.
Prior to assembly, adaptor sequences, empty reads, and low quality sequences (reads with unknown sequences "N") were filtered from the short raw reads. De novo assembly of the short reads was performed using SOAPdenovo [22] at default parameters. The generated unigenes were analyzed by searching the nonredundant (NR)  [24] was used to conduct multiple sequence alignments. The presence and location of signal peptide cleavage sites in amino acid sequences was predicted by SignalP 4.0 server [25]. The software package MEGA5 [26] was used for phylogenetic analysis. Bootstrap analysis was performed using Neighbor-Joining and the Poisson correction model with 1000 replicates. Positions containing alignment gaps and missing data were eliminated with pairwise deletion.

PCR.
To validate and extend fragments of several candidate gene sequences to encode full proteins, 5 gene-specific primers (3 -TTCTCAATCACACACTTCAAG-AAAC-5 and 3 -TTGGTAGAATATTGGTCATCAGGT-C-5 ) were designed based on the sequences of Uni-gene41600 YNNT and Unigene26497 YNNT. They were used to clone the 5 ends in conjunction with adapter primer provided in the SMART RACE cDNA amplification kit (Clontech). Two specific primer pairs (3 -ATGAAA-ACATTCGTGCTTGTTGCTT-5 and 3 -TTACAATTT-TAAACCTTCTTTTTCG-5 ) were designed according to the sequence of Unigene55391 YNNT. They were employed for amplifying the open reading frame of this putative gene. cDNA template was synthesized with a SMART RACE cDNA amplification kit (Clontech) using the total RNA extracted above. PCR was performed using Advantage 2 Polymerase Mix (Clontech) according to recommended conditions and following the manufacturer's instructions.

Overview of the Head Transcriptome.
After cleaning and quality checks, we generated over 4,600 million bases sequence information. The total number of reads was 51,822,228 with an average length of 90 bp ( Table 1). The GC percentage of the reads is 42.53%. The number is comparable with other insect and eukaryote sequencing projects that have GC content between 38.7% and 56.5% [27]. Using SOAPdenovo, these reads were assembled into 42,678 contigs, 140,311 scaffolds, and 57,142 unigenes. The average length of contigs, scaffolds and unigenes were 113 bp, 226 bp, and 355 bp, respectively. The longest unigene had 4,762 bases. The lengths of the 5,136 unigenes were ≥700 bp ( Figure 1).
After assembly, all unigenes were aligned to NCBI NR databases with a cut-off E-value of 10 −5 for annotation. 29,384 unigenes had at least one significant alignment to existing gene model in Blastx searches. The remaining 27,758 unigenes with no matches with any known sequences and likely represent novel. This might be due to the relatively short length of distinct gene sequences [17] and lack of genetic information in Tomicus. In addition, this might be resulted partly by the transcripts derived from the cDNA of untranslated regions, chimerical sequences (assemblage errors) and nonconserved areas of proteins where homology is not detected [28]. The species distribution of the best match result for each sequence is shown in (Figure 2). The T. yunnanensis sequences produced 20,896 hits to Tribolium castaneum, 1,396 hits to Apis mellifera, and 1,363 hits to Drosophila. Overall, a strong preference for matches is against T. castaneum genes, composing of 71.11%. It might be due to the relatively near evolutionary relationship of these two species, belonging to the same order Coleoptera, and the available complete genome sequence of T. castaneum.
For functional comparisons, all unigenes were assigned for GO terms based on BLAST matches with sequences whose function is previously known. A total of 54,427 unigenes were able to map to GO terms. These transcripts were assigned for biological process (25,348 sequences), cellular component (17,537 sequences), and molecular function (11,542 sequences) ( Figure 3). Cellular process (19.58%) and metabolic process (15.77%) were the main subcategories of biological process, indicating the important metabolic activities in T. yunnanensis head. Under the category of cellular component, cell (31.16%), cell part (31.16%), and organelle (17.23%) were among the most highly represented subcategories. The molecular function category was mainly comprised of proteins involved in binding (46.44%) and catalytic activities (36.54%).

Discovery of Genes Encoding Olfactory Proteins.
Of particular interest to detect the sequences encoding olfactory proteins, the head transcriptome of T. yunnanensis was analyzed more in detail. Unigenes with at least one match to olfactory proteins with an E-value of 10 −5 or lower were  selected. As shown in Table 2, we identified many sequences with homology to OBP, CSP, OR, and GR. Overall, 45 potential olfactory genes out of 57,142 unigenes were identified. The proportion is similar to what was recorded in antennal transcriptome of Manduca sexta and in the combined EST and transcriptome data of Solenopsis invicta [19,29]. Among these unigenes, 11 encoded for OBP, 8 for CSP, 18 for OR, and 8 for GR. The data in this study presents some information on the molecular basis of olfaction of a Coleoptera species besides T. castaneum. Analysis of fully sequenced genome has identified 49 OBPs, 20 CSPs, 299 ORs, and 220 GRs in T. castaneum [1,30]. Compared with the gene number of olfactory genes reported from this genome, the current number of T. yunnanensis is at the lower end of the range of T. castaneum. Additional olfactory genes may await discovery due to their absence from the current transcriptomic dataset. The remainders not obtained from this dataset might be due to the following reasons: (1) only a single run carried out in this study, which results in the dataset that do not cover all genes expressed in the head of T. yunnanensis, and (2) the tissues excluded the head that some olfactory genes abundantly located are not included as the target tissue. Increasing the number of olfactory genes can discovery from transcriptome of the samples that are greatly restricted to the main olfactory organs. In addition, as the ability of Illumina sequencing is for short reads (<120 bp) [31], only one unigene is appeared to complete in all the identified putative olfactory transcripts. More candidates encoding  full sequence could be determined by 454 pyrosequencing, which can generate longer sequence reads (∼450 bases) than those obtained by Illumina sequencing [32,33]. Using 454 sequencing, the majority of olfactory genes identified from the transcriptomic database of M. sexta antennae and S. invicta encoded the full length proteins [19,29]. However, the latter technology needs much more cost.

Validation of Several Detected Olfactory Genes.
Based on the fragments derived from the transcriptome database,  primers were designed to clone the full coding sequences for one OBP (Unigene41600 YNNT) (designated as TyunOBP1) and one CSP (Unigene26497 YNNT) (designated as TyunCSP1) by RACE-PCR, and assure the acute of one CSP sequence (Unigene55391 YNNT) (designated as TyunCSP2) that appeared to be complete after raw reads assembly by RT-PCR. We have successfully performed the amplification.
TyunOBP1 contained a 384 bp open reading frame (ORF) for a polypeptide of 128 amino acids. It was with a molecular mass of 14,15 Da and an isoelectric point (pI) of 5.15. The deduced protein sequence was in accordance with other OBPs, generally composed of subunits of about 14 kDa as small hydrophilic proteins with acidic isoelectric points [34]. Insect OBPs have also been classified into long chain (∼ 160 aa), medium chain (∼120 aa), and short chain (∼110 aa) classes, which relates to their potential structure and function [35]. According to this hypothesis, TyunOBP1 belongs to the medium chain class. A signal peptide is a common characteristic of OBPs [36]. Similar to that of other insects, the initial 25 amino acids of TyunOBP1 were predicted as a signal peptide. In addition, OBPs are characterized by conserved cysteines that are believed to form three disulfide bridges, and the hydrophobic domains [37]. Multiple amino acid sequence alignment revealed that the six cysteine residues are highly conserved in TyunOBP1 (Figure 4). Moreover, it indicated that TyunOBP1 displayed low similarity to that of other insects, which are typically around 20% or below, and the lengths of the N-and C-termini of OBPs are highly divergent. This highly diverse insect gene family is divided into nine major subclasses: Classic, Minus-C, Plus C, Dimer, PBP/GOBP, ABPI, ABPII, CRLBP, and D7 [16,38]. Among the currently available genomes, the dipterans consistently have large expansions of these genes [39], and the body louse, Pediculus humanus, appears to possess the smallest set of OBPs [40]. A phylogenetic tree was constructed using the amino acid sequences of TyunOBP1 and T. castaneum OBP (TcasOBP) ( Figure 5). According to the phylogenetic tree and the features of OBP subclasses, TyunOBP1 is near to TcasOBP42, belonging to the Classic clad.
The ORF for TyunCSP1 and TyunCSP2 were 390 bp. Both of them consisted of 130 amino acids. TyunCSP1 was with a calculated molecular weight of 15.11 kDa and pI of 5.10, whereas TyunCSP2 was with a molecular mass of 15.17 Da and a pI of 8.02. All CSPs appear to have a hydrophobic N-terminus of about 16-20 amino acids that are predicted to encode a signal peptide [41]. The deduced amino acid sequences for both proteins possessed a putative signal sequence of 18 amino acids. A typical feature for CSP is the remarkably conserved 4-cysteine motif that forms disulphide bridges in folded proteins [42]. A comparison between the sequences of TyunCSP1, TyunCSP2, and those of other insects clearly showed that a conserved four-cysteine signature is present at the characteristic positions within the CSP family ( Figure 6). CSPs are more conserved than OBPs across evolution, with about a 50% identity even between members of phylogenetically distant species [43]. These two TyunCSPs shared 54% identity, but were with 26-58% identity to CSPs from other species used in the multiple alignments. To characterize the molecular evolution relationships between the CSPs of T. yunnanensis and those of others insects in Coleoptera, we constructed a phylogenetic tree including 2 CSPs of T. yunnanensis identified in this study, and 20 CSPs of T. castaneum (Figure 7). The tree showed that five proposed groups of CSPs were resolved .  TcasOBP22  TcasOBP23  TcasOBP33  TcasOBP24  TcasOBP14  TcasOBP15  TcasOBP13  TcasOBP12  TcasOBP2  TcasOBP3   TcasOBP10  TcasOBP6  TcasOBP5  TcasOBP7  TcasOBP4  TcasOBP9  TcasOBP43  TcasOBP34  TcasOBP44  TcasOBP1  TcasOBP8  TcasOBP18  TcasOBP16  TcasOBP17  TcasOBP19  TcasOBP20  TcasOBP21  TcasOBP35  TcasOBP36  TcasOBP32  TcasOBP26  TcasOBP27  TcasOBP31  TcasOBP28  TcasOBP25  TcasOBP30  TcasOBP29  TcasOBP48  TcasOBP45  TcasOBP42  TyunOBP1  TcasOBP38  TcasOBP40  TcasOBP41  TcasOBP39  TcasOBP37  TcasOBP46  TcasOBP47   TcasOBP11   37   3

Conclusion
Historically, identification of a comprehensive list of candidate olfaction genes for understanding the molecular mechanisms of olfaction has been mostly limited by the difficulty in collecting antenna because it is small and encased in a hard cuticle, and successful cloning as the olfactory gene families are divergent [44]. Next generation short-read DNA sequencing has made it possible to explore genomelevel questions in nonmodel organisms, regardless of their phylogenetic proximity to model species [45,46]. This study produced a large amount of transcriptome data from T. yunnanensis head by Illumina sequencing. An extensive list of candidate genes involved in olfactory signal transduction has been generated from this valuable data platform. This is an important resource for further study of the chemical communication between T. yunnanensis and the environment. This study demonstrated that the Illumina sequencing technology could be applied as a particularly rapid, cost-effective, and fruitful approach to accelerate our understanding of the molecular basis of chemosensory pathways in nonmodel insects that have no previous genome data.