Comprehensive Transcriptome Study to Develop Molecular Resources of the Copepod Calanus sinicus for Their Potential Ecological Applications

Calanus sinicus Brodsky (Copepoda, Crustacea) is a dominant zooplanktonic species widely distributed in the margin seas of the Northwest Pacific Ocean. In this study, we utilized an RNA-Seq-based approach to develop molecular resources for C. sinicus. Adult samples were sequenced using the Illumina HiSeq 2000 platform. The sequencing data generated 69,751 contigs from 58.9 million filtered reads. The assembled contigs had an average length of 928.8 bp. Gene annotation allowed the identification of 43,417 unigene hits against the NCBI database. Gene ontology (GO) and KEGG pathway mapping analysis revealed various functional genes related to diverse biological functions and processes. Transcripts potentially involved in stress response and lipid metabolism were identified among these genes. Furthermore, 4,871 microsatellites and 110,137 single nucleotide polymorphisms (SNPs) were identified in the C. sinicus transcriptome sequences. SNP validation by the melting temperature (T m)-shift method suggested that 16 primer pairs amplified target products and showed biallelic polymorphism among 30 individuals. The present work demonstrates the power of Illumina-based RNA-Seq for the rapid development of molecular resources in nonmodel species. The validated SNP set from our study is currently being utilized in an ongoing ecological analysis to support a future study of C. sinicus population genetics.


Introduction
Copepods are an extremely ancient and diverse arthropod group. They have evolved into more species than any other multicellular animal group [1,2]. To date, more than 12,000 copepod species have been recognized all over the world [1] and over 700 species are identified in the China Seas. They pervade various aquatic habitats and show local adaptation to rapid environmental changes. Copepod adults are very small in size, which is mainly limited to 1-4 mm. However, they show great diversity in their morphology, physiology, and life strategies, which makes them very suitable for studying a variety of fundamental biological processes. Although copepods are critical species for the world's aquatic ecosystems, the available genomic resources are still limited, and sequencing efforts have been carried out for a small number of well-studied species [2]. Of these, the parasitic copepods sea lice have received great attention because of their adverse effect on the global salmon aquaculture industry [3]. The developed expressed sequence tags (ESTs) enabled studies to investigate host-parasite interactions at the molecular level and provided promising targets for vaccine development [4]. Moreover, a large number of ESTs are available for Calanus finmarchicus [5][6][7], a key planktonic species from the North Atlantic Ocean. This genomic information supported the development of a cDNA microarray, which was utilized to investigate the physiological responses to environmental variations [8]. Several functional genes with important physiological roles were identified by mining EST and 454 pyrosequencing data in C. finmarchicus [9][10][11]. Just recently, Ning et al. [12] performed the first large-scale transcriptome sequencing for Calanus sinicus to identify putative transcripts involved in growth, lipid metabolism, molting, and diapause process. Although more than 50,000 high-quality ESTs were obtained, more transcriptome data are needed to present a full view of this transcriptome organization and provide complete gene sets to facilitate future genomic and genetic studies in this species.
Calanus sinicus Brodsky is a planktonic copepod widely distributed in the shelf ecosystem of East Asia [13] and it dominates the mesozooplankton in the shelf water of Bohai Sea, Yellow Sea, East China Sea, and Inland Sea of Japan. Its adults, larvae, and eggs are the main food source for many commercially important fish, such as sardine and anchovy. Therefore, it is recognized as a key secondary producer that links phytoplankton and higher trophic level organisms and its population dynamics greatly impact the entire marine ecosystem. The spatial distribution of C. sinicus has changed in the continental shelf waters of China Seas as a result of global climate change [14,15], raising concerns about future climate-driven shifts in the geographical distribution of C. sinicus. These shifts in the biogeography of C. sinicus call for a better understanding of organism-environment interactions. However, the way that this organism responds physiologically to environmental variations is not well known, as well as the adaptive capacity of this species to elevated temperature and ocean acidification induced by climate change.
With the rapidly declining cost of next generation sequencing, RNA sequencing (RNA-Seq) approaches have been more widely applied to population genetic and molecular ecology studies of nonmodel species [16,17]. In this study, we describe the utilization of RNA-Seq to capture a significant portion of the C. sinicus transcriptome (expressed portion of the genome), stress-related expression signatures, and thousands of potential molecular marker loci in an Illumina HiSeq next generation sequencing run. Our results significantly deepen the pool of molecular resources available for this taxon and serve as a guide for similar studies in related copepod taxa.

RNA Extraction, Library Construction, and Illumina
Sequencing. Total RNA was extracted from a pool of about 50 individuals using RNeasy Mini Kit (Qiagen, Germany) following the manufacturer's instruction. The concentration of total RNA was determined by NanoDrop (Thermo Scientific, USA) and the RNA integrity value was checked with a RNA 6000 Pico LabChip on an Agilent 2100 Bioanalyzer (Agilent, USA). High-quality RNA was then provided for RNA-Seq library construction and Illumina sequencing. A cDNA library was constructed with ∼5 g initial DNasetreated total RNA following the protocols of the TruSeq RNA sample preparation kit (Illumina). After poly(A) mRNA was enriched by beads with Oligo (dT), a fragmentation buffer was added for shearing mRNA to short fragments (200-700 bp). Taking these short fragments as templates, a random hexamer-primer was used to synthesize the firststrand cDNA, and then the second-strand was amplified. The double-stranded cDNA was purified with the Qiagen PCR extraction kit, and the short fragments were connected with sequencing adaptors. The library was PCR amplified and the final library had the yields of ∼30 L of 19.8-21.4 ng/ L with an average length of ∼270 bp. After KAPA quantitation and dilution, the library was sequenced on an Illumina HiSeq 2000 instrument with 100 bp paired-end (PE) read chemistry by Majorbio Biotechnology Corporation (Shanghai, China).

De Novo Assembly and Transcriptome Analysis.
Transcriptome raw sequences were subjected to a series of assembly and annotation programs ( Figure 1). De novo assembly of short reads originating from Illumina sequencing was performed using Trinity [18]. Before assembly, raw reads were trimmed by stripping the adaptor sequences and ambiguous nucleotides using SeqPrep (https://github.com/jstjohn/SeqPrep) and Sickle (https://github.com/najoshi/sickle). Reads with quality scores less than 20 and lengths below 20 bp were removed. The resulting cleaned and filtered high-quality sequences were used in the subsequent assembly with the default settings including a fixed k-mer size of 25 as suggested.
2.4. Gene Annotation, Ontology, and Pathway Analysis. The resulting transcripts from Trinity assembly were used as queries for ORF prediction. A set of utilities included in the Trinity software were employed to extract the likely coding regions from Trinity transcripts. Gene annotation was then performed on the protein sequences with predicted ORF and the nucleotide sequences without predicted ORF, respectively. Sequence homology searches of the protein sequences with predicted ORF were performed using BLASTP program against sequences in NCBI nonredundant (nr) protein database, STRING database (http://string-db.org/), and KEGG GENES database (http://www.genome.jp/kegg/genes.html), while the contigs without predicted ORF were used as queries for BLASTX searches. The cutoff Expect value (E-value) was set at 1 − 5 and only the top hit result against known sequences was assigned as the annotation. Gene ontology (GO) analysis for biological process, molecular process, and cellular component was processed with Blast2GO 2.5.0 [19], which is an automated tool for the assignment of gene ontology   terms. The final annotation file was generated after gene-ID mapping, GO term assignment, annotation augmentation, and generic GO-slim processes. All the annotated contigs were categorized with regard to biological process, cellular component, and molecular function at level 2. They were used to determine the GO term, COG term, and further KEGG pathway analysis.

Identification of Microsatellites and Single Nucleotide
Polymorphisms (SNPs). The microsatellite mining was performed using the program Msatcommander [20]. The search criteria were set based on the number of repeat motifs: 7 for dinucleotides, 5 for trinucleotides, tetranucleotides, and pentanucleotides, and 4 for hexanucleotides. We implemented the SNP discovery process using Samtools (http://samtools.sourceforge.net/). Briefly, the Trinityassembled transcripts were used as reference sequences. SNPs were determined as superimposed nucleotide peaks where two or more reads contained polymorphisms at the variant allele with the default parameter. With the aim of avoiding false positive SNPs due to sequencing errors (which may therefore be monomorphic loci), only both variants with a minimum variant count of 2 high-quality (HQ) bases and a minimum site depth of 8 (HQ bases) were considered as putative SNPs.

Validation of SNP Markers with Melting Temperature-( -) Shift Method.
The -shift method [21] was used to genotype SNPs. For each SNP locus, the primer set included one common reverse primer (CR) and two forward allelespecific primers (AS1 and AS2), with the 3 terminal base of each specific primer matching one of the SNP alleles. The common primer was typically placed no more than 20 bp downstream from the SNP for favoring allele discrimination. GC tails of different lengths, 14 bases for one primer and 6 for the second, were added to each of the two allele-specific primers to discriminate melting curve of amplification products. As a rule, the long tail was attached to the allele-specific primer with the higher base (G or C) at its 3 end, and the short tail was attached to the other allele-specific primer with lower base (A or T). For SNP polymorphism analysis, 30 wild individuals of C. sinicus were collected from the Northern Yellow Sea. Total DNA was extracted from each single individual using genomic DNA isolation kit (Foregene, China). Allele-specific PCR was carried out in a final volume of 25 L containing 10 ng DNA, 1 × PCR SYBR Premix Ex Taq buffer (Takara), and 0.2 M each of the 3 primers. The PCR program was as follows: initial denaturation at 95 ∘ C for 30 s, followed by 40 cycles of 3-step amplification profile of 5 s at 95 ∘ C for denaturation, 30 s at 60 ∘ C for annealing, and 20 s at 72 ∘ C for extension. Melting curves were obtained using ABI 7500 realtime thermal cycler with the default "dissociation step" to measure the fluorescence intensity of the PCR product in a linear denaturation ramp from 65 ∘ C to 95 ∘ C. POPGENE 32 [22] was used to calculate observed and expected heterozygosities (H o and H e ).

Sequencing of Short Expressed Reads from Calanus sinicus Transcriptome.
Illumina-based RNA-Seq was conducted, generating a total of 58.9 million 100 bp paired-end (PE) reads. After trimming of low-quality reads (quality scores <20) and short read sequences (less than 20 bp), a total of 57.7 million high-quality sequences (98.0%) were obtained (Table 1). These sequences were selected for further analysis. All of the sequences with raw read data were deposited at the NCBI sequence read archive (SRA) database (SRP032493).

De Novo Assembly.
Assembly of the 57.7 million cleaned short reads using Trinity resulted in approximately 69,751 contigs with N50 of 1,127 bp (Table 1)

Gene Annotation.
All the assembled Trinity contigs were used as queries in BLASTP and BLASTX searches. A total of 58,885 assembled contigs had significant (E-value ≤ 1 − 5) hits against the nr protein database, representing 43,417 unique proteins. After the initial BLAST searches, BLAST2GO analysis was conducted to categorize the known genes into the level 2 functional groups ( Figure 3). A total of 60 GO terms were assigned to 13,639 unigenes, including 23 (38.3%) biological process terms, 19 (31.7%) cellular component terms, and 18 (30.0%) molecular function terms. For biological process, the genes involved in the cellular and metabolic processes were highly represented. For the cellular component, the cell was the most represented GO term, followed by the cell part. Molecular function mainly included binding and catalytic activity. GO annotation identified 1,934 unigenes involved in response to stimulus (GO: 0050896) ( Table 2; see Table S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2014/493825). This category may be of interest to ecotoxicology researchers, since the responses to environmental stress can be used as biomarkers to evaluate the biological effects of different types of pollutants in aquatic animals.
KEGG pathway analysis was also carried out in addition to GO analysis for the annotated unigenes, which is an alternative approach to categorizing gene functions with an emphasis on biochemical pathways. Enzyme commission (EC) numbers were assigned to 14,553 unigenes involved in 324 different pathways. Summary of the sequences of these pathways is shown in Table 3. Among the 14,533 genes with KEGG annotation, 45.5% were classified into the genetic information processing (GIP) group with most of them involved in replication and repair, folding, sorting and degradation, transcription, and translation. Sequences classified into the metabolism accounted for 42.8% of the KEGG annotated sequences. The well-represented metabolic pathways were enzyme families, carbohydrate metabolism, amino acid metabolism, and energy metabolism. Cellular processes were represented by 18.3% of the KEGG annotated sequences. Cell motility, cell growth and death, immune system, and endocrine system were well represented. Furthermore, 15.2%  of the sequences were classified into environmental information processing (EIP) including signal transduction, signaling and interaction molecules, and membrane transport. Lipids play an important role in the lifecycle of the copepod; therefore, the genes involved in lipid metabolism were also identified ( Table 2, Table S1). Similarly, COG-annotated putative proteins were classified functionally into at least 25 molecular families, such as the cellular structure, biochemistry metabolism, molecular processing, signal transduction, gene expression, and immune defense. All of these families correspond to the categories observed in GO analysis (Figure 4).

Microsatellite and SNP Identification.
A total of 4,871 microsatellites were identified. Most microsatellites were trinucleotide (92.4%) and dinucleotide (4.8%) repeats (Table 4) Tr an sp or te r ac tiv ity C he m oa tt ra ct an t ac tiv ity C at al yt ic ac tiv ity Tr an sl at io n re gu la to r ac tiv ity R ec ep to r ac tiv ity El ec tr on ca rr ie r ac tiv ity R ec ep to r re gu la to r ac tiv ity Pr ot ei n bi nd in g tr an sc ri pt io n fa ct or ac tiv ity C ha nn el re gu la to r ac tiv ity Pr ot ei n ta g M et al lo ch ap er on e ac tiv ity En zy m e re gu la to r ac tiv ity M ol ec ul ar tr an sd uc er ac tiv ity St ru ct ur al m ol ec ul e ac tiv ity N uc le ic ac id bi nd in g tr an sc ri pt io n fa ct or ac tiv ity M or ph og en ac tiv ity A nt io xi da nt ac tiv ity  common and C/G (7.3%) being the least common. The frequency of SNPs in the transcriptome was 3.01 per 1 kb. Most (83,270, 75.6%) SNPs occurred at the third position in a codon and were often referred to as synonymous SNPs, which do not alter the translated amino acid residue.
According to the frequency of mutation and the conservation of flanking sequences, 51 putative SNPs were selected for validation with -shift primers. Of the 51 primer pairs, 6 (11.8%) did not amplify any product and 29 (56.9%) failed because of amplification failure of one allele-specific primer. Sixteen (31.4%) were successful and showed biallelic   (Table 5). In Figure 6, we show an example of -shift genotyping assay for the locus CsSNP02 based on allele-specific PCR.

Discussion
Genomic tools play an important role in advancing our knowledge of biology at all levels, from genes to ecosystems. Generally, ecological studies often focus on nonmodel species, which lack genomic information. The development of next-generation sequencing techniques provides an exciting opportunity to explore the physiological ecology of organisms of interest. The pelagic copepod C. sinicus is a key zooplankton species in the shelf ecosystem of Northwest Pacific Ocean. Previous biological and ecological studies suggested a considerable diversity of physiological responses of C. sinicus to different environment conditions as well as distributional variations associated with monsoon, coastal currents, and temperature [23][24][25]. There are significant gaps in our understanding of the physiological mechanisms driving this diversity thus far. In ecological studies, a transcription-level assessment of physiological state can contribute important information about individuals in a population. Molecular methods allow researchers to identify gene expression levels involved in any physiological responses and measure sublethal effects on the gene level. A wide array of biochemical, cellular, and wholeorganism markers have been applied to evaluate the biological effects of different types of pollutants in aquatic animals and assess the status of marine ecosystems. The identification of several stress and immune-related genes is of great interest to ecologists due to their potential as biomarkers for environmental contamination. In the C. sinicus transcriptome developed in this study, we identified several types of heat shock protein (HSP) genes, including HSP90, HSP70, HSP60, HSP40, and HSP10. HSP family plays an important role in thermal tolerance, which is necessary for protein folding and regulation of the heat shock response [26,27]. The characterization of these genes provides an opportunity to understand the molecular signals involved in the thermal tolerance of planktonic copepods and will help understand the effects of global climate change on marine species with an extensive geographical distribution range. Another important gene family identified in this study was cytochrome P450 (CYP). CYPs are one of the major phase I-type classes of detoxification enzymes, which can catalyze the oxidation of a wide variety of exogenous compounds or xenobiotics [28]. We also identified a large number of glutathione S-transferases (GSTs), which are a superfamily of multifunctional phase II enzymes primarily involved in the detoxification of endogenous electrophiles. Superoxide dismutases (SODs) are an ubiquitous family of enzymes that  function to efficiently catalyze the dismutation of superoxide anions [29]. Cu/Zn SOD and Mn-SOD have been characterized in the copepod Tigriopus japonicus, and expression level could be inducible by heavy metals and B[ ]P, which indicated their potential as biomarkers for the risk assessment of these environmental pollutants [30]. Further studies in this direction can help understand the changes in the expression of CYPs, GSTs, and SODs under toxic stressors and explore the relation between gene expression and oxidative activity. True diapause is an important life strategy shared by many copepods. To survive long periods of low food availability, copepods undergo an ontogenetic vertical deep migration to delay their development to adulthood [31]. Diapause is a unique physiological process characterized by persistently reduced metabolism, increased stress resistance, and arrested development at a specific life stage [32]. Prior to entering diapause, lipids are sequestered in the form of wax esters in an oil sac [33] and constitute an important energy source. In the transcriptome of C. sinicus, several genes involved in fatty acid metabolism were identified by KEGG analysis. They are essential for lipid synthesis, transport, and storage, which are key components of preparation for diapause. Tarrant et al. [34] detected more highly expressed elongation of very long-chain fatty acids (ELOV) and fatty acid binding protein 8 BioMed Research International   GC tails of different lengths were added to allele-specific primers. Samples homozygous for allele A or T will be amplified with the short GC-tailed primer and show lower temperature peak. Samples homozygous for allele G or C will be amplified with the long GC-tailed primer and show higher temperature peak. Samples heterozygous will show both temperature peaks.
(FABP) genes in the bodies of active C. finmarchicus than those of diapausing individuals. ELOV is a member of a family of enzymes involved in the regulation of fatty acid elongation in both animals and plants [35]. Since some form of elongase is necessary for the synthesis of storage lipids, ELOV enzymes probably function in the synthesis of wax esters before entering diapause in C. sinicus. FABP belongs to a family of carrier proteins for fatty acids and other lipophilic substances such as retinoids [36]. These proteins are involved in the transfer of fatty acids between extra-and intracellular membranes. In C. sinicus, FABPs may function in facilitating the transport of wax esters to oil sac and the transport of lipophilic hormones, such as retinoids. Future studies are needed to identify the full complement of lipid metabolism genes in C. sinicus and particularly the mechanisms that regulate diapause. Transcriptome sequencing provides an important resource for rapid and cost-effective development of molecular markers. The application of molecular markers will aid in clarifying the population genetic diversity, evaluating the genetic differentiation among geographical populations, and elucidating the impact of environmental elements on the genetic structure and geographical differentiation in marine ecology studies [37]. Pelagic marine organisms are expected to have great potential for gene flow owing to lack of physical barriers for genetic exchange in the "open" oceans. The planktonic copepod C. sinicus is the main contributor to zooplankton biomass in the shelf ecosystem of the Northwest Pacific Ocean and has a wide range of distribution, large population size, and prolific fecundity. It shows great geographical diversity in many biological and ecological phenotypes, such as the number of generations, timing of reproduction, vertical distribution, seasonal patterns of abundance, and other life history traits [38][39][40]. Recently, our marine biodiversity monitoring program revealed that the temporal and spatial distributions of C. sinicus varied obviously in the China Seas, which may be a consequence of global climate warming since this species is a warm-temperate one. Insights into the dispersal capabilities gained from population genetic studies will be crucial in predicting the response of C. sinicus populations to future climate change.
Although high-throughput SNP genotyping systems have become available with the development of large-scale sequencing technology [41,42], these systems remain cost intensive. In this study, we found that -shift analysis is an efficient, cost-effective, and reliable method for SNP validation, especially for projects focused on a limited number of loci. The -shift method involves a single allele-specific PCR reaction followed by melting curve analysis. Wang et al. [21] indicated that up to 10,000 samples can be genotyped per day using a single 384-well real-time thermal cycler at a high accuracy (>99.9%). In the present study, 16 primer pairs could amplify target products and showed biallelic polymorphisms out of 51 primer pairs tested. The amplification failed in most cases (56.9%, 29/51) due to their monomorphisms, suggesting that the next generation sequencing (NGS) data could suffer from high SNP detection error rates by base-calling and alignment errors [43]. The uneven height of melting peaks in some primer pairs could make genotype scoring difficult when one primer amplified substantially more efficiently than the other. This issue was resolved by adding the more efficient primer at half of its original concentration (0.1 M). This is not always necessary as the genotypes can be identified even under the original conditions.