Analysis of Polygala tenuifolia Transcriptome and Description of Secondary Metabolite Biosynthetic Pathways by Illumina Sequencing

Radix polygalae, the dried roots of Polygala tenuifolia and P. sibirica, is one of the most well-known traditional Chinese medicinal plants. Radix polygalae contains various saponins, xanthones, and oligosaccharide esters and these compounds are responsible for several pharmacological properties. To provide basic breeding information, enhance molecular biological analysis, and determine secondary metabolite biosynthetic pathways of P. tenuifolia, we applied Illumina sequencing technology and de novo assembly. We also applied this technique to gain an overview of P. tenuifolia transcriptome from samples with different years. Using Illumina sequencing, approximately 67.2% of unique sequences were annotated by basic local alignment search tool similarity searches against public sequence databases. We classified the annotated unigenes by using Nr, Nt, GO, COG, and KEGG databases compared with NCBI. We also obtained many candidates CYP450s and UGTs by the analysis of genes in the secondary metabolite biosynthetic pathways, including putative terpenoid backbone and phenylpropanoid biosynthesis pathway. With this transcriptome sequencing, future genetic and genomics studies related to the molecular mechanisms associated with the chemical composition of P. tenuifolia may be improved. Genes involved in the enrichment of secondary metabolite biosynthesis-related pathways could enhance the potential applications of P. tenuifolia in pharmaceutical industries.


Introduction
As a part of commonly known traditional Chinese medicinal plants, Radix polygalae (RP) can be obtained from Polygala tenuifolia Willd. and P. sibirica L., which are listed in the Chinese Pharmacopoeia (2010); RP elicits sedative, antipsychotic, expectorant, and anti-inflammatory effects. High consumption and decreased regeneration of wild resources of RP have prompted researchers to domesticate and cultivate it and a large-scale planting has been performed since early 1980s. Thus far, RP from P. tenuifolia has been the most commonly used variety, so P. tenuifolia was used in this study. P. tenuifolia is a perennial herb mainly distributed in northeast, north, and northwest of China. This herb contains various medicinal components, such as saponins, xanthones, and oligosaccharide esters, which exhibit tonic, sedative, antipsychotic, expectorant, and other pharmacological effects [1]. In previous studies, chemical and pharmacological properties of triterpenoid saponins, one of the most important medicinal components of P. tenuifolia, were investigated [2]. However, studies on the biosynthetic pathway of triterpenoid saponins in P. tenuifolia have been rarely conducted compared with those in licorice, ginseng, notoginseng, quinquefolius, and salvia [3][4][5][6][7]. For instance, fourteen nucleotide sequences have been deposited in the NCBI GenBank database until August 2015; among these sequences, six are related to the biosynthetic pathway of triterpenoid saponins in P. tenuifolia. The lack of sequence data has limited extensive and intensive 2 International Journal of Genomics studies on P. tenuifolia; nevertheless, genomic research related to this medicinal plant is feasible. With the medicinal and economic importance of P. tenuifolia, genomic data sources of this plant species should be investigated to discover genes and develop further functional studies.
DNA sequencing technology was developed by Frederick Sanger in 1977. The "first-generation sequencing" was laborious and costly. After years of improvement, next-generation sequencing (NGS) has been developed in terms of speed and accuracy with reduced cost and manpower. Typical examples of NGS include SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche [8]. Among these techniques, Illumina HiSeq 2000 yields the largest output and entails the lowest reagent cost; this technique has been commonly used in deep sequencing of model and nonmodel organisms [9]. For example, Illumina HiSeq 2000 has been successfully applied in a wide range of species, including model plants (Arabidopsis, rice, tomato) [10][11][12], and crops with large genomes, including field pea, castor, and Sorghum propinquum [13][14][15]. This technique could promote studies on not only the genomics of these plants but also the biosynthetic pathways of the main active ingredients.
Triterpenoid saponins, xanthones, and oligosaccharide esters are currently considered as the main active ingredients of P. tenuifolia [16]. However, small amounts of secondary metabolites, such as isoquinoline alkaloids, lignins, and flavonols, are present in P. tenuifolia. Indeed, Illumina sequencing should be applied to determine the number of enzymes implicated in the biosynthetic pathway of secondary metabolites in P. tenuifolia. To date, the biosynthetic pathway of triterpenoid saponins, particularly pentacyclic triterpenoid saponins in P. tenuifolia, has been extensively studied [2]. The formation of a primary cucurbitane skeleton via the isoprenoid pathway is before the cyclization of 2,3-oxidosqualene of terpenoid biosynthesis in plant. Mevalonate acid (MVA) pathway located in cytoplasm and endoplasmic reticulum, and 2-C-methyl-d-erythritol-4phosphate/1-deoxy-d-xylulose-5-phosphate (MEP/DOXP) pathway located in plasmid, which are likely to participate in the formation of a presenegenin skeleton in P. tenuifolia. These two pathways are present in the different cellular parts of green plants [17,18]. In addition, phenylpropanoid biosynthetic pathways in P. tenuifolia are yet to be reported. Phenylalanine is an end product of the shikimate pathway, in which the aromatic amino acids tyrosine and tryptophan are also produced [19]. With phenylalanine as a precursor, lignin biosynthesis proceeds via a series of side-chain modifications, ring hydroxylations, and O-methylations to yield lignin monomers [20]. In this pathway, other small molecules, such as flavonoids, coumarins, hydroxycinnamic acid conjugates, and lignans, are also produced. Many of these molecules could be or have been the main focus of studies [21]. These two metabolic pathways are mainly found in P. tenuifolia. Therefore, whole transcripts for complete gene expression profiling should be identified by transcriptome sequencing because of limited genomic sources and information regarding the biosynthetic pathways of secondary metabolites in P. tenuifolia. It will support further studies on these two metabolic pathways by using high-throughput Illumina deep-sequencing techniques.
In this study, a cDNA library was constructed to obtain detailed and general data from P. tenuifolia by using a highthroughput Illumina deep-sequencing technique. We could determine genes encoding the enzymes involved in whole biosynthetic pathways of triterpenoid saponins, phenylpropanoids, and other secondary metabolites, and results could promote further analysis. Furthermore, our results could provide direct experimental data of P. tenuifolia not only for guiding genetic breeding but also for subsequent triterpenoid biosynthesis cloning and functional verification of key genes. Therefore, this transcriptome sequencing may help improve future genetics and genomics studies on molecular mechanisms associated with the chemical compositions of P. tenuifolia. Eventually, we can control the quality of P. tenuifolia from germplasm sources and thus provide important theoretical and practical significance.

Plant Materials.
Fresh roots of one-, two-, and threeyear-old P. tenuifolia were sampled and collected in Xinjiang County, Shanxi Province, China. Another batch of two-yearold fresh roots of P. tenuifolia was collected in Anguo City, Hebei Province. All of the samples were intact and healthy. The samples were snap-frozen in liquid nitrogen and stored at −80 ∘ C before use.

RNA Extraction, cDNA Library Construction, and Illumina Sequencing.
According to the manufacturer's protocol, a plant RNA isolation kit (AutoLab) was used to extract the total RNA from fresh tissues and then purified by an RNeasy MiniElute cleanup kit (Qiagen). In brief, mRNA was purified with the combination of oligo-dT-containing beads and poly-A-containing mRNA and then used methods of chemicals and high temperature to fragmentate the mRNA and get fragments of 150 bp. The next step was to synthesize the double-stranded cDNA and repair the DNA fragments with protruding terminal through the function of 3 -5 exonuclease and polymerase. In order to guarantee DNA fragments and adapters can combine with "A" and "T" complementary pair connection and prevent the inserted DNA fragments connecting with each other, single-base "A" was imported to the 3 end of the blunt DNA and single-base "T" was imported to the 3 end of the adapters. Adapters of the tag-containing and the DNA fragments were incubated and connected by the function of ligase. The free adapters and nonadapters DNA fragments were purified by using AMPureXP beads. DNA fragments with adapters were enriched selectively by PCR, and the PCR amplification with fewer number of cycles was operated in order to avoid errors in the library.
The methods of quantitative of library were Pico green and fluorescence spectrophotometer (Quantifluor-ST fluorometer, Promega, E6090; Quant-iT PicoGreen dsDNA Assay Kit, Invitrogen, P7589), and the quality control of the enrichment of the PCR fragments and validation of the size and distribution of DNA fragments in the library

De Novo Assembly and Functional Annotation.
Sequencing of the original data sequence through quality analysis after removal of low quality and subsequences gets the sequences available for subsequent analysis. A Perl program was written to remove reads with adapters, the reads with the average quality of less than Q20 for its 3 end to the 5 end, the reads with the final length of less sequence than 50 bp, and the reads with the uncertainty bases. High-quality reads were used for de novo assembly by the Trinity software to construct unique consensus sequences. Trimmed solexa transcriptome reads were mapped onto unique consensus sequences by using Bowtie [22] (Bowtie parameters: -v3 -allbest -strata). The functional annotation of unigenes was compared with National Center for Biotechnology Information (NCBI), nonredundant protein database (Nr), nonredundant nucleotide databases (Nt), UniProt/Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), Clusters of Orthologous Groups (COG), Gene Ontology (GO), and Interpro databases with Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nlm.nih.gov/). Unigenes were identified by comparing sequence similarity against SWISS-PROT (SWISS-PROT downloaded from European Bioinformatics Institute; ftp://ftp.ebi.ac.uk/pub/databases/swissprot/), COG [23,24], and KEGG database [25] with BLAST at values ≤ 1 −10. The KO information retrieved from blast results by a Perl script and the pathways between databases and unigenes were established. InterProScan [26] was used to annotate the InterPro domains. Then, the functional assignments of InterPro domains were mapped onto GO and the GO classification and tree were performed by WEGO [27].

Sequence Assembly.
The RNA extracted from all the samples was mixed for Illmina sequencing in order to get the whole range of transcript diversity. In total, 58.88 million raw reads and 11.77 gigabase pairs were sequenced with an average GC content of 43.9%, ≥ 20, and no ambiguous "N" (Table S1, in Supplementary Material available online at http://dx.doi.org/10.1155/2015/782635). A total of 55,432,632 high-quality reads were assembled, and 316,703 contigs and 145,857 transcripts with the N50 values were 1,636 bp. All the transcripts were subjected to cluster and assembly analyses, yielding 39,625 unigenes with a mean length of 1,378 bp and the N50 values were 1,971 bp ( Figure 1). The assembly statistics of contigs, transcripts, and unigenes were shown in Table S2.

Functional Annotation and Classification.
In the present library, approximately 67.2% of the unigenes were annotated by BLASTX and BLASTN search with a threshold of 10 −5 against seven public databases, including Nr, Nt, UniProt/ Swiss-Prot, KEGG, COG, GO, and Interpro databases ( Table 1).

GO Annotation.
A total of 41,832 unigenes for GO annotation were divided into three major categories, including cellular location, molecular function, and biological process. Among 34 functional groups with GO assignments, the dominant biochemistry, cell apoptosis, and metabolism were annotated (Figure 2).  In cellular location, eight classifications were clustered by the matched unique sequences and the larger subcategories were "cell" and "cell part." In terms of molecular function, 11 classifications were categorized, including the represented cellular components, such as "binding" and "catalytic activity." According to biological processes, 16 classifications were divided, including the represented biological processes, such as "cellular process" and "metabolic process." However, few genes were assigned to the categories "nutrient reservoir activity" and "viral reproduction," and no genes were found in the clusters of "developmental process."

COG Annotation.
Unigenes for the COG classifications were annotated in order to further evaluate the effectiveness of the annotation process ( Figure 3). Class definition, number, and percentage of this class were shown in Figure 3. The proteins in the COG categories were assumed to have the same ancestor protein, and these proteins were also assumed to be paralogs or orthologs. The largest category was general functional prediction only with 24.84%; the second categories were posttranslational modification, protein turnover, and chaperon with 8.97%; and the third category included translation, ribosomal structure, and biogenesis with 7.24%. Cell motility (24, 0.09%) and nuclear structure (10, 0.04%) represented the smallest category.

KEGG Annotation.
The KEGG database can be used to categorize gene functions with an emphasis on biochemical pathways. The KEGG database can also be used to systematically analyze inner cell metabolic pathways and functions of gene products. Pathway-based analyses help further determine the biological functions of genes. To gain further insights into the biological pathways in P. tenuifolia, we performed a BLASTX search against the KEGG protein database on the assembled unigenes. A total of 72,636 (62.9%) unigenes had blast hits and 42,702 unigenes were assigned to five KEGG biochemical pathways ( Table 2) including metabolism (17,217 unigenes), genetic information processing (14,593 unigenes), environmental information (2719 unigenes), cellular processes (3919 unigenes), and human diseases (4254 unigenes). The pathways with the highest unigene representation were spliceosome (ko03041, 1431 unigenes), chromosome (ko03036, 1232 unigenes), and ubiquitin system (ko04121, 1060 unigenes).
As a Chinese traditional medicinal plant, P. tenuifolia undergoes highly active metabolism. The higher expression of metabolism results in primary metabolite biosynthesis, such as "carbohydrate metabolism" (3702 unigenes), "amino acid metabolism" (2168 unigenes), and "energy metabolism" (1762 unigenes). However, the most active ingredients of P. tenuifolia were secondary metabolites. Therefore, we listed the classifications of terpenoids, polyketides, and other secondary metabolites (Figure 4) with high expression levels.

(1) Characterization and Expression Analysis of the Genes Involved in the Biosynthetic Pathway of Putative Terpenoid Backbone.
Dimethylallyl pyrophosphate (DMAPP) and isopentenyl pyrophosphate (IPP) are recognized as a precursor of the biosynthesis of terpenoid saponins in green plants and the compounds of "activity of isoprene" in vivo. DMAPP and IPP are alletic isomers, which are synthesized by MVA and MEP biosynthesis pathways mentioned above [28]. The MVA pathway starts with acetyl-CoA which is synthesized through the triterpene derivatives captured CO 2 in cytoplasm and the MEP pathway starts with pyruvate and glyceraldehydes-3phosphate in plastid [29]. In these databases, the biosynthetic pathways of MVA and MEP were identified ( Figure 5(a)). In most cases, two or more unique sequences were labeled as the same enzyme and these unique sequences could represent single gene or different fragments of gene.
The process of MVA pathway was as follows. Firstly, two molecules of acetyl-CoA were catalyzed into acetoacetyl CoA by AACT (EC 2.3.1.9, 13 unigenes) and then catalyzed into 3-hydroxy-3-methylglutaryl-CoA ( [30]. We have found that two types of OSC genes of P. tenuifolia are present in the dataset; these types are cycloartenol synthase (16 unigenes) and beta-amyrin synthase (54 unigenes), respectively. Moreover, except for the FPPS, another five nucleotide sequences (SQS, SE, CAS (2), cycloartenol synthase, and beta-amyrin) related to the triterpenoid saponins biosynthesis and reported in NCBI were found in our sequencing data.     (Table 3). A/T (1539) was the most abundant repeat type, followed by AT/GA (207) and GAA/TCA (69). However, the number of tetranucleotide, pentanucleotide, and hexanucleotide motifs was <2% in the unigene sequences. In this transcriptome database, SSRs did not all exist in the unigenes and the complex SSR motifs were not detected.

Discussions
The advantages of high-throughput, accuracy, and reproducibility lead transcriptome sequencing to become a powerful technology [31]. Illmina sequencing 2000 has been widely HMGS (18) HMGR (18) MVK (2) PMK (4) PMD (3) DXS (17) DXR (7) MCT (7) CMK (4) MDS (1) HDS (9) HDR ( (4) HCT (4) CCoAoMT (13) CCR ( Triterpenoid saponins, especially pentacyclic triterpenoid saponins, are one of the main active ingredients of P. tenuifolia, whose biosynthetic pathways are becoming more and more important. The upstream and midstream biosynthetic pathways of triterpenoid saponins have been studied very clearly in the past few years. Many researchers now mainly focus on the downstream biosynthetic pathways, which is also the difficult part of the studies [17]. In the downstream of biosynthesis of natural plant products, hydroxylation of CYP450s and glycosylation of UGTs play important roles in stabilizing the product and altering triterpenoid saponin bioactivity of dammarane-type and oleanane-type aglycones [32]. In this study, 466 and 157 unigenes are annotated as CYP450s and UGTs in the transcriptome of P. tenuifolia (data not shown), respectively. However, because of the quantity and complexity of CYP450s and UGTs, the enzymes of the downstream biosynthetic pathway of P. tenuifolia are still unknown, which can determine the specific steps in the accumulation of triterpenoid saponins.
CYP450s is a family of enzymes involved in the biosynthesis of lignins, terpenoids, sterols, fatty acids, hormones, pigments, and defense-related phytoalexins [33]. Only two CYP450s involved in triterpene saponin biosynthesis are functionally characterized until now, including CYP88D6 from Glycyrrhiza uralensis belonging to the CYP85 family [34] and CYP93E1 from Glycine max belonging to the CYP71 family [35]. Recently, several CYP450s and UGTs were identified as candidate genes in many studies of medicinal and nonmedicinal plants. One CYP450 (contig00248) was selected as a candidate gene, which is most likely to be involved in ginsenoside biosynthesis in Panax quinquefolius by MeJA-inducible and tissue-specific expression patterns [6]. Moreover, Pn02132 and Pn00158, closely related to CYP88D6 and CYP93E1, were selected as candidate CYP450s involved in triterpene saponin biosynthesis in P. notoginseng [5]. Furthermore, ten unigene sequences were identified corresponding to the seven different genes with a high homology to CYP79s, and four unigenes were identified corresponding to the two CYP83 genes in core structure biosynthesis of the glucosinolate metabolism of radish [36]. Our previous study once suggested that a combination of ultraperformance liquid chromatography coupled with electrospray ionization quadrupole time-of-flight mass spectrometry based on metabolomics and gene expression analysis can effectively elucidate the mechanism of biosynthesis of triterpenoid saponin, and three genes of CYP450s (CYP88D6, CYP716B1, and CYP72A1) putatively expressed in biosynthesis pathway of triterpenoid saponin in P. tenuifolia were identified by quantitative real-time PCR (qRT-PCR) analysis [2]. The examples mentioned above clarified that most identified CYP450s belonging to the CYP71 and CYP 88 families, and few studies were documented about the other CYP450 families until now.
UGTs are another large multigene family in plants and play an important role in the last step of biosynthesis of triterpenoid saponin. Glycosylation, the transfer of activated saccharides to an aglycone substrate, is the predominant modification in triterpenoid saponins biosynthesis. Regarding CYP450s genes, some UGT genes have also been identified. The genes UGT74M1 and UGT73K1 from Saponaria vaccaria [37] and UGT71G1 from Medicago truncatula [38] have been previously identified. Flavonoids related to UGTs (UGT78D1, UGT78D2, UGT73C6, and UGT89C1) were also identified in previous studies [39,40]. Furthermore, four UGTs (con-tig01001, contig14976, contig15451, and contig16321) were selected as candidate genes in P. quinquefolius, which are most likely to be involved in ginsenoside biosynthesis by MeJA-inducible and tissue-specific expression patterns [6]. Pn13895, closely related to UGT71G1, was regarded as a lead candidate UGT, which is responsible for triterpene biosynthesis in P. notoginseng [5]. Six putative flavonol glycosides were identified in Isatis indigotica [41]. Thirteen unigenes were identified as UGT74s, including UGT74B1, UGT74C1, UGT74F1, and UGT74F2, in the core structure biosynthesis of the glucosinolate metabolism of radish [36]. In our previous study mentioned above [2], three genes of UGTs (UGT74B1, UGT73B2, and UGT73C6) putatively expressed in triterpenoid saponin biosynthetic pathways were also identified by qRT-PCR in P. tenuifolia. The examples mentioned above clarified that most identified UGTs belonged to the UGT74 and UGT73 families, and few studies have been documented about the other UGT families until now.
In other words, the genes identified in CYP450s and UGTs add little information to the knowledge of downstream triterpenoid saponin biosynthesis and thus we will continue with our in-depth studies. The large number of CYP450 and UGTs candidates could provide not only a potential gene pool for the identification of special CYP450s and UGTs involved in triterpenoid biosynthesis in P. tenuifolia but also a convenient method to characterize the roles in triterpenoid biosynthesis in future studies.

Conclusion
RP, one of the predominant Chinese medical plants, contains various medicinal ingredients with the elicits tonic, sedative, antipsychotic, expectorant, and other pharmacological effects. In this study, a large-scale unigene investigation of P. tenuifolia was performed by Illumina sequencing. Our results showed that many transcripts encoded by putative genes are involved in the biosynthesis of triterpene saponins and phenylpropanoid. The data we obtained presented the most abundant genomic resource and provided comprehensive information on gene discovery, transcriptome profiling, transcriptional regulation, and molecular markers of P. tenuifolia. This study will improve the production of active compounds through marker-assisted breeding or genetic engineering for P. tenuifolia, as well as other medicinal plants in the Polygalaceae family. However, many candidate CYP450s and UGTs are likely involved downstream of the biosynthetic pathway of triterpene saponins, but these candidates should be further investigated.