Transcriptome Sequencing of Gynostemma pentaphyllum to Identify Genes and Enzymes Involved in Triterpenoid Biosynthesis

G. pentaphyllum (Gynostemma pentaphyllum), a creeping herbaceous perennial with many important medicinal properties, is widely distributed in Asia. Gypenosides (triterpenoid saponins), the main effective components of G. pentaphyllum, are well studied. FPS (farnesyl pyrophosphate synthase), SS (squalene synthase), and SE (squalene epoxidase) are the main enzymes involved in the synthesis of triterpenoid saponins. Considering the important medicinal functions of G. pentaphyllum, it is necessary to investigate the transcriptomic information of G. pentaphyllum to facilitate future studies of transcriptional regulation. After sequencing G. pentaphyllum, we obtained 50,654,708 unigenes. Next, we used RPKM (reads per kilobases per million reads) to calculate expression of the unigenes and we performed comparison of our data to that contained in five common databases to annotate different aspects of the unigenes. Finally, we noticed that FPS, SS, and SE showed differential expression of enzymes in DESeq. Leaves showed the highest expression of FPS, SS, and SE relative to the other two tissues. Our research provides transcriptomic information of G. pentaphyllum in its natural environment and we found consistency in unigene expression, enzymes expression (FPS, SS, and SE), and the distribution of gypenosides content in G. pentaphyllum. Our results will enable future related studies of G. pentaphyllum.

Gypenosides are secondary metabolites in the synthesis pathway of triterpenoids. Mevalonate or isoprenoid are the precursors in the beginning of the pathway of triterpenoid synthesis, which we refer to as the MVA (mevalonic acid) or MEP (methylerythritol phosphate) pathway ( Figure 1) [14][15][16]. The synthesis pathway of the triterpenoids can be decomposed into three parts: (1) the synthesis of IPP (isopentenyl pyrophosphate) or DMAPP (dimethylallyl pyrophosphate); (2) the synthesis and cyclization of the squalene; and (3) the functionalization reaction that proceeds with complexity of the squalene (Figure 1). FPS (farnesyl pyrophosphate synthase), SS (squalene synthase), and SE (squalene epoxidase) were previously identified as the main enzymes involved in the synthesis of triterpenoid saponins [17][18][19][20] and SE are required for the synthesis and cyclization of the squalene that combines two sesquiterpenoids into one triterpenoid (C15 + C15 = C30) [16,21]. After this step, triterpenoids can be transformed into many isoform types like protosteryl type (chair-chair-chair-boat conformations), dammarenyl type (chair-chair-chair-boat conformations), cadinyl type (chair-chair-chair-boat conformations), and hopene and tetrahymanol (chair-chair-chair-chair conformations or chair-chair-chair-boat conformations) [21]. Like the ginsenosides, gypenosides (triterpenoids) in G. pentaphyllum have various and vital applications in medicine and health [22]. However, gypenosides showed much higher heterogeneity when compared with ginsenosides and more than 169 kinds of gypenosides were found in G. pentaphyllum International Journal of Genomics 3 [23][24][25][26][27][28]. In other words, more than five times the number of triterpenoid saponins was found in G. pentaphyllum relative to Panax ginseng (P. ginseng). It was interesting that G. pentaphyllum has such diversity in triterpenoids compared to other plants [26]. This is likely related to different expression of genes and enzymes involved in the synthesis pathway of triterpenoids. Nowadays, transcriptomic sequencing (RNA sequencing) is a more and more popular tool to explore transcriptomic process [29][30][31][32][33][34][35][36][37], because RNA sequencing has several advantages relative to DNA sequencing like lower fee, higher efficiency, more advanced features, and so forth [36][37][38][39]. In 2011, Sathiyamoorthy Subramaniyam analyzed the transcriptome of G. pentaphyllum related to the synthesis pathway of triterpenoids. However, this article had two important limitations. First, the samples used for sequencing only included two tissues (leaves and roots) and G. pentaphyllum that was sampled and sequenced was planted in water and not in its natural environment [32]. This is important because G. pentaphyllum exhibits great phenotypic diversity in different environments because of its strong adaptability [33,[40][41][42][43][44]. Additionally, although the author showed sequencing data in the paper, no association analysis among unigenes expression, enzyme expression, or the distribution of gypenosides content of G. pentaphyllum was determined. In 2015, Zhao et al. identified EST-SSR makers by analyzing the sequencing data of two species of Gynostemma (Cucurbitaceae) [33]. In that article, the tissues were natural and complete, but the three tissues (young leaves, flowers, and immature seeds) from each kind of G. pentaphyllum were mixed up together to extract RNA for constructing cDNA to sequence. In other words, the sequencing data and related information in that article were a mixed result and these results could not be classified by tissues. Therefore, to address this deficiency of knowledge, we collected G. pentaphyllum in natural environment and sequenced its transcriptome separately by Illumina's NextSeq 500.

Illumina
Sequencing. The plant tissue sample of G. pentaphyllum was sent to Personalbio Company (Shanghai City, China) for transcriptome sequencing using Next-Generation Sequencing (NGS) technology based on the sequencing platform of Illumina's NextSeq 500. First, the mRNA was cleaved into little segments after treatment with chemical reagents and high temperature. Next, the segments were used to construct a cDNA library that was sequenced by paired end (PE) reads.

Unigene
Assembly. Trinity (r20140717, k-mer 25 bp) professional software was used to assemble the RNA sequence [45]. First, high-quality sequences were constructed into a short-sequence library with length of k-mer. Next, primary contig sequences were obtained by the extension of the shortsequence library using overlaps with a k-mer-1 length. Next, primary contig sequences were categorized by their overlaps and categorical contigs were constructed into the De Bruijn graph. Based on the recognition rate of reads in each category, transcript sequences were restored by the contigs. After assemblage by Trinity, BLAST (version 2.2.30+) was used to compare the assembled sequences with reference sequences in NCBI (National Center for Biotechnology Information) nonredundant protein (NR) sequences to determine the best comparison results. Finally, sequences with the same gi number were classified as the same unigene and the longest sequence was regarded as the representative sequence of that unigene [46].

Analysis of Unigene Expression
. RPKM (reads per kilobases per million reads) was used to calculated unigene expression of G. pentaphyllum and the calculation method of RPKM is described below [47]. Before we calculated the unigene expression, we need to process the read count of unigenes with Bowtie 2 (2.2.4, default setting) [48]. The RPKM density distribution generally reflected the pattern of gene expression. Typically, unigenes with medium expression cover the majority of the area under the curve (AUC) in the density distribution of RPKM (as drawn with the density function of software R). Oppositely, unigenes with higher or lower expression occupy the minority of AUC. DESeq (version 1.18.0) software was used to analyze the differential expression of unigenes in our study [49]. The expression of unigenes was compared by the fold change (fold change > 2) and its significance ( value < 0.05). The final result was displayed by Venn diagram.

Analysis of the Distribution of Gypenosides Content in
G. pentaphyllum. First, dried powder of the sample (about 15 mg) was mixed with 10 mL of extraction solvent (ethanol containing 5.0% pure water) and was processed by a continued supersonic treatment for 30 minutes. Second, the mixed solvent was evaporated to dryness using a rotary evaporator. The dry gypenosides were redissolved in 5.0 mL

Result of Annotation.
We used five databases, NR, GO, KEGG, eggNOG, and Swiss-Port, to annotate unigenes for functions ( Figure S6-S9). The overview of annotation is listed in Table 2. The result of each annotation is provided in the support file. Generally speaking, eggNOG showed an identification rate of 96.57% and KEGG displayed the lowest identification rate of 8.77% when comparing unigenes with the reference sequences. Based on GO annotation, the unigenes were categorized into different categories based on different functions and GO Slim displayed general characteristic about the distribution of the unigenes. Then, we used the eggNOG database to explore the biological function of protein in more detail because eggNOG classifies different protein sequences into a more detailed directory. Similarly, Swiss-Port was also used for annotation of the protein sequences, and Swiss-Port builds on eggNOG and provided more detailed structural information about the protein. Finally, KEGG is the last but most important database we used to annotate enzymes, since the KEGG pathway annotation showed us the network of the intermolecular reaction. This allows determination of the enzymes that are located in the synthesis pathway of triterpenoid saponins.

Expression of Unigenes.
RPKM is a normalization method to calculate gene expression and we used the density distribution of the RPKM to show the expression of unigenes. The map of the density distribution showed that our unigenes expression conformed to standards because unigenes with mid-range expression occupied the majority of AUC (area under the curve) and unigenes with lower or higher expression were the minority of AUC ( Figure 2). In the density distribution, unigenes in fibrous roots showed much higher expression than in the stems and leaves. Although stems and leaves showed similar unigene expression, stems had slightly higher unigene expression than the leaves. We analyzed the result of expression of the unigenes of FPS, SS, SE, and -AS (beta-amyrin synthase) in Table 3. Noticeably, the unigenes that encoded FPS, SS, SE, and -AS showed the highest expression in leaves and the lowest expression in fibrous roots. In the unigene expression of -AS, the leaves showed almost 125 times higher expression than in the fibrous roots. The unigenes of FPS, SS, SE, and -AS showed higher expression in the stems than in the fibrous roots. Based on this sequencing data, we detected the differential expression of unigenes using the software DESeq. We obtained the upregulated and downregulated unigenes in the pairwise comparison among the data from the fibrous roots, stems, and leaves (Table 4). We also determined the unigenes that showed differential expression in all samples. We used Venn International Journal of Genomics 5   diagram function in software R to describe the general distribution of unigenes with differential expression (Figure 3). Combining the data in Table 4 and Figure 3, we concluded that 10832 unigenes showed differential expression. Additionally, 699 unigenes displayed differential expression in all samples, while 6512 unigenes showed differential expression in the pairwise comparison of samples.

Content Distribution of Gypenosides in G. pentaphyllum.
A UV-Vis Spectrophotometer was used to detect the content   (Table 5), and the content of gypenosides in stems (0.365%) or fibrous roots (0.172%) was much lower than the leaves. A correlation coefficient ( 2 ) of 0.996 in the standard curve indicates that our result was accurate and reliable ( Figure  S10).

Expression of Enzymes in the Synthesis Pathway of Triterpenoids.
Based on the KEGG Pathway, a more detailed result about the enzymes expression in the triterpenoids synthesis was obtained. We categorized related enzymes into three parts according to the synthesis pathway of the triterpenoids: (1) enzymes involved in the synthesis of IPP or DMAPP, (2) enzymes involved in the synthesis and cyclization of squalene, and (3) enzymes in the squalene functionalization reaction (Figure 1). We focused on FPS, SS, and SE because these three enzymes play an important role in the synthesis and cyclization of triterpenoids ( Figure 4) [17][18][19]. The results are shown in Table 6 and Figure 5. We found three noticeable results. First, in the comparison of FPS, only one comparison between the fibrous roots and leaves was found. Leaves showed higher expression of FPS compared to the fibrous roots. Second, in the comparison of SS, two comparisons were found and the result showed that leaves had higher expression than the fibrous roots and stems. Third, in the comparison of SE, leaves showed higher expression than the stems and fibrous roots. The fibrous roots showed higher expression of SE than the stems.

Discussion
G. pentaphyllum is a creeping herbaceous perennial with medicinal properties used in traditional Chinese medicine. Triterpenoid saponins, the main effective components of G. pentaphyllum, have been widely studied [2-13, 23, 24]. In this study, we obtained transcriptome information using RNA sequencing, a technique that exhibits higher efficiency and is less expensive than DNA sequencing [38]. We analyzed the associations of unigene expression, enzyme (the output of the unigenes) expression, and the content distribution of gypenosides (enzyme output) after functional annotation (Table 2), RPKM calculating (Table 3), and measurement of gypenosides content (Table 5). We found that the expressions of unigenes and enzymes were positively associated with the distribution of gypenosides content. Generally speaking, unigenes and enzyme expression (FPS, SS, and SE) in the samples (fibrous roots, stems, and leaves) determined to the distribution of gypenosides content. Higher expression of unigenes and enzymes (encoded by unigenes) caused the higher content of enzymes' production (gypenosides), and the lower expression of unigenes and enzymes caused less enzyme production. This consistent result could facilitate future studies of other secondary metabolites in G. pentaphyllum. Since G. pentaphyllum has wide applications for health and medicine, it is essential to identify the secondary metabolites with various and vital medicinal functions.
One interesting finding was the observed differences between general expression and individual expression of unigenes ( Figure 2 and Table 3). A group of specific unigenes   International Journal of Genomics (FPS, SS, and SE) located in a special pathway like triterpenoids synthesis could increase their expression to a much higher level as required for a special physiological activity like triterpenoid synthesis. This obvious difference in expression may result from changes in the regulation of transcription. It is currently a hot topic to study the differential expression of the transcriptome and the regulation of transcription of plants in response to stresses of a special environment or other genetic factors [57][58][59][60][61][62][63]. Another interesting observation was that the transcriptome sequences of G. pentaphyllum determined in our study showed high similarity to the transcriptome sequences related to bitterness in cucumber as reported previously [64]. We downloaded the available mRNA sequences of related enzymes from that article and blasted them against the unigene sequences of G. pentaphyllum. The BLAST result was surprising, as all thirteen available sequences related to bitterness in that article showed a high degree of similarity to the specific unigene sequences of G. pentaphyllum. This high similarity predicted that genes related to the biosynthesis, regulation, and domestication of bitterness in cucumber may also be present in G. pentaphyllum. G. pentaphyllum also has two tastes (sweet and bitter) and this difference of taste may be caused as in cucumber. The taste of G. pentaphyllum from bitter to sweet predicts that G. pentaphyllum may change in response to domestication and contain a similar mutation. A further genetic exploration of the domestication of G. pentaphyllum may provide understanding of its changed taste.
Although this study was not the first report of the transcriptome information for G. pentaphyllum using sequencing, we corrected the limitations of the previous study and enriched the analysis of triterpenoid synthesis of G. pentaphyllum. G. pentaphyllum used for analysis in our study was planted in a natural environment and three common kinds of plant tissues (fibrous roots, stems, and leaves) were used to provide tissue samples for sequencing. Together with the sequencing, an exploration of the distribution of gypenosides content was performed to confirm the analysis of enzymes and unigenes. We found a positive association of unigene expression, enzyme expression, and the distribution of gypenosides. Our study will facilitate more genetic studies examining the regulation of transcription and the change of bitterness in G. pentaphyllum.

Conclusions
To provide more complete and high-quality transcriptional information of natural G. pentaphyllum, we used RNA sequencing technology to sequence the transcriptome of G. pentaphyllum. We found a positive association of unigene expression, enzyme expression, and the distribution of gypenosides. Our results will enable future related studies of G. pentaphyllum.