Transcriptome Characterization and Identification of Molecular Markers (SNP, SSR, and Indels) in the Medicinal Plant Sarcandra glabra spp.

Sarcandra glabra has significant metabolically active bioingredients of pharmaceutical importance. The deficiency of molecular markers for S. glabra is a hindrance in molecular breeding for genetic improvement. In this study, 57.756 million pair-end reads were generated by transcriptome sequencing in S. glabra (Thunb.) Nakai and its subspecies S. glabra ssp. brachystachys. A total of 141,954 unigenes with 646.63 bp average length were assembled. A total of 25,620 simple sequence repeats, 726,476 single nucleotide polymorphisms, and 42,939 insertions and deletions were identified, and the associated unigenes and differentially expressed genes were characterized. This work enhanced the molecular marker resources and will facilitate molecular breeding and gene mining in S. glabra spp.


Background
Sarcandra glabra (Chloranthaceae family) is an evergreen, smooth shrub, and grows up to one-meter height. It is originated in tropical climate of south Asia. It especially grows in China, Japan, Philippines, Vietnam, Korea, Cambodia, Malaysia, India, Sri Lanka, and other areas including areas of North, Central, and South America and some areas of the Pacific [1]. It has been known for its medicinal values in the treatment of joint and cancer problems [2]. S. glabra extract (SGE) contains high amount of powerful antiinflammatory compounds including isofraxidin, terpenoid saponins, fumaric acid, caffeic acid, rosmarinic acid, and caffeoylquinic acid [2]. SGE is widely used in Traditional Chinese Medicines to help with arthritis, pain, swelling, and redness [3]. It may have multifarious medicinal values as anti-inflammatory [4], antitumor [5], anti-infection [1], and antioxidant [6] bioactivities as described in recent studies. SGE also has the ability to attenuate hyperglycemia [7], which could be useful to recycle the industrial wastes for beneficial by-products [2]. S. glabra seeds can be dried and roasted for human consumption. Besides, it can also be used as an in-or outdoor ornamental plant. Owing to its significant chemical, rheological, and pharmacological importance [1,2], it is important to improve of S. glabra varieties with wider acceptance and enriched with bioactive components.
The marker-based breeding techniques have been widely adapted for plant improvement and genetic conservation [8]. The information about the availability of molecular markers for genetic mapping and map-based gene cloning is always a basic need for any breeding or genomic research. Among DNA markers, single nucleotide polymorphisms (SNPs), simple sequence repeats (SSRs), and insertions and deletions (InDels) are the most simple and valuable markers [9]. The microsatellites are tandem repeats of various mono-, di-, tri-, tetra-, penta-, and hexanucleotide sequence motifs of variable lengths [10], which have been identified and manipulated in various crops [11]. SSR markers have always been preferred for their user-friendly properties and for their random genome distribution, simplicity of use, high level of polymorphism, high clarity, low operational cost, reproducibility, hypervariability, ease of multiplexing, amenability to automation, and use with low quality DNA [11]. Recently, high-throughput sequencing methods have widely been adapted for rapid identification of SNP, SSR, and InDel markers, especially in nonmodel plants [12].
The next-generation sequencing of cDNA pool obtained from extracted RNA of plant tissues is generally known as RNA-seq. The RNA-seq or transcriptomics approach can be used to obtain a huge number of expressed sequences which can further be used to design molecular markers. RNA-seq-based SSRs, SNPs, and InDels are associated with protein coding genes and their relevant translated regions [13]. The transcriptome-based markers facilitate the understanding of their link with phenotypic variation and/or functional genes [14]. They are transferable among closely related species [13] because of their highly conserved domains [15]. Furthermore, transcriptomic SSRs may provide important information about evolution and conservational genetic variation of plant species [8,11,16]. Recently, different reports based on genomics [17], transcriptomics [18], metabolomics [18], and phenomics [19] have been reported in S. glabra. Various researches for characterization of extrachromosomal (chloroplast) genome [20][21][22] and to find the evolutionary relationship of various S. glabra subspecies [23] have also been presented. Furthermore, the amplified fragmentlength polymorphism (AFLP) and microsatellite establishment [24,25] and SNP marker manipulation studies [26] have also been carried out for molecular evaluation in S. glabra. Nonetheless, the identification and characterization of molecular markers are still lacking in S. glabra which greatly impair establishment of breeding programs.
In this study, we generated transcriptome sequences of S. glabra (Thunb.) Nakai and its subspecies S. glabra ssp. brachystachys. We mined the transcriptome data to identify and characterize the SSR, SNP, and InDel markers in order to enhance the genetic resources of this pharmacologically important plant. We also reported the differentially expressed genes and their functional annotations in both subspecies.

Transcriptome Sequencing and De Novo
Assembly. Two S. glabra species with contrasting morphological and physiological traits (Table 1) were selected for de novo transcriptome assembly. Using the Illumina hiseq xten platform, we generated 52.327 and 57.756 million raw reads for S. glabra (Thunb.) Nakai (SG) and the subspecies S. glabra ssp. brachystachys (CSH), respectively (Table 2). After cleaning the reads, 7,847.65 and 8,661.75 million bases with 46.03% and 45.77% GC contents (Additional Figure 1) and 92.72% and 92.64% Q > 30 were retained for CSH and SG, respectively ( Table 2).
KEGG pathways analysis showed that 21,192 (14.93%) unigenes governed the 5 main and 33 subcategories of KEGG database (Additional Figure 4). The unigenes in "carbohydrate metabolism" were further evaluated for best functional orthologous group. In total, 187 unigenes were found for glutathione S-transferase (GST) as the top functional ortholog.

Expression and Pathway Enrichment
Analyses. The nonstandard normal distribution was observed by transcript expression level based on FPKM values and densitydistribution pattern for each sample (Additional Figure 7A). Transcripts with similar expression levels were clustered together (Additional Figure 7C & 7D). Pearson correlation analysis of the samples showed two separate groups for each subspecies (Additional Figure 7D), strong genetic similarity among biological replicates (r 2 ≥ 0:9), while high dissimilarities were observed between CHS and SG. Among the mapped-reads, 14,584 unigenes were differentially expressed by more than twofold between GS and CHS (Additional Figure 7B). The DEGs were enriched in "chloroplast envelope," "calmodulin binding," and "carbohydrate binding" biological pathways ( Figure 2). Furthermore, KEGG pathway analysis of the DEGs revealed "plant-pathogen-interaction," "endocytosis," and "plant-  (Table 4 and Additional Table 2). The SSR density in the transcriptome was 385.91 SSRs/Mb. Among the SSRs, mononucleotide microsatellites were the most abundant (47.77%) with 16,920 bp length, followed by di-(33.85%), tri-(16.59%), tetra-(1.1%), hexa-(0.45%), and pentanucleotide (0.25%) types. The mononucleotide motifs A (50%) and T (48%) were highly abundant repeats. Among the total, 1,968 dinucleotide motifs GA (17.63%) and AG (16.61%) were the most detected types, while only 10 GC and 5 CG motifs were available in the transcriptome data. Among the trinucleotide, ATG (6.77%) followed by TCT (6.25%), and GAA (6.03%) motif types were identified, while only 8 CCG and 9 GCG were observed. An average microsatellite length of S. glabra was 23.7 bp. The repeat motif size significantly affects the length variation of microsatellites. Hexanucleotide motifs with an average length of 34.8 bp were the longest motif followed by penta-(27.72 bp), di-(24.27 bp), tetra-(21.92 bp), and trinucleotide (19.7 bp) motifs. The mononucleotides showed the shortest average motif length of 12.56 bp. A trinucleotide motif 87 bp length and 29-fold repetition was the longest microsatel-lite observed (Table 4 and Additional Table 2). Other than these motifs, a compound motif type denoted as "c" for which two types of motifs separated by a few nonrepeating nucleotides, and "c * " motifs with two different types of repeats without any separation were observed also. The 9.08% (3218) and 0.25% (90) of total SSRs were "c" and "c * " type compound motifs, respectively, and showed an average length of 76.67 bp and 47.1 bp. Among the compound motifs, the longest was of 481 bp with 4 mononucleotide and 4 hexanucleotide motifs with multiple repeats. In the top ortholog group of carbohydrate metabolic pathway (K00799), a total 14 SSRs were identified for 19 assembled unigenes.

SNPs and InDel Variant Identification, Distribution, and
Their Selection Signature. We detected in the unigenes a total of 726,476 single nucleotide polymorphism sites (SNP) with an average of one SNP in 92.32 bp. These SNP markers were situated in 65,539 unigenes. The total number of transition and transversion mutations was 1,588,169 (63%) and 928967 (37%) ( Table 5 and Additional Table 3). A total of   (Table 5 and Additional Table 4). In total, 89 SNP markers for 89 unigenes were related to 9 GO terms including sulfur compound metabolic process, response to stress, signal transduction, molecular functions, oxidoreductase-activity, transferase-activity transferringalkyl or -aryl (other than methyl) groups, cellular components, cytoplasm, and nucleus, in the top ortholog group (K00799) of carbohydrate metabolic pathway (Additional Table 5).
The genetic variant (SNPs and Indels) distribution and the effect on gene functions were investigated. Totally, 36.894% (275,669) variants were observed in exon regions while 20.726% (154,861) variants were in intergenic regions. The variant numbers in 5 ′ UTR and 3 ′ UTR were 19.327% (144,409) and 23.051% (172,232), respectively. Only 22 vari-ants were found in the splice site regions. The majority of variants (60.33%) have modifying effect in genome, which consisted of 3 ′ UTR, 5 ′ UTR, and intergenic variants. In contrast, 19.934% and 19% variants could exert low and moderate effects, respectively ( Table 5). The missense variants were observed to be moderate effect variants. The start-and stopcodon-loss-in-function and stop-codon-gain-in-function variants were observed to have the high effect in the genome. The variants within the coding area mostly missense (52.116%) or silent (46.482%) while only 1.42% mutations were nonsense variations.
2.6. Validation of SSR Markers. Most of the two ends of SSRs are conservative single-copy sequences. The primers were designed as per complementary sequences of the two ends of the SSRs. A total of sixteen SSR markers including two markers for each type of SSRs were randomly selected for validation. Fifteen SSR markers were successfully amplified in 26 to 30 tested genotypes, while one marker was validated in 19 genotypes. The length polymorphism of SSR sites was displayed by electrophoresis of PCR products. One to 17 alleles could be identified. The description of validated markers, their PCR products, and polymorphic alleles are presented in Additional Table 6.

Discussion
The genetic improvement of S. glabra as a pharmacologically important plant is required, which could be accelerated by breeding strategies employing molecular markers. Two SG species (S. glabra ssp. brachystachys and S. glabra (Thunb.) Nakai) have various morphological differences including plant height, leaf shape, and appearance of inflorescences between them [27], and both are considered as model in previous studies for this family [28]. A molecular plant breeding program is always a fast and cost-effective tool in genomic improvement but it demands a rich molecular marker resource that is deficient for S. glabra. Development of such markers is not easy as its genome has not been sequenced yet. The transcriptome-based markers characterized in the study can be used to identify the linked and/or cosegregating loci for linkage and association-based genome mapping of various traits of interest in S. glabra as reported in various other plant species [29].
High-throughput sequencing of transcriptome is a fast and cost-effective way to characterize the genes and identify polymorphic molecular markers in nonmodel plants [12]. In the current study, we obtained the transcriptome data of S. glabra and revealed the available unigenes by de novo assembly. The high N50 and ExN90 values revealed the excellent sequencing quality. The coverage of functional genes was expressed by the BUSCO percentage. The relatively low percentage of complete BUSCOs may indicate the possibility that all genes in this species were not captured. It is normal as only the leaf samples and a single genotype per species were used in this study. It is known that the gene content and expression are different across tissues and genotypes. Furthermore, the transcriptome sequencing reveals only expressed functional genes, while a lot of genes present in the genome are missing. On the basis of transcriptome assembly, we characterized the molecular (microsatellite, SNP, and InDel) markers. The results indicated the availability of microsatellites in about one-fifth of the transcripts (18.04%), which is quite higher than the estimation of 2-5% in other reported studies [30]. Besides SSRs, we also detected SNPs and Indels which are considered very valuable markers. Recently, the development, characterization, and utilization of these markers in model crops as rice [31] and nonmodel crops as guar potato [11,32] have been reported. The density of InDel and SNP identification was higher than the other studies as 17.10 SNPs per Kb in Cyamopsis tetragonoloba L. Taub [2] and one SNP per 17.08 Kb in Guar [32]. Collectively, the molecular markers obtained in this study will be helpful for future genomics studies and markers assisted selection in S. glabra. The leaf samples were employed for transcriptome characterization in this study because leaves are the main source of bioactive ingredients in SG. Nonetheless, mixing different tissues for transcriptome analysis may be a better approach in order to identify more transcripts and potentially more molecular markers as described in previous studies [33,34]. The identified molecular markers in this study have tagged functional genes governing important metabolic activities including the ion binding. Various reports have shown the roles of ion channels in cancer from two principal standpoints: examining how specific ion channels are involved in certain cancer-related cellular behaviors such as proliferation, apoptosis, migration, or angiogenesis or examining the specific expression and functional profiles of various channel characteristic of certain human cancers [35]. Among the annotated pathways, the carbohydrate metabolism was observed to be on top hit. The evaluation of functional ortholog of this pathway indicated 187 unigenes for the glutathione S-transferase (GST) functional ortholog (k00799 in Map-05200) known as important genes in medic-inal plants [22]. The glutathione S-transferases belong to a supergene family of seven cytosolic enzymes that catalyze the conjugation of glutathione (GSH) to a variety of electrophiles including arene oxides, unsaturated carbonyls, organic halides, and other substrates [35]. In humans, polymorphism in GST genes has been associated with susceptibility to various diseases. Previous studies described the associations between GST mutants (GSTM1 null and GSTT1 null) and location of colorectal cancers in individuals with mutation 1 in the MLH1 mismatch repair gene in Finnish kindreds. It is proposed that the GST genes are among other detoxicating enzymes that act as modifying genes and affect the phenotype in monogenic colorectal cancer [35].
Even though the establishment of AFLP and microsatellite markers have been reported in recent studies [24], and important molecular marker resources of S. glabra have been generated in this study, it is still less sufficient as compared to the well-studied model plants. Further works based on longread transcriptome sequencing as well as whole genome sequences are needed to accelerate the genetic improvement of this important crop.

BioMed Research International
with DNase I for purification. The step by step process as total RNA sample detection, mRNA enrichment, doublestrand cDNA-synthesis, end repair, splice selection and PCR-based amplification library quality detection, and computer sequencing was performed at Wuhan Benagen Tech Solutions Company Limited, Wuhan, China, to generate the paired-end reads.

Transcriptome
Sequencing, Cleaning, and Assembly. The preparation of cDNA library and sequencing was performed by using three independent biological replicates for SG and CSH. The total RNA from each sample was extracted, and cDNA library was prepared. The original files of image data were obtained by high-throughput sequencing (Illumina hiseq xten) and were transformed into pair-end raw reads by base-calling analysis. As per machine's sequencing strategy, 150 bp average read length was maintained. The raw reads with joint sequences and/or <5 mass value, ≥50% proportion rate, more than or equal to 5% N base (the base with undetermined information), containing Poly-A, were filtered out to get the cleaned reads. The sequencing quality was estimated by QPhred values. The clean reads were assembled into a fastq format [36]. The Trinity v 2.6.6 program [37] was used for transcriptome assembly and to get the unigenes with default parameters. The accuracy and effectiveness of the assembly results were ensured by estimation of N50, ExN50, and BUSCO [38].     [39] were used to compare the unigene sequences and protein sequences against UniProt database [40] with default parameters. At the same time, diamond v 0.8.36 program blast X was used to compare the unigene sequence with NR database to find the closely related species. Functional annotation was further performed by mapping unigenes to UniProt/Swiss port, Kyoto Encyclopedia of Genes and Genomes (KEGG) [41], EggNog [42], and Gene Ontology (GO) [43,44] databases. The Hmmscan v3.1 (parameter: E value 0.01) was used for protein domain prediction on the basis of Pfam values.

Identification of SNPs and InDels.
To identify the putative single nucleotide polymorphic sites (SNPs) and insertion and deletion events (InDels) in the transcripts, the highquality transcriptome assembled sequences were aligned and compared by STAR v2.7.0 d program [46] with default parameters, and the BAM format comparison files were obtained. The Picard tool (version: 1.93) software package was employed to compare and process the results with default parameters. The SNP calling and low-quality filtering were performed by "haplotypecaller" in GATK (version: 4.1.4) [47] using default parameters. The SNP annotation was performed by "snpeff (version 4.3)" [48] using default parameters. Furthermore, the manual characterization of obtained SNP and InDel markers was performed.

Expression Analysis and Mining the Differentially
Expressed Genes. The bowtie2 v 2.3.4 [49] with zero mismatch parameter was used to compare the reads to the   [50]. The number of read count on each gene was obtained from each sample, and the gene expression level was estimated by the FPKM method. The difference gene expression was analyzed by comparing read count data from the two subspecies using the DESeq2 program with the Q value <0.05 and log 2FC > 1 as a threshold [35].

Data Availability
The RNA-seq data has been submitted to NCBI SRA: PRJNA671629.

Disclosure
The funder has no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Supplementary Materials
Additional Figure 1: the base distribution of sample's reads. The first half is the base content distribution of the first end of the double end sequencing sequence, and the second half is the base content distribution of the other end. Additional Figure 2: simplified annotation information of Gene Ontology classification of identified transcripts for S. glabra. Additional Figure 3: the gene classification on the bases of similarity on cluster of indigenous groups of protein database for S. glabra. Additional Figure 4: the gene classification on the bases of KEGG metabolic pathways for S. glabra. Additional Figure Table 1: expression-based frequency distribution of transcripts extracted from transcriptome data of S. glabra. Additional Table 2: characterization of transcriptome SSR markers in S. glabra. Additional Table 3: characterization of transcriptome SNP markers in S. glabra. Additional Table 4: characterization of transcriptome InDel markers in S. glabra. Additional Table 5: the SSR, SNP, and InDel markers in transcriptome-based unigenes of S. glabra with the respective annotations in GO, KEGG, and COG databases. Additional Table 6: list of sixteen randomly selected and validated SSR markers, their primers, and PCR products. (Supplementary Materials)