Chloroplast Genome Evolution and Species Identification of Styrax (Styracaceae)

The genus Styrax L. consists of approximately 130 species distributed in the Americas, eastern Asia, and the Mediterranean region. The phylogeny and evolutionary history of this genus are not clear. Knowledge of the phylogenetic relationships and the method for species identification will be critical for the evolution of this genus. In this study, we sequenced the chloroplast genome of 17 Styrax samples and added 17 additional chloroplast genome sequences from GenBank. The data were used to investigate chloroplast genome evolution, infer phylogenetic relationships, and access the species identification rate within Styrax. The Styrax chloroplast genome contains typical quadripartite structures, ranging from 157,641 bp to 159,333 bp. The chloroplast genome contains 114 unique genes. The P distance among the Styrax species ranged from 0.0003 to 0.00611. Seventeen small inversions and SSR sites were discovered in the Styrax chloroplast genome. By comparing with the chloroplast genome sequences, six mutation hotspots were identified, and the markers ycf1b and trnT-trnL were identified as the best Styrax-specific DNA barcodes. The specific barcodes and superbarcode exhibited higher discriminatory power than universal barcodes. Chloroplast phylogenomic results improved the resolution of the phylogenetic relationships of Styrax compared to previous analyses.


Introduction
Chloroplasts are involved in photosynthesis and energy transformation in plants [1,2]. Its own genome is known as the chloroplast genome, plastid gene, or plastome, which commonly occurs in multiple copies within the organelle. The important role of the chloroplast genome is functioning of the photosynthesis and other metabolic processes. The chloroplast genome is 120-160 kb in length [2,3] and has a highly conserved quadripartite circular organization. This organization contains two single-copy regions (LSC and SSC) separated by two copies of inverted repeat (IR) regions [4]. The chloroplast genome encodes approximately 80 protein-coding genes, four rRNAs, and 30 tRNA genes [1,2].
Advances in DNA sequencing technology have provided scientists with high efficiency and low cost to obtain complete chloroplast genome sequences. The chloroplast genomes are mostly inherited uniparentally, lack recombination, have compact size; thus, they effectively expand genetic information. Although the genome structure is conserved, mutational events, including indels, SSRs, and singlenucleotide substitutions (SNPs), are frequently occurring even in related species [5,6]. These mutational resources provide rich information to infer evolutionary patterns [7], establish relationships among the plants [8][9][10], and provide effective genetic markers to resolve complex evolutionary histories [11,12]. Moreover, plant DNA barcodes rely heavily on chloroplast genome sequences. Chloroplast genome markers, including rbcL, matK, and trnH-psbA, have been used as core DNA barcodes for plants [13]. Comparison of complete chloroplast genome sequences also provides an opportunity to identify specific plant DNA barcodes [14,15]. Whole chloroplast genome sequences have been used as superbarcodes for plants in recent years [16,17].
The genus Styrax L. consists of approximately 130 species distributed in the Americas, eastern Asia, and the Mediterranean region [18]. The Styrax species have important medicinal, ornamental, and economic values. The seed oil or resin of several species is a valuable medicinal ingredient and raw material for the manufacture of aromatic oils.
The most taxonomic treatment of Styrax is from Fritsch [18] who conducted a phylogenetic analysis based on 34 morphological characters. In this treatment, Styrax was divided into section Valvatae (predominantly tropical evergreen species) and section Styrax (north-temperate deciduous species). The section Styrax was divided into two series: Cyrta and Styrax. Series Cyrta were distributed in eastern Asia and eastern North America with serrated leaf margins and included 31 species. Series Styrax is distributed in western North America and western Eurasia with entire leaf margins and included three species. The Valvatae section also included two series. Series Valvatae was a strictly neotropical clade including about 77 species, and the series Benzoin was a strictly paleotropical clade including nine species.
To better understand the chloroplast genome evolution in Styrax and identify the variable markers to species identification within Styrax, we sequenced the chloroplast genome of 17 samples of Styrax and added published data from Gen-Bank. Specifically, we attempted to (1) elucidate the chloroplast genome evolution and (2) determine whether the variable chloroplast markers or whole chloroplast genome data can be effective for Styrax species identification.

Plant Materials and DNA Extraction.
A total of 17 species of the genus Styrax were obtained from the field and the DNA Bank of China, Institute of Botany, Chinese Academy of Sciences, and the DNA Bank of China has been permitted obtaining from the materials of the specimens in PE (Institute of Botany, Chinese Academy of Sciences). The details of the 17 species are shown in Table S1. Fresh leaf tissues from each accession were immediately dried with a silica gel before DNA extraction. Total DNA was extracted using a modified CTAB DNA extraction protocol (mCTAB) [26]. In addition to the newly collected material for DNA sequencing, publicly available complete chloroplast genome sequences (17 accessions, Table S1) of Styrax were also included. In total, the dataset of sequenced samples and GenBank accessions consisted of 34 individuals representing 29 Styrax species.
2.2. Chloroplast Genome Sequencing, Assembly, and Annotation. Chloroplast genome sequencing was performed at Novogene (Beijing) using the Illumina HiSeq X-ten platform. Total DNA was sheared to 350 bp fragments using an ultrasonicator. A rapid library was prepared using the NEBNext® Ultra™ DNA Library Prep Kit. Each sample yielded approximately 4 GB of data.
Illumina data were filtered using Trimmomatic v0.36 [27] to remove the adaptors and low-quality reads with Q − value ≤ 20. The parameters were set as follows: leading: 20; trailing: 20; sliding window: 4 : 15; MIN LEN: 36; and AVG QUAL: 20. The clean data were used to assemble the chloroplast genome using GetOrganelle [28], and the k -mer length was set to 85, 95, and 105. Complete chloroplast genomes were annotated using Plann [29], and the published chloroplast genome sequences of S. obassis (GenBank  [7]. Inversions form a stem-loop structure, including inversion sequences and inverted repeats at the opposite flanking end [7].
The mVISTA program was used to analyze the variation in the Styrax chloroplast genomes [31], for which sequence annotation of S. agrestis was used as the reference. The 47 Styrax chloroplast genomes were aligned using MAFFT 3 BioMed Research International v7.0 and then adjusted manually using Se-Al v2.0 [32]. To explore the sequence divergence with the whole chloroplast genome in the 27 Styrax species, genetic P distances were calculated with MEGA X [33].

Mutation Hotspots Identified and DNA Barcoding
Analysis. Three factors, including nucleotide diversity (π), mean distance (D), and the proportion of zero pairwise genetic distances (Z) for each species in the matrix, were used to explore the mutation hotspots in the Styrax chloroplast genome. Nucleotide diversity was calculated using the software DnaSP v6 [34]. Mean window distance and the proportion of zero pairwise genetic distances for each species in the matrix were calculated using the slideAnalyses function of the SPIDER package [35] in R. The window length was set to 600 bp, with a 50 bp step size.
Nucleotide diversity and variable and parsimonyinformative sites were used to evaluate marker variability. The three universal chloroplast DNA barcodes, rbcL, matK, and trnH-psbA, were also used in this analysis. Nucleotide diversity was determined using the DnaSP v6 software [34], and variable and parsimony-informative sites were calculated using MEGA v7 software [36].
Distance and tree-building methods were used to assess the marker discriminatory power. The distance method uses the nearneighbor function of SPIDER. The tree-based method was applied using ML. ML analysis was conducted using RAxML nonparametric bootstrapping and 1000 ML   pseudoreplicates. The best-fitting models were selected using ModelFinder [37].

Phylogenetic Analyses.
Phylogenetic analysis was conducted to elucidate the interspecific phylogenetic relationships within Styrax. Two datasets were created to infer the Styrax phylogeny. The first data were whole chloroplast genome sequences of 34 Styrax samples with Huodendron tibeticum and H. biaristatum used as the outgroup. The second dataset was the concatenation of the 80 coding genes. Maximum likelihood (ML) and Bayesian inference (BI) methods were used to infer phylogenetic relationships. All phylogenetic analyses used the best-fitting models of nucleotide substitution selected in ModelFinder [37] under the Bayesian information criterion. Maximum likelihood (ML) analyses were performed in RAxML-NG [38] with 500 bootstrap replicates. The BI tree was inferred to be MrBayes v3.2 [39]. The BI analysis was run with two independent chains and prior for 20 million generations, with sampling every 1000 generations. The initial 25% of the sampled trees were discarded as burn-ins. Stationarity was considered to have been reached when the average standard deviation of the split frequencies remained below 0.01.  (Table S2). All 17 newly sequenced chloroplast genomes were assembled entirely, and their sequence lengths and structures were very similar (Table 1, Figure 1). The chloroplast genome length ranged from 157,641 bp (S. japonicus) to 159,333 bp (S. suberifolius). The chloroplast genome has a quadripartite structure typical of angiosperms composed of an LSC region (87,250-88,656 bp), SSC region (17,993-18,412 bp), and two IR copies (26, 017-26,352 bp). The overall G/C content was approximately 37%. The Styrax chloroplast genome encodes 114 genes, including 80 proteincoding genes, 30 transfer RNA (tRNA) genes, and four ribosomal RNA (rRNA) genes. The mVISTA results revealed collineation, no rearrangement, and high sequence similarity across the Styrax chloroplast genomes ( Figure S1).

Repeats and Small Inversions.
A total of 61-74 SSRs were found in the Styrax chloroplast genomes. Mono-, di-, tri-, tetra-, penta-, and hexanucleotide SSRs were identified ( Figure 2). The majority of SSRs were mononucleotide repeats in all Styrax species, followed by trinucleotide repeats. Pentanucleotide repeats were limited to one occurrence in S. ramirezii. Most mononucleotide repeats were composed of A/T with minimal G/C. The LSC region contained the most significant SSRs (76.91%), with 14.49% identified in the SSC region and 8.61% in the IR region.
Seventeen small inversions were identified in the Styrax chloroplast genome ( Table 2). All inversions and their inverted repeating flanking sequences formed stem-loop structures. The inversion length was 4 to 164 bp, and the flanking repeats ranged from 6 bp to 28 bp. The longest inversion occurred in the trnS UGA -psbZ region. Except for the two inversions, the others were all located in the LSC region. All inversions were located in noncoding regions, including 14 in space and three in intron regions. The trnF GAA -ndhJ region included three inversions, and ycf3-trnS GGA had two inversions. Seven inversions (trnC GCA -petN, ycf3-trnS GGA 01, trnT UGU -trnL UAA , trnF GAA -ndhJ 01, trnF GAA -ndhJ 02, clpP, and trnR ACG -trnN GUU ) were specific to one species (Table S3). For example, the inversion in trnC GCA -petN was specific to S. duclouxii. The inversion in trnS GCU -trnG GCC and psaJ-rpl33 occurred in only one sample of S. agrestis, whereas inversions in trnS GCU -trnG GCC and petN-psbM occurred in some samples of S. tonkinensis. This suggests that these three inversions are polymorphic in one species.

Universal DNA Barcodes of Styrax.
Three universal candidate DNA barcodes, rbcL, matK, and trnH-psbA, were analyzed to test the species discrimination power of Styrax (Table 3). The core barcode of rbcL had an aligned length of 695 bp, with 14 variable sites and nine parsimonyinformative sites. Nucleotide diversity was 0.00297. The matK barcode was more variable than rbcL, with an aligned length of 878 bp and 25 variable sites in Styrax. The trnH-psbA barcode is an intergenic space region with an aligned length of 486 bp. It contained 25 variable sites and 14 parsimony-informative sites. According to the nucleotide Using the distance-based species identification methods, the three universal DNA barcodes had 18.4% (rbcL), 44.4% (matK), and 48.8% (trnH-psbA) discriminatory power in Styrax. Combining rbcL and matK, the success rate was 48.15%, and the success rate of the combined three barcodes was 70.37%. The results obtained using the treebased method are shown in Figure 3(a). The phylogenetic tree had a lower resolution and lower support values. Four individuals of S. tonkinensis did not form a monophyletic clade.
The percentage of variable sites among these six regions ranged from 4.36 to 5.85, and the parsimony-informative sites ranged from 1.61 to 3.85. According to the π values, ycf1b showed the highest variability in Styrax, followed by petA-psbJ, rps16-trnQ, rpl32-trnL, trnT-trnL, and ndhC-trnV. Using the distance methods, ycf1b had 92.56% discriminatory power, followed by trnT-trnL and petA-psbJ. Combined with ycf1b and trnT-trnL, all Styrax species were successfully distinguished. The tree-based results are presented in Figure 3(b). Compared to universal DNA bar-codes, the combination of ycf1b and trnT-trnL had a higher resolution. Thus, ycf1b and trnT-trnL were chosen as Styrax-specific chloroplast DNA barcodes. The primers designed for the two regions are listed in Table S4, and the primers were tested to work well.
3.5. Superbarcode of Styrax. The 47 whole Styrax chloroplast genomes had an aligned length of 163,099 sites with 3,160 variable sites (1.94%) and 1,481 parsimony-informative sites (0.91%). The mean nucleotide diversity was found to be 0.00231 (Table 4). The genetic P distance of the Styrax species is shown in Figure 5. The mean genetic distance was 0.00244, the lowest divergence (0.0003) was between S. macrocarpus and S. zhejiangensis, and the largest sequence divergence (0.00611) was between S. casearifolia and S. ramirezii. The discriminatory power of the whole chloroplast genome as a DNA barcode was assessed using distanceand tree-based methods. Compared to the universal DNA barcodes or the six newly specific DNA barcodes, the whole chloroplast genome data exhibited the highest discriminatory power (Table 4 and Figure 6).
3.6. Phylogenetic Inference. The phylogenetic tree inferred from the chloroplast genome and 80 coding gene datasets was similar to the phylogenetic relationships of Styrax species ( Figure 6). The best-fit model GTR+G from ModelFinder was used for the ML and BI analyses. The topologies of the ML and BI trees and the two datasets were nearly identical. All Styrax species formed a monophyletic clade (BS = 100/PP = 1), and some notes had shortened branches, indicating low divergence among some Styrax species. Three lineages were formed in the phylogenetic tree. Styrax ramirezii and S. argenteus were the first diverging branches (series Valvatae) and were sisters to the remaining species. Styrax chinensis and S. suberifolius formed the second lineage (series Benzoin). The remaining species formed the third lineage (series Cyrta) with a 100% bootstrap value.

Styrax Chloroplast Genome Evolution.
The Styrax chloroplast genomes were similar to other angiosperms, indicating that the chloroplast genome was a quadripartite structure, including a large single-copy, a small single-copy, and a pair of inverted repeats. The Styrax chloroplast genomes have highly similar genome structures, genome sizes, and gene contents (Figure 1), and the single-copy regions and noncoding regions are more variable than the IRs and coding regions ( Figure S1). SSRs, which consist of tandemly repeated motifs of six base pairs (bp) or less, are important markers for population genetics and germplasm management [5,40,41]. In the chloroplast genome, SSRs are dominated by mono-and dinucleotide repeats and A/T bases are the most common [7,42,43]. This was consistent with pre-vious findings that the chloroplast genome is usually composed of polyA and polyT repeats [44]. A total of 61-74 SSRs were found in the Styrax chloroplast genomes (Figure 2), which were more abundant than other species of Styracaceae [45].
Small inversions have been found in most related species [46][47][48]. All of the inversions formed stem-loop structures, and there was no correlation between the lengths of inversions and inverted repeats [7]. Many small inversions are generated by parallel or back mutation events during chloroplast genome evolution [49,50]. The inversion in trnS GCU -trnG GCC and petN-psbM occurred in some samples of S. tonkinensis, and trnS GCU -trnG GCC and psaJ-rpl33 occurred in only one sample of S. agrestis. These inversions did not show phylogenetic signals (Table S3). Recent studies suggest that some small inversions are valuable for phylogenetic relationships in some groups [51][52][53][54].     [55]. DNA barcoding is a new effective tool widely used in species identification since 2003 [56]. Selecting a DNA marker as a universal DNA barcode is essential for the diversity of organisms [57]. However, the selection of universal barcode(s) in plants is more complex than other organisms. The CBOL Working Group recommended three chloroplast markers (rbcL, matK, and trnH-psbA) and nuclear ITS as universal DNA barcodes for higher plants [58]. More evidence has shown that these markers have lower variability and discrimination power [15,[59][60][61]. This study assessed the three chloroplast markers in Styrax to evaluate their suitability for species resolution. Using the distance-and tree-based methods, their discrimination power was barely satisfactory (Table 3). Combining the two core DNA barcodes (rbcL and matK) had a resolution rate of less than 50%. ITS is regarded as a powerful phylogenetic marker at the species level, showing high interspecific divergence [62]. However, phylogenetic resolution using ITS data was also limited in Styrax [23].
The chloroplast genome sequence mutations (SNPs and indels) were not random and clustered into mutation   [5,59,60,63]. Specific DNA barcodes revealed a higher resolution rate than universal DNA barcodes. For example, Oryza chloroplast genomes were compared and five or six specific DNA barcodes for Oryza were identified [14,64]. Using comparison of oak species, two intergenic regions matK-trnK-rps16 and trnR-atpA and two coding regions ndhF and ycf1b were developed as specific DNA barcodes [15]. This study identified six variable markers (rps16-trnQ, trnT-trnL, ndhC-trnV, petA-psbJ, rpl32-trnL, and ycf1b). These markers had higher variable and species resolutions than the universal DNA barcodes (Table 3). According to the success discrimination rate, two markers (trnT-trnL and ycf1b) were selected as Styrax-specific chloroplast DNA barcodes. TrnT-trnL is an intergenic spacer region and has been frequently used in plant phylogeny [65,66]. Dong et al. were the first to report ycf1a and ycf1b markers, located in the second-longest gene ycf1 [60,61]. ycf1b was more variable than the two core DNA barcodes, rbcL and matK [61]. Combining ycf1b and trnT-trnL significantly improved the identification success rate, and these two markers were chosen as the Styrax-specific DNA barcodes.
The advent of next-generation sequencing (NGS) technologies has led to a decrease in the cost of genome sequencing. Genomic data have extended the concept of the DNA barcoding approach, referred to as "superbarcoding" [14,16], "ultrabarcoding" [67], or "plant barcoding 2.0" [17]. Compared to the nuclear and mitochondrial genomes, the chloroplast genomes were easily sequenced using genomic skimming [68][69][70], and the chloroplast genome has sufficient sequence variation in closely related species [6]. More studies showed that the chloroplast genome had a sufficiently high mutation rate which enables species identification and it may be best suited as superbarcodes for plants [70]. For example, using the chloroplast genome sequences, all 20 sampled Olea species had been successfully distinguished and even some subspecies of O. europaea can be identified [70]. Wu   9 BioMed Research International genome can be used to effectively differentiate Fritillaria species [71]. Moreover, chloroplast genome data have been widely used in plant phylogenetics at different taxonomic levels [8,72,73]. In this study, the chloroplast genome showed sufficient information for Styrax species identification (Table 4 and Figure 6). In addition to assembling the chloroplast genomes, clean reads from NGS could further be used to retrieve nuclear genome sequences, giving the possibilities for accurate species identification and phylogenetic relationship reconstruction.

Conclusions
The analyzed 34 Styrax chloroplast genomes have a similar structure, gene number, and gene order. SSR sites and small inversions were also identified. Comparisons of the Styrax chloroplast genome sequence divergences revealed that rps16-trnQ, trnT-trnL, ndhC-trnV, petA-psbJ, rpl32-trnL, and ycf1b were variable markers. Furthermore, ycf1b and trnT-trnL were suggested as Styrax-specific DNA barcodes. The whole chloroplast genome is potentially available as a superbarcode for Styrax species. This study demonstrated the potential of chloroplast genome data to improve the phylogenetic resolution.

Data Availability
The 17 Styrax chloroplast genomes are available in the GenBank database (accession numbers: MZ285733 to MZ285749). Table S1: taxa included in this study with locality, voucher, and GenBank accession numbers. The bold font samples were sequenced in this study. Table S2: summary of the sequencing data for 17 Styrax samples. Table S3: the information of small inversions. Table S4: the primers used for amplification the specific Styrax DNA barcodes. Figure S1: visualization of genome alignment of the Styrax chloroplast genomes. The x-axis showed the coordinate between the chloroplast genomes. (Supplementary Materials)