Complete Chloroplast Genome Sequence of Coptis chinensis Franch. and Its Evolutionary History

The Coptis chinensis Franch. is an important medicinal plant from the Ranunculales. We used next generation sequencing technology to determine the complete chloroplast genome of C. chinensis. This genome is 155,484 bp long with 38.17% GC content. Two 26,758 bp long inverted repeats separated the genome into a typical quadripartite structure. The C. chinensis chloroplast genome consists of 128 gene loci, including eight rRNA gene loci, 28 tRNA gene loci, and 92 protein-coding gene loci. Most of the SSRs in C. chinensis are poly-A/T. The numbers of mononucleotide SSRs in C. chinensis and other Ranunculaceae species are fewer than those in Berberidaceae species, while the number of dinucleotide SSRs is greater than that in the Berberidaceae. C. chinensis diverged from other Ranunculaceae species an estimated 81 million years ago (Mya). The divergence between Ranunculaceae and Berberidaceae was ~111 Mya, while the Ranunculales and Magnoliaceae shared a common ancestor during the Jurassic, ~153 Mya. Position 104 of the C. chinensis ndhG protein was identified as a positively selected site, indicating possible selection for the photosystem-chlororespiration system in C. chinensis. In summary, the complete sequencing and annotation of the C. chinensis chloroplast genome will facilitate future studies on this important medicinal species.


Introduction
Chinese goldthread, Coptis chinensis Franch., is an important medicinal plant in the Ranunculaceae. C. chinensis is native to China and has been used in traditional Chinese medicine for centuries [1,2]. The major active compounds of C. chinensis are protoberberine alkaloids [1], such as berberine, palmatine, jatrorrhizine, coptisine, columbamine, and epiberberine. These compounds have antiviral, antiinflammatory, and antimicrobial activity, and they dispel dampness, remove toxicosis, and aid detoxification [3][4][5][6]. Despite the prominent roles of C. chinensis in medicine, understanding of its biology and evolution is limited due to a lack of genomic resources.
Chloroplast genomes in angiosperms are mostly circular DNA molecules ranging from 115 to 165 kb in length [7]. They exhibit a conserved quadripartite structure consisting of one large single copy (LSC) region, one small single copy (SSC) region, and two copies of inverted repeats (IR). Due to their low levels of recombination and substitution rates compared to nuclear genomes, plant chloroplast genomes are valuable sources of genetic markers for phylogenetic analyses. Over 21 complete genomes of species within the Ranunculales have been sequenced and deposited in the NCBI database (as of August 2016), and these data can be used to study chloroplast genome evolution in the Ranunculales.
The improvement of NGS technologies allows the sequencing of entire chloroplast genomes cheaper [8] and has resulted in the extensive use of chloroplast genomes for molecular marker and molecular phylogenetic studies. In our study, we assembled the complete C. chinensis chloroplast genome sequenced using the sequencing data generated by the Illumina HiSeq platform. Genome annotation reported both the conserved and variable information of the C. chinensis genome compared to other Ranunculales species. The phylogeny and molecular dating analyses also deepen our 2 BioMed Research International understanding of the evolutionary history of the Ranunculales order.

Plant Material and Library
Preparation. C. chinensis was collected from Shizhu, Chongqing City, China. DNA extraction and library preparation used methods described by He et al. [8]. Fresh leaves were used to extract total chloroplast DNA with the Tiagen Plant Genomic DNA Kit (Beijing, China). 300-bp DNA fragments were obtained by breaking extracted genomic DNA using a Covaris M220 Focused-Ultrasonicator (Covaris, Woburn, MA, USA). NEBNext5 Ultra6 DNA Library Prep Kit Illumina (New England, Biolabs, Ipswich, MA, USA) was used to construct a sequencing library according to the manual from the manufacturer.

DNA Sequencing, Data Preprocessing, and Genome
Assembly. Cluster generation was performed using TruSeq PE Cluster Kit (Illumina, San Diego, CA, USA), and 2 × 100 bp reads were generated on an Illumina HiSeq 2500. FASTX-Toolkit (2016a) was used to remove the adaptorcontaminated reads, low-quality bases (quality scores <20 or ambiguous nucleotide) dominated reads, and short reads (<20 bp). The remaining reads were called "clean reads." Velvet v1.2.07 [9] was used for the de novo assembly of these clean reads, with the parameters described by He et al. [10].
To determine the contig orders and orientations, the 43 Velvet contigs were then aligned to the M. saniculifolia chloroplast genome [11] (NCBI RefSeq accession NC_012615.1, a species in the Ranunculaceae). Then, five pairs of primers linking adjacent contigs were designed and used to perform PCR amplification of the unassembled regions, and the PCR products were sequenced with Sanger method. Finally, using the Lasergene SeqMan program from DNASTAR (Madison, WI, USA), the Sanger reads, together with the Velvet contigs, were further assembled into high-quality complete chloroplast genome (NCBI GenBank accession: KY120323).

Genome
Annotation. The C. chinensis genomes were annotated with the DOGMA (Dual Organellar GenoMe Annotator) [12], followed by being manually reviewed to remove duplicated annotations and checking for start and stop codons. The predicted genes were also BLASTed [13] to the nonredundant protein sequences database from the NCBI, the KEGG [14], and the COG [15] database. The graphical illustration of the circular plastome was drawn using the GenomeVx [16]. To compare the function of chloroplast proteins from Ranunculales species, we annotated these proteins from Supplementary

SSR Identification.
MISA (MIcroSAtellite identification tool, 2016b) was used to identify simple sequence repeats (SSRs) in the C. chinensis chloroplast genome together with 23 other chloroplast genomes. The settings included the following: more than 10 repeats for mononucleotide SSRs, six repeats for dinucleotide SSRs, five repeats for trinucleotide SSRs, five repeats for tetranucleotide SSRs, five repeats for pentanucleotide SSRs, and five repeats for hexanucleotide SSRs. Compound SSRs were defined as two SSRs with <100 nt interspace nucleotides.

Phylogenetic Tree Reconstruction and Divergence Time
Estimation. The chloroplast genome annotation data from the species listed in Supplementary Table S3 were downloaded from NCBI. Then, genes existing in all 24 chloroplast genomes were exacted, and a total of 42 genes remained. Using the MUSCLE (version: v3.8.31, parameters: default) [17], the protein sequences from each gene were aligned. The CDS alignments were obtained by translating the corresponding protein alignments using PAL2NAL [18] and were further concatenated into a supermatrix. Using the CDS alignments dataset, the phylogenetic tree was reconstructed by the RAxML [19] with the GTR + Ι + Γ substitution model, and the divergence times were estimated by the MCMCTree program from the PAML4.7 package [20] following the methods described by He et al. [8]. 125 and 193 Mya were set as the lower and upper boundaries for the splitting of Magnoliaceae-Ranunculales clade [21].

Identification of Positively Selected Genes (PSGs).
The CDS alignments of 42 genes were used for the identification of positively selected genes. The (Ka/Ks) ratios of filtered reliable codons in 42 genes were calculated using the branch-site model of CODEML in PAML4.7a [20], setting C. chinensis as the foreground branch and the others as background branches. The null hypothesis was that of each site was either equal to 1 or less than 1, while the alternative hypothesis allows of particular sites on the foreground branch to be larger than 1. Then, likelihood ratio test (LRT) analyses were performed, and the values were used to guide against violations of model assumptions. The branch was considered to have undergone positive selection if they showed a statistically significant LRT and positively selected sites on the branch were identified in the BEB analysis.

Results and Discussions
3.1. Genome Sequencing and Assembly. We generated 2.13 GB pair end reads (2 × 100 bp) using the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA). Clean reads were obtained by removing adaptors and low-quality read pairs. In total, we got 10,624,225 clean read pairs, and these clean reads were assembled into 43 contigs with N50 length of 47,033 bp using Velvet assembler (Table 1). To determine the orders and orientations, these Velvet assembled contigs were aligned to the Megaleranthis saniculifolia chloroplast genome [11] (Supplementary Table S1), and then gaps between two adjacent contigs were closed by Sanger reads (Supplementary Figure S1; primers are listed in Supplementary Table S2). The final complete C. chinensis chloroplast genome is comprised of 155,484 bp with guanine-cytosine content of 38.17% and falls within the range of the typical angiosperm chloroplast genome. By comparing it with the M. saniculifolia chloroplast genome, we confirmed the synteny and the absence of reversions or disorders in the genome.

Genome Annotations.
As the general quadripartite structure found in plant chloroplast genomes, the C. chinensis chloroplast genome has two inverted repeated regions (IRa and IRb) of 26,758 bp in length, which split the circular genome into small single copy (SSC) and large single copy (LSC) region with 17,383 and 84,585 bp lengths, respectively. We found that the guanine-cytosine content in LSC and SSC regions (36.4% and 32.1%, respectively) is less than that in IR regions (43%). The relatively higher GC content in IR regions may be attributable to the transfer-RNA genes and ribosomal-RNA genes, which is consistent with the results from Pogostemon cablin [10]. The chloroplast genome of C. chinensis was predicted to consist of 128 gene loci, including 8 rRNA gene loci, 28 tRNA gene loci, and 92 protein-coding gene loci (Figure 1, Table 2). These gene loci contained 107 unique genes, including 80 protein-coding genes, 23 transfer-RNA genes, and 4 ribosomal-RNA genes. Each IR region contained five tRNA genes (including trnI-CAT, trnL-CAA, trnV-GAC, trnR-ACG, and trnN-GTT), nine protein-coding genes (ten loci), and all 4 rRNA genes. Extensions of the IR into the genes rps19 and ycf1 were identified (Figure 1) resulting in its pseudogenization due to incomplete duplication. There were 92 protein-coding gene loci, of which nine are duplicated ( Table 2). The ycf15 gene has four copies in the C. chinensis chloroplast genome, and each IR region has two copies. The rps12 gene has three copies, and IRa, IRb, and LSC region each have one copy (Table 2).
We mapped the proteins to the NR, Clusters of Orthologous Groups (COG) [15], and Kyoto Encyclopedia of Genes and Genomes (KEGG) [14] database. A total of 76 proteins were aligned to homologous orthologs in the KEGG database; only 56 proteins could be assigned to COG orthologs (Supplementary Tables S5-S6). Homologs of all 92 proteins except for two proteins from gene ycf15 were identified in the NR database (Supplementary Table S4) showing highquality annotation. Most of the proteins are involved in photosynthesis, energy metabolism, and ribosome-related functions, as indicated by annotations from the NR database and KEGG database. Consistent with other species from the same order, the COG classification of these proteins also mainly grouped into two groups: Category J (translation, ribosomal structure, and biogenesis) and Category C (energy production and conversion), which are in Supplementary Tables S7-S8.

Identification of Simple Sequence Repeats (SSRs).
We identified perfect SSRs in the C. chinensis chloroplast genomes, as well as the chloroplast genomes of several other species in the Ranunculales. We found that both the numbers and types of chloroplast SSRs are variable in different species (Table 3 and Supplementary Tables S10-S11). The most abundant SSRs in all the species were mononucleotide type, with numbers varying from 16 to 71. Moreover, most mononucleotide types are comprised of polyadenine and polythymine, which is consistent with the results of other studies [10]. In addition, mononucleotide SSRs in C. chinensis and other species in Ranunculaceae family were fewer than those in species from Berberidaceae family, while dinucleotide type was relatively more common.

Phylogenetic Tree Construction and Divergence Time Estimation.
To determine the evolutionary history of C. chinensis within the Ranunculales, we used 42 genes existing in all 24 chloroplast genomes, including 21 sequenced chloroplast genomes from species in the Ranunculales and two species from the Magnoliaceae as an outgroup. Phylogeny analysis shows that six Ranunculaceae species and 12 Berberidaceae plants comprise two unique clades, whereas the other four species are relatively divergent and ancestral in Ranunculales (Figure 2). Estimation of divergence times of these plants was performed using the MCMCTree program in the PAML4.7a package [20] (Figure 2), and all the times estimated matched well with the data deposited in TIMETREE, a public knowledge-base of divergence times among organisms, thereby confirming that the molecular clock dating strategy was reliable. C. chinensis is relatively ancestral in the Ranunculaceae and diverged from other Ranunculaceae plants about 81 million years ago (Mya). The divergence between Ranunculaceae and Berberidaceae was about 111 Mya, whereas Ranunculales and Magnoliaceae shared a common ancestor prior to divergence during the Jurassic period, around 153 Mya.        gene. To detect potential positive selection affecting selected sites along C. chinensis lineages, the branch-site model implemented in PAML [20] was applied (Figure 3). The results suggest that the ndhG (NADH dehydrogenase subunit 6) evolved under positive selection in the C. chinensis lineage (Supplementary Table S3). The test statistic (2ΔL) of ndhG gene was 5.79, and the value was 0.008. BEB analysis revealed the position 104 of this protein as positively selected in C. chinensis, with posterior probabilities of 0.994. The ndhG is one of the 11 NADH dehydrogenase genes, and the ndhG subunit is associated with nuclear-encoded subunits to form the NADH dehydrogenase-like complex in angiosperm chloroplasts. This protein complex associates with photosystem I and then forms a supercomplex, which mediates cyclic electron transport [22], produces ATP to balance the ATP/NADPH ratio, and facilitates chlororespiration [23]. Therefore, the selection values identified in C. chinensis indicate positive selection for elements of the photosystemchlororespiration system.