Complete Chloroplast Genome Sequence of Sonchus brachyotus Helps to Elucidate Evolutionary Relationships with Related Species of Asteraceae

Sonchus brachyotus DC. possesses both edible and medicinal properties and is widely distributed throughout China. In this study, the complete cp genome of S. brachyotus was sequenced and assembled. The total length of the complete S. brachyotus cp genome was 151,977 bp, including an LSC region of 84,553 bp, SSC region of 18,138 bp, and IR region of 24,643 bp. Sequence analyses revealed that the cp genome encoded 132 genes, including 87 protein-coding genes, 37 tRNA genes, and 8 rRNA genes. The GC content was 37.6%. One hundred mononucleotide microsatellites, 4 dinucleotide microsatellites, 67 trinucleotide microsatellites, 4 tetranucleotide microsatellites, and 1 long repeat were identified. The SSR frequency of the LSC region was significantly greater than that of the IR and SSC regions. In total, 175 SSRs and highly variable regions were recognized as potential cp markers. By analyzing the IR/LSC and IR/SSC boundaries, structural differences between S. brachyotus and 6 other species were detected. According to phylogenetic analyses, S. brachyotus was most closely related to S. arvensis and S. oleraceus. Overall, this study provides complete cp genome resources for S. brachyotus that will be beneficial for identifying potential molecular markers and evolutionary patterns of S. brachyotus and its closely related species.

pneumoniae, Salmonella enterica, Staphylococcus aureus, and Micrococcus luteus; this is especially true in the case of Escherichia coli. In addition, functional antioxidant components of S. brachyotus, including caffeic acid, rutin, orientin, and luteolin, can scavenge free radicals [12]. Although the chemical composition of S. brachyotus has been reported, S. brachyotus and S. arvensis are similar in morphology and difficult to distinguish, and their phylogenetic relationships are not very clear.
The chloroplast is an important plastid that provides necessary energy for growth via photosynthesis and plays vital roles in the physiology and development of plants. Chloroplasts, as semiautonomous organelles, possess a genetic information expression system. In contrast to nuclear DNA, chloroplast (cp) DNA exhibits single-parent inheritance. The cp genome is more conserved than mitochondrial and nuclear genomes in terms of gene type, genome organization, and genome structure [13], so the cp genome has become an important means for reconstructing the phylogenetic relationships among plant species [14][15][16][17][18][19][20][21]. With the development of bioinformatics analysis and sequencing technology, studies on the evolution of species using cp genome sequences are increasing.
In this study, we sequenced and analyzed the complete cp genome of S. brachyotus and reconstructed the phylogeny of Compositae based on the cp genomes of 42 species. The following questions were addressed: (1) what are the features of the cp genome of S. brachyotus? (2) How many potential microsatellite markers can the cp genome provide? (3) Which types of structural variation events have occurred across the cp genomes in the Sonchus genus?

DNA Extraction, Genome Sequencing, and Annotation.
Total genomic DNA was extracted from 100 mg of fresh leaves of S. brachyotus using the CTAB (cetrimonium bromide) method [22]. The Illumina NovaSeq 6000 platform was used to construct and sequence a genomic library on the basis of the standard Illumina paired-end (PE) protocol. The raw reads were trimmed using NGS QC Toolkit_v2.3.3 [23]. After trimming of low-quality reads and adapter sequences, the clean reads were aligned with the reference genome of Lactuca sativa (NC_007578.1) and S. arvensis (NC_054161) from the NCBI GenBank database using Burrows-Wheeler Alignment (BWA) [24], and sequenced reads of chloroplast genomes were "selected" from clean sequence data. The matched PE reads were assembled using SPAdes v3.10.1 software [25]. The reference sequences of the genomes were compared for collinearity of conserved and rearranged genomes by MUMmer v3.23 [26].  25 was applied to compare coding sequences (CDSs) of chloroplasts in the NCBI database, the preliminary draft annotation was examined and adjusted manually by comparison with the reference cp genome, and the gene annotation results of the cp genome were then obtained. The rRNA and tRNA annotation information was obtained by using HMMER v3.1b2 and Aragorn v1.2.38 to compare the rRNA and tRNA sequences of chloroplasts in the NCBI online database. The annotated cp DNA sequences were submitted to the NCBI database by BankIt to obtain the GenBank sequence login number MT850048. OGDRAW v1.1.1 software [27] was then used to map the cp genomes of S. brachyotus according to the chloroplast sequence assembly results.

Phylogenetic Analyses.
A total of 43 cp genomes available in GenBank were recovered to infer the phylogenetic relationships, including newly sequenced S. brachyotus and 42 published Compositae species (Table S1). Multiple alignments were performed using complete cp genomes based on the conserved structure and gene order of the chloroplast genomes. All the nucleotide sequences were aligned using MAFFT v7.308 [32] to assess the taxonomic and phylogenetic relationships of S. brachyotus. Two methods were employed to construct phylogenetic trees, including maximum parsimony (MP) and Bayesian inference (BI). MP analyses were performed using Mega 2 BioMed Research International 11.0 software [33], and the addition sequence was set as 1,000 replications for the heuristic search. BI analyses were conducted using MrBayes v3.2.6 [34] based on the model GTR+G inferred from Modeltest 3.7 [35]. The first 25% of trees generated were discarded as burn-in, and the remaining trees were used to construct a majority-rule consensus tree with posterior probability (PP) values for each node.

Results
3.1. Chloroplast Genome Features, Sequencing, and Assembly of S. brachyotus. After trimming of low-quality reads and adapter sequences, the total length of the reads was approximately 7.5 Gb and 24,858,121 clean reads were produced by the Illumina NovaSeq 6000 platform. Based on a combination of de novo and reference-guided assembly, the cp genome of S. brachyotus was obtained. The complete cp genome sequence of S. brachyotus was submitted to the NCBI database under GenBank accession number MT850048. The total length of the cp genome of S. brachyotus was 151,977 bp (Table 1, Figure 1). The cp genome contained four characteristic regions: a large single-copy (LSC) region of 84,553 bp, a small single-copy (SSC) region of 18,138 bp, and a pair of inverted repeats (IRa and IRb) of 24,643 bp. The base composition of the complete cp genome sequence was analyzed and found to be 31.3% T, 31.1% A, 18.7% C, and 18.9% G. The overall GC content was 37.6%, which is very close to those of other Sonchus species. Furthermore, the GC contents were unevenly distributed across regions of the cp genome and were found to be 35.71%, 31.44%, and 43.08% for the LSC, SSC, and IR regions, respectively. The S. brachyotus cp genome included 132 genes, 1 or 2 more genes than the other 6 Sonchus genomes, of which there were 87 protein-coding genes, 8 rRNA genes, and 37 tRNA genes (Table 1). Eight protein-coding genes (ndhB, rpI2, rpI23, rps7, rps12, ycf2, ycf15, and ycf1), 7 tRNA genes (trnI-CAU, trnL-CAA, trnV-GAC, trnI-GAU, trnA-UGC, trnR-ACG, and trnN-GUU), and 4 rRNA genes (rrn16, rrn23, rrn4.5, and rrn5) were duplicated in the IR region in the cp genomes. There were 113 unique genes, and 16 genes (trnK-UUU, rps16, rpoC1, atpF, trnG-UCC, trnL-UAA, trnV-UAC, rps12, petB, petD, rpl16, rpl2, ndhB, trnI-GAU, trnA-UGC, and ndhA) contained 1 intron, whereas 2 protein-coding genes (ycf3 and clpP) contained 2 introns ( Table 2). The majority of these intron-containing genes were located in the LSC region.

Simple Sequence Repeats and Large Repeat Sequences.
In this study, we explored the presence of various microsatellites (mono-, di-, tri-, tetra-, penta-, and hexanucleotides) in the cp genome of S. brachyotus. A total of 175 microsatellites were detected in the cp genome of S. brachyotus, and the most common simple sequence repeats (SSRs) were mononucleotides (notably for A/T), with 100, accounting for 57% of the SSRs in S. brachyotus. The second most abundant motif type was the trinucleotide type, especially TAA, with a total number of 67 in S. brachyotus (approximately 38%). The proportion of other SSR types was relatively low (approximately 2% for dinucleotides and tetranucleotides). Intriguingly, the SSRs in S. brachyotus were chiefly distributed in coding regions (46.5%), with much lower numbers distributed in noncoding introns (12.6%) and intergenic regions (41%). The SSRs were spaced disproportionately through the cp genome, with the largest number of SSRs situated in the LSC region, followed by the IR and SSC regions, in the quadripartite structure regions (Figure 2(a)).
Repeat motifs are valuable for phylogenetic reconstruction. Consequently, we examined the forward, palindromic, complementary, and reverse repeats in the S. brachyotus cp genome ( Figure 2(b)). Overall, 35 pairs of repeat sequences were identified in the cp genome of S. brachyotus, which contained 16 palindromic repeats and 19 forward repeats; however, complementary and reverse repeats were not found in S. brachyotus. The lengths of the repeats ranged from 30 to 24,643 bp in S. brachyotus, and the most common repeat length was 30 bp (approximately 34%), followed by repeats of 43 bp (11%) and 31-42 bp (approximately less than 10%), while those of 43-24,643 bp (approximately 2%) were comparatively rare. The repeats were mainly distributed in noncoding regions, including intergenic spacers (IGSs) and introns. However, several coding and tRNA genes, such as ycf2, ycf3, psbN, psaB, psaA, ndhA, rpI16, and trnS, also contained repeat sequences.

Expansion and Contraction of Border Regions.
The expansion and contraction of the borders and adjacent genes of cp genomes give rise to genome size variations among various plant lineages. Hence, the borders and adjacent

Sequence Divergence and Hot Spots.
To clarify the level of genomic differences, the cp genome sequences of S. brachyotus plants were compared via Mauve. The local collinear block sequences (LCBSs) confirmed by Mauve showed high sequence similarity among the 7 Sonchus cp genomes, which indicated that the genome structure was quite conserved at the gene sequence level ( Figure 4). As anticipated, the SC Table 2: List of genes found in the chloroplast genome of S. brachyotus.

Category of genes
Group of genes Names of genes

BioMed Research International
The number of variable sites in the IR region was more conserved than that in the LSC and SSC regions, and 5 of these sites were highly variable: ycf3, matK, rpl36, ndhF, and ycf1. Three of the sites (ycf3, matK, and rpl36) were located in the LSC region, and 2 (ndhF and ycf1) were located in the SSC region ( Figure 5). Five divergence hotspots in the most variable regions (Pi > 0:02) could be used as potential molecular markers for phylogenetic studies of Sonchus species.

Phylogenetic Analysis.
On the basis of the phylogenetic analysis of the cp genome relationships of 42 representative Compositae plants, the taxonomic status and evolutionary relationships of S. brachyotus were determined ( Figure 6). The evolutionary tree revealed clear phylogenetic relationships for 43 species in 14 genera of Compositae, which were clustered into 3 branches. The first branch consists of 18 species in 4 genera, Lactuca, Mulgedium, Taraxacum, and Sonchus, all belonging to Lactuceae. The second branch consists of 11 species from 4 genera, Atractylodes, Cirsium, Carthamus, and Saussurea. The third branch consists of 14 species of 6 genera, Chrysanthemum, Artemisia, Leontopodium, Aster, Anaphalis, and Helianthus. Chrysanthemum and Artemisia belong to Anthemideae; Leontopodium and Anaphalis belong to Inuleae; Aster belongs to Astereae; and Helianthus belongs to the Heliantheae. These are all members of Cynareae. Sonchus is located on the first branch of the phylogenetic tree. In the Sonchus genus, S. brachyotus is more closely related to the small clades formed by S. arvensis and S. oleraceus, so it can be inferred that they have the closest relationship.

Discussion
As the second largest family in the plant kingdom, Compositae consists of approximately 1,620 genera and more than 23,600 species [36,37]. Nevertheless, few cp genomic sequences for members of this family have been stored in GenBank, with the first sequence being that of L. sativa [38,39]. Although the advancement of high-throughput sequencing techniques has enabled several additional Compositae cp genomes to be sequenced [40][41][42][43], the cp genome of S. brachyotus has remained unexplored. In this study, we sequenced the complete cp genome of S. brachyotus by using Illumina high-throughput sequencing technology.
The structure and genes of the cp genome of S. brachyotus were found to be highly conserved through comparative analysis with closely related species, and they exhibited the same protein-coding genes, tRNAs, and rRNAs. Nevertheless, there was a difference in genome size (Table 1), indicating genetic differences. We found that this phenomenon may be due to contractions and expansions of boundary regions [44][45][46][47][48]. The length of the cp genome sequence is related to the contraction and expansion of noncoding regions. Recent studies have revealed that the IRb/SSC and IRa/LSC regions are mainly responsible for length differences in cp genome sequences, and such regions have been discovered in numerous angiosperm cp genome sequences [49]. Cho et al. [1,50] carried out a boundary analysis of the LSC, SSC, and IR regions of the cp genomes of 5 Sonchus plants and found some slight differences in the position or length of the rps19, rpl2, trnH, ndhF, and ycf1 genes. Although the whole genome structure, including both gene number and order, was found to be nearly identical, the cp genome of S. brachyotus and the 6 published cp genomes of Sonchus (S. oleraceus, S. boulosii, S. canariensis, S. acaulis, S. webbii, and S. arvensis) showed obvious deviations at the IRb/SSC and IRa/LSC borders.
Microsatellites can be divided into mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats. The locations of SSRs have functional roles in the genome, including gene regulation, advancement, and evolution. As shown in a genomewide analysis of polymorphisms related to height, microsatellite markers can be powerful tools for measuring genetic diversity in populations and addressing genetic issues, such as gene origin, flow, and species group configuration, at the level of both intraspecific and interspecific variations [51].    9 BioMed Research International analyzed the SSRs of the cp genomes of 5 species of Sonchus and found that the SSRs were mainly distributed in coding regions and LSC regions. In our study, 175 repeat sequences were found in S. brachyotus; additionally, we discovered that they mostly existed in the LSC regions.
Previous studies show that multiple sequence alignments used for interspecies discrimination can reveal the development of mutational hotspots [57,58] and be applied in phylogenetic or phylogeographic studies [59,60]. At present, some studies have shown that markers derived from chloroplast genomes can also be used in phylogenetic studies [61]. In several studies, the LSC and SSC regions were less conserved than the IR region [61][62][63], as revealed in this study. Numerous variable sites (e.g., ycf3, matK, rpl36, ndhF, and ycf1) were confirmed by calculating and comparing the nucleotide diversity value (Pi). Among them, ycf1 and ycf3 have been demonstrated to be conducive markers for phylogenetic studies of Sonchus [1,50]. These markers were also found to be useful for analyzing the intraspecific variation of S. brachyotus. According to the results of the present study, 5 divergence hotspots screened on the basis of Pi > 0:02 show great potential for the development of a system of highly informative markers for S. brachyotus.
The taxonomic position and evolutionary relationships of S. brachyotus were revealed through comparisons with 42 Compositae plants, which were based on the correlations of all cp genomes. The 43 Compositae plants were divided into 3 groups. The phylogenetic relationships identified among Sonchus species were consistent with those from previous studies [1,50,64]. James et al. [64] constructed a phylogeny of 13 species of Compositae plants on the basis of the cp genome and revealed that S. oleraceus was closely related to L. sativa (AP007232). Cho et al. [1] used cp genomes to analyze a phylogeny of 32 Compositae plants and revealed that S. acaulis, S. canariensis, and S. webbii were closely related to S. oleraceus (MG 878405). Cho et al. [50] utilized cp genomes to analyze a phylogeny of 30 Compositae plants and demonstrated that 2 S. asper and 2 S. oleraceus plants were closely related to S. oleraceus (MG 878405). Overall, S. oleraceus was closely related to S. asper. In this study, Sonchus was most closely related to Taraxacum, followed by Lactuca. S. arvensis is the closest relative of S. brachyotus, followed by S. oleraceus, within the Sonchus genus. Therefore, we hypothesize that S. brachyotus and S. arvensis show similarity in physiology. Phylogenetic relationships identified within Sonchus and its phylogenetic relationships with other genera of the Compositae can facilitate additional studies. The cp genome sequences provide useful genetic information for understanding the evolution of Compositae plants.

Conclusions
In this study, we assembled, annotated, and analyzed the cp genome of S. brachyotus, an important wild plant used for food and medicine. The S. brachyotus cp genome (151,977 bp) was fully characterized and compared with those of related species. We identified IR regions, as well as SSC and LSC regions. The S. brachyotus cp genome included 132 genes, of which there were 87 proteincoding genes, 8 rRNA genes, and 37 tRNA genes. A total of 175 microsatellites and 35 pairs of repeat sequences were detected in the cp genome of S. brachyotus. The unique inversion, insertion, and gene loss events detected here may provide informative markers for phylogenetic resolution among different genera in Compositae. Several hotspots (e.g., ycf3, matK, rpl36, ndhF, and ycf1) of intergeneric divergence were also identified. Both RAxML and GTR analyses strongly support the topology in which the clade including S. brachyotus is near that containing S. arvensis. The cp genomic resources presented in this study will be useful for further studies on the evolutionary patterns of S. brachyotus and its closely related species.

Data Availability
The data that support the findings of this study are openly available in GenBank of NCBI at https://www.ncbi.nlm.nih .gov/, and the accession numbers are provided in Table S1 in Supplementary Materials.

Disclosure
The funding bodies had no role in the study design, analysis and interpretation of data, or writing of the manuscript.