Plastome Phylogenomic and Biogeographical Study on Thuja (Cupressaceae)

Investigating the biogeographical disjunction of East Asian and North American flora is key to understanding the formation and dynamics of biodiversity in the Northern Hemisphere. The small Cupressaceae genus Thuja, comprising five species, exhibits a typical disjunct distribution in East Asia and North America. Owing to obscure relationships, the biogeographical history of the genus remains controversial. Here, complete plastomes were employed to investigate the plastome evolution, phylogenetic relationships, and biogeographic history of Thuja. All plastomes of Thuja share the same gene content arranged in the same order. The loss of an IR was evident in all Thuja plastomes, and the B-arrangement as previously recognized was detected. Phylogenomic analyses resolved two sister pairs, T. standishii-T. koraiensis and T. occidentalis-T. sutchuenensis, with T. plicata sister to T. occidentalis-T. sutchuenensis. Molecular dating and biogeographic results suggest the diversification of Thuja occurred in the Middle Miocene, and the ancestral area of extant species was located in northern East Asia. Incorporating the fossil record, we inferred that Thuja likely originated from the high-latitude areas of North America in the Paleocene with a second diversification center in northern East Asia. The current geographical distribution of Thuja was likely shaped by dispersal events attributed to the Bering Land Bridge in the Miocene and subsequent vicariance events accompanying climate cooling. The potential effect of extinction may have profound influence on the biogeographical history of Thuja.


Introduction
Understanding the differences and connections of biogeographic distribution between the flora of the Northern Hemisphere, especially the flora of East Asia and North America, has been of great interest to systematists and biogeographers since the last century [1][2][3][4][5][6][7][8]. Multiple studies suggest that the geological activities and climatic oscillation in the Cenozoic, especially in the Late Neogene and Quaternary, have been responsible for plant biogeographic patterns of interconti-nental disjunction [9][10][11][12]. In addition, two intercontinental land bridges, the Bering and North Atlantic Land Bridges, have played vital roles in the formation and dynamic floristic disjunctions due to their connectivity at different time points [4,5,8,10]. Previous studies have shown that in angiosperm lineages, the prevalent pattern is origination in East Asia followed by migration to North America [13][14][15][16], while the opposite case, species originating in North America and then migrating to East Asia, has been reported in some gymnosperms [17][18][19][20]. Given that gymnosperms originated much earlier than angiosperms, they have been proposed as an ideal system for understanding deep biogeographical patterns, in particular those of an intercontinental nature.
Thuja is a small genus of Cupressaceae comprising five species: T. plicata Donn ex D. Don, T. occidentalis L., T. koraiensis Nakai, T. standishii (Gordon) Carrière, and T. sutchuenensis Franch [21,22]. Species of Thuja are discontinuously distributed across East Asia and North America. Species T. plicata and T. occidentalis are distributed in western and eastern North America, respectively, whereas T. koraiensis, T. standishii, and T. sutchuenensis are endemic to East Asia with quite restricted distributions [23,24].
The biogeographic history of Thuja remains controversial mainly due to the ambiguous relationships among species, despite several phylogenetic studies [23][24][25]. Early phylogenies of Thuja, including both the extant and fossil species, inferred that T. sutchuenensis may have arisen from an ancestor similar to the fossil species T. polaris and represents the earliest diverging clade of Thuja [25]. Molecular evidence from nrDNA ITS sequences of Thuja showed that T. standishii and T. sutchuenensis are sister and together they form a clade with T. occidentalis, while T. koraiensis and T. plicata form another clade. Based on the ITS tree, an eastern Asia origin and two dispersals to North America were inferred for Thuja [23]. Later, a phylogenetic study using five cpDNA regions (rpl16, AtpI-rpoC1, trnS-trnfM, trnS-trnG, and trnT-trnF), nrDNA ITS, and two low-copy nuclear genes (LEAFY, 4CL) suggested high topological discordance between the chloroplast and nuclear gene trees, with discordance even occurring between the different nuclear gene trees [24]. In that study, the sister relationship of T. sutchuenensis-T. standishii was supported by both the chloroplast and nuclear genes, whereas the sister relationship of T. koraiensis-T. plicata was supported with 4CL, nrDNA ITS, and combined nuclear gene trees. Biogeographic analysis based on the 4CL tree indicated that the most recent common ancestor (MRCA) of Thuja likely had a wide distribution in East Asia and North America [24]. Additionally, comparisons of extant species and species known only in the fossil record showed that Thuja likely first appeared at high latitudes of North America before the Paleocene and spread to eastern Asia in the Miocene [26]. Owing to the relatively incompatible phylogenetic hypotheses proposed previously, the evolutionary history of Thuja still needs further investigation.
Phylogenomic approaches, which generate large amounts of DNA sequence data throughout the genome, have become increasingly essential for reconstructing entangled phylogenetic relationships among gymnosperms [27][28][29]. However, due to the extremely large genome sizes in gymnosperms [19,30,31], phylogenomic studies based on the nuclear genome still present significant challenges, most notably sequencing expense and annotation. In contrast, plastomes, which exhibit a high copy number per cell and a much smaller size, have been successfully implemented in phylogenetic inference of gymnosperms such as in the Pinaceae [32,33] and Cupressaceae [34,35]. Furthermore, the structural variations in plastomes are phylogenetically informative characters of themselves; for instance, the conifers (which Thuja belongs) are characterized by lacking canonical inverted repeats (IRs) and containing lineage-specific repeated tRNA genes [36][37][38][39][40][41][42]. Given the fundamental role of plastomes in understanding gymnosperm evolution, studies within each family/genus of conifers are insufficient to date.
In the present study, we newly sequenced three plastomes of Thuja. Including two previously reported plastomes [40,43], we aim to (i) understand the plastome evolution of Thuja, (ii) reconcile the phylogenetic relationships within this small genus, and (iii) investigate the biogeographical history of Thuja. Additionally, rich molecular markers will be obtained as genetic resources for future research.

Results
2.1. Plastome Assembly, Structure, and Gene Content. After de novo and reference-guided assembly, we obtained three plastome sequences without gaps. The plastome features of all Thuja species are presented in Table 1. The size of Thuja plastomes ranges from 130,505 bp in T. standishii to 131,118 bp in T. occidentalis. A total of 116 genes are identified including 82 protein-coding genes, 4 ribosomal RNAs, and 30 transfer RNAs ( Figure 1). Among the 116 unique genes, there are 15 genes that contain one intron (seven tRNA genes and eight protein-coding genes) and three protein-coding genes containing two introns (rps12 and ycf3). Two tRNA genes, trnI-CAU and trnQ-UUG, have two copies (Table 2). Similar to other conifers, all species of Thuja are found to lack the inverted repeat (IR) region, thereby differing from the conventional quadripartite structure typical of angiosperm plastomes ( Figure 1). In addition, a 36 kb inversion is detected in all Thuja plastomes, resulting in an isomeric plastome with the same B-arrangement as previously recognized [39]. This inversion segment is flanked by the duplicates of the trnQ-UUG gene (Figures 1 and 2).

Repetitive Sequences.
We characterized SSRs without setting a minimum satellite length constraint, thus obtaining abundant molecular markers. The numbers and types of SSRs are quite similar in all plastomes, with the total number ranging from 700 in T. koraiensis to 723 in T. standishii (Figure 3(a)). The proportion for each type of SSR is shown (Figure 3    Although the application of the DEC+J model has been questioned previously [44], the selection of DEC+J may constitute the evidence in favor of founder-event speciation as a biogeographic process. The DEC+J analysis showed that the distribution area of Thuja is mostly restricted to northern East Asia (approx. 0.46) or less likely western North America (approx. 0.18) and northern East Asia/western North America (approx. 0.10). This may indicate that the MRCA of extant Thuja had a wide distribution in East Asia and North America. In addition, six dispersal and three vicariance events were identified within Thuja ( Figure 6).

Characterizations of Thuja
Plastomes. Prior to this study, only two Thuja species had sequenced plastomes available [40,42]. In the present study, we incorporated three newly sequenced plastomes with the two previously reported ones, which provided the opportunity to illustrate plastome evolution, as well as identify valuable molecular markers. All plastomes of Thuja possess 116 unique genes arranged in the same order, including 82 protein-coding genes, 30 tRNA genes, and 4 rRNA genes, with trnI-CAU and trnQ-UUG having two copies. Previous studies have commonly reported that the plastomes of conifers are usually characterized by the loss of an IR [36][37][38][39][40][41][42], which is different from the typical quadripartite structure shared by most angiosperm plastomes [45]. As expected, the loss of an IR was evident in all Thuja plastomes.
The isomeric plastomes formed by the repeated trnQ-UUG gene (i.e., the trnQ-IR arrangements) have been discovered in the Cephalotaxus (Cephalotaxaceae; [38]) and Cupressoideae species (Cupressaceae; [39,40]). Guo et al. [39] recognized two types of isomeric plastome arrangements, the A-arrangement and B-arrangement, by the comparative and Southern blot analyses in Juniperus. Later, another two new types, C-and D-arrangements, were discovered by Qu et al. [40] in Calocedrus. Both studies suggested that the trnQ-IR may have promoted homologous recombination activity and is responsible for the presence of different isomeric forms. In our study, all Thuja plastomes contain the B-arrangement (Figures 1 and 2), which supports the hypothesis of Qu et al. [40] that the B-arrangement predominates in cupressophyte plastomes. Notably, we found that the relict species of Thujopsis, which is sister to Thuja, exhibits the A-arrangement, indicating the plastome rearrangement may have occurred multiple times during cupressophyte evolution. From all the evidence above, we can infer that the existence of isomeric plastomes might be a diagnostic feature in cupressophytes.

BioMed Research International
Survival Commission (SSC), while was rediscovered in 1999 in Chengkou, Chongqing, in the southwest of China [21,24]. In the Late Pliocene, the fossil record of T. sutchuenensis was discovered in Shanxi Province, in northwest China. The northern Greenland cone-bearing material of T. occidentalis has been dated to the Late Pliocene to Pleistocene [26]. From paleobotanical and our phylogenomic evidence, we hypothesize that the ancestor of T. occidentalis and T. sutchuenensis   Prior to this study, two relatively incompatible standpoints have been proposed to explain the biogeographic process of Thuja. Li and Xiang [23] suggested an eastern Asia origin for Thuja based on ITS sequences. While the multiple gene evidence provided by Peng and Wang [24] indicated reticulate evolution occurring in Thuja, and they inferred that Thuja could have originated from the high-latitude areas of North America, although only the 4CL gene was used. Comparative analyses using fossils suggest that Thuja likely first appeared at high latitudes of North America in or before the Paleocene and arrived in eastern Asia in the Miocene [24,26]. According to the present study, the diversification of Thuja was dated to approximately the Middle Miocene and the ancestral area was located in northern East Asia, indicating a second diversification center of Thuja in northern East Asia. In the Middle Miocene, the Bering Land Bridge and the warm climate may have facilitated long distance dispersals (LDD) from East Asia to North America. Subsequently, cli-mate cooling and drying after the Miocene, acting as a vicariance driver, forced a southward migration of Thuja and restricted species in their current distributions.
Previous studies of gymnosperm radiations have mostly inferred Oligocene-age crown groups [11,18,[48][49][50][51], as is the case in cycads [51], Pinaceae [11], and Cupressaceae [18], indicating relatively recent diversification occurring in gymnosperms. The extinction following ancient origination may contribute to the young ages of most living gymnosperm clades. In conifer lineages, the evidence for widespread extinction and range shrinkage has been extensively reported. For example, Sequoioideae and Taxodioideae had a widespread distribution in the Northern Hemisphere in the Cretaceous and Paleogene, with Sequoioideae also found in Australia, but now are restricted to southern North America and East Asia [52]. Species of Torreya were widely distributed in the Northern Hemisphere during the Cretaceous but are now restricted to East Asia and North America [44,53]  7 BioMed Research International early widespread distribution of Thuja. Climatic oscillations and glaciation in the Quaternary have reportedly eliminated many plant groups from Europe [4,53]; this phenomenon is apparently applicable to the whole Northern Hemisphere flora that has been largely influenced by recent extinction. We speculate that the discrepancy between geographic distributions and phylogenetic relationships of Thuja can be attributed to extinction, a process blurs evolutionary history of species while is difficult to trace. Therefore, in the intercontinental disjunction context, we advocate that the potential effect of extinction should be reevaluated in the East Asia and North America flora, in particular, for the ancient rare gymnosperms.

Taxon
Sampling, Chloroplast DNA Isolation, and High-Throughput Sequencing. Two previously reported plastomes (Thuja plicata and T. standishii) [40,43] were downloaded from National Center for Biotechnology Information (NCBI) database. The fresh leaves of T. sutchuenensis, T. occidentalis, and T. koraiensis were collected from Wuhan Botanical Garden, Chinese Academy of Sciences; Atlantic Botanical Garden, USA; and Changbaishan Mts. National Reserve, Jilin, China, respectively. Total DNA was extracted using a modified CTAB protocol [58]. The purified DNA was used to construct Illumina Nextera XT libraries (Illumina, San Diego, CA, USA) following the manufacturer's instructions. DNA sequencing was performed on an Illumina MiSeq platform with paired-end 300 bp reads using V3 chemistry at Kunming Institute of Botany, China.

Plastome Assembly, Annotation, and Comparative
Analyses. After sequencing, Illumina data were filtered using the NGS QC Toolkit (Patel and Jain 2012) by removing adapter sequences and low-quality reads with a quality value ≤ 20. The remaining high-quality reads were assembled into contigs with a minimum length of 1000 bp using SPAdes v.3.7.1 [59]. The complete plastome of T. standishii (GenBank: KX832627.1) was used as a reference for contig assembling [35]. Raw reads were then mapped against the resulting single contig to ensure no gaps remained using  Geneious v.9.0.2 [60]. The assembled plastomes were annotated using Dual Organellar GenoMe Annotator (DOGMA) [61], with manual correction of gene start and stop codons. The tRNA genes were identified with tRNAscan-SE [62]. Graphical maps of the circular plastomes were visualized with OGDRAW [63].
To estimate locally collinear blocks (LCBs) among the examined plastomes, we performed whole-genome alignment using progressive Mauve implemented in Mauve v2.3.1 [64] with default parameters. Simple sequence repeats (SSRs) were identified using Phobos v.3.3.12 (http://www.rub .de/ecoevo/cm/cm_phobos.htm). The default settings in the perfect search function were used by setting a repeat unit size ranging from one to ten without setting a minimum satellite length constraint. Tandem repeats were identified with Tandem Repeats Finder (TRF) [65] with default parameter settings. The tandem repeat lengths were 20 bp or more with the minimum alignment score and maximum period size set as 50 and 500 (respectively), and the identity of repeats was set to 90%.

Phylogenetic Analyses.
Coding sequence (CDS) of all 82 protein-coding genes was extracted from the five plastomes of Thuja, 27 other Cupressaceae species, and two species of Taxaceae (Taxus baccata and Cephalotaxus sinensis) [52]. Accession number and voucher information for each species are provided (Supplementary Materials: Table S1). The 82 genes from the 34 taxa were concatenated in PhyloSuite v.1.1.14 [66] and aligned using MAFFT v.7.22 [67] under "-auto" strategy and codon alignment mode. Ambiguously aligned fragments of the alignment were removed using Gblocks [68] with the following parameter settings: minimum number of sequences for a conserved/flank position (17/17), maximum number of contiguous nonconserved positions (8), minimum length of a block (10), and allowed gap positions (all).
Both the maximum likelihood (ML) and Bayesian inference (BI) analyses were conducted. ML analysis was performed in RAxML v. 8 [70] with the same model as ML analysis (GTR+G). Two runs were conducted in parallel with four Markov chains (one cold and three heated), with each running for 2,000,000 generations from a random tree and sampled every 200 generations. The average standard deviation of split frequencies (<0.01) was used for checking   [72]. The molecular clock test was used to compare the ML value with and without the molecular clock constraints under the GTR model using MEGA X [73]. The null hypothesis of equal evolutionary rates throughout the tree was rejected (with clock, ln L: -39638711.585; without clock, ln L: -157069.417; P < 0:001). Thus, an uncorrelated lognormal relaxed-clock model with the birth-death process tree prior was implemented. The uncorrelated lognormal model allows uncertainty in the age of calibrations to be represented as prior distributions rather than as strict calibration/fixed points [71]. The Markov Chain Monte Carlo runs were set to 500 million generations with sampling every 5000 generations. Tracer v.1.7.1 [74] was used to assess the effective sample size (ESS > 200) of each parameter. After a burn-in of 25%, the maximum clade credibility (MCC) tree with median branch lengths and 95% highest posterior density (HPD) intervals on nodes was built using TreeAnnotator 2.1.3 [71].
According to the previous comprehensive biogeographic study of Cupressaceae [52], four calibration points were used: (A) the crown age of Cupressaceae, (B) the stem age of Thuja, (C) the split time among Cryptomeria, Glyptostrobus, and Taxodium, and (D) the MRCA of Sequoia-Metasequoia. These calibration points constrained the minimum age to 157.2 Ma, 58.5 Ma, 111 Ma, and 92.8 Ma, respectively. Following the study of Mao et al. [52], we modeled calibrated nodes with a lognormal distribution with a mean of 0.5, standard deviation of 1.5, and an offset (hardbound constraint) that equaled the minimum age of the calibrations.

Ancestral Area Reconstructions.
To infer ancestral distribution ranges of living species of Thuja, we used the R package BioGeoBEARS (http://CRAN.Rproject.org/package= BioGeoBEARS), as implemented in program RASP v.4.0 [75]. A total of 10,000 random trees and one MCC tree generated by BEAST were used as input trees, which included only Thuja and its sister genus Thujopsis. We used the Dispersal-Extirpation-Cladogenesis (DEC) model [76], which allows dispersal, extinction, and cladogenesis as fundamental processes, accommodates differing dispersal probabilities among areas across different time periods, and can integrate branch lengths, divergence times, and geological information. We compared the DEC model with the "+J" suffix (i.e., DEC+J), which allows for founder speciation events. According to current distributions, species of Thuja were assigned to four possible geographic areas: (A) southwest region of China, (B) northern East Asia, (C) western North America, and (D) eastern North America. At most, two areas were allowed for any node in any tree, as each sampled taxon is restricted to only one area. Influenced by extinction, the relict distribution of Thujopsis may not be representative [24]; thus, we labeled it as an ambiguous geographic area (ABCD). An among-area dispersal probability matrix, which was inferred from the connectivity of the Bering Land Bridge [10], was coded to define different dispersal probabilities in four time periods, 0-5, 5-30, 30-45, and 45-65 (Table S12).

Conclusions
In the present study, we sequenced and analyzed complete plastomes of Thuja, providing new insight into plastome evolution, phylogenetic relationships, and evolutionary history. Phylogenomic analyses based on plastome sequences yielded robust relationships within Thuja. Incorporating paleobotanical evidence, we hypothesize a North American origination and a northern East Asia diversification of Thuja. The current geographical distribution of Thuja was likely shaped by dispersal events attributed to the Bering Land Bridge in the Miocene and subsequent vicariance events accompanying climate cooling. We further inferred that the potential effect of extinction has had profound influence on the biogeographical history of Thuja. Our study highlights the utility of plastome-scale datasets in resolving controversial phylogeny and inferring biogeographical history.

Data Availability
The newly sequenced plastomes have been submitted to Gen-Bank; accession numbers are provided in Table S1 (Additional files).

Disclosure
The funding sources for this study had no role in the design of the study, collection of data, data analysis, and interpretation or in writing the manuscript.

Supplementary Materials
There are additional tables in the supplementary materials file. Table S1: the taxa information including GenBank accession numbers used in the present study. Tables S2-S12: the distributions of tandem repeats and characteristics of simple sequence repeats (SSRs) identified in the plastomes of Thuja species. For each table, there is a detailed title included in the supplementary materials. (Supplementary Materials)