Exploring Diversification and Genome Size Evolution in Extant Gymnosperms through Phylogenetic Synthesis

Gymnosperms, comprising cycads, Ginkgo, Gnetales, and conifers, represent one of the major groups of extant seed plants. Yet compared to angiosperms, little is known about the patterns of diversification and genome evolution in gymnosperms. We assembled a phylogenetic supermatrix containing over 4.5 million nucleotides from 739 gymnosperm taxa. Although 93.6% of the cells in the supermatrix are empty, the data reveal many strongly supported nodes that are generally consistent with previous phylogenetic analyses, including weak support for Gnetales sister to Pinaceae. A lineage through time plot suggests elevated rates of diversification within the last 100 million years, and there is evidence of shifts in diversification rates in several clades within cycads and conifers. A likelihood-based analysis of the evolution of genome size in 165 gymnosperms finds evidence for heterogeneous rates of genome size evolution due to an elevated rate in Pinus.


Introduction
Recent advances in sequencing technology offer the possibility of identifying the genetic mechanisms that influence evolutionarily important characters and ultimately drive diversification.Within angiosperms, large-scale phylogenetic analyses have identified complex patterns of diversification (e.g., [1][2][3]), and numerous genomes are at least partially sequenced.Yet the other major clade of seed plants, the gymnosperms, have received far less attention, with few comprehensive studies of diversification and no sequenced genomes.Note that throughout this paper "gymnosperms" specifies only the approximately 1000 extant species within cycads, Ginkgo, Gnetales, and conifers.These comprise the Acrogymnospermae clade described by Cantino et al. [4].
Many gymnosperms have exceptionally large genomes (e.g., [5][6][7]), and this has hindered whole-genome sequencing projects, especially among economically important Pinus species.This large genome size is interesting because one suggested mechanism for rapid increases in genome size, polyploidy, is rare among gymnosperms [8].Recent sequencing efforts have elucidated some of genomic characteristics associated with the large genome size in Pinus.Morse et al. [9] identified a large retrotransposon family in Pinus, that, with other retrotransposon families, accounts for much of the genomic complexity.Similarly, recent sequencing of 10 BAC (bacterial artificial chromosome) clones from Pinus taeda identified many conifer-specific LTR (long terminal repeat) retroelements [10].These studies suggest that the large genome size may be caused by rapid expansion of retrotransposons and may be limited to conifers, Pinaceae, or Pinus.Other studies have quantified patterns of genome size among gymnosperms, especially within Pinus and the other Pinaceae [6,7,[11][12][13][14].These studies have largely focused on finding morphological, biogeographic, or life history correlates of genome size, but the rates and patterns of genome size evolution in gymnosperms are largely unknown.
This study first synthesizes the available phylogenetically informative sequences to build a phylogenetic hypothesis of

Methods and Materials
2.1.Supermatrix Phylogenetic Inference.We constructed a phylogenetic hypothesis of gymnosperms from available, phylogenetically informative sequence data in GenBank that was available on June 30, 2009.We first downloaded from GenBank all core nucleotide sequence data from gymnosperms (Coniferophyta, Cycadophyta, Ginkgophyta, and Gnetophyta).Additionally, we downloaded sequences from the "basal angiosperm" lineages (e.g., Amborella, Nymphaeales, Chloranthaceae, and Austrobaileyales) to represent the angiosperms and a diverse sampling of Moniliformopses taxa (including species from Equisetum, Psilotum, Ophioglossum, Botrychium, Angiopteris, and Adiantum) to use as outgroups.
To identify sets of homologous sequences from the GenBank data, we clustered sequences less than 10,000 bp in length based on results from an all-by-all pairwise BLAST analysis.The all-by-all blastn search was done with blastall using the default parameters [15].Significant BLAST hits had a maximum e-value of 1.0e −10 and at least 50% overlap of both the target and query sequences.A Perl script identified the largest clusters of sequences in which each sequence has a significant BLAST hit against at least one other sequence in the cluster.We only considered clusters containing loci that had been used previously for phylogenetic analyses.This included plastid and mitochondrial loci as well as nuclear 18S rDNA, 26S rDNA, and internal transcribed spacer (ITS) sequences.Among these clusters, those containing sequences from at least 15 taxa were aligned using Muscle [16], and the resulting alignments were manually checked and adjusted.The resulting alignments were edited for inclusion in the supermatrix by removing hybrid taxa and those that lacked a specific epithet and also keeping only a single sequence per species.The final cluster alignments were then concatenated to make a single phylogenetic supermatrix (e.g., [17]).

Phylogenetic and Dating Analysis.
To estimate the optimal topology and molecular branch lengths for the gymnosperms, we performed maximum likelihood (ML) phylogenetic analysis on the full supermatrix alignment using RAxML-VI-HPC version 7.0.4[18].All ML analyses used the general time reversible (GTR) nucleotide substitution model with the default settings for the optimization of individual per-site substitution rates and classification of these rates into rate categories.To assess uncertainty in the topology and branch length estimates, we ran 100 nonparametric bootstrap replicates on the original data set [19].
We transformed the optimal and bootstrap trees to chronograms, ultrametric trees in which the branch lengths represent time, using penalized likelihood [20] implemented in r8s version 1.71 [21].We used a smoothing parameter of 10000, which was chosen based on cross-validation of the fossil constraints.For the r8s analysis, we used the same time constraints on seed plant clades used by Won and Renner [22].The most recent common ancestor of seed plants was constrained to a maximum age of 385 million years ago (mya).The most recent common ancestor of the extant gymnosperms was fixed at 315 mya and Gnetum at 110 mya.The following clades were given minimum age constraints: angisperms: 125 mya, cycads: 270 mya, Cupressaceae: 90 mya, Araucariaceae: 160 mya, Gnetales + Pinaceae: 225 mya, Pinaceae: 90 mya, and Gnetales: 125 mya.

Diversification Analysis.
To examine the general patterns of diversification through time among the extant gymnosperm lineages, we first made lineage through time plots using the R package APE [23].To account for uncertainty in the dating estimates, we plotted each bootstrap tree after it had been transformed into a chronogram and all nongymnosperm taxa were removed.
Since there appears to be much variance in the divergence time estimates among trees, and branch length estimates are often unreliable, especially when estimated from such a sparse, heterogeneous sequence matrix, we used a test for changes in diversification rate that relies on tree shape, not branch lengths.Specifically, we used the whole-tree, topology-based test described by Moore et al. [24] to detect nodes associated with significant shifts in diversification rate based on the Δ 1 statistic.The analyses were performed using the apTreeshape R package [25].We used only the optimal tree estimate and again, pruned all non-gymnosperm taxa from the tree prior to analysis.

Rates of Genome Size Evolution.
We first assembled a set of mean genome size data for all gymnosperms present in the phylogenetic tree (in pg DNA) from the Kew C-value database [26].This includes data from the studies of Murray [6] and Grotkopp et al. [14].When there were multiple estimates available from a single species, we used the mean of the estimates.We tested for shifts in rates of genome size evolution using Brownie v. 2.1.2[27].We used the censored rate test, which tests for differences in rates of evolution of a continuous character (genome size) in one clade versus another clade or paraphyletic group based on a Brownian motion model.We made the following comparisons: conifers + Gnetales versus cycads + Ginkgo, Pinaceae versus other conifers + Gnetales, non-Pinaceae conifers + Gnetales versus cycads + Ginkgo, Pinus versus other Pinaceae, the non-Pinus Pinceae versus the other conifers + Gnetales, and Pinus subgenus Strobus subgenus versus Pinus subgenus Pinus.To account for topological and branch length uncertainty, we performed all hypothesis tests in Brownie on each bootstrap tree and weighted the results across replicates.The penalized likelihood analysis in r8s collapsed some branch lengths to 0, and Brownie does not work with 0 branch lengths in the tree.Thus, prior to the Brownie analysis, all 0 branch lengths were changed to 0.1.

Phylogenetic Data.
The alignment from the complete supermatrix contains sequences from 950 taxa (739 gymnosperms, 108 angiosperms, and 103 nonseed plant outgroups) and is 74,105 characters in length.The 739 gymnosperm taxa include at least one representative from every family as well as from 88 genera.In total, the matrix contains 4,511,144 nucleotides and 93.6% missing data.The number of nucleotides per taxon ranges from 252 to 33,138 (average = 4,749; median = 3,355).

Phylogenetic Inference.
In the 950-taxon trees, 63.3% (601) of the nodes have at least 50% bootstrap support, 41.7% (396) have at least 70% support, 25.8% (245) have at least 90% support, and 9.7% (92) have 100% support.The seed plants have 100% support, and the angiosperms are sister to a clade of all gymnosperms (Figure 1).Within gymnosperms, a clade of Ginkgo + cycads (bootstrap support (BS) = 66%) is sister to a clade consisting of conifers + Gnetales (BS = 96%).Gnetales are sister to Pinaceae within the conifers, although the "Gne-Pine" clade has only 57% support.Within the major groups of gymnosperms (conifers, Gnetales, and cycads), family-level and generic relationships generally are congruent with those inferred in other analyses (Figure 1).Of the 54 gymnosperm genera represented by more than one species in the tree, 47 have at least 50% bootstrap support, 36 have at least 90% bootstrap support, and 26 have 100% bootstrap support.A full version of the bootstrap consensus tree is available as Supplemental Material.

Diversification.
Although the lineage through time plots display much variation among bootstrap replicates (Figure 2), the general trend among the bootstrap trees is similar, with what appears to be high and possibly increasing diversification over the last 100 million years.Still, lineage through time plots are imprecise and difficult to interpret.If this trend of high recent diversification were true, we would expect to find evidence of increased rates of diversification in some relatively young clades.
The Δ 1 statistic indicated a significant shift in the rates of diversification at 10 nodes in the tree.Several are within the cycads.This includes the node dividing Cycas and Epicycas species from the other cycads (P = 0.0474) and its daughter node separating Cycas, Epicycas, and Dioon from the other cycads (P = 0.157).Also, two basal-most nodes of Zamia show significant shifts in diversification rates (P = 0.014 and 0.316).Within conifers, there is a significant shift in diversification at the most recent common ancestor of Podocarpus (P = 0.017).Also, there are significant shifts in diversification at the two basal nodes of Cupressaceae (P = 0.0326 and 0.0366) and within Cupressaceae, at the most recent common ancestor of Callitris, Neocallitropsis, Actinostrobus, Widdringtonia, Fitzroya, Diselma, and Austrocedrus (P = 0.0387).Finally, there is a significant shift in two of the basal nodes of Picea (P = 0.0166, P = 0.0029).All bootstrap trees, with ultrametric branch lengths from r8s, were pruned to include only the gymnosperm taxa.Each line represents a single ML bootstrap tree.The graph shows the pattern of diversification of the gymnosperm taxa in the tree through time, as the tree grew from a single lineage at the root to the current sampling of 739 species.

Genome Size Evolution.
Based on the large size of genomes of Pinus species, we hypothesized that there may be an increase in the rate of genome size evolution (Figure 3).We performed a series of likelihood ratio tests to examine the patterns of rate variation throughout gymnosperms, with a focus on testing for rate variation associated with conifers, Pinaceae, and Pinus (Table 1).In all comparisons in which Pinus (or a group containing Pinus) was compared to another group, the group with Pinus showed significantly elevated rates of genome size evolution (Table 1).We detected no significant shifts in rates of evolution between any groups that did not contain Pinus, and there was no significant difference in rates of evolution between the two Pinus subgenera (Pinus and Strobus; Table 1).

Discussion
The analyses of gymnosperm diversification and genome size evolution demonstrate the dynamic evolutionary processes of the extant gymnosperms, which sharply contrasts with their reputation as ancient, relictual species.The lineage through time plots are consistent with high, and possible growing, rates of diversification within the last 100 million years, concurrent with major radiations of angiosperms (e.g., [1,2,28]) and extant ferns [29].There is evidence of numerous significant shifts in diversification within both cycads and conifers, and there is strong evidence for a recent, large increase in the rate of genome size evolution in Pinus.Although Pinus is a species-rich genus, we find no links between increased rates of diversification and shifts in rates of genome size evolution.
Advances in sequencing technology and computational biology over the past decade enable phylogenetic estimates comprising large sections the plant diversity.This study demonstrates that it is possible to construct credible phylogenetic hypotheses including nearly three quarters of the extant gymnosperm species.Unlike supertree approaches (e.g., [14]), the supermatrix methods easily incorporate branch length estimates and estimates of topological and branch length uncertainty.Still, until there is far more data per taxon, estimates of the gymnosperm phylogeny will continue to improve, and thus, it is important to consider error and uncertainty in phylogenetic estimates when using these trees to infer evolutionary processes.There are other reasons to interpret this gymnosperm tree with caution.For example, both heterogeneity in the patterns of molecular evolution and missing data can lead to erroneous estimates of trees and branch lengths in ML phylogenetic analyses (e.g., [30,31]).Furthermore, our analysis does not attempt to incorporate evolutionary processes, such as incomplete lineage sorting or gene duplication and loss or reticulation, that may cause incongruence between the gene trees and the species phylogeny (e.g., [32]).Although this study used thousands of sequences, it does not incorporate the evolutionary perspective of low-copy nuclear genes.
Still, in many cases, evolutionary or ecological analyses that use phylogenetic trees may be robust to topological and branch length error (e.g., [33]), and the large tree of gymnosperms enables sophisticated and comprehensive tests of evolutionary and ecological hypotheses.We demonstrate this with our diversification analysis, the results of which emphasize numerous, independent shifts in diversification rate throughout gymnosperms and apparently recent, high rates of diversification (Figure 2).Estimates of diversification may be affected by taxonomic sampling and inaccurate branch length estimates.However, we might expect that adding the remaining species, which would likely fit near the tips of the tree, would result in increased estimates of recent diversification.Thus, our analyses suggest the intriguing perspective that the extant gymnosperms are a vibrant, growing clade, and not simply the sole survivors of ancient diversity.Greater sampling and a more robust tree will provide a more complete view of gymnosperm diversification.With better branch length estimates, it will be possible to use more powerful likelihood-based approaches to identify clades with increasing and decreasing diversification rates [34].With more complete taxon sampling, it may be possible to identify characters associated with changing speciation and extinction rates ( [35], but see [36]).One of the great challenges of evolutionary genomics is to identify the mechanisms of genome evolution that drive diversification.Some of the mechanisms that cause changes in genome size, such as whole-genome duplications or activity of retrotransposons, can have implications on diversification rates.Our analysis of the rates of genome size evolution demonstrate that Pinus is unique among gymnosperms.That is, the highly elevated rates of change in genome size appear to be limited to Pinus.However, in gymnosperms, we find no evidence of increases in diversification associated with Pinus, which displays a significantly elevated rate of genome size evolution.Furthermore, we find no obvious evidence for increase in rates of genome size evolution in clades associated with shifts in diversification.While our analysis failed to link genome size and diversification, this comparative approach for identifying shifts in genome size can inform our search for the specific drivers of the increased genomic complexity in Pinus, and this ultimately can help inform strategies for sequencing and assembling the first Pinus genomes.

Table 1 :
Rate estimates from the two rate parameter models from Brownie.* Indicate that the single rate model was rejected based on the Chi-squared P value ( * * P < 0.005; * * * P < 0.0005).Significance was also assessed using AIC.
Figure 3: Ancestral state reconstruction of genome size (in pg DNA) on a chronogram 165 gymnosperm taxa.Different genome sizes are represented by different colors, with the ancestral genome sizes estimated with squared change parsimony.