Positive Selection of Squalene Synthase in Cucurbitaceae Plants

Triterpenoid saponins are secondary metabolites synthesized through isoprenoid pathways in plants. Cucurbitaceae represent an important plant family in which many species contain cucurbitacins as secondary metabolites synthesized through isoprenoid and triterpenoid pathways. Squalene synthase (SQS) is required for the biosynthesis of isoprenoids, but the forces driving the evolution of SQS remain undetermined. In this study, 10 SQS cDNA sequences cloned from 10 species of Cucurbitaceae and 49 sequences of SQS downloaded from GenBank and UniProt databases were analyzed in a phylogenetic framework to identify the evolutionary forces for functional divergence. Through phylogenetic construction and positive selection analysis, we found that SQS sequences are under positive selection. The sites of positive selection map to functional and transmembrane domains. 180L, 189S, 194S, 196S, 265I, 289P, 389P, 390T, 407S, 408A, 410R, and 414N were identified as sites of positive selection that are important during terpenoid synthesis and map to transmembrane domains. 196S and 407S are phosphorylated and influence SQS catalysis and triterpenoid accumulation. These results reveal that positive selection is an important evolutionary force for SQS in plants. This provides new information into the molecular evolution of SQS within the Cucurbitaceae family.

To adapt to environmental stress and development, the accumulation of squalene is regulated at various steps of the biosynthetic pathway [18]. Previously, it was reported that the overexpression of Panax ginseng squalene synthase 1 (PgSS1) results in enhanced biosynthesis of both phytosterols and triterpene saponins in Panax ginseng [19], which determines the content of saponin and other downstream products. Enhancing the activity of PgSS1 increases the levels of phytosterols including ginsenoside, demonstrating that PgSS1 is a key regulatory enzyme not only for phytosterol synthesis but also for triterpene biosynthesis. When SQS is transformed into the callus to produce transgenic plants, SQS activity is up to threefold higher than that of wild-type plants. This suggests that the accumulation of phytosterols and triterpenoids can be enhanced by metabolic engineering [20]. These results reveal that SQS is critical to the triterpenoid biosynthetic pathway and that SQS overexpression increases the biosynthesis of triterpenoids. In the Cucurbitaceae family, cucurbitane-type tetracyclic triterpenoids possess a variety of notable pharmacological activities [21]. To understand the mechanism(s) of cucurbitane biosynthesis, insight into the SQS-mediated triterpenoid biosynthetic pathway in Cucurbitaceae is required. To gain new understanding into the role of SQS during triterpenoid biosynthesis, we isolated and characterized 10 SQS cDNA clones from Cucurbitaceae plants (Gynostemma pentaphyllum, Momordica charantia, Momordica cochinchinensis, Cucumis sativus, Benincasa hispida var. chieh-qua, Luffa acutangula, Cucurbita moschata, Trichosanthes rubriflos, Trichosanthes truncata, and Trichosanthes kirilowii).
The regulation of gene expression within the triterpenoid synthetic pathway has gained a significant interest [22,23]. Two or more copies of SQS in several plants have been identified, including Trichosanthes truncata C.B. Clarke, Cucumis sativus Linn, Panax ginseng [19], and Arabidopsis thaliana [24]. SQS1 of Panax ginseng (PgSS1) overexpression resulted in enhanced biosynthesis of both phytosterols and triterpenoid saponins in Panax ginseng [16]. The ectopic expression of PgSS1, PgSS2, and PgSS3 in the yeast erg9 mutant strain 2C1 lacking SQS activity restored ergosterol prototrophy [19].
Some results demonstrated that SQS is targeted to the ER membrane, and moreover, this location is exclusively dependent on the presence of the predicted C-terminal transmembrane domain [25]. The significance of SQS biological diversity suggests that it is subjected to positive Darwinian selection. Many research groups successfully rely on whole-gene random mutagenesis and recombination approaches for the directed evolution of enzymes, which have improved the properties of enzyme [26]. In the isoprenoid pathway for triterpenoid saponin synthesis, farnesyl pyrophosphate synthase (FPS) and SQS are the ratelimiting enzymes, which are considered to play an important regulatory role in the pathway. And our previous research found that FPS proteins in plants are under positive selection [27]. We want to investigate the force to evolution for SQS enzyme and explore the relationships between positively selected sites and some essential catalytic sites.
In this study, the squalene synthase of Gynostemma pentaphyllum (GpSS) served as a reference sequence for comparison with other plants. We identified two conserved sequences 77-82 (DTVEDD) and 213-217 (DYLED) important for the catalytic activity of isopentenyl pyrophosphate synthase (Trans-IPPS) and its downstream products. We assess how the evolution of SQS regulates its function and if the DTVEDD and DYLED motifs are positively selected. To investigate this, we cloned SQS sequences of 10 species in the Cucurbitaceae family and analyzed nucleotide and amino acid divergence in 59 plant species. We reveal important evolutionary functional sites of SQS through positive selection analysis. The likelihood method with site models, branch models, and branch-site models was used to calculate active amino acid sites and to investigate potential patterns of positive selection of the SQS gene. 2.5. Sequence Alignment. All protein sequences were aligned in MUSCLE [28] using the default parameters (http:// www.ebi.ac.uk/Tools/msa/muscle/), and PAL2NAL [29] (http://www.bork.embl.de/pal2nal/) was used to align amino acids and rearranged CDs. To convert nucleotide sequences to the nexus format, we ran MEGA4.0 [30] which allowed phylogenetic analysis in the MrBayes version 3.1.2 [31,32]. The parameters in the nexus file with PAUP * version 4.0 [33] were modified with the Model test version 3.7 [34] prior to MrBayes calculation. However, recombination events and substitution saturation can cause failure to reconstruct the phylogenetic tree and influence the detection of positive selection. Therefore, prior to performing reconstruction of the tree and positive selection, we must assess the recombination signals and substitution saturation between sequences involved in the alignment of SQS gene sequences. The Genetic Algorithm for Recombination Detection (GARD) approach [35] was applied to screen multiple sequence alignments for evidence of phylogenetic incongruence and to identify the number and location of breakpoints and sequences involved in putative recombination events [36]. We applied DAMBE as a new index to measure substitution saturation in a set of aligned nucleotide sequences [37]. With all parameters modified, MrBayes was performed to reconstruct the phylogenetic tree [32]. According to the tree and aligned DNA sequences (CDs) in the DAMBE, PAML was performed for selected pressure analysis.

Positively Selected Sites and Putative Biological
Significance. Based on the reconstruction of the phylogenetic tree of SQS with MrBayes, we performed positive selection analysis. Selection pressure was assessed based on the phylogeny of SQS gene sequences by comparing nonsynonymous (dN)/synonymous (dS) substitution ratios (ω = dN/dS) with ω = 1, ω < 1, and ω > 1, which indicated neutral evolution, purifying selection, or positive selection, respectively, prompting the formation of distinct subclasses and new functions in different species [38]. To explore selection pressure between SQS gene sequences, we performed strict statistical analysis using the CodeML program in the PAML package version 4 [39], which mainly runs site models, branch models, and branch-site models based on dN/dS(ω) [40]. The likelihood ratio test (LRT) [41] was used to compare fit models for data from two nested models to assess the statistical significance of each pair of nested models. Concrete methods have been previously described [27]. The analysis route of positive selection is shown in Figure 1. We used LRT through the comparison of M0 (one ratio) with M3 (discrete) (Genetic Algorithm for Recombination Detection = 3) to test for variable selection pressure amongst sites and two LRTs to test for sites evolving by positive selection, comparing M1a (nearly neutral) against M2a (positive selection) and M7 (beta) (κ = 10) against M8 (beta and ω) (κ = 10). A significantly higher likelihood of the alterative model compared to the null model suggested positive selection. Generally, all positive selection sites were calculated in the M8, which provided useful information for branch-specific and branch site analysis. The site models did not detect positive selection and influenced only selected sites along a small number of lineages following a duplication event, and so, the branch model and two-ratio test were implemented to select statistically significant "foreground branches" under positive selection. All other branches in the tree were classed as "background" branches. Based on the evolutionary branches screened out using the two-ratio test, branch-site models were applied to further estimate different dN/dS values amongst significant branches detected by the branch model and amongst sites [42]. Two models were computed: a null model (H0), in which the foreground branch differed according to the proportion of sites under neutral selection compared to background (relaxed purifying selection), and an alternative model (H1), in which the foreground branch has a proportion of sites under positive selection. The null and alternative hypotheses were as follows: null hypothesis (branch-site model, with ω2 = 1 fixed): model l = 2, NSsites = 2, fix ω = 1, and ω = 1 and alternative hypothesis (branch-site model, with ω2 estimated): model = 2, NSsites = 2, fix ω = 0, and ω = 1 (or any value > 1). Finally, the Bayes empirical Bayes (BEB) approach [41] was used to calculate posterior probabilities that a site originates from the site class with ω > 1, which was used to identify sites under positive selection or relaxed from purifying selection in the foreground group with significant LRTs.

Characteristics of Protein SQS Sequence and Structural
Analysis. We used GeneDoc tools (http://www.nrbsc.org/ gfx/genedoc) for visualizing, editing, and analyzing multiple sequence alignments of SQS proteins. We used NetPhos 2.0 Servers (http://www.cbs.dtu.dk/services/NetPhos/) to predict phosphorylation sites. PredictProtein [43,44] and TMHMM2.0 [45] were used to predict transmembrane domains and posttranslational modifications. An estimate of the prediction accuracy was provided based on the confidence score of the modeling. Furthermore, the functional area and relevant positive selection sites identified in the evolutionary analysis were built in three-dimensional graphic models and presented in the highlight areas. Finally, the Discovery Studio Visualizer (DSV) [46] designed to validate any predictions of protein-ligand interactions was used to reveal sites under positive selection in the SQS 3D structure. Meanwhile, we combined positively selected sites with proteomic analysis to investigate their potential relationship.

Full-Length cDNA Sequences Cloned from Cucurbitaceae
Plants. Full-length cDNA sequences of SQS in Cucurbitaceae were obtained from Gynostemma pentaphyllum, Momordica charantia, Momordica cochinchinensis, Cucumis sativus, Benincasa hispida var. chieh-qua, Luffa acutangula, Cucurbita moschata, Trichosanthes rubriflos, Trichosanthes truncata, and Trichosanthes kirilowii. ORF finder showed that all cloned sequences had ORFs of 1254 bp encoding a protein 417 amino acids in length. For the prediction of the physical and chemical properties of SQS gene sequences of Cucurbitaceae, we analyzed the theoretical pI and molecular weight (kD) of these sequences (Table 1). We found that the pIs ranged from 7.50 to 8.50 and the molecular weights werẽ 47 kD. All Cucurbitaceae SQS sequences were used for phylogenetic analysis.

Origin of the SQS Sequences during Plant Evolution.
The phylogenetic tree was reconstructed to investigate selection pressure amongst the 59 plant sequences. Through GARD detection, no recombination events occurred within the SQS sequences. Substitution saturation showed that the Iss < Iss c was significantly different, indicating a little saturation between the SQS sequences. It was therefore inferred that the phylogenetic tree reconstruction and positive selection of SQS are not influenced by recombination and substitution saturation. To investigate the phylogenetic relationship amongst SQS sequences, a phylogenetic tree based on codon alignment was produced using the Bayesian method ( Figure 2). Bayesian posterior probability (pp) methods were used to evaluate clades. Phylogenetic analysis showed that SQS sequences consisted of several distinct branch clusters that accompanied species divergence [47], which could be grouped into five lineages (Outgroup, Pteridophyte, Gymnosperms, Monocotyledons, and Dicotyledons) with Botryococcus braunii and Selaginella moellendorffii emerging as outgroups (group A in Figure 2). The relationships displayed in the phylogenetic tree were identical to the taxonomic classifications. Thus, the phylogenetic tree of the SQS sequences of terrestrial plants reflects the genetic relationship amongst different species. SQS sequences emerged in branch points of the terpenoid synthesis pathway, which is responsible for directing carbon flow from the isoprenoid pathway [48]. The reaction products catalyzed by SQS may therefore flow into the triterpenoid metabolic pathway. According to the phylogenic tree, three obvious branches with high pp supportive values (1.00) were evident and included Gymnosperms, Monocotyledons, and Dicotyledons. In the Dicotyledoneae (group E in Figure 2), the first to emerge were Arabidopsis thaliana and Camelina sativa, belonging to Brassicaceae. Malus domestica and Eriobotrya japonica are the typical plants of Rosaceae in the phylogenic tree. The SQS sequences of the four species in Brassicaceae and Rosaceae suggested sharing of a common ancestral SQS gene that originated from monocots. Due to duplication and gene diversification of the SQS sequences, each branch constantly evolved in the Dicotyledoneae, based on the tree lineages. The evolution of SQS sequences in the plants accompanied divergence of the metabolic pathways. For example, the clades amongst Panax ginseng, Panax notoginseng, and Panax quinquefolius were clustered, suggesting that they share similar functions to that of Araliaceae SQS, which encodes enzymes participating in the biosynthesis of dammarane-type ginsenoside. The phylogenic tree of cloned SQS sequences of Gynostemma pentaphyllum, Siraitia grosvenorii, and other SQS sequences of Cucurbitaceae plants occurred at the same branch. Tetracyclic triterpenoids are also the main secondary metabolic substances in Cucurbitaceae. The saponins found in Gynostemma pentaphyllum mainly belong to the dammarane-type tetracyclic triterpenoids. The dammarane-type tetracyclic triterpenoids present in both Araliaceae and Cucurbitaceae may be closely related to triterpenoid synthesis and SQS enzyme activity.

Positively Selected Sites amongst SQS Subfamilies and
Putative Biological Significance. We next analyzed whether positive selection occurred during lineage-specific evolution. We calculated the nonsynonymous/synonymous substitution ratio (dN/dS = ω) in SQS sequences using the PAML package version 4.4. Following the removal of gaps, Gynostemma pentaphyllum squalene synthase (GpSS) was used as the reference sequence amongst the 59 species, and all protein sequences were analyzed using the CodeML program. For site model detection, M0 vs. M3 and M1a vs. M2a showed more than a 95% occurrence of synonymous changes relative to nonsynonymous changes (dN/dS < 1), indicating that they were under a strong purification selection. However, the pairs of models of M7 vs. M8, the LRT (2ΔlnL = 11187 62, df = 2, p ≤ 0 01), showed a ω dN/dS greater than 1 in M8 (dN/dS > 1), indicating an excess of nonsynonymous changes. This result showed that positive selection had driven SQS evolution, which was distinguished from M0 vs. M3 and M1a vs. M2a. We detected 45 positively selected sites (Table 2) with a pp ≥ 99%. When considering positive selection only occurring during specific evolution stages or in specific branches, we used a branch-specific model to detect positive selection. The free ratio model was significantly higher than the oneratio model (2ΔlnL = 256 64, p ≤ 0 01, df = 135), indicating a heterogeneous selection in 6 branches (ω > 1) ( Table 3). Amongst these 6 branches (Ta, Tb, Tc, Td, Te, and Tf), a two-ratio model test was applied to screen out the evolutionary branches. Four branches (Ta, Tb, Te, and Tf) had a ω ≥ 1, indicating a strong positive selection in branches Ta, Tb, Te, and Tf ( Table 4). The branch-site model was used to screen for amino acid sites that underwent positive selection in branches Ta, Tb, Te, and Tf. According to the LRT of the branch-site model, comparisons of BSa1 vs. BSa0-fix (2ΔlnL = 4 18, p < 0 05, df = 1), BSb1 vs. BSb0fix (2ΔlnL = 7 02, p < 0 01, df = 1), BSe1 vs. BSe0-fix (2ΔlnL = 6 64, p < 0 01, df = 1), and BSf1 vs. BSf0-fix (2ΔlnL = 3 3, p > 0 05, df = 1) significantly differed. Interestingly, 13 amino acid sites in the branches displayed a least posterior probability of >0.95 by BEB (13D, 278D, and 196S with the pp > 0 95 in branch Ta, p < 0 01). At the same time, we detected positively selected sites 52A, 157K, 171Y, 180L, 203L, 226M, 289P, and 333G with the pp > 0 981 (p < 0 01) in the branch Tb and amino acid sites at position 37W and 308R with pp > 0 96 (p < 0 01). Interestingly, 196S was detected both in the site model and branch-site model, which was located in the phosphorylation site. All of these sites were considered to undergo positive selection. In our previous studies, MchSS and TrSS were listed as the first reference sequences for the calculation of positive selection amongst SQS sequences. Surprisingly, the distributions of positively selected sites of SQS from Cucurbitaceae were consistent. Moreover, the important functional site 196S was detected in different species of Cucurbitaceae. This indicated a strong positive selection amongst SQS sequences in Cucurbitaceae (Table 5). Further analysis showed that (1) the SQS gene

Structural Characteristics of SQS in Plants.
In addition to the phylogenetic tree and positive selection analysis of SQS sequences, we conducted detailed structural studies from the two-dimensional model containing the sequence alignment of the SQS protein sequences in several medicinal herbs, including GpSS. GpSS was considered the reference sequence in this analysis. The SQS sequences shared a high level of sequence similarity within the coding region. We mapped positively selected sites through assessing sequence alignments of the SQS gene sequences amongst GpSS, SgSS, MchSS, McSS, CsSS, BhSS, LaSS, CmSS, TrSS, TtSS, and TkSS ( Figure 3). As shown in the two-dimensional structure produced by GeneDoc, the positively selected 189S was located in the casein kinase II phosphorylation site on GpSS but at position 189L in the other 10 Cucurbitaceae plants. The positively selected sites 408A and 410R overlapped with the protein kinase C phosphorylation sites. This was 410R on GpSS but 410Q on CmSS, BhSS, TtSS, and TkSS. Moreover, positively selected 196S was a serine phosphorylation site, which was consistent in SQS in Cucurbitaceae plants. Based on these positive signal and functional sites, we inferred that the phosphorylation of these sites plays an important role in SQS activity, potentially influencing the further metabolic flux of squalenes. The positively selected 414N was a glycosylation site in the transmembrane region, which may have been related to SQS function. All these sites require further  lnL: loglikelihood; LRT: likelihood ratio test; ω2: average dN/dS ratio for sites subject to positive selection. p and q: shape parameters for the beta distribution of ω; p0, p1, and p2: proportions of codons subject to purifying selection, neutral evolution, and positive selection, respectively; df : degrees of freedom; 2ΔlnL: twice the loglikelihood difference of the models compared. Selection analysis by site models was performed using CodeML implemented in PAML. Bold values indicate ω > 1 and p < 0 05. Significant tests at 1% cut-off, * * ; posterior probability pp > 99%. verification by site-directed mutagenesis in future studies. At the same time, TMHMM (http://www.cbs.dtu.dk/services/ TMHMM-2.0) and TMPRED (http://www.ch.embnet.org/ software/TMPRED_form.html) were used to predict the transmembrane domains of GpSS, and motif elicitation (MEME) analysis (http://meme-suite.org/) was used to obtain characterized motifs [49]. It is known that SQS is membrane-associated [15]. According to TMHMM and TMPRED predictions, two transmembrane helices are present in GpSS, located at position 285-301 (inside-outside, 1336 score) and position 389-408 (outside-inside, 2549 score) (Figure 4). These transmembrane domains indicated that GpSS is an integral membrane protein. The second transmembrane region (389-408) may have important biological roles in anchoring to GpSS to the ER membrane [24]. When MEME was employed, we identified two significantly conserved motifs (168-183 YCHYVAGLVGLGLSKL and 201-229 MGLFLQKTNIIRDYLEDINEIPKCRMFWP). These were located in the functional domains in GpSS. Through subcellular localization predictions, it was revealed that GpSS has no mitochondrial targeting signal or chloroplast transit peptides. Conserved domain analysis (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) was performed to identify conserved domains [50]. This showed that GpSS belongs to the Trans_IPPS_HH superfamily (38-320 interval, p = 1 38e − 84) with substrate-binding pockets, Mg 2+ -binding sites, active site lid residues, catalytic residues, and an aspartate-rich region. NetPhos 2.0 Server was used to predict SQS phosphorylation sites which may regulate SQS activity ( Figure 5). We identified 9 serine residues (48S, 84S, 187S, 196S * , 198S, 249S, 349S, 358S, and 407S * ), 5 Thr (78T, 83T, 161T, 318T, and 327T), and 5 tyrosine residues (165Y, 236Y, 246Y, 273Y, and 387Y). Of these sites, 196S and 407S were positively selected.

Distribution of Positive Selection Sites on the SQS 3D
Structure. Due to evidence of the positive selection sites within the SQS sequences, the crystal structure of the human SQS (3vj8.1A) was used as a template to predict the 3D structure of GpSS using homology structural modeling on a Swiss model server [51]. Human SQS is a 47 kD membraneassociated enzyme that catalyzes the dimerization of two molecules of FPP in a two-step reaction to form squalene [52]. We used the structure of GpSS as a reference to analyze the relationship between positive selection sites and functional sites. We mapped important sites of positive selection from the branch-site model and site model onto the 3D structure using DSV. Whilst sites of positive selection mainly exist in the transmembrane domains indicating that GpSS undergoes strong constraints within its functional domains, important sites of positive selection (52A, 180L, 189S, 196S, 200S, and 289P) mapped to other areas ( Figure 6). A residue with the SQS catalytic center (180L) was connected with the substrate FPP, in addition to 66V, 184F, and 285F by noncovalent bonds (Figures 6(a) and 6(e)). In Figure 6(b), a positively selected site 52A located near the catalytic sites 51F and 289P impacts on the condensation of two side chains of FPP by noncovalent interaction with 51F. Position 196S was identified as a phosphorylation site in both the site model and branch-site model, whilst 189S lies within a casein kinase II phosphorylation (Figure 6(c)) [52]. Moreover, 196S and 200S lie close to the first transmembrane domain, and the 289P lies within the first transmembrane domain (Figure 6(d)) [52].

Discussion
In this study, we obtained 10 cDNA sequences of SQS genes from 10 plants of the Cucurbitaceae family. To gain new Table 4: Parameter estimation and likelihood ratio tests for the two-ratio model. insight into the role of SQS during the triterpenoid biosynthesis in higher plants, we performed phylogenetic analysis and evaluated sites of positive selection. The analysis revealed that SQS sequences in Cucurbitaceae plants are highly conserved and cluster into subfamilies. SQS sequences from different species of plants often exhibit both characteristic sequences. For example, three SQS genes are differentially expressed in Panax ginseng, all of which are involved in the generation of squalene in Panax ginseng. The enhanced activity of PgSS1 results in increased levels of phytosterols and ginsenosides [16]. PgSS contains three conserved domains ((a) 168~183, (b) 202~224, and (c) 280~298) essential for the two half reactions catalyzed by SQS [19]. The three domains are also present in GpSS and other SQS genes of Cucurbitaceae plants. The amino acid sequences within these SQS domains are identical, and variation exists only in the first two amino acids. This may explain why dammarane-type tetracyclic triterpenoids are present in the Panax genus and some Cucurbitaceae plants, particularly in Gynostemma pentaphyllum. Conserved protein domain analysis through PredictProtein (https://www.predictprotein.org/) revealed that GpSS and all other Cucurbitaceae SQS genes contain an isopentenyl diphosphate synthase-(Trans_IPPS-) conserved catalytic center that consists of a large central cavity formed by antiparallel α-helices with two aspartate-rich regions (DXXXD). The results indicate that typical GpSS from Cucurbitaceae plants possess two transmembrane domains at the C-terminus (285-301 AA and 385-408 AA), which is consistent with the properties of other eukaryote SQS proteins bound to the ER membrane [53]. The aspartate-rich motifs in Cucurbitaceae SQS were located at residues 77 to 82 (DTVEDD) and 213 to 217 (DYLED). These motifs underwent a strong negative selection pressure at the constraints which were highly conserved. The motifs f: ω0 = 0 07521, ω1 = 1 00000, ω2a = 1 00000, ω2b = 1 00000 None p0 = 0 16610, p1 = 0 02926, p2a = 0 68412, p2b = 0 12052 were conserved during evolution to maintain the basic functions of SQS proteins. We found no sites of positive selection located within these highly conserved motifs. Thus, positive selection events primarily occur outside the active center of SQS, consistent with the negative selection of the active center. As the number of SQS genes cloned in our laboratory and collected from the databases is growing, it is feasible to explore both evolutionary relationships and the genetics of the SQS family. In this study, 59 sequences were used for phylogenetic tree reconstruction using Bayesian methods. It was shown that triterpenoid saponins are widely distributed in dicotyledonous plants [3,26,[54][55][56]. Cucurbitaceae resident tetracyclic triterpenoids are valuable active ingredients for self-defense and possess many clinical values when extracted. Thus, studying the molecular traits of the SQS triterpenoid pathway of Cucurbitaceae plants is important. Gene expression analysis revealed that SQS increases terpenoid generation in plants [57][58][59]. However, the mechanism(s) by which SQS regulates this pathway and its speciesspecific biological roles remain poorly understood. Using molecular adaptive evolution and positive selection principles to identify functional sites can provide valuable references to the artificial regulation of triterpenoid synthesis.

Model
In this study, we combined molecular phylogenetic trees with putative biological significance and protein structure analysis to explore the evolution of terrestrial plant SQS. Positive selection results in the retention and spread of advantageous mutations throughout a population and has long been considered synonymous with protein functional shifts [60]. The previous study found that positively selected genes were likely to interact as opposed to interactions with genes under neutral evolution or purifying selection [61].  varies amongst lineages and amino acid sites in SQS genes. From positive selection analysis, the category of ω in the site model did not sufficiently fit the data to a level that describes the variability in selection pressure, whilst branch model analysis showed variable ω ratios amongst clades. The branch-site model was applied to evaluate sites amongst specific clades of the SQS phylogeny. As shown in the space structural model of GpSS (Figure 6), 189S is located within a casein kinase II phosphorylation site, 408A and 410R are located in protein kinase C phosphorylation sites, 414N lies within a glycosylation site, and 389P, 390T, 391L, 394I, 406L, 407S, and 408A are in the second transmembrane domain (389-408) ( Figure 4). Importantly, 196S and 407S are also phosphorylation sites ( Figure 5). As detected in the branch-site model, 289P is located in the first transmembrane domain (285-301). 196S was detected in the site model and branch-site model and considered as a vital site of SQS-mediated triterpenoid synthesis. Given that phosphorylation/dephosphorylation regulates enzyme activity, this site is predicted to mediate GpSS and impact on triterpenoid biosynthesis.
Although the active center of SQS from various species is highly conserved, distinct sites were also reported. LjSS (Lotus japonicas squalene synthase) possessed an unusual Asp (D280) near the active center. Site-directed mutagenesis and in vitro assessment of LjSS demonstrated that D280 can be substituted by Gln and Glu, suggesting that the local structure of plant SQS differs from mammalian SQS in which Gln at the same site cannot be replaced by Glu to retain activity [62]. Sequence alignment demonstrated that D280 is located in all known Cucurbitaceae SQS proteins near the active center. Two sites of positive selection include L178 and K179 for GpSS but are L178 and R179 for other known Cucurbitaceae SQS proteins.
Two transmembrane domains (285-301 and 389-408) are present on the Cucurbitaceae structure. Our analysis indicated that 289P, a site of positive selection, is located in the first transmembrane region. On the second transmembrane region, other sites of positive selection, namely, 389P, 390T, 391L, 394I, 406L, 407S, and 408A, are present. SQS should contain sequences responsible for its localization on the ER membrane [53]. The sites of positive selection on the second transmembrane region would enhance ER targeting and thus promote cellular SQS activity. The majority of the sites mapped to this region. Pathway evolution indicated that genes that regulate similar functional pathways maintain a consistent or associated evolutionary trend [63,64].
It was reported that positive gene selection during species evolution drives diversification during plant secondary metabolism [65]. Many of the genes that undergo positive selection tend to interact [61]. In the previous studies, the positive selection was reported in plant farnesyl pyrophosphate synthase (FPS). We found that the FPS genes of terrestrial plants also undergo positive selective pressure, generating a functional diversity [66]. FPP, produced through the catalytic activity, is the SQS substrate. Both FPS and SQS are key enzymes in triterpenoid biosynthesis. Thus, why both FPS and SQS experience positive selection pressure is worthy of further exploration. Based on FPS and SQS structural analysis, both were found to contain two isopentenyl diphosphate synthase-(Trans_IPPS-) conserved catalytic sites, including SQS head-to-head (Trans_IPPS_HH) [52] and FPS head-to-tail (Trans_IPPS_HT) [67]. These associated structures may determine the flux of metabolites catalyzed by SQS and FPS, in addition to other enzymes associated with triterpenoid biosynthesis. The formation of large complexes may be regulated by the metabolic requirement, in which the previous enzyme acts as a substrate for the next enzyme. This may be more beneficial to the formation of secondary metabolites and structural diversity. Thus, we provide new insights into the functional divergence of SQS genes from the Cucurbitaceae family and implicate new potential roles of SQS positive selection, which now warrants further investigation.   Gymnospermae and Monocotyledonae to Dicotyledoneae;

Conclusions
(2) 45 positively selected sites can be detected in the site model (M8) and 13 positively selected sites in the branch-site model. 196S was detected in both the site and branch-site models, 196S and 407S are present within phosphorylation sites, and 407S is located in the second transmembrane domain. 196S and 407S are considered the most significant sites for plant SQS and may contribute to the regulation of triterpenoid biosynthesis; and (3) positive selection and protein structure analysis were combined to explain Darwinian selection on branches and on sites in all SQS sequences. These changes due to positive selection were found to possess biological significance. This study will provide useful information for further research on the regulation of triterpenoid biosynthesis.

Data Availability
Nine SS sequences of Cucurbitaceae plants, cloned by our laboratory, have uploaded on GenBank (http://www.ncbi .nlm.nih.gov/) and these sequences will released with the publication of the article. The rest SS sequences of cDNA and amino acid were downloaded from the GenBank and the UniProt (http://www.uniprot.org/). All the accession numbers of sequences involved in article were list in the additional files.

Disclosure
The funding body had no influence over the experimental design, data analysis, interpretation, and writing of the manuscript.