Comparative Genomics of Ten Solanaceous Plastomes

Availability of complete plastid genomes of ten solanaceous species, Atropa belladonna, Capsicum annuum, Datura stramonium, Nicotiana sylvestris, Nicotiana tabacum, Nicotiana tomentosiformis, Nicotiana undulata, Solanum bulbocastanum, Solanum lycopersicum, and Solanum tuberosum provided us with an opportunity to conduct their in silico comparative analysis in depth. The size of complete chloroplast genomes and LSC and SSC regions of three species of Solanum is comparatively smaller than that of any other species studied till date (exception: SSC region of A. belladonna). AT content of coding regions was found to be less than noncoding regions. A duplicate copy of trnH gene in C. annuum and two alternative tRNA genes for proline in D. stramonium were observed for the first time in this analysis. Further, homology search revealed the presence of rps19 pseudogene and infA genes in A. belladonna and D. stramonium, a region identical to rps19 pseudogene in C. annum and orthologues of sprA gene in another six species. Among the eighteen intron-containing genes, 3 genes have two introns and 15 genes have one intron. The longest insertion was found in accD gene in C. annuum. Phylogenetic analysis using concatenated protein coding sequences gave two clades, one for Nicotiana species and another for Solanum, Capsicum, Atropa, and Datura.


Introduction
Chloroplasts are essential cellular organelles within plant cells possessing the enzymatic machinery for the process of photosynthesis which provides essential energy to plants. Besides photosynthesis, chloroplasts are also involved in biosynthesis of fatty acids, amino acids, pigments, and vitamins [1,2]. Despite enormous divergence in whole plant form and habitat, chloroplast structure and function have remained remarkably conserved which might be due to intense evolutionary selection pressures associated with the functional requirements of photosynthesis [3][4][5][6][7]. The chloroplast genome is actually a reduced genome derived from a cyanobacterial ancestor that was captured early in the evolution of the eukaryotic cell [8,9]. Among the three genomes of the plant cell, the plastome is the most gene dense with more than 100 genes in a genome of only 120 to 210 kb [10]. In the last two decades, the nucleotide sequences of large number of plastid genomes have been published leading to better understanding of their organization and evolution [2,11,12]. Currently, about 470 eukaryotic chloroplast genomes have been sequenced completely (http://www.ncbi.nlm.nih.gov/ genomes/GenomesHome.cgi?taxid=2759&hopt=html) with the best representation from flowering plants.
Most land plant chloroplast genomes are composed of a single circular chromosome with a quadripartite structure which includes two copies of an inverted repeat (IR) region that separates the large and small single copy regions (LSC and SSC). Genes of chloroplast genomes of higher plants can be divided into three broad categories [13,14]. In the first, there are genetic system genes encoding for rRNAs, tRNAs, ribosomal proteins, and RNA polymerase subunits. The second category is comprised of genes for photosynthesis which encode subunits of the two photosystems, the cytochrome b6f complex and the ATP synthase. Open reading frames (orfs) of unknown function constitute the third category. Besides, there are some other genes coding for different kinds of proteins including infA, matK, clpP, cemA, accD, and ccsA. Although overall chloroplast genome organization is highly conserved among taxa, structural rearrangements due to inversions have been reported in different taxa like Campanulaceae [15], Cyatheaceae [16], Fabaceae [17], 2 Advances in Bioinformatics Funariaceae [18], Geraniaceae [19], Onagraceae [20], and Poaceae [21,22]. Besides structural rearrangements, sequence polymorphisms have also been reported in some cereals [23,24] and Oenothera species [20]. These studies revealed that highly divergent sequences were concentrated in specific regions called "hotspots. " Such sequence polymorphisms have been used to derive phylogenetic relationships among species.
Solanaceae is an important family of dicots comprising more than 3000 species placed within about 90 genera. It is an ethnobotanical family and is extensively utilized by humans and has recently become a model of comparative and evolutionary genomics research. Few efforts have been made to study the variations in chloroplast genomes of Solanaceae family by using in silico tools. Most of these attempts have been concentrated on comparison of newly sequenced chloroplast genome with the available complete chloroplast genomes from some members of this family [25][26][27][28][29]. The availability of complete nucleotide sequences of plastid genomes of ten solanaceous species, Atropa belladonna (NC 004561.1; [30]), Capsicum annuum (NC 018552.1; [29]), Datura stramonium (NC 018117.1; Li et al. (unpublished)), Nicotiana sylvestris (NC 007500.1; [28]), Nicotiana tabacum (NC 001879.2; [31]), Nicotiana tomentosiformis (NC 007602.1; [28]), Nicotiana undulata (NC 016068.1; [32]), Solanum bulbocastanum (NC 007943.1; [26]), Solanum lycopersicum (NC 007898.2; [27]), and Solanum tuberosum (NC 008096.2; [33]), provided us with an opportunity to conduct their in silico comparative analysis in depth. Hence, the present study is an attempt to compare the genome organization, structure, and coding capacity of chloroplast genomes of ten solanaceous species. The study focuses on length mutations, intron-containing genes, grouping of genes in different identity classes based on pairwise comparison of individual genes, and InDel analysis of divergent genes.

Sequence Analysis.
Whole chloroplast genome sequences as well as individual gene and protein sequences of ten solanaceous species were obtained from "Organelle Genome Resources" section of NCBI in Genbank as well as in Fasta format. Sequence regions corresponding to various genomic features including genes, exons, introns, and cds were specifically extracted from the Genbank files using Extractfeat, Extractseq, and Featcopy programs from Jemboss package. AT percentage for different genomic regions was calculated using Wordcount and Union programs from Jemboss package. Pairwise comparison of gene sequences was done by using NCBI BLAST program and multiple sequence alignment of nucleotide as well as protein sequences was done by using ClustalW. Alignments of protein sequences for some of the genes were manually edited in correspondence to InDels observed in alignments of their nucleotide sequences.

Phylogenetic Analysis of Concatenated Protein-Coding
Genes. 75 protein-coding genes of plastomes of ten solanaceous species and two outgroup species (Daucus carota and Coffea arabica) were selected for phylogenetic analysis from the total of 79 classified protein-coding genes excluding accD, rpl20, ycf1, and ycf15. Ycf15 was excluded due to its absence on the plastome of both outgroup species chosen while the other three were not included in the phylogenetic analysis due to their high levels of variation. Multiple sequence alignment of each gene was obtained using ClustalW (https://www.ebi.ac.uk/Tools/msa/clustalw2/). These alignments were then concatenated using standalone BIOEDIT version 7.25 (http://www.mbio.ncsu.edu/bioedit/ bioedit.html) and maximum likelihood phylogenetic tree with 500 bootstrap iterations was constructed using PhyMLv3.0 (http://www.atgc-montpellier.fr/phyml/). A graphical view of tree was generated using Archaeopteryx 0.988 SR (https://sites.google.com/site/cmzmasek/home/ software/archaeopteryx).

Comparison of Properties of Chloroplast Genomes.
Comparison of the properties of plastid genomes of ten solanaceous species with respect to their genome size (size of complete plastid genome and LSC, SSC, and IR regions); percent coding regions, introns, and intergenic regions; AT content of overall plastid genomes as well as coding and noncoding regions is presented in Table 1. The total plastid genome size ranged from 155296 bp (S. tuberosum) to 156781 bp (C. annuum). The large size of plastome of C. annuum can be attributed to large LSC region as compared to other species. On the contrary, size of SSC region in C. annuum was the least as compared to other species. The largest size of IR region was in A. belladonna. Among four Nicotiana species studied, N. sylvestris and N. tabacum were almost identical to each other with respect to size of complete genome (difference of only 2 bps) or LSC, SSC, or IR regions compared with plastome of any other species studied. However the percent coding region was slightly more for N. sylvestris (61.49%) than in N. tabacum (61.12%). The size of complete chloroplast genome and LSC and SSC regions of three species of Solanum is comparatively smaller than that of any other species studied except for A. belladonna where size of SSC region was the smallest (18008 bp). However the size of IR region of Solanum species is larger as compared to Nicotiana species. Coding region percentage was found to be higher in Nicotiana species as compared to all other species with maximum for N. undulata (63.12%) and minimum for S. tuberosum (58.45%). Maximum of 12.8% of the plastome was shown to be introns for S. bulbocastanum whereas minimum intron percentage (11.62%) was observed for D. stramonium. Maximum percentage (29.19%) of intergenic region was observed in D. stramonium and minimum (24.19%) was observed in N. undulata. The AT content of noncoding regions was found to be higher as compared to coding regions for all the ten species studied. Similarly, protein-coding regions have shown higher content of AT base pairs as compared to RNA coding genes which can be explained by the requirement of more GC base pairs for proper folding of highly structured ribosomal RNAs and tRNAs [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]. Comparison of AT content in LSC, SSC, and IR regions reveals that AT content was the highest in SSC

Split Genes.
A total of eighteen split genes have been reported. The sizes of exons and introns for these genes in all the solanaceous species studied are summarized in Table 2. The rps12 gene is divided such that its 5 end exon is located in the LSC region whereas second and third exons are located in the IR region. Maturation of RNA transcript requires a trans-splicing mechanism between exon 1 and exon 2 [34,37]. Among the eighteen intron-containing genes, ycf3, clpP, and rps12 contained two introns whereas the other 15 genes contain only one intron. As per Kim and Lee [34] trnL-UAA gene intron belongs to the self-splicing group I intron whereas all other introns belong to group II. Generally, the size of exons was shown to be conserved and variability was observed in the intron regions; however, ndhB was found to be highly conserved for both exons and introns. identity in comparison were considered as highly conserved and the genes showing less than 95% identity at least once in the comparison were considered as highly divergent. These highly divergent genes were further explored at nucleotide as well as at protein level to probe the variations in detail. A total of 11 highly divergent genes were found whereas the number of highly conserved genes varied from 26 (for species pair: N. tomentosiformis and S. lycopersicum) to 107 (for species pair: N. sylvestris and N. tabacum). Most of the tRNA genes were found to be highly conserved. Genes accD, cemA, clpP, ndhA, rpl32, rpl36, rps16, sprA, trnA-UGC, trnL-UAA, and ycf1 were found to be highly diverged. Tables 3 and 4 describe the summary of InDels observed in nucleotide and amino acid sequences, respectively. Partial multiple sequence alignment of 9 genes and 5 proteins is shown in Figures 1 and 2, respectively. The longest insertion of 141 bp was observed in accD gene sequence of C. annuum. Since genes clpP, ndhA, rps16, and trnL-UAA contained introns, it was important to examine whether these InDels were present in exon or intron region. It was found that all the InDels reported in ndhA and trnL-UAA were present in introns whereas, in case of clpP, InDel 24 was located in exon of the gene. Similarly, the first and last InDels of gene rps16 were present in exons of the gene. Keeping in view the observations in number and length of InDels in nucleotide and protein sequences of genes under consideration, the variation for individual genes is discussed below.
(2) clpP. In clpP gene InDels were found both in intron and in exon regions. Two major consequences were observed in the InDels in the exon regions. An insertion of 6 bp in S. bulbocastanum and S. tuberosum and 30 bp in S. lycopersicum at 3 end (exon 3) of the clpP gene resulted in shifting of stop codon by 6, 6, and 30 bp downstream in respective species compared to other species of Solanaceae family, increasing the length of the coding sequence and the protein product (Figures 1 and 2). An interesting feature was observed as InDel 1 in protein sequence corresponding to insertion of a repeat of "I" amino acids in D. stramonium making exon 3 region longer by 6 bp. This region however corresponds to the end of intron 2 in clpP gene in all other species. Since D. stramonium chloroplast genome has been sequenced recently, this observation needs to be experimentally validated.
(3) ndhA. All the InDels found in ndhA were present in introns while the protein-coding regions (exons) were highly conserved. This indicates high diversifying selection on intronic region of this gene. Out of the total 14 InDels most of the InDels were observed with respect to C. annuum (InDels 1, 2, 5, 7, and 10). InDel 10 was observed to be shared by C. annuum and S. lycopersicum in full and by C. annuum and A. belladonna in part.
(4) rpl32. In rpl32 insertion of 1 bp in D. stramonium and 3 bp in C. annuum was found in the 3 region of gene while a deletion of 4 bp was observed in D. stramonium.
The insertion of 3 bp in C. annuum only altered the length of the protein by making it longer by 1 amino acid. However, the small insertion of 1 bp in D. stramonium proved to be a frameshift mutation resulting in three changes in the amino acid sequence near the C-terminus. Moreover, deletion of 4 bp at the 3 end resulted in premature termination of protein synthesis. The frameshift mutation and the 3 end deletion finally reduced the gene product length by 1 amino acid. As the C-terminal of amino acid chain is well conserved in all the other species, the effect of above mentioned variations needs to be validated experimentally.
(5) rps16. In rps16 also InDels were observed in introns as well as exons. Five of the major insertions in the intron regions were species specific. Insertion of 38 bp (InDel 1), 9 bp (InDel 2), 5 bp (InDel 7), 4 bp (InDel 8), and 6 bp (InDel 9) was observed in A. belladonna, S. lycopersicum, D. stramonium, and C. annuum. A deletion of 5 bp was observed in all the three Solanum species and C. annuum. A deletion of 9 bp was observed in all Nicotiana species resulting in an amino acid change (P to S) and shortening of protein by three amino acids in the C-terminal region. Similar deletion has also been observed by Kahlau et al. [27] and was suggested to be functionally neutral.
(6) sprA. sprA gene has been reported as stable noncoding RNA of unknown function. This gene has been suggested to influence 16S rRNA maturation [38,39]. In many species this gene seems to be present as remnant and shows large variations in its 5 region. The largest deletion of 109 bp was observed in C. annuum. The rest of this gene appears to be more conserved with a deletion towards the 3 end in all Nicotiana species and A. belladonna. The manner in which this gene functions and the consequences of the above mentioned variations are yet to be investigated experimentally.
(7) trnA-UGC. In this particular gene a long deletion of 102 bp was observed in all Nicotiana species. Interestingly, this deletion was further extended to 130 bp in both directions in A. belladonna. These deletions were found in the intron region and so are unlikely to have any negative impact on gene product function.
(8) trnL-UAA. The trend of variation in trnL-UAA was similar to that in ndhA as all InDels were observed in introns. The longest species specific deletion (InDel 3) was observed in C. annuum whereas short insertion of four nucleotides, a repeat of "T, " was observed specifically in D. stramonium. Another insertion of 6 bp was observed in two Nicotiana species, that is, N. sylvestris and N. tabacum.

Phylogenetic Analysis of Solanaceous Plastomes.
Evolutionary relationships between diverse plant species have been analyzed using several plastome markers including matK and rbcL (genes) or trnH-psbA and trnL-trnF (intergenic regions) due to sequence conservation among plant taxa blended with suitable variation [40,41]. However, determination of phylogeny based on single gene sequences may be inaccurate [42]. Availability of complete chloroplast sequences for many species has made it possible to use many individual genes or concatenated gene sequences to deduce phylogenetic relationships among taxa [42][43][44].
Efforts have been made to carry out phylogenetic analysis of solanaceous species using complete plastome sequences by Moore et al. [44] and Jansen et al. [45]. Evolutionary positions of Capsicum and Datura in Solanaceae have been determined using a single or a few plastid genes [46,47]. Recently, concatenated protein-coding gene sequences from completely sequenced plastomes were used to obtain reasonable phylogenetic relationships for solanaceous species [29]. In the present investigation we also used a similar approach to analyze the phylogenetic relationship for ten solanaceous species with completely sequenced plastomes. Individual multiple sequence alignments were concatenated for maximum likelihood phylogenetic tree generation. As depicted in Figure 3, taxa were divided into two clades with 100% bootstrap value of 500. The first clade consisted of four Nicotiana species while the species in Solanum, Capsicum, Atropa, and Datura were included in the second clade. These results are in line with previous phylogenetic analyses using concatenated protein-coding gene sequences as well as phylogenetic analyses using plastid ndhF and trnL-F sequences [29,47]. However, in an analysis of 13 orfs of solanaceous plastomes, a different arrangement was shown in which Atropa was shown to be separated from Solanum and was grouped together with Nicotiana [25].

Conclusions
The analyses of complete plastid genomes of ten solanaceous species revealed overall similarity in terms of the gene content and organization. The sizes of LSC, SSC, and IR regions were found to be somewhat conserved among species but a significant variation was found between genera. Most of the coding regions were well conserved. However, many of the features in few genes were observed to be typical of a particular genus and even species, which can be used as molecular markers to investigate genetic diversity and evolution. These typical variations can be further utilized to develop more sophisticated DNA barcoding based techniques. Ten solanaceous species are divided into two clades on the basis of Phylogenetic analysis using concatenated alignment of gene sequences from coding regions of plastomes. The first clade consisted of four Nicotiana species and the second clade consisted of species of Solanum, Capsicum, Atropa, and Datura.