Complex Analyses of Short Inverted Repeats in All Sequenced Chloroplast DNAs

Chloroplasts are key organelles in the management of oxygen in algae and plants and are therefore crucial for all living beings that consume oxygen. Chloroplasts typically contain a circular DNA molecule with nucleus-independent replication and heredity. Using “palindrome analyser” we performed complete analyses of short inverted repeats (S-IRs) in all chloroplast DNAs (cpDNAs) available from the NCBI genome database. Our results provide basic parameters of cpDNAs including comparative information on localization, frequency, and differences in S-IR presence. In a total of 2,565 cpDNA sequences available, the average frequency of S-IRs in cpDNA genomes is 45 S-IRs/per kbp, significantly higher than that found in mitochondrial DNA sequences. The frequency of S-IRs in cpDNAs generally decreased with S-IR length, but not for S-IRs 15, 22, 24, or 27 bp long, which are significantly more abundant than S-IRs with other lengths. These results point to the importance of specific S-IRs in cpDNA genomes. Moreover, comparison by Levenshtein distance of S-IR similarities showed that a limited number of S-IR sequences are shared in the majority of cpDNAs. S-IRs are not located randomly in cpDNAs, but are length-dependently enriched in specific locations, including the repeat region, stem, introns, and tRNA regions. The highest enrichment was found for 12 bp and longer S-IRs in the stem-loop region followed by 12 bp and longer S-IRs located before the repeat region. On the other hand, S-IRs are relatively rare in rRNA sequences and around introns. These data show nonrandom and conserved arrangements of S-IRs in chloroplast genomes.


Introduction
Inverted repeat sequences (IRs) play an important regulation role in genomic DNA [1]. It has been demonstrated that short IRs (S-IRs) are essential for DNA replication in both prokaryotes and eukaryotes [2]. Depending on their length, sequence, and crowding conditions, S-IRs can form a local DNA structure called cruciform, which is a target of numerous DNAbinding proteins with important regulatory functions [3]. Among these proteins are cruciform resolving proteins like E. coli enzyme RuvA [4,5] and proteins important for human diseases including the 14-3-3 [6] and p53 protein families [7,8]. Similarly, high mobility group-(HMG-) like cruciform binding proteins have also been purified from plants [9] and it was demonstrated that IRs in chloroplast DNA (cpDNA) genomes play a role in evolution and various regulatory processes [10,11]. Both chloroplasts and mitochondria retain their own genomes and transcription machinery. S-IRs in mitochondrial DNA (mtDNA) are located nonrandomly with high abundance in the replication origin, D-loop, and stemloop sequences [12]. While mitochondria are present in all eukaryotes, chloroplasts are exclusive for algae and plants. Therefore, it is thought that mitochondria were engulfed before the origin of chloroplasts by a common ancestor. This is also supported by the fact that mtDNA is usually shorter than cpDNA.
Chloroplasts are semiautonomous organelles; their origin dates back to over 1,000 million years ago, when an ancient cyanobacterium was engulfed by a eukaryotic cell (primary endosymbiotic event), which subsequently gave rise to glaucophytes, red algae, green algae, and plants [13] and probably most recently to the filose amoeba Paulinella 2 BioMed Research International chromatophore [14]. Subsequent secondary, serial secondary, and tertiary endosymbiotic processes have been important during the evolution of modern photosynthetic eukaryotes [15]. In land plants, cpDNA sequences are highly conserved and the genome can be divided into four different regions: large single copy (LSC) and small single copy (SSC), which are separated from each other by two large internal repeat regions. A cpDNA genome typically contains between 120 and 130 genes, mainly involved in photosynthesis, transcription, and translation. The cpDNA genome is usually between 107 kbp (Cathaya argyrophylla, Pinaceae family) and 218 kbp (Pelargonium, Geraniaceae family) in size. However, smaller cpDNA genomes are also common; for example, Apicomplexa genomes are around 30 kbp. On the other hand, the chromophore genome of Paulinella chromatophora is larger than some complete bacterial genomes, with 1 022 kbp due to the unique recent engulfing event. Chloroplast genome size is independent of nuclear genome size [16].
The evolution of cpDNA genes is slower than that of nuclear genes [17], but faster than that of mtDNA genes [18]. Many of today's algae and almost all land plants carry two identical copies of a large coding sequence called the large IR A and IR B regions (varying from 20 kbp to 36 kbp), whose sequences may differ remarkably in individual species. In algae, the whole large IR has been lost multiple times during evolution [19]. In the parasitic and often nonphotosynthetic plants of Orobanchaceae family, the independent losses of one large IR region occur; Conopholis americana has the smallest cpDNA genome of land plants (45 kb) [20]. Loss of the large IR was also observed in some legumes (Papilionoideae subfamily, for example) [21]. Recently, it was found that, compared to single copy genes, synonymous substitution rates are on average 3.7-fold slower in chloroplast large IR genes (mainly coding for ribosomal, transfer RNAs, and ribosomal proteins). This may be due to the duplicative nature of the large IR, which reduces the substitution rate within these regions [22]. cpDNA rearrangements are more frequent when a large IR region is lost, suggesting that large IR regions are important for maintaining the conserved arrangement of cpDNA sequences [23]. In our analyses, we focused only on "short" IRs (6-60 bp) which are able to form cruciform structures. We analyzed these S-IRs in all sequenced cpDNA genomes to determine frequencies, localization, and similarities/differences.

2.1.
Analyses of cpDNA Sequences. The set of 2,566 complete plastid cpDNA sequences were downloaded from the genome database of the National Center for Biotechnology Information (NCBI). We used the computational core of our DNA analyser software [24] modified to read NCBI identifiers of sequences from a text file and to download them. The parameters of analyses were size of S-IR 6 to 60 bp and spacer size 0 to 10 bp, and maximally one mismatch was allowed. Analysis produced a separate list of S-IRs found in each of the 2,566 cpDNA sequences and an overall report for each selected group. Overall results for each species group contained a list of species with the size of its cpDNA sequence and the number of S-IRs found in that sequence. Our software also counted S-IRs grouped by their individual size (6-60 bp individually and the sum of S-IRs longer than 8 bp, 10 bp, and 12 bp).

Analysis of S-IRs around Annotated NCBI Features.
We downloaded the genome feature tables from the NCBI database along with the cpDNA sequences. We performed analysis of S-IR occurrence inside and around (before and after) recorded features. Features were grouped by their name stated in the feature table file. From this analysis we obtained a file with feature names and numbers of S-IRs found inside and around features for each group of species analyzed. Search for S-IRs took place in predefined feature neighborhoods (we used ±100 bp; this figure is important for calculation of S-IR frequency in feature neighborhood) and inside feature boundaries. We calculated the amount of all S-IRs and those longer than 8, 10, and 12 bp in regions before, inside, and after features. The categorization of an S-IR according to its overlap with a feature or feature neighborhood is demonstrated by the example shown in Supplementary Figure S1. Further processing was performed in Microsoft Excel.

Analyses of Similarities.
Similarity among S-IRs was performed for those with abundant presence in the cpDNA genomes by Levenshtein algorithm, which counts distance between two strings according to the number of deletions, insertions, or substitutions required to transform source string into target string [25].

Phylogenetic Tree Construction.
Exact taxid IDs of all analyzed groups (obtained from Taxonomy Browser via NCBI Taxonomy Database [26]) were downloaded to phyloT: a tree generator (http://phylot.biobyte.de) and a phylogenetic tree was constructed using function "Visualize in iTOL" in Interactive Tree of Life environment [27]. The resulting tree is shown in Supplementary Figure 2.

Statistical Analysis.
Cluster dendrogram of S-IR frequency data (Supplementary Table S1) was made in R v. 3. 4. 3 (R Core Team, 2016) using the hclust function with the cluster method "ward.D2". The resulting cluster dendrogram is shown in Supplementary Figure S2. Principal component analysis (PCA) interactive plots were made in R with ggplot2 [28] and plotly [29]. The R code is available in Supplementary Code S1. Frequency of S-IRs (categorized by length) in individual species groups was used as input data, so one PCA plot was constructed for each species group to display intragroup variability.

Results and Discussion
3.1. General Statistics for cpDNA. cpDNAs are stored in the genome database in the three taxonomy groups (Protists, Plants, and others) and four subgroups (Apicomplexans,  (107). Due to discrepancies in the number of sequenced cpDNA genomes in diverse groups (for example, the phylogenetically important group Euglenozoa has only 9 sequenced cpDNAs whereas Rosids and Asterids in the Pentapalae group [30] each have more than 400), we divided downloaded sequences into 21 phylogenetically related groups with a minimum of eight members in each group (see Supplementary Figure 2). This division allows us to observe detailed trends in S-IR frequency evolution across the Plant kingdom (from evolutionary oldest Bryophyta to Polypodiopsida, Acrogymnospermae, Basal Magnoliophyta, Magnoliidae, Alismatales, Dioscoreales, Liliales, Asparagales, Commelinids, Early-Diverging Eudicotyledons, Santalales, Saxifragales, Caryophyllales, to Asterids and Rosids). In total, we have analyzed 2,565 plastid genomes and, in addition, the chromatophore genome of amoeba Paulinella chromatophora. The length of cpDNA sequences (Table 1) varies from 11,348 bp (Pilostyles aethiopica, endoparasitic land plant which preserved only 17 chloroplast genes of the usual number of 116 cpDNA genes [16]) to 610,063 bp in Bulboplastis apyrenoidosa algae from the Rhodophyta division. Recently, a plastid DNA was described for red algae Corynoplastis japonica as the largest and most intron-rich plastid genome (1.13 Mbp) [31]. However, the majority of cpDNA genomes varied in the close interval between 120 and 150 kbp. The cpDNA genome of Paulinella chromatophore, which has a plastid genome from a different engulfment event estimated only 60 million years ago, is 1,021,616 bp [14], longer than the whole genomes of individual prokaryotic organisms. Aside from this unique organism, the longest cpDNAs are typical for Rhodophyta, Chlorophyta, and Zygnemophyceae, and the shortest for unicellular organisms in the phylum Euglenozoa ( Figure 1). Length variability is generally correlated with evolutionary age. The largest variability is observed in the group Chlorophyta and Rhodophyta while the sizes of cpDNA in group Liliales and other land plants are relatively constant. The variability of length in higher plants is limited compared to lower plants (bryophyta, algae, etc.) ( Figure 1). Contrary to the diverse sizes of cpDNA in ancient phylogenetic groups, the size of the cpDNA in phylogenetically newer groups is more limited with only a few exceptions. For example, 75% of all 522 sequenced cpDNA genomes in clade Rosids are in the size interval of 159,441 to 160,886 bp, differing by less than 1%.  Supplementary Table  S2.

Analyses of Short Inverted
Comparing S-IRs in individual organisms and subgroups shows a general decrease in frequency with increasing S-IR length, except for S-IRs 15, 22, 24, or 27 bp long, which are present more often than expected by approximation from neighboring values (Table 1). Similar selective abundance according to length has also been observed for mtDNAs, but only for S-IR lengths 24 and 30 bp. Both 24 and 30 bp long S-IRs are also more frequent than expected in cpDNA sequences, but there is very strong relative abundance also for 15, 22, and 27 bp S-IRs. We investigated if the S-IRs abundantly present in cpDNA genomes are similar using analyses of similarity by Levenshtein distance (Supplementary Data 1). We used two different conditions for the     Table 2. The most common longest S-IRs varied from 17 bp (in Commelinids) to 32 bp (in Bryophyta). The most common longest S-IR in vascular plants (Tracheophyta) (24 bp) was found in Asparagales. We also found 1,105 S-IRs longer than 30 bp in cpDNA genomes (range 31-100 bp), but they constituted less than 0.01% of the total number of 17,326,953 S-IRs identified.

Comparison of S-IR Frequencies according to Sequence
Annotations. The NCBI genome database contains annotations for cpDNA sequences. The best described are gene (343,857), CDS (226,783), tRNA (91,586), exon (36,345), rRNA (18,719), and intron (11,028). Numbers of annotations at the time of analysis are given in Supplementary Table S3. To compare S-IR frequencies at different locations we used the most commonly described location "gene" as a standard for comparison with other locations. There are significant differences in S-IR frequency in diverse segments of cpDNAs. The largest relative increase of S-IR frequency is for stemloop sequences (12.8-times higher) followed by regulatory (3.2x), intron (2.4x), and misc sequences (2.7x) (Figure 3). The presence of the "large internal repeat region" is typical for higher plants. The difference between S-IR frequency in the large internal repeat regions and gene regions is only 1.7-fold, whereas S-IRs are 4.6-fold higher in the sequences surrounding these features and the frequency of 12 bp S-IRs and longer is 7.6 times higher before the repeat region compared to gene regions. Another interesting finding is the presence of S-IRs in introns; while S-IRs are abundant in introns, S-IR frequencies are significantly reduced in exon sequences that neighbor introns (both 100 bp before and after introns). We observed significantly reduced S-IR frequencies in and around rRNAs compared to gene locations, especially for 12 bp and longer S-IRs. The opposite effect is seen for tRNA sequences. Raw data are in Supplementary Tables S4  and S5.
Based on the data from S-IR analyses we produced a cluster dendrogram of individual groups (Supplementary Figure 2). The real phylogenetic tree is fairly consistent with this cluster dendrogram. Very interesting is the proximity of Euglenozoa and Bryophyta clades. However, this may be due the fact that both groups contain only a small number of accessible sequenced genomes (only 9 and 8 cpDNA genomes are available for Euglenozoa and Bryophyta, respectively).
Based (ii) In Asterids, the most divergent S-IR frequencies in cpDNA include two species of the family Orobanchaceae (Phelipanche ramosa and Phelipanche purpurea), which are nonphotosynthetic flowering plants. Their S-IR frequencies are unusually high in comparison to the rest of the Asterids.
(iii) In Caryophyllales, the flytrap Dionaea muscipula has an unusually low S-IR frequency in cpDNA.
(iv) In Early-Diverging Eudicotyledons, Gymnospermium microrrhynchum has the highest S-IR frequency, especially for S-IRs longer than 12 bp.
(v) In Commelinids, Carex neurocarpa has the highest S-IR frequency in cpDNA. In Asparagales, Cypripedium formosanum (endemic orchid of Taiwan) has a very high S-IR frequency in cpDNA.
(vi) In Liliales, all three PCA clusters are very well distinguished from each other. In Dioscoreales, Burmannia oblonga has the highest S-IR frequency in cpDNA.
(vii) In Alismatales, the aquatic plant Najas flexilis has a low S-IR frequency in cpDNA in comparison to the rest of the group. The plastid genome of Najas flexilis presents some anomalous modifications (reduced size of a small single copy region, eleven ndh gene losses) [21].    (xiv) In Rhodophyta, the strangest pattern of S-IR frequency in cpDNA was found in Cyanidioschyzon merolae, a small unicellular haploid red alga adapted to hot spring environment with high sulfur acidic content. Cyanidioschyzon merolae has only one chloroplast and mitochondria. After detailed analysis we have found that cpDNA of Cyanidioschyzon merolae has an extremely low content of S-IRs in comparison to the rest of the Rhodophyta group.
(xv) In Stramenopiles, Aureococcus anophagefferens (unicellular alga with only one chloroplast and mitochondrion) and Aureoumbra lagunensis have the lowest S-IR frequency in cpDNA in comparison with the rest of the group. In Aureococcus anophagefferens and Aureoumbra lagunensis the large IR regions were lost [22].
(xvi) In Euglenozoa, Monomorphina aenigmatica has the highest S-IR frequency and the plastid genome of this species is AT rich (70.6%) and contains 53 intron insertion sites, of which 41 were found to be shared with other euglenids [23].
DNA cruciforms are formed by S-IRs and with their important roles in replication, transcription and DNA stability it is not surprising that S-IRs are also present in cpDNA genomes. Analyses of mtDNA genomes revealed that S-IR sequences are abundant and nonrandomly distributed in the mitochondrial genomes of all living organisms, with particular abundance in regulatory regions such as replication origin and D-loop region [12]. Here, we analyzed all available cpDNA genomes for the presence of S-IRs capable of forming cruciform structures. Our results show that the typical maximal S-IR length in cpDNA varied from 17 bp (Commelinids) to 32 bp (Bryophyta). Interestingly, substantial numbers of longer S-IRs are detected in some cpDNAs. While the mean frequency of S-IR was ∼45 IRs/kbp, the frequency was remarkably higher in some plastid genomes. In 13 plastid genomes, the frequency was higher than 100 S-IR/kbp, most of these genomes were from Apicomplexa, but also red algae Choreocolax polysiphoniae, which has reduced gene regions [32]. These organisms typically contain an AT-rich relict plastid which functions in fatty acid biosynthesis, heme biosynthesis, iron-sulfur cluster synthesis, and isoprenoid biosynthesis, but they are deficient in photosynthetic abilities [33]. Long IRs (about 20-36 kbp long) in cpDNA can help stabilize the rest of the chloroplast genome and prevent gene loss-and-gain rearrangements [34]. Interestingly, these longer IRs are surrounded by short S-IRs. While the difference between S-IR frequency in the repeat regions and gene regions is relatively small, the frequency of 12 bp and longer S-IRs in the surrounding sequences is 7.6 times higher before the repeat region. It is likely that S-IRs in cpDNA also form cruciform structures in vivo and these are targets for binding of specific proteins (our hypothesis is based on cruciform functions and preferential protein binding to cruciforms in human and other model organisms). Some information is available to support this concept in plants; ribosomal protein S16 (RPS16) binds to cruciform DNA and is targeted to the chloroplast, indicating the possibility of RPS16-cruciform cpDNA interactions and regulation [35]. Moreover, S-IRs were found to be highly enriched in stem-loop sequences, which have an important regulatory function in cpDNAs.

Conclusions
Chloroplasts, as basic organelles for algae and plants, are fundamental for life due to their oxygen management. In this paper, we analyzed all 2,566 sequenced cpDNA by "palindrome analyser". We described the basic parameters of cpDNA including the frequency and localization of S-IRs able to form cruciform structures. Interestingly, the frequency of S-IR does not decrease for S-IRs 15, 22, 24, or 27 bp long. These results point to the importance of specific S-IRs in cpDNA genomes. Moreover, comparison by Levenshtein distance of S-IR similarities showed that a limited number of sequences are shared in the majority of cpDNA S-IRs. S-IRs are not located randomly, but are lengthdependently enriched in specific locations, including the repeat region, stem, introns, and tRNA regions of cpDNA genomes. The highest enrichment was found for 12 bp and longer S-IRs in the stem-loop region, followed by 12 bp and longer S-IRs located before the repeat region. These data showing nonrandom and conserved arrangements of S-IRs in chloroplast genomes indicate the potential importance of S-IRs in basic biological processes within chloroplasts.

Data Availability
All data are freely accessible in the paper and in supporting materials.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper. versity of Ostrava; and by the SYMBIT project Reg. no. CZ.02.1.01/0.0/0.0/15 003/0000477 financed from the ERDF.