Chloroplasts are key organelles in the management of oxygen in algae and plants and are therefore crucial for all living beings that consume oxygen. Chloroplasts typically contain a circular DNA molecule with nucleus-independent replication and heredity. Using “palindrome analyser” we performed complete analyses of short inverted repeats (S-IRs) in all chloroplast DNAs (cpDNAs) available from the NCBI genome database. Our results provide basic parameters of cpDNAs including comparative information on localization, frequency, and differences in S-IR presence. In a total of 2,565 cpDNA sequences available, the average frequency of S-IRs in cpDNA genomes is 45 S-IRs/per kbp, significantly higher than that found in mitochondrial DNA sequences. The frequency of S-IRs in cpDNAs generally decreased with S-IR length, but not for S-IRs 15, 22, 24, or 27 bp long, which are significantly more abundant than S-IRs with other lengths. These results point to the importance of specific S-IRs in cpDNA genomes. Moreover, comparison by Levenshtein distance of S-IR similarities showed that a limited number of S-IR sequences are shared in the majority of cpDNAs. S-IRs are not located randomly in cpDNAs, but are length-dependently enriched in specific locations, including the repeat region, stem, introns, and tRNA regions. The highest enrichment was found for 12 bp and longer S-IRs in the stem-loop region followed by 12 bp and longer S-IRs located before the repeat region. On the other hand, S-IRs are relatively rare in rRNA sequences and around introns. These data show nonrandom and conserved arrangements of S-IRs in chloroplast genomes.
Inverted repeat sequences (IRs) play an important regulation role in genomic DNA [
Chloroplasts are semiautonomous organelles; their origin dates back to over 1,000 million years ago, when an ancient cyanobacterium was engulfed by a eukaryotic cell (primary endosymbiotic event), which subsequently gave rise to glaucophytes, red algae, green algae, and plants [
The evolution of cpDNA genes is slower than that of nuclear genes [
The set of 2,566 complete plastid cpDNA sequences were downloaded from the genome database of the National Center for Biotechnology Information (NCBI). We used the computational core of our DNA analyser software [
We downloaded the genome feature tables from the NCBI database along with the cpDNA sequences. We performed analysis of S-IR occurrence inside and around (before and after) recorded features. Features were grouped by their name stated in the feature table file. From this analysis we obtained a file with feature names and numbers of S-IRs found inside and around features for each group of species analyzed. Search for S-IRs took place in predefined feature neighborhoods (we used ±100 bp; this figure is important for calculation of S-IR frequency in feature neighborhood) and inside feature boundaries. We calculated the amount of all S-IRs and those longer than 8, 10, and 12 bp in regions before, inside, and after features. The categorization of an S-IR according to its overlap with a feature or feature neighborhood is demonstrated by the example shown in Supplementary Figure
Similarity among S-IRs was performed for those with abundant presence in the cpDNA genomes by Levenshtein algorithm, which counts distance between two strings according to the number of deletions, insertions, or substitutions required to transform source string into target string [
Exact taxid IDs of all analyzed groups (obtained from Taxonomy Browser via NCBI Taxonomy Database [
Cluster dendrogram of S-IR frequency data (Supplementary Table
cpDNAs are stored in the genome database in the three taxonomy groups (Protists, Plants, and others) and four subgroups (Apicomplexans, Green Algae, Plants, and others). However, the vast majority of sequences belong to the Plants subgroup (2,278), compared to Apicomplexans (36) and Green Algae (107). Due to discrepancies in the number of sequenced cpDNA genomes in diverse groups (for example, the phylogenetically important group Euglenozoa has only 9 sequenced cpDNAs whereas Rosids and Asterids in the Pentapalae group [
Numbers and frequencies of S-IRs according to size.
IR size | Amount in dataset | IR frequency per 1000bp | IR size | Amount in dataset | IR frequency per 1000bp | IR size | Amount in dataset | IR frequency per 1000bp |
---|---|---|---|---|---|---|---|---|
6 | 10,351,040 | 26.899 | 15 | 13,370 | 0.035 | 24 | 1,619 | 0.004 |
| ||||||||
7 | 4,157,127 | 10.803 | 16 | 6,641 | 0.017 | 25 | 1,005 | 0.003 |
| ||||||||
8 | 1,656,101 | 4.304 | 17 | 5,505 | 0.014 | 26 | 760 | 0.002 |
| ||||||||
9 | 637,184 | 1.656 | 18 | 3,595 | 0.009 | 27 | 1,231 | 0.003 |
| ||||||||
10 | 264,249 | 0.687 | 19 | 2,783 | 0.007 | 28 | 450 | 0.001 |
| ||||||||
11 | 113,649 | 0.295 | 20 | 2,676 | 0.007 | 29 | 350 | 0.001 |
| ||||||||
12 | 50,229 | 0.131 | 21 | 2,108 | 0.005 | 30 | 302 | 0.001 |
| ||||||||
13 | 27,833 | 0.072 | 22 | 3,090 | 0.008 | >30 | 1,105 | 0.004 |
| ||||||||
14 | 13,935 | 0.036 | 23 | 1,577 | 0.004 |
Variability of length of cpDNAs. Box plots show sequence length interquartile ranges for different species groups. The whiskers represent the minimum and maximum values.
The total number of nucleotides in the 2,566 plastid genomes analyzed is 384,975,139 bp and we found 17,326,953 S-IRs. The average frequency is 47 (17-81) S-IR/kbp for green algae and 34 (29-59) S-IR/kbp for land plants. The differences between organisms are significant; 50% of cpDNAs have a frequency of 40 to 45 S-IR/kbp, but S-IR frequencies range from 26 S-IR/kbp in unicellular green algae of the order Mamiellales
Frequency of S-IRs in mtDNAs for subgroups and numbers of mtDNAs. The box plot shows the interquartile ranges of S-IR frequencies per 1000 bp in different species groups. Whiskers represent the minimum and maximum values.
Comparing S-IRs in individual organisms and subgroups shows a general decrease in frequency with increasing S-IR length, except for S-IRs 15, 22, 24, or 27 bp long, which are present more often than expected by approximation from neighboring values (Table
The detailed results of S-IR frequencies for all groups are summarized in Table
cpDNA sizes and S-IR frequencies and lengths.
Group name | Number of seq. | Median size [bp] | Shortest sequence | Longest sequence | IR/kbp | Longest S-IR for 50% of seq. [bp] |
---|---|---|---|---|---|---|
| 9 | 91,616 | Monomorphina aenigmatica | Euglena gracilis | 68 | 18 |
(74,746 bp) | (143,171 bp) | 56 – 79 | ||||
| 37 | 122,660 | Aureococcus anophagefferens | Cylindrotheca closterium | 57 | 25 |
(89,599 bp) | (165,809 bp) | 43 – 69 | ||||
| 60 | 171,284 | Cyanidioschyzon merolae | Bulboplastis apyrenoidosa | 59 | 19 |
(149,987 bp) | (610,063 bp) | 34 -83 | ||||
| 90 | 157,916 | Ostreococcus tauri | Floydiella terrestris | 61 | 27 |
(71,666 bp) | (521,168 bp) | 27 – 102 | ||||
| 11 | 142,017 | Spirogyra maxima | Cosmarium botrytis | 51 | 24 |
(129,954 bp) | (207,850 bp) | 32 – 64 | ||||
| 8 | 123,868 | Syntrichia ruralis | Takakia lepidozioides | 67 | 32 |
(122,630 bp) | (149,016 bp) | 44 – 78 | ||||
| 49 | 151,126 | Diplazium unilobum | Lygodium japonicum | 38 | 18 |
(127,840 bp) | (157,260 bp) | 34 – 52 | ||||
| 85 | 127,659 | Cathaya argyrophylla | Macrozamia mountperriensis | 44 | 23 |
(107,122 bp) | (166,341 bp) | 38 – 50 | ||||
| 13 | 159,881 | Schisandra chinensis | Trithuria inconspicua | 40 | 18 |
(146,859 bp) | (165,389 bp) | 38 – 42 | ||||
| 41 | 159,443 | Cassytha filiformis | Piper kadsura | 40 | 18 |
(114,622 bp) | (161,486 bp) | 39 – 43 | ||||
| 14 | 163,856 | Zostera marina | Wolffiella ryophyte | 46 | 22 |
(143,877 bp) | (169,337 bp) | 40 – 50 | ||||
| 10 | 154,205 | Burmannia oblonga | Tacca leontopetaloides | 47 | 22 |
(39,386 bp) | (162,477 bp) | 42 – 63 | ||||
| 41 | 152,677 | Amana wanzhensis | Heloniopsis tubiflora | 45 | 18 |
(150,576 bp) | (158,229 bp) | 42 – 46 | ||||
| 125 | 153,953 | Oberonia japonica | Cypripedium formosanum | 44 | 24 |
(142,996 bp) | (178,131 bp) | 42 – 64 | ||||
| 290 | 139,171 | Aegilops cylindrica | Carex neurocarpa | 41 | 17 |
(113,490 bp) | (181,397 bp) | 38 – 52 | ||||
| 49 | 157,817 | Kingdonia uniflora | Berberis koreana | 42 | 19 |
(147,378 bp) | (166,758 bp) | 39 – 45 | ||||
| 9 | 128,744 | Schoepfia jasminodora | Erythropalum scandens | 45 | 18 |
(118,743 bp) | (156,154 bp) | 41 – 48 | ||||
| 10 | 152,692 | Phedimus takesimensis | Liquidambar formosana | 41 | 20 |
(147,048 bp) | (160,410 bp) | 40 – 43 | ||||
| 32 | 151,686 | Carnegiea gigantea | Drosera rotundifolia | 45 | 19 |
(113,064 bp) | (192,912 bp) | 40 – 47 | ||||
| 398 | 153,377 | Monotropa hypopitys | Adenophora divaricata | 43 | 19 |
(35,336 bp) | (176,331 bp) | 38 – 61 | ||||
| 522 | 159,441 | Cytinus hypocistis | Pelargonium transvaalense | 45 | 20 |
(19,400 bp) | (242,575 bp) | 35 – 75 | ||||
| 662 | 155,196 | Pilostyles aethiopica | Pleodorina starrii | 46 | 20 |
(11,348) | (269,857) | 28 – 192 |
The NCBI genome database contains annotations for cpDNA sequences. The best described are gene (343,857), CDS (226,783), tRNA (91,586), exon (36,345), rRNA (18,719), and intron (11,028). Numbers of annotations at the time of analysis are given in Supplementary Table
Differences in S-IR frequency by DNA locus. The chart shows S-IR frequencies per 1000 bp between “gene” annotation and other annotated locations from the NCBI database. We analyzed frequencies of all S-IRs (all) and of S-IRs with lengths 8 bp and longer (8+), 10 bp and longer (10+), and 12 bp and longer (12+) within annotated locations (inside) and before (100 bp) and after (100 bp) annotated locations.
Based on the data from S-IR analyses we produced a cluster dendrogram of individual groups (Supplementary Figure
Based on PCA analysis (Supplementary Data In Rosids, the cpDNA of the holoparasitic plant In Asterids, the most divergent S-IR frequencies in cpDNA include two species of the family Orobanchaceae ( In Caryophyllales, the flytrap In Early-Diverging Eudicotyledons, In Commelinids, In Liliales, all three PCA clusters are very well distinguished from each other. In Dioscoreales, In Alismatales, the aquatic plant In Magnoliidae, In Basal Magnoliophyta, In acrogymnospermae, all three PCA clusters are very well distinguished from each other (Supplementary Data In Bryophyta, two species ( In Zygnemophyceae, In Chlorophyta, four species of family Ulvaceae ( In Rhodophyta, the strangest pattern of S-IR frequency in cpDNA was found in In Stramenopiles, In Euglenozoa,
DNA cruciforms are formed by S-IRs and with their important roles in replication, transcription and DNA stability it is not surprising that S-IRs are also present in cpDNA genomes. Analyses of mtDNA genomes revealed that S-IR sequences are abundant and nonrandomly distributed in the mitochondrial genomes of all living organisms, with particular abundance in regulatory regions such as replication origin and D-loop region [
Chloroplasts, as basic organelles for algae and plants, are fundamental for life due to their oxygen management. In this paper, we analyzed all 2,566 sequenced cpDNA by “palindrome analyser”. We described the basic parameters of cpDNA including the frequency and localization of S-IRs able to form cruciform structures. Interestingly, the frequency of S-IR does not decrease for S-IRs 15, 22, 24, or 27 bp long. These results point to the importance of specific S-IRs in cpDNA genomes. Moreover, comparison by Levenshtein distance of S-IR similarities showed that a limited number of sequences are shared in the majority of cpDNA S-IRs. S-IRs are not located randomly, but are length-dependently enriched in specific locations, including the repeat region, stem, introns, and tRNA regions of cpDNA genomes. The highest enrichment was found for 12 bp and longer S-IRs in the stem-loop region, followed by 12 bp and longer S-IRs located before the repeat region. These data showing nonrandom and conserved arrangements of S-IRs in chloroplast genomes indicate the potential importance of S-IRs in basic biological processes within chloroplasts.
All data are freely accessible in the paper and in supporting materials.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The authors would like to thank Dr. Philip J. Coates for proofreading and editing the manuscript. This work was supported by the Grant Agency of the Czech Republic
Supplementary Figure S1: neighborhood of an annotated feature. Example of possible S-IR occurrence around features and its classification: (a) an S-IR overlapping only partially with a feature is considered to be in near neighborhood; (b) an S-IR overlapping fully with a feature is therefore considered to be inside; (c) an S-IR is not considered to be in near neighborhood because it is not fully overlapping with either a feature or its neighborhood. Supplementary Figure 2: phylogenetic tree of all inspected organisms with chloroplast genome made using iTOL. Subgroups are highlighted by different colors. From left counterclockwise: Rosids (red, 522 species); Asterids (blue, 398 species); Caryophyllales (dark green, 32 species); Saxifragales (yellow, 10 species); Santalales (purple, 9 species); Early-Diverging Eudicotyledons (green, 49 species); Commelinids (red, 290 species); Asparagales (blue, 125 species); Liliales (yellow, 41 species); Dioscoreales (purple, 10 species); Alismatales (dark green, 14 species); Magnoliidae (orange, 41 species); Basal Magnoliophyta (green, 13 species); Acrogymnospermae (red, 85 species); Polypodiopsida (green, 49); Bryophyta (orange, 8 species); Zygnemophyceae (red, 11 species); Chlorophyta (purple, 90 species); Rhodophyta (green, 60 species); Stramenopiles (orange, 37 species); Euglenozoa (blue, 9 species). Supplementary Code S1: method for construction of interactive PCA plots from S-IR data by R (version 3.4.0). Referred Excel input for this analysis was values from even S-IR length columns of Supplementary Table S1. Supplementary Table S1: incidence of S-IRs. This table represents ratio of presence of S-IRs by their length. Values were calculated by the following formula: number of sequences containing at least one S-IR of given length in a subgroup/total number of sequences in a subgroup. For example, in Alismatales subgroup, there is a total of 14 S-IR sequences, 9 of those sequences have S-IRs of length 24, and thus 9 / 14 = 0.64. Supplementary Table S2: statistical evaluation of results. This table contains statistical data about groups of sequences. Row denoted usual longest S-IR contains length of S-IR that is present in roughly half of the sequences of that group; see Table S3. Supplementary Table S3: feature amounts and length. This table shows amounts of annotated features in all downloaded sequences and their length. Supplementary Data 1: analysis of S-IR similarity for lengths 15, 22, 24, and 27. The first four sheets group identical S-IRs from all cpDNA sequences. The last four sheets group S-IRs that are similar, based on Levenshtein distance of 2 or less. The upper part of each sheet contains the 20 most abundant S-IRs identified by rank. Lower parts of the sheets contain incidence for given S-IRs by rank in different groups of species. Supplementary Data 2.html: PCA plots: interactive PCA plots intuitively represent differences in cpDNA S-IR frequencies of all main groups and intradifferences of cpDNA S-IR frequencies between organisms of each subgroup. Organisms with the most distinct patterns of S-IR frequencies in their cpDNA are always more distant from the middle of the plot.