Comparison Analysis Based on Complete Chloroplast Genomes and Insights into Plastid Phylogenomic of Four Iris Species

Iris species, commonly known as rainbow flowers because of their attractive flowers, are extensively grown in landscape gardens. A few species, including Belamcanda chinensis, the synonym of I. domestica and I. tectorum, are known for their medicinal properties. However, research on the genomes and evolutionary relationships of Iris species is scarce. In the current study, the complete chloroplast (CP) genomes of I. tectorum, I. dichotoma, I. japonica, and I. domestica were sequenced and compared for their identification and relationship. The CP genomes of the four Iris species were circular quadripartite with similar lengths, GC contents, and codon usages. A total of 113 specific genes were annotated, including the ycf1 pseudogene in all species and rps19 in I. japonica alone. All the species had mononucleotide (A/T) simple sequence repeats (SSRs) and long forward and palindromic repeats in their genomes. A comparison of the CP genomes based on mVISTA and nucleotide diversity (Pi) identified three highly variable regions (ndhF-rpl32, rps15-ycf1, and rpl16). Phylogenetic analysis based on the complete CP genomes concluded that I. tectorum is a sister of I. japonica, and the subgenus Pardanthopsis with several I. domestica clustered into one branch is a sister of I. dichotoma. These findings confirm the feasibility of superbarcodes (complete CP genomes) for Iris species authentication and could serve as a resource for further research on Iris phylogeny.


Introduction
Iris (L.) is a genus of flowering plants, including 300 species of the Iridaceae family classified into six subgenera (subg.) [1,2]. These species, commonly called rainbow flowers, are found in the northern hemisphere's temperate regions and are widely used in landscape gardens because of their beautiful and colorful flowers [3]. Most Iris species can adapt to dry environments, such as deserts, semideserts, or rocky habitats, and a few live in mesic and wetland areas [4]. Iris species are also used as medicinal plants. Several pharmacological studies have shown that the rhizome extracts of Iris species have anticancer, anti-inflammatory, and α-glucosidase inhibitory effects and can reduce human infarct volume [5][6][7]. Few species are used to treat throat-swelling diseases [8]. The dried rhizomes of I. tectorum and I. domestica, referred to as "Chuan She Gan" and "She Gan," respectively, are used in traditional Chinese medicine, but "She Gan" is often adulterated with the dried rhizomes of I. dichotoma and I. japonica. Therefore, identifying these four species is needed for clinical safety.
Iris species are characterized by fan-shaped leaves, three colorful outer perianth segments, three inner perianth segments, three petaloid stigmas with a bifid crest, and underground tuberous organs [9]. However, these species have similar leaf shapes, flower shapes, and rhizome morphological characteristics. Therefore, identification based on morphological features alone is complicated, especially during the nonflowering period. The development of I. domestica and I. dichotoma hybrids has also made species identification challenging owing to the similarities between the hybrids and female parents [10]. Molecular phylogeny combined with palynology suggested that I. tectorum is far away from I. japonica [11], which is inconsistent with classical taxonomy that shows the two species with a close relationship. I. tectorum is a species of section (sect.) Lophiris of subg. Limniris sect. Lophiris contains 13 species distributed in Eastern Asia; Dykes included this rank in sect. Evansia [12], but this rank was later amended by Lawrence to subsection Evansia [13], by Rodionenko to subg. Crossiris [14], and finally by Mathew to sect. Lophiris of subg. Limniris [2]. Molecular phylogeny placed I. domestica in subg.
The current study sequenced the complete CP genomes of I. tectorum, I. dichotoma, I. japonica, and I. domestica. The study's major objectives were to (1) characterize the complete CP genome structure and functional genes, (2) analyze the codon usage, (3) identify the SSRs and long repeats, and (4) compare the whole CP genomes of Iris   BioMed Research International species to screen highly variable regions. The genomes were further used to uncover the phylogeny relationship among Iris species. The findings will lay a foundation for classifying the species and elucidating the phylogeny in Iridaceae. 2.2. DNA Extraction and Sequencing. Total DNA was extracted from the leaf samples by using the DNeasy Plant Mini Kit (Qiagen Co., Hilden, Germany). DNA quality was detected by agarose gel (1%) electrophoresis. The libraries (insert size average, 350 bp) were generated from total DNA and sequenced on an Illumina NovaSeq 6000 system.

Genome Structure and Codon Usage
Analyses. Furthermore, MEGA X [50] was used to examine the GC content of the genome. CodonW version 1.4.2 was used to calculate the codon usage using the relative synonymous codon usage (RSCU) value as follows: there is no preference in codon usage (RSCU = 1), the codon usage frequency is less than expected (RSCU > 1), and the codon usage frequency is more than expected (RSCU < 1) [51, 52].

SSR and Long
Repeat Sequence Analyses. The SSRs were examined by using the Microsatellite Identification tool version 2.1 [53,54], with the parameters mentioned by Cui   BioMed Research International et al. [55]. In addition, the forward (F), palindromic (P), reverse (R), and complement (C) types of long repeat sequences with different sizes in the CP genomes were searched by using REPuter version 3.0 [56] with 30 bp as the minimum repeat size and 3 as the hamming distance.
2.6. Comparative Genome Analysis. The CP genomes from I. tectorum, I. dichotoma, I. japonica, and I. domestica were aligned using the mVISTA program [57]. The sequences of the shared genes in the four Iris species and the complete CP genomes were further aligned using MAFFT version 7 [58]. Nucleotide diversity (Pi) was calculated using DnaSP version 6 [59] to identify the divergence hotspot regions among the four species.
2.7. Phylogenetic Analysis. Twenty-two CP genomes of Iris species were downloaded from NCBI to conduct a phylogenetic tree abided by the maximum likelihood (ML) method in IQ-TREE version 2 with 1000 bootstrap replicates. Sisyrinchium angustifolium (NC_056184) was used as the outgroup (Table S5). The optimum model of nucleotide substitution, TVM+F+R3, determined by ModelFinder [60] in IQ-TREE [61] was used for the ML analysis.

Results and Discussion
3.1. CP Genomes of Four Iris Species. Generally, sequences are chosen for molecular taxonomy, and fast (slow) molecular changes correspond to recent (old) evolution time [62]. The structure and components of the genome contribute to the nucleotide substitution rate [63,64]. The whole CP genome is appropriate to relate species identification and relationship because of its moderate molecular changes [65]. The current study sequenced and analyzed the CP genomes of the four Iris species for their authentication and relationship. Illumina NovaSeq 6000 system sequencing generated 8.   (Table 1,  Table S1) and were distributed unevenly across the four parts. The GC content illustrated in dark gray in Figure 1 was the highest in the IR region (42.97%-43.05%). This finding is probably due to the rRNA genes (rrn4.5, rrn5, rrn16, and rrn23) with less duplicated AT nucleotides [66,67]. The LSC (35.97%-36.16%) and SSC (31.40%-31.49%) regions followed IR in terms of GC content; therefore, IR is highly conserved. Moreover, the protein-coding regions (CDS) had lengths of 78,507-79,059 bp and GC contents of 38.02%-38.15% (Table 1). The AT content at the third codon position (69.36%-69.73%) was higher than that at the second (61.75%-61.81%) and first positions (54.42%-54.48%, Table 1). These characteristics of CP genomes are different from those of nuclear and mitochondrial genomes. Moreover, these CP genome characteristics are consistent with earlier reports on I. tectorum [42], I. dichotoma [26], and I. domestica [26,45]. Thus, the sequencing conducted in the current study has enriched the CP genome data of Iris species and could serve as an essential source for species identification and phylogeny.
A total of 113 specific genes were annotated in each CP genome, including 79 CDS genes, 30 tRNA genes, and 4 rRNA genes ( Table 2). The pseudogene ycf1 was found in all these species, whereas the pseudogene rps19 was found only in I. japonica. In these species, 19 genes (18 in I. japonica), including 7 (6 in I. japonica) CDS genes, 8 tRNA genes, and 4 rRNA genes, were repeated twice in IRs. Moreover, 15 genes, including 9 CDS and 6 tRNA genes, contained 1 intron, whereas 3 genes contained 2 introns ( Table 2). The CDS lengths of I. tectorum, I. japonica, I. dichotoma, and I. domestica were 78,957, 78,507, 79,050, and 79,059 bp, respectively, and accounted for 51.52%, 51.50%, 51.45%, and 51.43% of the genome, respectively. In I. tectorum, the rRNAs were 9,050 bp long (5.91%), and the tRNAs were 2,878 bp long (1.88%). The lengths and proportions of rRNAs and tRNAs in I. japonica, I. dichotoma, and I. domestica are shown in Table S2. In addition, the noncoding regions, including introns, intergenic spacers (IGSs), and pseudogenes, constituted 40.69%, 40.67%, 40.79%, and 40.81% of the CP genomes of I. tectorum, I. japonica, I. dichotoma, and I. domestica, respectively (Tables 1 and 2  and Table S2). These observations revealed the similarities in genomic features among these four species, indicating a close relationship.    (Table S3). Thus, the preferential codon usage patterns were similar among these four species, which was probably due to the codon usage bias toward A/T. These similarities in codon choice also reveal the related relationship in the four species. The observed codon pattern is consistent with the CP genomes of Amomum [68], Panax [69], Dipterygium and Cleome [70], and various other species [71][72][73] 3.3. SSR and Long Repeat Sequences. CP SSRs have been used as molecular markers in species authentication, population genetics, and phylogeny analysis owing to their high substitution rates [74][75][76]. A total of 59, 42, 58, and 56 SSRs were detected in the CP genomes of I. tectorum, I. japonica, I. dichotoma, and I. domestica, respectively (Table 3 Figure 3). The mononucleotide repeats of I. tectorum and I. japonica had no C/G type. All four species had one AACTT/ AAGTT pentanucleotide repeat. Additionally, an AAAAT/ ATTTT pentanucleotide repeat was present in I. tectorum and I. domestica, whereas none was seen in I. japonica and I. dichotoma. Moreover, I. tectorum, I. dichotoma, and I. domestica had one specific pentanucleotide (AAAAC/GTTTT, ACTAT/AGTAT, and AATAT/ATATT, respectively). The hexanucleotide repeat (AACAAG/CTTG TT) was found in all species except I. tectorum (Table 3). The analysis uncovered that A/T mononucleotide repeats were mostly SSRs and account for 100.0% in I. tectorum and I. japonica, 97.1% in I. dichotoma, and 97.0% in I. domestica. Moreover, A or T base was the most frequent in the SSRs, which is similar to the base preference observed in the CP genomes of Symplocos [77], Achnatherum [78], and other species [79,80]. These previous studies were all researched between close taxa. Therefore, the SSRs identified in this study might address the relationship among closely related Iris species.
Long repeat sequences (F, P, R, and C types) are ≥30 bp long sequences and are generally located in the IGS and intron; these repeat sequences are responsible for CP genome rearrangement and genetic diversity in populations and used as sources to uncover phylogeny relationships [81,82]. The current study analyzed the number of long repeats within Iris species (Figure 4). A total of 38, 34, 43, and 67 long repeats were identified in I. tectorum, I. japonica, I. dichotoma, and I. domestica, respectively. Most of the long repeats were F and P types, accounting for 97.37% in I. tectorum, 100.00% in I. japonica, 88.37% in I. dichotoma, and 77.61% in I. domestica. The 30-39 bp long F and P types were the majority in the Iris species: >50% for I. tectorum, I. japonica, and I. domestica and 44% for I. dichotoma. Moreover, the repeats with ≥70 bp were all F and P types. None of the species had a C repeat, and I. japonica had no R repeat. In addition, I. tectorum, I. dichotoma, and I. domestica had 1, 5, and 15 R types, respectively. The distribution of repeats in the Iris species was similar to that of Camellia [83], Saraca [84], and various other species [85][86][87]. These repeats, one of the CP genome's various origins, are used in elucidating the phylogeny relationships of Iris species.

Inverted Repeat Expansion and Contraction.
The comparison of boundaries in the CP genomes from I. tectorum, I. japonica, I. dichotoma, and I. domestica revealed highly conserved LSC/IR/SSC conjunctional regions in the four species; however, variations were detected in the rps19, ndhF, and ycf1 genes ( Figure 5). The rps19 gene was located 45, 34, and 45 bp away from the LSC/IRb boundary in I. tectorum, I. dichotoma, and I. domestica, respectively. In I. japonica, the rps19 gene extended into the IRb region (72 bp), creating the rps19 pseudogene in the IRa region. The ndhF gene crossed the SSC/IRb boundary in all species. Moreover, the ycf1 gene was located in the SSC/IRa boundary, resulting in a pseudogene 895 bp long in I. tectorum, 892 bp in I. japonica, and 893 bp in I. domestica and I. dichotoma in the IRb region. These observations suggest that the incomplete duplications at the boundaries probably knocked down the coding potential of the rps19 gene in I. japonica and the ycf1 gene in all four Iris species; these expansions in IR boundaries are consistent with those in Passiflora [88], Lagerstroemia [89], and various other species [90,91]. Divergence variations due to IR expansion among interspecies will help distinguish closely related Iris species.  [92][93][94]. The IR regions were more conserved, whereas the LSC and SSC regions were more divergent. Furthermore, the average Pi values [95,96] were calculated separately for the shared genes and IGS to compare the DNA polymorphisms and identify the highly variable regions (Figure 7). The average Pi value of the gene regions was 0.00733 (Figure 7(a)), and that of the IGSs was 0.01629 (Figure 7(b)). LSC and SSC were higher than the IR regions in Pi values, similar to other plants, such as Handroanthus [97], Speirantha [98], and Combretaceae [99]. Consistent with earlier reports on other species, 13 mutational hotspots and highly divergent loci were examined in the SSC and LSC regions (Pi > 0:03 for IGS and Pi > 0:015 for gene regions), which is helpful for species authentication. The most remarkable divergent loci were trnG-UCC-trnR-UCU (Pi = 0:10078) and rpl16 (Pi = 0:0178) in the IGS and gene regions, respectively. Finally, the combination of the mVISTA plots (divergent regions indicated in white) and the Pi values screened two IGSs, ndhF-rpl32 (Figure 7(b), 11) and rps15-ycf1 (Figure 7(b), 13), and the rpl16 gene (Figure 7(a), 4). These regions with large white plots and high Pi values will serve as potential DNA barcodes for Iris species authentication.
3.6. Phylogenetic Analysis. CP genomes have been used to determine evolutionary relationships [100][101][102][103][104]. In the present study, a ML tree was constructed using 27 whole CP genome sequences to determine the evolutionary relationships of I. tectorum, I. japonica, I. dichotoma, and I. domestica with S. angustifolium as the outgroup (Figure 8). The phylogenetic analysis revealed the relationships between I. tectorum and I. japonica and between I. domestica and I. dichotoma. Subg. Limniris was divided into two clades: I (sect. Limniris) and IV (sect. Lophiris). Here, sect. Limniris showed a sister relationship with three clades, comprising subg. Pardanthopsis (clade II), subg. Iris (clade III), and sect. Lophiris (clade IV), including I. tectorum and I. japonica. These three monophyletic clades (clades I, II, and IV) were highly supported (bootstrap 100%). Moreover, subg. Pardanthopsis was a sister to subg. Iris, including I. gatesii of sect. Oncocyclus (bootstrap value of 100%); I. domestica and I. dichotoma in clade II were closely related sister species. Additionally, I. domestica (OK448491, B. chinensis) was clustered with the other three I. domestica sequences. This finding was consistent with the findings of Goldblatt and Mabberley [18], Mavrodiev et al. [105], and Wilson [28] who indicated that B. chinensis is a synonym of I. domestica. In addition, two I. dichotoma sequences (previous and present) were clustered into a branch, similar to the two sequences of I. tectorum. These results mutually corroborated the accuracy of the sequences. Notably, the four species were separated into distinct groups. Thus, for the first time, the present study deduced the relationship among the four Iris species based on complete CP genomes following the ML method. These results are consistent with the molecular phylogeny by Wilson [28], Guo and Wilson [11], Kang et al. [26], and Xiao et al. [106] based on different plastid fragments. Thus, the phylogenetic analysis uncovers that the CP genomes could be used to verify the subdivisions of Iris species, especially at the subgenus and section ranks.
The ML tree based on common protein-coding sequences ( Figure S4) was similar to that based on the complete CP genomes (Figure 8), except for two branches, i.e., branch of I. pseudacorus, I. setosa, I. laevigata, and I. ensata species and branch of I. domestica and I. dichotoma species. In detain, I. ensata, in both trees, was the most primitive taxon among four species, but the I. pseudacorus, I. setosa, and I. laevigata demonstrated different relationships in these two trees. Meanwhile, I. domestica could be distinguished from I. dichotoma in the tree based on the complete chloroplast genomes, but the tree based on common protein-coding sequences could not differentiate I. domestica from I. dichotoma. The complete chloroplast genome has been commonly used as superbarcoding for species identification in researches, such as Dipterygium and Cleome [70] and Zantedeschia [91]. In the present study, the result of species authentication based on complete CP genomes among four medicinal Iris species also proved the efficacy of superbarcoding. The usage of complete CP genomes was more efficient than the usage of common protein-coding sequences for Iris species identification, probably derived from more variant regions contained in intergenic regions of the complete chloroplast genome [98,104].

Conclusions
The present research sequenced and analyzed the complete CP genomes of four Iris species, namely, I. tectorum, I. dichotoma, I. japonica, and I. domestica. CP genome sizes, GC contents, codon usages, SSRs, and long repeats were examined, and the genome conservation and differences among the four Iris species were compared. Furthermore, comparing these species' genomes with other Iridaceae 9 BioMed Research International species revealed a few variable regions; however, the use of these markers in DNA barcoding needs to be tested. The study also generated an ML phylogenetic tree that depicted the evolutionary relationship of Iris species and confirmed that B. chinensis is a synonym of I. domestica; however, the whole CP genomes of the 13 taxa of sect. Lophiris need to be included in one robust phylogenetic analysis. The study's findings confirm that CP genomes are a worthy genetic resource for identifying Iridaceae species and analyzing their phylogeny.

CP:
Chloroplast CDS: Protein-coding genes SSR: Simple sequence repeat Pi: Nucleotide diversity subg: Subgenera sect: Section SSC: Small single copy LSC: Large single copy IR: Inverted repeat NCBI: National Center for Biotechnology Information RSCU: Relative synonymous codon usage ML: Maximum likelihood IGS: Intergenic spacers.

Data Availability
The data supporting the study's findings are publicly available in NCBI under the accession numbers MW201731, OK448491, OK448492, and OK448493. The associated data are available in Sequence Read Archive (SRA) under the Bio-Sample, BioProject, and SRA numbers of Iris tectorum (SAMN17169715, PRJNA688136, and SRR13311445), Iris domestica (SAMN25087045, PRJNA798580, and SRR17692213), Iris dichotoma (SAMN25087046, PRJNA798580, and SRR17692212), and Iris japonica (SAMN25087047, PRJNA798580, and SRR17692211). The sequence data are available from https://dataview.ncbi.nlm .nih.gov/object/SRR13311445, https://dataview.ncbi.nlm.nih .gov/object/SRR17692213, https://dataview.ncbi.nlm.nih .gov/object/SRR17692212 and https://dataview.ncbi.nlm.nih .gov/object/SRR17692211. The accession numbers of others used in the present study are shown in Table S5, and these were released from NCBI. Figure S1: CP genome map of Iris japonica. Figure S2: CP genome map of Iris dichotoma. Figure S3: CP genome map of Iris domestica. Figure S4: ML tree constructed based on common protein-coding genes of 26 Iris species and S. angustifolium (outgroup). Bootstrap support value is shown at each node. Table S1: gene content and gene order in the chloroplast genomes of four Iris species.