Complete Chloroplast Genome Sequence of Justicia flava: Genome Comparative Analysis and Phylogenetic Relationships among Acanthaceae

The complete chloroplast genome of J. flava, an endangered medicinal plant in Saudi Arabia, was sequenced and compared with cp genome of three Acanthaceae species to characterize the cp genome, identify SSRs, and also detect variation among the cp genomes of the sampled Acanthaceae. NOVOPlasty was used to assemble the complete chloroplast genome from the whole genome data. The cp genome of J. flava was 150, 888bp in length with GC content of 38.2%, and has a quadripartite structure; the genome harbors one pair of inverted repeat (IRa and IRb 25, 500bp each) separated by large single copy (LSC, 82, 995 bp) and small single copy (SSC, 16, 893 bp). There are 132 genes in the genome, which includes 80 protein coding genes, 30 tRNA, and 4 rRNA; 113 are unique while the remaining 19 are duplicated in IR regions. The repeat analysis indicates that the genome contained all types of repeats with palindromic occurring more frequently; the analysis also identified total number of 98 simple sequence repeats (SSR) of which majority are mononucleotides A/T and are found in the intergenic spacer. The comparative analysis with other cp genomes sampled indicated that the inverted repeat regions are conserved than the single copy regions and the noncoding regions show high rate of variation than the coding region. All the genomes have ndhF and ycf1 genes in the border junction of IRb and SSC. Sequence divergence analysis of the protein coding genes showed that seven genes (petB, atpF, psaI, rpl32, rpl16, ycf1, and clpP) are under positive selection. The phylogenetic analysis revealed that Justiceae is sister to Ruellieae. This study reported the first cp genome of the largest genus in Acanthaceae and provided resources for studying genetic diversity of J. flava as well as resolving phylogenetic relationships within the core Acanthaceae.


Introduction
Justicia L. is one of the largest and the most taxonomically complex genus in Acanthaceae belonging to the tribe Justicieae consisting of ca. 600 species [1][2][3] typified by J. hyssopifolia (Sp Pl. 15 1753; Gen. Pl. ed. 5, 10 1754; nonsensu Nees 1832, 1847; nec sunsu Kuntze, 1891) distributed in the tropical and subtropical regions of the world, extending into warm temperate zones in Europe, Asia, and North America [4]. The genus is the type specimen of the tribe Justiceae and is characterized by a number of characters which includes corolla having 2 libbeds (bilobed upper lip and trilobed lower lip) and 2 bithecous stamens, one of the two named thecae above the other and the other one contains stur at the base [1,3]. Several authors [5][6][7] include small segregate genera in the genus while author [1] in his own view reported a new classification adding more taxa and defined the genus as having 16 sections based on some floral parts and seeds. Recent molecular studies reported that Justicia s. l (Graham 1988) is paraphyletic [8,9]. Lindau classified Justicia and several closely related species under a tribe that is characterized by androecium having 2 stamens and Knotchenpollen. There is a problem with this system of classification as reported by some student of Acanthaceae 2 BioMed Research International because some species in the tribe do not have this character while several taxa with the characters were classified in other tribes. However, there is still a problem in the status of the tribe Justiceae; [6] divided Ruellioideae into Justicieae, Ruellieae, and Lepidagathideae, whereas authors [10] in their study on phylogenetic relationship among Acanthaceae using trnL-trnF reported Ruellia and Justicia as separate lineages. Recently, [11] classified the member of the Justicieae as subtribe under the tribe Ruelloideae.
Justicia flava is among the endangered species in Saudi Arabia; the plant is widely used in traditional medicine in treating various ailments such as cough, paralysis, fever epilepsy, convulsion, and spasm and skin infection disorder. The roots are also reported to be used in treating diarrhea and dysentery [12,13]. As reported by [14] the plant has antimicrobial and antioxidant activity as well as wound healing activity. Despite the endangered nature and uses in traditional medicine of the species, the complete chloroplast genome of the species was not sequenced until this study.
Comparison of complete chloroplast genome provides very informative information for reconstruction of phylogeny and resolving evolutionary relationships issues at various taxonomic levels [15][16][17][18][19]. This is as a result of the conservative nature of the chloroplast genome [20]; this conservative nature is because the plastome evolves about half the rate of other genomes like the nuclear [21,22]. However, rearrangements in the sequence of chloroplast genome were reported by various plant chloroplast genome studies [19,[23][24][25]. These rearrangements occur as a result of contractions, expansions, and inversions in the single copy regions (large single copy and small single copy) and the inverted repeats [26,27]. The rearrangements of the genes and inversion in the chloroplast genome are reported to be useful in phylogenetic analyses to solve taxonomic problems at various taxonomic levels because they do not occur often and estimation of their homology and inversion event polarity is simple [22,[28][29][30][31]. With the importance of complete chloroplast genome in resolving phylogenetic relationship issues and the large number of genera and species in Acanthaceae only complete chloroplast genome of four genera has been so far reported (Andrographis paniculata (Burm.f.) Nees In this study, we reported the characteristics of the complete chloroplast genome of Justicia flava, the first cp genome in the largest genus of Acanthaceae, and compared the genomes of four Acanthaceae species to understand the variation among the cp genomes, report the simple sequence repeats to provide the tools for genetic diversity and identification of the species, and lastly, resolve the status of Justiceae.

Plant Material and DNA Extraction.
Plant material (vegetative and floral part) was collected through field survey of J. flava in Taif, Saudi Arabia, and identified based on the herbarium specimens and morphological features seen in relevant literatures, the voucher specimen was deposited in the herbarium of King Abdulaziz University, Jeddah, Saudi Arabia. Leaves were collected from the specimen for genomic DNA extraction. The genomic DNA was extracted using Plant Genomic DNA Kit according to manufacturer's protocol.

Library
Construction, Sequencing, and Assembly. A total amount of 1.0 g DNA was used as input material for the DNA sample preparations. Sequencing libraries were generated using NEBNext5 DNA Library Prep Kit following manufacturer's recommendations and indices were added to each sample. The genomic DNA is randomly fragmented to a size of 350bp by shearing, then DNA fragments were end polished, A-tailed, and ligated with the NEBNext adapter for Illumina sequencing and further PCR enriched by P5 and indexed P7 oligos. The PCR products were purified (AMPure XP system) and resulting libraries were analyzed for size distribution by Agilent 2100 Bioanalyzer and quantified using real-time PCR. The qualified libraries are fed into Illumina sequencers after pooling according to its effective concentration and expected data volume. The raw reads were filtered to get the clean reads (5 Gb) using PRINSEQ lite v0.20.4 [32] and were subjected to de novo assembly using NOVOPlasty2.7.2 [33] with kmer (K-mer= 31) to assemble the complete chloroplast genome from the whole genome sequence. ndhAx1-ndhAx2 intergenic spacer from Justicia flava (KY632456) was used as seed and the plastome sequence of Ruellia breedlovei (KP300014.1) was used as the reference. Finally, one contig containing the complete chloroplast genome sequence was generated.

Gene Annotation.
The programme DOGMA (Dual Organellar GenoMe Annotator, University of Texas at Austin, Austin, TX, USA) [34] was used to annotate the genes in the assembled chloroplast genome. The positions of start and stop codon were adjusted manually. trnAscan-SE server (http://lowelab.ucsc.edu/tRNAscan-SE/) [35] was used to verified the tRNA genes and finally, the plastome genome circular map was drawn using OGDRAW (Organellar Genome DRAW) [36]. The sequence of the chloroplast genome of J. Flava was deposited in the GenBank database with accession number (MK548577,).

Sequence
Analysis. MEGA 6.0 was used to analyze the relative synonymous codon usage values (RSCU), base composition, and codon usage. Possible RNA editing sites present in the protein coding genes of J. flava cp genome were determined using PREP suite [37] with 0.8 as the cutoff value.

Repeat Analysis in J. flava Chloroplast Genome.
Simple sequence repeats (SSRs) were identified in the J. flava chloroplast genome and genome of other three species of Acanthaceae using the online software MIcroSAtellite (MISA) [38] with the following parameters: eight, five, four, and three repeats units for mononucleotides, dinucleotides, trinucleotides, and tetra-, penta-, hexanucleotides SSR motifs, respectively. For analysis of long repeats (palindromic, forward, reverse, and complement) the program REPuter (https://bibiserv.cebitec.uni-bielefeld.de/reputer) [39] with default parameters was used to identify the size and location of the repeats in J. flava chloroplast genome and genome of other 3 species of Acanthaceae.
2.6. Genome Comparison. The program mVISTA [40] was used to compare the chloroplast genome of J. flava with the cp genomes of E. longzhouensis, R. breedlovei, and S. cusia using the annotation of J. flava as reference in the Shuffle-LAGAN mode [41]. The border region between the large single copy (LSC) and inverted repeat (IR) and small single copy SSC and inverted repeat (IR) junction were compared among the four species of Acanthaceae.

Characterization of Substitution
Rate. DNAsp v5.10.01 [42] was used to analyze synonymous (dS) and nonsynonymous (dN) substitution rate and dN/dS ratio to detect the genes that are under selection pressure; the chloroplast genome of J. flava was compared with the cp genome of E. longzhouensis, R. breedlovei, and S. cusia. Individual protein coding genes were aligned separately in Genious V 8.1.3; the aligned sequences were then translated into protein sequence.

Phylogenetic
Analysis. The complete chloroplast genome of three Acanthaceae and five species from the order Lamiales were downloaded from Genbank. The downloaded sequences were aligned with sequenced cp genome of J. flava using MAFFT v.7 [43]. The data were analyzed using Maximum Parsimony (PAUP version 4.0b10) [44] using heuristic searches with 1000 replicates of random taxon addition, tree bisection-reconnection branch swapping, MulTrees on, saving a maximum of 100 trees each replicate. Missing characters were treated as gaps. Support was assessed using 1000 replicates of nonparametric bootstrap analysis. Bayesian analysis was carried out using MrBayes version 3.2.6 [45]. jModelTest version 3.7 [46] was used to select the suitable model.

Results and Discussion
3.1. Characteristics of J. flava Chloroplast Genome. The complete chloroplast genome of J. flava is a circular molecule and has quadripartite structure; the genome was found to have 150, 888bp in length. The genome is divided into four regions, namely, large single copy (LSC), small single copy (SSC), and two inverted repeats (IRa and IRb). The coding region is 98, 671bp in length and constitutes 65.39% of the genome; the rest of 52, 217 bp is the intergenic spacer region including intron (34.60%). The LSC and SSC regions possessed 82, 995bp and 16, 893bp, respectively, the inverted repeats IRa and IRb have 25, 500bp and are separated by SSC region ( Figure 1 and Table 1). The organization and structure of the J. flava cp genome are similar to other sequenced Acanthaceae cp genomes [47,48]. The percentage of occurrence of AT and GC content in the genome showed that the LSC, SSC, IRA, and IRB regions possessed 63.8% and 36.2%, 67.8% and 32.2%, 56.7% and 43.4%, and 56.7% and 3.4, respectively. The whole chloroplast genome has AT content of 61.8% and GC content of 38.2%; this is similar to cp genome of Strobilanthes cusia [49]. The GC content in the inverted repeat is found to be higher than the single copy regions both the LSC and SSC.
The result of the genes annotation in the chloroplast genome of J. flava revealed a total of 132 genes among which 113 are unique; the remaining 19 are duplicated in the inverted region. The genome harbored 80 protein coding genes, 30 tRNA genes, and 4 rRNA genes ( Figure 1 and Table 2). The numbers and orientation of the genes in the cp genome are the same as other cp genomes of Acanthaceae [47,48]. The inverted repeat region contained eight protein coding genes, seven tRNA, and four rRNA while in the single copy region, the LSC contained 61 protein coding genes and 22 tRNA genes; the rest of 12 protein coding genes and 1 tRNA are located within the SSC region. Almost all the protein coding genes start with the ATG codon that code for methionine whereas some of the genes start with codon like ATC, GTG and ACG; this is common in most flowering plant (angiosperms) chloroplast genome [49][50][51].
The J. flava chloroplast genome is found to contain intron in some of the protein coding and tRNA genes, like other chloroplast genomes of angiosperms [49,50]. There are 14 genes that contain intron out of the 113 different genes (Table 3); among the 14 genes 8 are protein coding genes while the remaining six are tRNA genes (Table 3). Four genes that have the intron, namely, rpl2, ndhB, trnI-GAU, and trnA-UGC are located in the IR region while the remaining 12 are located in the LSC region. Only two genes ycf3 and clpP have two introns, the other 12 have only one intron, and this is also seen in S. cusia [49]. The tRNA, trnK-UUU has the longest intron of 2460 bp (Table 3); this is as a result of position of the matK gene in the intron.
The codon usage bias in the plastome was computed using the protein coding genes and tRNA genes nucleotide sequences 89, 377bp. The relative synonymous codon usage of each codon in the genome is presented in (Table 4); the result revealed that all the genes are encoded by 29, 790 codons. Codons, coding for the amino acids Leucine, are the most frequent codons 3,329 (11.7%) ( Figure 2), similar to that of Ailanthus altissima [52], whereas codons coding for Trp are the least 570 (1.91%) in the genome. G-and C-ending are found to be more frequent than their counterpart A and T; this is not the case in other plastomes sequences [53][54][55]. The result of the analysis (Table 4) showed that codon usage bias is low in the chloroplast genome of J. flava. The RSCU values of 29 codons were >1 and all of them have A/T ending while for 30 codons were <1 and are all of G/C ending. Only two amino acids Tryptophan and Methionine have RSCU value of 1; therefore they are the only amino acids with no codon bias.
The RNA editing site in the J. flava chloroplast genomes was predicted using the program PREP suite; all the analysis was done using the first codon position of the first nucleotide. The result (Table 5) shows that the majority of the conversion in the codon positions is from the amino acid Serine to Leucine (Table 5). In all, the programme revealed 61 editing sites in the genome; the editing sites are distributed among 18 protein coding genes. As reported in previous researches [56][57][58] the ndhB gene has the highest number of editing  )  cp genome  31  19  31  19  150888  LSC  32  19  31  18  82995  SSC  34  17  34  15  16893  IRA  28  23  28  21  25500  IRB  28  21  28  23  25500  1st Position  31  20  30  19  50296  2nd Position  31  19  31  19  50296  3rd Position  31  20  30 19 50296       9 reverse repeats, and only one complement repeat ( Table 6). Majority of the repeats size is between 20 and 29bp (46.93%), followed by 10-19bp (40.81%), whereas 30-39bp and 40-49bp are the least with 8.16% and 4.08%, respectively. In total, there are 49 repeats in the chloroplast genome of J. flava.
In the first location the intergenic spacer harbored 65.30% of the repeats; this has also been reported in cp genome of Fagopyrum dibotrys [59]. tRNA contained 8 repeats (16.32%); the remaining 9 repeats (18.36%) are located in the protein coding genes specifically atpB, psaB, ndhC, ycf1, and ycf2. Within the protein coding genes ycf2 contained 1 reverse and 2 palindromic and forward repeats.
We compared the frequency of repeats among four Acanthaceae cp genomes and found that all the types of repeats (palindromic, forward, reverse, and complement) are present in all the genomes (Figure 3). S. cusia has the highest frequency of palindromic repeats (23) while J. flava has the lowest with (16). R. breedlovei and S. cusia have the same number of forward repeats 15 each, and number of reverse repeats is the same in the genome of J. flava and S. cusia (Figure 3). Complement repeats are found to be the less type of repeat in the genome with E. longzhouensis and J. flava having 1 and the other two species having 3 each.

Simple Sequence Repeats (SSRs).
There are short repeats of nucleotide series (1-6 bp) that are dispensed all through genome called microsatellites (SSRs). This short repeat in plastid genome is passed from a single parent. As a result, they are used as molecular indicators in developmental studies such as genetic heterogeneity and also contribute in recognition of species [60][61][62]. The sums of 98 microsatellites were found in plastid genome of J. flava in this study ( Table 7). Majority of SSRs in the cp genome are mononucleotide (83.67%) of which most are poly T and A (Figure 4). Poly T (polythymine) constituted 58.53% whereas poly A (polyadenine) 40.24%; this is consistent in previous studies [63,64]. Only a single poly C (polycytosine) 1.21% is present in the genome whereas 2 poly G (polyguanine) is found in the genome. Among the dinucleotide only AT/AT is found in the genome. Reflecting series complementary, three trinucleotide AAG/CTT, ATC/ATG, and AAT/ATT, five tetra AAAC/GTTT, AAAG/CTTT, AAAT/ATTT, AATC/ATTG, and AATT/AATT, and two penta AATAC/ATTGT and AATTC/AATTG were discovered in the genome while no hexanucleotide repeat is present (Figure 4). The intergenic spacer region harbored most of the microsatellite (62.24%)  than the coding region (33.67%) ( Figure 5). Most but not all the repeats (70.40%) were detected in the LSC region and the SSC region incorporates the least number of repeats in the genome.

/ G A T / A T A A G / C T T A A T / A T T A T C / A T G A A A C / G T T T A A A G / C T T T A A A T / A T T T A A T C / A T T G A A T T / A A T T A A T A C / A T T G T A A T T C / A A T T G
The frequency of SSR among the cp genome of the four species was also compared ( Figure 6); the comparison showed that mononucleotide occurs more frequently across all the genomes. E. longzhouensis had the highest number of mononucleotides and pentanucleotides with 115 and 9, respectively, but had the lowest number of tetranucleotides with 2. Pentanucleotide is not present in cp genome of R. breedlovei while possessing hexanucleotide which is not present in the remaining 3 species.

Comparative Analysis of J. flava Chloroplast to Other
Acanthaceae Genomes. The complete chloroplast genome of J. flava was compared with three chloroplast genomes of Acanthaceae available in the Genbank, namely, R. breedlovei, S. cusia, and E. longzhouensis. To examine the degree of DNA sequence divergence among the species of Acanthaceae chloroplast genome, the programme mVISTA was used to align the sequences using annotation of Justicia flava as reference. The result of the alignment showed that the genomes highly conserved with some degree of divergence. The inverted repeat regions are more conserved than the single copy regions, the large single copy and small single copy; on the other hand the protein coding genes were found to be more conserved than the noncoding region, particularly the intergeneric spacer. The nonprotein coding regions that showed high rate of divergence across the genome are trnH-GUG-psbA, rps16-trnQ, trnC-petN, accD-psaI, clpP intron, trnL-trnF, rps15-ycf1, rps12-trnV, and trnL-trnA among others ( Figure 7). For the protein coding genes the following genes showed a little sequence variation among the genomes atpE, atpF, rbcL, petA, psbL, petB, and ycf2. The chloroplast genome of angiosperms is reported to be conserved in terms of structure and size [65]; despite the conserved nature there is slightly variation in size and the boundaries of inverted repeats and single copy regions due to evolutionary events such as expansion and contraction in the genome [66,67]. The comparisons between IR-LCS and IR-SSC boundaries in the four cp genome of Acanthaceae (Justicia flava, Echinocactus longzhouensis, Ruellia breedlovei, and Strobilanthes cusia) are shown in (Figure 8). The result showed that there is slightly variation among the compared cp genomes ( Figure 8); four genes, namely, rps19, ndhF, ycf1, and trnH, were located in the junction of inverted repeats and single copy region of J. flava and E. longzhouensis genome with slightly variation in number of base pairs in the borders (Figure 8).Two genes, ndhF and ycf1, are found in the IRb/SSC border among all the four genomes. The IRb/LCS border of R. breedlovei is unique by having ycf2 gene with 390 bp in LSC and 6800 bp in IRb whereas Strobilanthes cusia genome also has unique structural variation by having trnH in IRa and IRb. The ndhF was found to have 70 bp, 59 bp, and 44 bp in the IRb region in J. flava, E. longzhouensis, R. breedlovei, and S. cusia, respectively, whereas the trnH of J. flava and E. longzhouensis starts at exactly IRa/LSC border while the tRNA is 76 bp away the border in R. breedlovei genome.

Divergence of Protein
Coding Gene Sequence. The rates of synonymous (dS) and nonsynonymous (dN) substitution and dN/dS ratio were calculated to detect the selective pressure among the 78 protein coding genes in the cp genome of four Acanthaceae species. The results showed that the dN/dS ration is less than 1 in most of the paired genes except petB, psaI, and ycf1 of J. flava vs. S. cusia having 1.52, 1.24, and 1.12, respectively ( Figure 9). Two genes are also found to be greater than 1 in J. flava vs. R. breedlovei atpF, clpP, psaI, and rpl32. petB, rpl16, and psaI genes of J. flava vs. E. longzhouensis are greater than 1 as well. This indicates that most of the genes were under negative selection; only few undergo positive selection. The synonymous (dS) values in all the genes range from 0.02 to 0.44 ( Figure 9). The genes, atpH, petG, petN, psaC, psaJ, psbE, psbF, psbI, psbT, and rps12, showed no nonsynonymous change occurs in the cp genome of the four species of Acanthaceae.

Phylogenetic
Analysis. The phylogenetic relationship within the four species of Acanthaceae was reconstructed using the complete chloroplast genome. The tree from the Bayesian Inference and Maximum Parsimony is congruent with strong support in all the nodes PP, 1.00, and MP, 100. All the species of Acanthaceae sampled clustered in one clade with strong support (Figure 10), as reported by [68]. The result showed that the tribe Justicieae is sister to Ruellieae; this relationship has also been reported by [69] and they should be regarded as independent tribe not as Justiceae under the tribe Ruellieae as proposed by [11] Data Availability (1) The complete chloroplast genome sequence can be found in Genbank with accession no. MK548577 after publishing the article. (2) The data used to support the findings of this study are available from the corresponding author upon request.    accD  atpA  atpB  atpF  atpE  atpH  atpI  ccSA  cemA  clpP  infA  matK  ndhA  ndhB  ndhC  ndhD  ndhE  ndhF  ndhG  ndhH  ndhI  ndhJ  ndhK  petA  petB  petD  petG  petL  petN  psaA  psaB  psaC  psaI  psaJ  psbA  psbB  psbC  psbD  psbE  psbF  psbH  psbI  psbJ  psbK  psbL  psbM  psbN  psbT  psbZ  rpl2  rbcL  rpl14  rpl16  rpl20  rpl22  rpl23  rpl32  rpl33  rpl36  rpoA  rpoB  rpoC1  rpoC2  rps2  rps3  rps4  rps7  rps8  rps11  rps12  rps14  rps15  rps16  rps18  rps19  ycf1  ycf2  ycf3