The Whole Genome Assembly and Comparative Genomic Research of Thellungiella parvula (Extremophile Crucifer) Mitochondrion

The complete nucleotide sequences of the mitochondrial (mt) genome of an extremophile species Thellungiella parvula (T. parvula) have been determined with the lengths of 255,773 bp. T. parvula mt genome is a circular sequence and contains 32 protein-coding genes, 19 tRNA genes, and three ribosomal RNA genes with a 11.5% coding sequence. The base composition of 27.5% A, 27.5% T, 22.7% C, and 22.3% G in descending order shows a slight bias of 55% AT. Fifty-three repeats were identified in the mitochondrial genome of T. parvula, including 24 direct repeats, 28 tandem repeats (TRs), and one palindromic repeat. Furthermore, a total of 199 perfect microsatellites have been mined with a high A/T content (83.1%) through simple sequence repeat (SSR) analysis and they were distributed unevenly within this mitochondrial genome. We also analyzed other plant mitochondrial genomes' evolution in general, providing clues for the understanding of the evolution of organelles genomes in plants. Comparing with other Brassicaceae species, T. parvula is related to Arabidopsis thaliana whose characters of low temperature resistance have been well documented. This study will provide important genetic tools for other Brassicaceae species research and improve yields of economically important plants.


Introduction
Plant mitochondrial genome contains unique features, including large-scale sequence, frequent recombination, and merging foreign DNA [1]. Mitochondria are significant to life and play an important role in the cell. Mitochondria tackle oxidative phosphorylation very efficiently which is related to energy releasing in the electron transport and each molecule of glucose oxidized can produce about 30 molecules of ATP; however, only two molecules of ATP are produced through nuclear-controlled glycolysis. Toxic reactive oxygen species, a byproduct produced by mitochondria through oxidative phosphorylation increasing with age, can damage proteins, DNA, lipids, and their production [2]. As a typically active organelle with regularly undergoing fission and fusion, mitochondria have been thought to play an important part in mitochondrial outer membrane permeabilization process [3]. Another important feature of mitochondrial genome is its maternal inheritance, which shows its difference with nuclear DNA [2]. In plants, nuclear genes carry out relative copies of molecules of recombination-derived subgenomic DNA from mitochondria [4]. Additionally, a large set of mitochondria ribosomal proteins moves rapidly to the nucleus in certain angiosperm groups [5]. Although the evolution of plant mitochondrial genome sequence is slow, its structural evolution is swift [6]. The reported Arabidopsis thaliana mitochondrial genome sequence gives a more comprehensive and detailed view of internal duplication and foreign DNA uptake has 2 International Journal of Genomics influenced the size, structure, and evolutionary potential. Evolutionary surprises about plant mitochondrial genomes will continue to be brought out. However, within the high rate of mutation, only a very small part of plant mitochondrial genomes can be assembled.
The Brassicaceae family contains many species that are able to grow well in extreme conditions. Their stress tolerance mechanisms have been found in these species which are able to survive in multiple abiotic stresses environments including freezing temperature and high salinity. The abiotic stresses are disadvantageous to improve the yield of plants, so in order to improve productivity of economically important plants such as winter rape (Brassica napus), mustard (Brassica juncea), cabbage (Brassica oleracea), and radish (Raphanus sativus), it is crucial to study how crucifers withstand natural stress. T. parvula is a species in the Brassicaceae (subclade Eutremeae), and its whole genome has been reported (GenBank ID: AFAN00000000.1) and is similar to Arabidopsis thaliana. So far, Arabidopsis thaliana's low-temperature-resistant ability has been well documented, which is also another advantage to study native crucifers [7]. Thellungiella parvula mitochondrion plays significant roles in ATP synthesis, metabolism, and stress response. Because of its efficient mobilization of resources in poor or degraded soils, T. parvula perform better not only under salt conditions, but also in cold and freezing circumstance. T. parvula can live in the environment of 5 ∘ C, which means its temperature-sensitive processes such as seed germination and the production of pollen and seeds were resistant to cold temperatures. Moreover, the land plant shows dramatically different flowering phenotypes while shifting to cold environment [7]. It is expected that the research of the T. parvula mitochondrial complete genome sequence would serve as a reference and a useful genetic resource to the future research.

T. parvula Mitochondrial Genome Content and
Organization. Thellungiella parvula mitochondrial genome was assembled from the whole genome sequence using Roche-454 Sequencing technologies. The circular mitochondrial genome of T. parvula is 255,773 bp in size (GenBank accession number KT988071), with the overall base composition of 27.5% A, 27.5% T, 22.7% C, and 22.3% G in descending order and A + T content of 55.0% (Table 1). The overall GC content of the T. parvula mitogenome is 45%, which is similar to Arabidopsis thaliana, the first published mitochondrial genome sequence of the Brassicaceae family, originally from NCBI (NC 001284).
Sequencing information such as uniquely aligned reads and average read depth of gene was shown in Table 6. Among all the 25 contigs, the longest is 73000 bp in length; its length value and N50 value are relatively large, suggesting that these sequencing data will be accurate and useful to assemble. Also, the encoding length of the largest contig is 51610 bp which accounts for 71%. Mapping these contigs to the completed mitochondrial genome sequence results of Arabidopsis thaliana showed that 20 of the contigs were assembled only once, while the other 5 contigs were linked twice or three times. Some certain sequences also have multiple copies in the mitogenome by analyzing these sequence results. Contig 1 was used at the beginning and we linked other contigs sequentially according to the connecting map ( Figure 1). Finally, the T. parvula single master circle of mitochondrial genome covering the great mass of the contigs was generated with a complete length of 255773 bp. Using web-based tool Public MITOFY Analysis [5] and tRNA scan-SE, 54 genes were identified, including 3 ribosomal RNA (rRNA) genes (5S rRNA, 26S rRNA, and rrnS), 19 transfer RNA (tRNA) genes, and 32 protein-coding genes (PCGs) ( Table 2) (Figure 1). In addition, the positions of these genes in the T. parvula mitochondrial genome are shown in Table 1. Among genes of known function, trnE-TTC has two copies. By comparison with other Arabidopsis thaliana, the mitochondrial genome of T. parvula has the same types of function genes as crucifers.
The percentage of genes of known function is 11.5% (Table 7), 32 protein-coding genes in this mitogenome contain a total of 29,118 bp, and 9,706 codons are coded by nucleotide. Analyzing all the 32 protein-coding genes, the nad5 is the longest (2009 bp) and the shortest is the nad3 (119 bp). In these codons, 157 (1.62%) encode cysteine and 1071 (11.06%) encode leucine, which are the least and the most prevalent amino acids separately. AUU and UGC are used frequently to encode leucine and cysteine, respectively. Furthermore, nine genes have introns: four genes (rpl2, ccmFn1, ccmFc, and cox2) have a single intron, two genes (nad4, nad2) have three introns, and three genes (nad1, nad5, and nad7) have four introns. The complete NADH dehydrogenase subunit 2 (nad2) in T. parvula mitochondrial genome is encoded by five exons. These exons are located in two distant locations of the mitochondrial genome. One genomic region encodes exons a and b; the other encodes exons c, d, and e. As a result, the trans-splicing reactions are required to link exons b and c [8,9].
What is more, the gene ccmFN in T. parvula mitochondrial genome is divided into two reading frames. The    tRNA Trp , tRNA Gln , and tRNA Asn ) and seven protein-coding genes (ccmC, Nad6, CcmFn-2, Cox2, ccmB, Atp1, and Rps7) are encoded on the H-strand; others are encoded on the L-strand (Table 2) [10]. Three rRNA genes (rrn5, rrn26, and rrnS) are highly conserved between the two mitochondrial genomes of Brassicaceae.

Repeat Structure and Sequence Analysis.
A repeat is defined as a pair of sequences which has over 90% similarity [1]. Overall, 53 repeats are identified in the mitochondrial genome of T. parvula, including 28 tandem repeats (TRs), 24 direct repeats, and one palindromic repeat. An amount of 28 TRs is recognized in T. parvula, which are fewer than A. thaliana (48 TRs) ( Table 3). Among these repeats, the size of these tandem repeats is 6-69 bp. There are four long TRs (>100 bp) (with copy numbers of 51, 51, 69, and 69 separately) located in the intergenic spacer regions of nad3-trnK-TTT, nad5-nad1, cox1-1-trnN-cp, and cox1-1-trnN-cp of the T. parvula mitochondrial genome (Table 3). Tandem repeated sequences are allocated unevenly; repeatrich sequences are widely concentrated near the central and the end regions.
We used Blastn to align the T. parvula mitochondrial genome to locate and identify disperse repeats which has a minimal length of 100 bp and more than 90% identity between the two repeat copies, including forward repeats, inverted repeats, and palindromic repeats. As a result, the repeats in the Thellungiella parvula mitochondrial genome include sixteen forward repeats and eight inverted repeats (disperse repeats). We paid more attention to the 24 large repeats because they are related to reversible genomic structural changes. The large repeat of T. parvula mitochondrial genome is 10,084 bp in size, accounting for 3.9% of the whole mitogenome. Sixteen forward repeats and eight inverted repeats were among 106 and 453 bp (Figure 3). Most of these  repeats are located in the no-protein-coding area; however, two inverted repeats are in the protein-coding regions and rRNA gene, respectively. These two genes in mitogenome that participated in energy metabolism probably have important effects on T. parvula. What is more, the length of palindromic repeat is 32 bp, which is distributed in the intergenic spacer 6 International Journal of Genomics regions of ccmFc and cox2. The disperse repeats research will offer significant information for the future population phylogeny study.
Simple sequence repeats (SSRs), also known as microsatellites, are usually 1-6 bp in length per unit and are widely distributed over the mitochondrial genome [11]. With a higher degree of polymorphism, codominant microsatellites are used as molecular marker assisting breeding [12], population genetics [13], genetic linkage map construction, and gene mapping [14]. A total of 199 simple sequences repeats were detected in the T. parvula mitochondrial genome, of which 4 are pentanucleotides, 6 are trinucleotides, and 189 are others. These SSRs have a high A/T content (83.1%) and the 3 dinucleotides are composed of AT ( Table 4). The higher A/T content in SSRs of mitochondrial genome also contributes to a slight bias of 55% AT in the T. parvula mitogenome. Moreover, it is clear that SSRs distributed in the proteincoding areas are much fewer than noncoding regions, which account for 9.06% and 90.94%, respectively. Additionally, 32.01% of SSRs exist in the introns and 58.93% of SSRs exist in the intergenic spacer regions. However, SSRs account for only 0.1% of the whole protein-coding areas, which may further indicate an uneven distribution of SRRs within the mitochondrial genome of T. parvula.

Comparison with Other Brassicaceae Species. T. parvula
completed mitochondrial genome sequence provides a unique view of mitochondrial genome structure, organization, and gene complement. Comparing with other Brassicaceae species, we conclude that the T. parvula has a similar mitochondrial sequence organization to that of most species, such as Arabidopsis thaliana (GenBank ID: Y08501.2) which is unquestionably a kind of stress-sensitive crucifers, so it is meaningful to compare its mitogenome with that of Arabidopsis thaliana. From the web-based tool Public MITOFY Analysis, the T. parvula mitochondrial genome has 5 ATPase and 9 NADH which their numbers are equal to A. thaliana mitochondrial genome, but ribosomal proteins are slightly higher in T. parvula [4] than that in A. thaliana [3,5]. Also, it may reflect the different environment pressures and habitats between the species adapted. Ribosomal proteins S1 (rps1), rps10, and rps11 have been completely lost in both species of the mitochondrial genome, which may indicate that genes transfer to nuclear compartment. In general, the base content of the A. thaliana mitochondrial genome is as follows: A (27.9%), T (27.3%), C (22.6%), and G (22.2%), which demonstrate a slight bias of A + T (55.2%) rich feature similar to that in T. parvula (Table 5). Except for mttB and nad1 genes employing the ACG start codon, most of mitochondrial protein-coding genes of A. thaliana initiate with ATG; the usage of the stop one is rather complete or incomplete [15]. Four types of stop codons are used by the coding genes: TAA (atp4, atp6, ccmB, ccmC, ccmFc, cox1,  nad1, nad2, nad3, nad4L, nad9, rpl5, rpl6, rps4, rps7, and  rps14), TAG (atp1, matR, sdh4, nad7, and rpl2), TGA (atp8,  atp9, ccmFn, cob, cox2, cox3, nad4, nad6, and rps12), and incomplete stop codons T-for rps2, rps19. Approximately, a total of 1098 and 646 tandem repeats were found in the A. thaliana and T. parvula genomes, respectively. In our initial analysis, this halophyte, with a genome about 30% smaller than that of A. thaliana, shows striking difference in gene complement. The differences are partly due to tandem duplications in A. thaliana of single copy genes in T. parvula and some genes amplification which may have known or assumed functions in stress defense responses [16].

Phylogenetic Analysis.
A phylogenetic context was based on amino acid sequences of the 15 protein-encoding genes (atp1, atp9, ccmB, cob, cox1, cox3, nad1, nad3, nad4, nad4L,  nad6, nad7, nad9, rps3, and rps4) (Figure 4). We got these twenty complete genome sequences from the GenBank of NCBI Organelle Genome Resources database (http://www.ncbi.nlm.nih.gov/genome/organelle/). Among these species, 20 are angiosperm species and one is a gymnosperm plant (Cycas taitungensis) that was set as an outgroup. Twenty angiosperm plants are divided into two major clades: monocotyledons and dicotyledons plants. Bootstrap ( Figure 5) analysis shows that there are 12 out of 18 nodes with bootstrap values >90%, and 8 of these have a bootstrap value of 100%. Those strong lineages conclude their sister relationship to Caricaceae and Brassicaceae (100% bootstrap values). In the NJ tree, T. parvula is near to Arabidopsis thaliana; although the bootstrap value is only 82%, the results are consistent with previous molecular studies [16]. Through the phylogenetic analyzing, the mitochondrial genome of T. parvula is evolutionarily close to that of Brassica napus, Brassica Rapa subsp. campestris, Brassica juncea, Raphanus sativus, and Brassica carinata. Then, the mitogenomes of Carica papaya and Zea perennis belong to Caricaceae and Gramineae, separately.
In the plants evolution, the numbers of protein-coding genes in mitochondrial genomes tend to decrease [17]. As a representative of gymnosperms, the mitochondrial genome of Cycas taitungensis has conserved almost all genes inherited from the seed plants, which was mitochondrial genome progenitor [18]. In contrast, Silene latifolia has lost all of the ribosome coding genes except for rpl5 and other genes like the sdh [19]. Succinate dehydrogenase genes are usually missing. Almost all the monocotyledons and dicotyledons plants have lost rps11 gene. This loss in the mitochondrial genome probably occurred in the early stage of the divergence process between monocotyledons and dicotyledons. The proteincoding gene like mttB that encodes a transport membrane protein in T. parvula, Brassica napus, and Brassica carinata is an exceptional losing one. More specially, contrasting with Cycas taitungensis, other twenty plants like Thellungiella parvula, Brassica carinata, Brassica napus, and Zea perennis have lost rpl10, sdh3, and sdh4 genes, which can be inferred as rpl10, sdh3, and sdh4 gradually developed into pseudogenes during the evolution of angiosperms. The loss of rps2 in sixteen dicotyledons plants (Brassica carinata, Brassica juncea,     absent in clades of angiosperm mitochondrial genomes, which can be considered unnecessary to some extent [19]. Some chloroplast tRNA genes through random intercellular transfers have been inserted to mitochondrial genomes [20]. Many evolutionarily original species still maintained ancient transfers to some degree; meanwhile, new transfers exist in individual genera or species. These chloroplast genome genes trnH, trnM, and trnS (GGA) are regularly found in the mitochondrial genomes of C. taitungensis and the nineteen species mitogenomes ( Figure 4: A phylogenetic context based on amino acid sequences of the fifteen protein-encoding genes (atp1, atp9, ccmB, cob, cox1, cox3, nad1, nad3, nad4, nad4L, nad6, nad7, nad9, rps3, and rps4) of twenty-one species (Brassica napus, Brassica rapa subsp. campestris, Brassica juncea, Raphanus sativus, Brassica carinata, Thellungiella parvula, Arabidopsis thaliana, Carica papaya, Vigna radiata, Citrullus lanatus, Cucumis sativus, Cucurbita pepo, Nicotiana tabacum, Silene latifolia, Beta vulgaris subsp. vulgaris, Beta vulgaris subsp. maritima, Oryza sativa Indica, Sorghum bicolor, Tripsacum dactyloides, Zea perennis, and Cycas taitungensis). These conserved genes were aligned with ClustalW and the phylogenetic tree was constructed using the Neighbor-Joining method in MEGA 6. in the phylogenetic tree (Figure 4), many mitochondrial-like tRNA genes may be lost in various plants, but the trnE, trnP, and trnR genes exist stably in mitochondrial genomes of these seed plants. Additionally, according to the analyzing of mitogenome sequence and phylogenetic context (Figure 4), seed plants' mitochondrial genomes are inserted into trnH but disappeared in the mitogenome of Tripsacum dactyloides.
Comparing with the typical gymnosperm Cycas taitungensis, which has the largest numbers of mitochondrial-like genes, the T. parvula mitochondrial genome has lost 10 genes, including seven protein-coding genes (rps1, rps10, rps11, rps13, rps14, rps19, and sdh3) and three tRNA genes (tRNA-Phe, tRNA-Arg, and tRNA-Val) ( Table 2). Similarly, these 10 genes, which are lost in T. parvula, cannot be found in its related species Arabidopsis thaliana. As a result, these genes are not essential in gene expression or they can be duplicated by other genes [21]. However, other lost mitochondrial genes may be considered as some functional losses [22]. Furthermore, this may imply that gene loss will be regarded as the mitogenome evolutionary compaction in seed plants [23].

Discussion
Plants mitochondrial genome sequences are increased rapidly with the availability of sequencing technology [24]. Crops, a group that includes putative keystone species [25], are chosen as an obvious target to satisfy the most urgent desire to study and improve agronomically important species [26]. These plants are expanding potential comparisons at the biochemical and physiological level in particular environments with other existing genomic and genetic models [27]. As a result, T. parvula is such a plant. On one hand, it resembles the prototypical model Arabidopsis thaliana in development and evolution [28]. On the other hand, it is a plant with halophytic characters and exceptionally high mobilization of extreme environment, including salinity, cold, freezing temperatures, and the ability to grow in degraded soils [29]. With the T. parvula mitochondrial genome sequence presented in this study and also with reference to whole genome sequence of T. parvula [16], we will enlarge the exploration of genome sequence that favored extremophile adaptations in further researches.
The reported Brassicaceae species including T. parvula and Arabidopsis thaliana are 255773 bp and 366924 bp in size, respectively, both of them with a little smaller mitochondrial genome among higher plants [6]. As expected, genes contents of these mitochondrial genomes and the completed sequences are highly conserved between T. parvula and A. thaliana. Large repeats numbers (>1 kb) are less in Brassicaceae (Arabidopsis and Thellungiella) than in other species [30]. The repeat sequence detected in the T. parvula mitochondrial genome could be used as molecular marker and in population genetics studies, genetic linkage map construction, and gene mapping. Phylogenetic analysis also proved the two clades of Angiospermae and the isolation of Magnoliopsida with Liliopsida. Moreover, it proved the place of T. parvula in crucifer. So far, it has also been reported that the mitochondrial genomes of these crucifers do not contain much nuclear sequence plastid genomes, which can be one of the reasons why Brassicaceae species mitochondrial genomes are small. It seems that pathways and functions related to stress could be different in evolutionarily stress-tolerant plants ofBrassicaceae [31]. As a result, the mitogenome of T. parvula will be a useful tool in exploring extreme condition tolerance and mechanisms of adaptive evolution. Additionally, within these differences, we expect to learn the unique lifestyle of T. parvula and its adaptation to essential ecological niche.
The accomplishment of mitochondrial genome sequencing of both model plant and important crops in succession enlarged the higher plant mitochondrial genome database. However, there are still untypical plant mitogenomes. For instance, Citrullus melo L. has a large mitochondrial genome [32]. The mitogenome of Geraniaceae has the higher rate of nucleic acid replacement compared with homologous species [33]. These untypical mitochondrial genomes can also provide clues to the higher plants mitochondrial genome evolution mechanism. How the unknown mitochondrial genome sequences have emerged remains unsolved. We can carry out a research using mutants of crucifers related to genome distribution and the integrity of genome structure on the problems above in the future.

Plant Material and DNA Extraction.
The original data are from GenBank of NCBI Genome Resources database (http:// www.ncbi.nlm.nih.gov/genome?term=Thellungiella+parvula% 5Borgn%5D&cmd=DetailsSearch). In their research, 10-dayold seedlings of T. parvula isolated total DNA are applied. The seeds were obtained from a single plant. At the collection site, the soil bulk density was 1.225 g/cm 3 with 32.4% salts by weight [16]. By using the Nucleon Phytopure Genomic DNA Extraction kit (GE Healthcare), genomic DNA is prepared [16].

DNA Library Preparation and 454
Sequencing. The original data was constructed by both shotgun and paired-end libraries. To construct the two shotgun libraries, genomic DNA was nebulized to fragments of 500-800 bp randomly in size. Additional DNA was used to construct paired-end libraries with size spans of 3 kb (three libraries), 8 kb (two libraries), and 20 kb (two libraries). 454 Genome Sequencer FLX-Titanium was used to construct, amplify, and sequence all libraries according to the manufacturer's kits and protocols (454 Life Sciences) [16]. (2,155,540 Reads). We expected to produce a gap-free, scaffold-level mitochondrial genome of the T. parvula through this research. Using Roche-454 FLX Titanium sequencing which has an advantage of read length, 2,155,540 reads with a total length of 6.2 G were obtained. The original sequence reads were a mixture of DNA with nucleus and other organelles. In order to assemble the mitogenome, we researched 359 plants mitochondrial genomes sequences from NCBI Organelle Genome Resources and found the common features; then, we used Blastn to isolate mitochondrion reads from the whole genome reads based on these completed reference mitogenomes. The sequences were assembled using Newbler 3.0 (454-Roche) with default parameters. During this process, we found 3801 contigs. Contigs with long length and high reads depth were separated with shorter ones [34]. The longest was 73000 bp in length. Contigs with reads depth between 50x and 100x may contain essential mitochondrial genes, so we filtered them out. To visualize the contigs connections, we used perl scripts and Newbler 3.0 generated file "454AllContigGraph.txt". At the same time, according to reads depths of the contigs, false links and some wrong forks were removed manually [34]. Referring to the basis of the connecting map for the normal mitochondrion genome, we connected contigs to develop a circular mitochondrial genome and mapped them to Arabidopsis thaliana mitochondrial genome. After connecting 25 contigs, there were still some gaps. Contigs' gaps were filled up with a method like Ma et al. 's [35] study: first, we mapped mitochondrion reads onto both ends (3-60 bp) of the assembled contigs and then extended the contigs by joining the reads, which were partly overlapped (≥95% identical, ≤ 1 − 30) with the contigs. Finally, a 255,773 bp nucleotide sequence was finished.

Genome Annotation.
Genes were located using the webbased tool Public MITOFY Analysis [5], coupled with synonymous and nonsynonymous SNPs. tRNAscan-SE was used to identify the transfer RNA genes with default settings [36]. GC content was analyzed by perl script. The OGDraw version 1.1 was used to draw the circular mitochondrial genome map [37] (http://ogdraw.mpimp-golm.mpg.de/cgi-bin/ogdraw.pl).

Repeat
Analysis. T. parvula mitochondrial genome tandem repeats were analyzed using Tandem Repeats Finder program [38] with default settings. Blastn alignments were used to identify and locate disperse repeats including the forward and palindromic repeats with a minimal length of 100 bp and over 90% identity between the two copies. For the simple sequence repeats (SSR), we used MISA [39] with the size of one to six nucleotides and the thresholds of eight, four, four, three, three, and three, respectively. Additionally, all kinds of repeats that were detected from the above programs were confirmed manually to simplify the results. , Zea perennis (NC 008331), and Cycas taitungensis (NC 010303)) has been inferred using the Neighbor-Joining method. Fifteen genes (atp1, atp9, ccmB, cob, cox1, cox3, nad1, nad3, nad4, nad4L, nad6, nad7, nad9, rps3, and rps4) were extracted by local perl scripts among these twentyone plants. The evolutionary analyses including aligning multiple sequence by ClustalW, algorithm of Neighbor-Joining maximum likelihood (ML), Neighbor-Joining (NJ), and maximum parsimony (MP) used for tree building were conducted in MEGA 6.0 [19]. Numbers designate bootstrap (BP) index values (%) were calculated using 1000 replicates to estimate the support of the data for each internal branch of the phylogeny.