Comparative Analysis and EST Mining Reveals High Degree of Conservation among Five Brassicaceae Species

Brassicaceae is an important family of the plant kingdom which includes several plants of major economic importance. The Brassica spp. and Arabidopsis share much-conserved colinearity between their genomes which can be exploited for the genomic research in Brassicaceae crops. In this study, 131,286 ESTs of five Brassicaceae species were assembled into unigene contigs and compared with Arabidopsis gene indices. Almost all the unigenes of Brassicaceae species showed high similarities with Arabidopsis genes except those of B. napus, where 90% of unigenes were found similar. A total of 9,699 SSRs were identified in the unigenes. PCR primers were designed based on this information and amplified across species for validation. Functional annotation of unigenes showed that the majority of the genes are present in metabolism and energy functional classes. It is expected that comparative genome analysis between Arabidopsis and related crop species will expedite research in the more complex Brassica genomes. This would be helpful for genomics as well as evolutionary studies, and DNA markers developed can be used for mapping, tagging, and cloning of important genes in Brassicaceae.


Introduction
Brassicaceae species consisting of various agronomically important crops like oilseeds, broccoli, cabbage, black mustard, and other leafy vegetables are cultivated in most parts of the world. The genus Brassica is evolutionarily closely related to model crucifer plant Arabidopsis thaliana, since both are members of the family Brassicaceae and reported to have diverged 14−20 million years ago [1]. The major centers of diversity of Brassicaceae family are southwestern and central Asia and the Mediterranean region whereas the arctic, western North America, and the mountains of South America are secondary centers of diversity [2]. The genus Brassica is a monophyletic group within the Brassicaceae. It includes the cultivated oil seeded species, Brassica juncea, B. napus, and B. rapa and vegetable B. oleracea, which are also very closely related to A. thaliana. The genomes of the three diploid Brassica species, that is, B. rapa, B. nigra, and B. oleracea, have been designated as A, B, and C, respectively, where as the genomes of the amphidiploids, B. juncea and B. napus, have been designated as AB and AC, respectively [3][4][5].
Comparative genomics is a powerful tool for genome analysis and annotation. There are two basic objectives for comparative genomics. First, to understand the detailed process of evolution at the gross level (the origin of the major classes of organism) and at a local level (what makes related species unique) [6]. Second, to translate DNA sequence data into proteins of known functions. The rationale here is that DNA sequences encoding important cellular functions are more likely to be conserved between species than sequences encoding dispensable functions or noncoding sequences.
The biology of Arabidopsis and Brassica are very similar. However, because of polyploidy nature of Brassicaceae species, their genomes are more complex compared to A. thaliana. A. thaliana serves as a model for comparative microsynteny studies with Brassica species because of its small genome (with less repetitive DNA), short generation time, and well-established genetic and genomics resources [7]. A pattern of chromosomal colinearity has been identified between Arabidopsis and Brassica plants [7]. Since the Brassica and Arabidopsis belong to the same Brassicaceae family, the level of synteny between them may provide a good opportunity to study how genetic and morphological variation has developed during the evolution of the genome, including the endurance of certain genetic structures in Arabidopsis and related Brassica species [7]. Hence, comparative genome analysis may lead to a better understanding of plant of closely related species.
ESTs are considered as important genomic resources for mining DNA markers based on simple sequence repeats (SSRs). The SSRs are present and distributed in the genomes of all eukaryotes. Because of the abundance and specificity of SSRs, these are considered as important DNA markers for genetic mapping and population studies. The important features of SSR markers coupled with their ease of detection have made them useful molecular marker in different crops [8]. Therefore, detection of SSRs in the unigenes and ESTs of Brassicaceae species may help in designing a new set of DNA markers and may provide more insight in the evolution of these species. Once validated, these markers can be used by the breeders in different Brassica improvement programmes.
The analysis of GC contents among unigenes and ESTs gives important indication about the gene and genome compositions. The GC content of the sequence gives a fair indication of the melting temperature (T m ) and stability of the DNA molecules. The positive correlation has been obtained with the higher GC content and absolute values of thermostability, bendability, and ability to B−Z transition of DNA structure whereas negative correlation has been obtained between the curvature and high GC content of the DNA molecule. The GC-rich DNA constitutes gene-rich, actively transcribed genomic regions hence considered good as functional or expressed DNA [9]. The GC content of sequences surrounding to the gene(s) also considered as the best predictor of the rates of substitution during evolution [10]. However, such analysis is lacking in case of different Brassica species.
In this study, the gene indices were constructed and comparative analysis for five Brassicaceae species, namely, B. juncea, B. napus, B. oleracea, B. rapa, and R. sativus was reported for the first time. These gene indices constitute a total of 131,286 nonredundant sequences which was utilized to assess sequence conservation among Brassicaceae on a genomic scale, mining SSRs, frequency and type of repeat elements, and finding GC contents. DNA markers were designed and validated across Brassica species using PCR. Using the computational method, we have identified sequence and functional similarity of Brassicaceae transcripts to that of Arabidopsis, suggesting that a portion of these transcripts have a high degree of conservation with Arabidopsis genome. These analyses provide insight into the overall sequence conservation among Arabidopsis and Brassicaceae and within Brassicaceae. were downloaded. The available ESTs of these species were clustered into gene indices that represent a nonredundant set of transcripts or unigenes. Batch files of EST sequences for these species were downloaded in FASTA format. The sequences were clustered by using the SeqMan programme of DNASTAR software (http://www.dnastar.com/) to eliminate redundancies and generate unigene sequences. For clustering, we optimized clustering parameters in DNA Star software by using sample data created by taking random sequences of known genes. The optimized parameters were found to be efficient to cluster ESTs to a specific expected cluster and did not produce false joins among the ESTs.

Analysis of GC Content and SSR.
The GC content of all the five Brassicaceae species was calculated using the formulae in excel sheet. We calculated the number of G and C separately, summing the two quantities and dividing by the total number of bases in that unigene sequence and then computing the percentage of GC contents.
The unigene sequences were used to identify SSRs using MISA software (http://pgrc.ipk-gatersleben.de/misa/). Six classes of SSRs, that is, mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats were targeted for identification using this tool. The default setting used in the program for minimum number of repeats was 10 for mononucleotide, 6 for dinucleotide, and 5 for tri-, tetra-, penta-, and hexanucleotides. In addition, this program also identifies complex repeats. Batch files of the target species were exported to the local database in Sun server using FTP and were run through MISA by passing the sequence file as input to the program at the command prompt. The output files were transferred to desktop using FTP and opened using excel sheets for visualizing the results. The four classes of mononucleotide SSRs were defined based on the repeat length, that is, mononucleotides 15 or less bp, 16−30, 31−45, and 46 or more bp repeats. The class chosen for dinucleotide repeats were 5−10 bp repeats, 11−16 bp repeats, and 17 or more bp repeats, while that for trinucleotide repeats were 5−10 and 11−16 bp. Results on repeat types, number of repeats, and frequency across all species were tabulated and significant results and observations were depicted in the form of different figures.

Functional Annotation of Unigenes.
The unigene sequences of the five Brassicaceae species were matched with Arabidopsis gene sequence database at local BLAST server using BLASTN (with advanced options: -G5, -E1, -q1, -r1, -v1, and -b1). The results were extracted using in-house developed Perl scripts, and tabulated in excel sheet. The Arabidopsis unigene set was used as a reference, and the sequences of each of the five crops were split into batches of 200 each for comparisons. The results were tabulated and the bit score cutoff of 100 was applied to filter significant matches. These sieved hits were then BLAST searched against nr database using BLASTX (http://blast.ncbi.nlm.nih.gov/Blast.cgi) for annotation. The annotated genes were classified into 28 different functional categories based on their homology to known proteins. at 94 • C for 5 min followed by 30 cycles of denaturation at 94 • C for 1 min primer annealing at 55 • C-60 • C for 1 min and primer extension at 72 • C for 1 min. This was followed by a final extension step at 72 • C for 10 min followed by storage at 4.0 • C. The amplified products were resolved on 3% agarose gel using 1x TBE buffer, run at 120 V for 2 to 3 h depending on the size of the expected PCR product, and visualized using ethidium bromide staining using GEL documentation system. The band sizing of the amplicon generated by each SSR marker was determined as against 100 bp DNA ladder.   Table 3). The analysis based on EST-derived unigenes in these five Brassica species revealed that the majority of the gene indices have very less sequence variation compared to Arabidopsis gene indices and are conserved across the Brassicaceae family.

Analysis of GC Content of Brassicaceae Unigenes.
We analyzed the GC content (ratio of guanine and cytosine) of all the unigenes, and results were tabulated based on the class intervals defined in the range from 10%−95% GC  content, with an interval of 5%. The GC content range of the transcripts of all the unigenes of 5 Brassicaceae species is given in Figure 1. The average GC content of all the species was between 50%−55% and symmetrical in distribution except for B. napus which showed skewed distribution ranging from 30%−95%. The GC content of R. sativus unigenes was quite variable (Figure 1).

Distribution of Repeat Length Classes in Unigenes.
We found that in all the five Brassicaceae species explored in present study, most of the unigenes contained a single SSR stretch from which potential unique markers can be derived. The frequency of single SSR-containing unigene ranged from 60% (B. rapa) to 92% (R. sativus). The average frequency of unigenes containing multiple SSRs across all five species was 25%. The maximum number of unigene containing single SSR was found in case of B. rapa, followed by B. juncea and B. oleracea ( Table 4). The SSR frequency observed was not uniform among these Brassica species (x 2 = 456.2, df = 4). The relative abundance of mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats in all the five Brassicaceae species were determined by calculating their frequencies in the unigenes. The mononucleotide repeats were predominant in all the five species studied in present investigation. The frequency of mononucleotide repeats varied from 60% in B. rapa to 92% in R. sativus. The second dominant class was dinucleotide repeat in all species except B. juncea, which had trinucleotide repeat at second position. In rest of the species, highest percentage of mononucleotide repeats were obtained followed by di-, tri-and tetranucleotide repeats. A little variation observed at penta-and hexanucleotide where frequency of hexanucleotide was greater than pentanucleotide repeats.

Frequencies of Different SSR Repeat
Types. The relative frequencies of SSRs were calculated for five species. The frequency estimates shown are based on the total number of SSRs observed in all unigenes that have either single or multiple SSRs. It was seen that A/T repeats were the predominant mononucleotides in all the five species. The results indicated that A/T SSRs represent more than 50% of the total SSRs in all five species whereas the frequency of C/G repeats were 19.14% in B. oleracea, 4.55% in B. juncea, and 4.17% in R. sativus (Figure 2(a)). Among dinucleotide SSRs, AG/GA/CT/TC group was a ruling class of dinucleotide repeats in all of the species analyzed during this investigation. It ranged from 4.2% to 18.7% of the total SSRs explored. These repeats were maximum in B. rapa followed by B. juncea, B. oleracea, B. napus, and R. sativus. The average frequency of AT/TA and AC/CA/TG/GT was almost same (0.61% and 0.67%, resp.) among the five species (Figure 2(b)).
An assay of frequencies of trinucleotide repeats of total SSRs showed the predominance of AAG/AGA/GAA/ CTT/TTC/TCT repeats class in 4 out of 5 species. For instance, the trinucleotide repeats were 22.73% in B. juncea, 18.48% in B. rapa, 12.85% in B. oleracea, 6.62% in B. napus, and 4.17% in R. sativus (Figure 2(c)). In R. sativus, the only ATG/TGA/GAT/CAT/ATC/TCA repeat class was found, which is the second dominant class of repeats in B. juncea. The AGG/GGA/GAG/CCT/CTC/TCC repeat was the second dominant class in B. napus, B. oleracea, and B. rapa.
The possibility of tetranucleotide repeats is 33 across the genomes [11,12], but only a small number of tetra nucleotide repeats were observed among the 5 Brasisca species in present study. As the numbers are too low for frequency evaluation, all of the observed tetranucleotide repeats were assayed in order to figure out the most recurrent tetranucleotide SSRs across these Brassicaceae species. The top 15 tetranucleotide repeats obtained in the 5 Brassicaceae species were AAAC, AAAG, ATGA, CCAA, CTTT, GAAC, TACA, GAAA, AGAA, TTGT, TCAA, TTTG, AATC, CAAA, and GAAG. The AAAC and AGAA repeats were the most abundant tetranucleotide SSRs.

Frequencies of Different SSRs Repeat Length Classes.
It was found that the majority of mononucleotide SSRs fall in 16−30 repeat classes followed by 15 or less repeat classes, except in B. juncea and B. oleracea, where 15 or less repeat classes were more abundant than 16−30 repeat classes ( Figure 3(a)). In B. rapa, the 15 or less and 16−30 repeat classes almost shared nearly equal distribution of the SSRs. Although SSRs with 46 or more repeats were less frequent in all species. Distribution of dinucleotide SSRs showed that in most of species, they fall in the category of 5−10 repeat classes succeeded by 11−16 repeat classes (Figure 3(b)). However, in R. sativus, SSRs were detected in 17 or more repeat classes. With respect to the occurrence of trinucleotide SSR distribution into repeat length classes, the 5−10 repeat classes were most predominant in all the species analyzed ( Figure 3(c)). Thus, the distribution of SSRs clearly showed the predominance of mononucleotide SSRs containing 16−30 repeats and di-and trinucleotide containing 5−10 repeats.

Discussion
Crops belonging to Brassicaceae family are closely related to Arabidopsis thaliana. Since the whole sequence of A. thaliana genome has been decoded and is in public domain [13], it can be effectively used in comparative genome analysis with the genomic sequence of Brassica species to understand biological processes and manipulating different traits. In the present investigation, a comprehensive and detailed analysis of Brassicaceae unigenes was made and compared with that of A. thaliana gene indices. Our analysis showed that Brassica and Arabidopsis genes share high percentage of sequence identity hence can be used in various functional genomic studies in Brassicaceae. Analysis of GC contents showed that the unigenes of B. juncea, a tetraploid species have more GC content than another tetraploid species like B. napus. Even the unigenes of B. napus were less than that of diploid species B. oleracea and B. rapa [14]. It has also been reported that the GC contents may vary even in phylogenetically related species like onion and rice [14]. In other studies the mean GC content of coding regions is higher in angiosperms compared to the dicots [15]. However, from present investigation, such conclusions cannot be drawn since we have taken all the unigene sequences and did not distinguish among coding or noncoding regions. A gradient in GC contents along the direction of transcription has been obtained in case of gramineae genes [16]. Their exhaustive analysis showed that 5 -ends of gramineae genes were having 25% higher GC contents than their 3 -ends. Similarly, microsynteny analysis between Oryza sativa spp japonica and O. sativa spp. indica showed presence of higher average GC contents in japonica genes than in the indica genes [17].      The frequencies of different classes and types of SSRs have been calculated in the unigenes of five species within Brassicaceae species. Simple sequence repeats are found to be in abundance and consistently distributed in plant genomes. It has also been reported that SSRs occur as frequently as once in about 6 kb in case of plant genomes [18]. SSRs are

Biogenesis of cellular components
Cell rescue, defense, and virulence more common in the vicinity of genes than in other regions of the genome [19]. However, among five Brassicaceae crops studied in present investigation, 62.45% of the unigenes of B. napus contained SSRs.
Theoretically, the probability of finding mononucleotide repeats in a genome is higher followed by dinucleotide repeats and then by trinucleotide repeats followed by tetra-, penta-, and hexanucleotide repeats [20]. This trend of distribution of repeats for all the species, namely, B. napus, B. oleracea, B. rapa, and R. sativus has also been found in present study. However, the trinucleotide repeats were the second abundant in B. juncea. The frequency of hexanucleotide repeats found in B. napus, B. oleracea, and B. rapa is more than that of pentanucleotide repeats. The general trend showed that mononucleotides were the most abundant repeats in all five species followed by di-and trinucleotide repeats.
The available SSR motif combination could be grouped into unique classes based on the property of DNA-based complementarities. For mononucleotides, although A, T, C, and G are possible, A and T could be grouped into one category since an A repeat on one strand is same as a T repeat on the opposite strand and a poly C on one strand is the same as a poly G on the opposite strand, resulting in two unique classes of mononucleotides, A/T and C/G [11]. Similarly, in our study, all dinucleotides can be grouped into four unique classes: (i) AT/TA; (ii) AG/GA/CT/TC; (iii) AC/CA/TG/GT and (iv) GC/CG. Thus, the number of unique classes possible for mono-, di-, tri-, and tetranucleotide repeats is 2, 4, 10, and 33, respectively, [11,12]. Major role of repeat elements has been attributed to the gene duplication and amplification for generating new alleles in a population. The whole genome analysis of rice and Arabidopsis has shown very interesting observations. In whole rice genome, a total of 18,828 classes of di-, tri-, and tetranucleotide SSRs representing 47 distinct motif families have been annotated [21]. It has been reported that 51 hypervariable SSR per Mb of the rice genome are available. These SSRs also used as DNA markers for specific regions of the genome, amplified well with PCR, polymorphic among different genotypes thus are of immense applications in genetic analysis [21]. A comprehensive analysis on presence of SSRs in Arabidopsis genome has been performed [22,23]. It has been reported that the majority (80%) of all SSRs found in Arabidopsis genome were mono-, di-, tri-, tetra-and pentanucleotides [23]. In our analysis, maximum (22.73%) of trinucleotides were obtained in B. juncea compared to other 4 species studied. In Arabidopsis genome, SSRs in general are more favored in upstream region of the genes and trinucleotide repeated were the most common repeats found in the coding regions [22].
Comparative genomics has progressed the discovery and understanding of orthologues, but it has brought to light many fast evolving "orphan" genes of unknown function and evolutionary history. In Brassica species, comparative analysis provides an opportunity to study rapid genome changes associated with polyploidy level in this largest plant family. Brassica genome analysis might provide new insights into the organization of plant genome and the size and shape of plants as well. To accomplish this task, the complete sequence of Brassica's close relative, Arabidopsis thaliana, would be an important genomic resource.
The abundance of unigenes with cellular roles in Brassicaceae species was estimated by classifying the BLASTX matches with similarity to known proteins into 26 functional categories. The proportion of transcripts involved in metabolism and energy was 24.1% (between 20% and 34% among Brassicaceae species). Though such analysis has not been performed in case of Brassica species, in sugarcane assembled EST sequences with 23.8% transcripts involved in various metabolism and energy processes like bioenergetics, secondary metabolism, lipid metabolism, amino acid metabolism, DNA metabolism, nucleotide metabolism, and N, S, and P metabolism were obtained [24]. The 22% of unigenes showed similarity with that of the genes involved in storage protein, cell cycle, and DNA processes, transcription factor, protein synthesis, protein fold/modification/destination, structural/catalytic protein, protein activity regulation, and nuclear protein in different organisms. Similar types of analysis was performed in wild Arachis stenosperma and found that ∼22% ESTs were involved in the same function [25]. Maximum numbers of unigenes analyzed in our study are still hypothetical or unknown hence could be used in functional analysis study, which may lead to discovery of some unique genes in Brassicaceae crops.
PCR-based markers designed from various genomic sequences can be used for various molecular and genetic studies after their validation for quality and robustness of the amplification. Earlier reports suggest that a portion of genomic SSRs, developed in the past, have produced faint bands or stuttering [26,27]. However, in the present study, all the genomic SSR produced clear and high-intensity bands. SSR derived from the genes have produced a high proportion of high-quality markers with strong bands and distinct alleles in most of the reports [28,29]. The quality of genotyping data obtained from EST-SSR is highly dependent on the quality and robustness of amplification patterns. Varshney et al. [30] reported that markers derived from the conserved region of genome are expected to show greater crosstransferability between species and genera. The unigenederived SSR markers have unique identity and positions in the transcribed region of the genome. With the availability of huge unigene databases, large-numbered SSR can be easily identified. The markers developed in present study would be an important resource for the brassica breeders. These markers would be useful for generating comparative genetic and physical maps, study of genetic diversity, marker-assisted selection, and even positional cloning of useful genes in Brassica and other related species.

Conclusions
Our analysis on the comparative analysis of Brassicaceae crops with A. thaliana confirmed a high level of nucleotide sequence conservation. Thus, a genome scale comparison of Arabidopsis with Brassica at the sequence level provides an excellent opportunity to find some agriculturally important genes, to clone and use them in breeding programmes. The average GC content of Brassicaceae species was between 50%−55%. The mining of SSRs showed highest percentage of mononucleotide repeats followed by di-, tri-, and tetranucleotide repeats in all of the species except B. juncea. A/T repeats were the prevalent mononucleotides with more than 50% in all the 5 species. The predominant class of dinucleotide repeats in all the species was AG/GA/CT/TC, maximum in B. rapa. The distribution of SSRs showed the abundance of mononucleotide SSRs containing 16−30 repeats while di-and trinucleotide containing 5−10 repeats. Out of the 28 functional categories, the ruling functional category of unigenes was metabolism and energy followed by structural/catalytic protein. Comparative genomics can facilitate the study of the evolution of sequences and functions of orthologous genes and also to understand diversification and adaptation. These comparative studies have contributed to analysis of complicated quantitative traits and comparisons of the organization of the chromosomes of Brassica. It is expected that comparative genome analysis between Arabidopsis and related crop species will expedite research in the more complex Brassica genomes. The markers developed in present study would be an important resource for the brassica breeders. These markers would be useful for generating comparative genetic and physical maps, study of genetic diversity, marker-assisted selection, and even positional cloning of useful genes in Brassica and other related species.