Purifying Selection Bias against Microsatellites in Gene Rich Segmental Duplications in the Rice Genome

Little data is available on microsatellite dynamics in the duplicated regions of the rice genome, even though efforts have been made in the past to align genome sequences of its two sub-species. Based on the coordinates of duplicated sequences in the indica genome as available in the public domain, we identified microsatellites in these regions. CCG and GAAAA repeats occurred most frequently. In all, 259 microsatellites could be identified in the duplicated sequences using the criteria of minimum 90% alignability spread over a minimum of 1 Kb sequence. More than 25% of the repeats in duplicated regions occurred in the genic sequences. Only 45 (17%) of these 259 microsatellites were found conserved in the duplicated paralogues. Among these repeats, 40% maintained both sequence and length conservation. The effect of mutability of nearby regions could also be clearly seen in microsatellite regions. The overall purpose of this study was to investigate, whether microsatellites follow an independent course of evolutionary dynamics subsequent to events like genome reshuffling that simply drives these elements to different locations in the genome. To the best of our knowledge, this is the first comprehensive analysis of microsatellite conservation in the duplicated regions of any genome.


Introduction
Microsatellites represent a class of tandem DNA repeats with 1 to 6 bp long repeat units. These sequences occur in almost all the organisms and frequently constitute the hypervariable regions of the genome. No specific functions have been assigned to most of the microsatellites till date. However, in some cases at least, microsatellite alleles provide protective or adaptive advantage to the host [1]. In many cases, occurrence of different alleles has been found associated with different phenotypes [2]. Microsatellites are not expected to be conserved for long evolutionary periods either, as argued by Buschiazzo and Gemmell [3]. Nevertheless, models of microsatellite mutational dynamics have been developed based on comparison of orthologous microsatellite loci in related taxa [4][5][6][7]. However, whether these models also describe microsatellites at paralogous loci created by segmental changes within a genome remains to be investigated.
Availability of whole-genome sequences for rice (Oryza sativa L.) allows analysis of noncoding DNA also within the segmentally duplicated regions in addition to the gene order, tandemly arranged genes (TAGs) and gene functions. A collective look emerging from different reports on mapping of duplicated regions in rice genome [8][9][10] reflects that these studies primarily focused on the analysis of genes in these regions. The strategy commonly used involved making blocks of genes, and mapping them elsewhere in the genome. In a way, the noncoding DNA, particularly, the repetitive DNA has been ignored due to nonemployment of methods suitable for this kind of mapping. Nevertheless, to understand the complete mechanism of speciation and genome evolution, the characterization of conserved noncoding DNA is equally important [11]. No information, to date, is available on the fate of microsatellites in newly duplicated locations. Signatures of ancient duplications, in terms of sequence similarity of genes, and their genomic International Journal of Evolutionary Biology order on chromosomes in rice, are widely available, as mapped by Yu et al. [10]. Using the same information as a reference, we have attempted to outline the dynamics of microsatellite DNA within the segmentally duplicated regions of the rice genome to enlighten the patterns of conservation and divergence of these sequences. The overall objective of this study was to investigate whether there is any participation of microsatellites in genome reshuffling or they are simply being carried over. We were also interested to know if after duplication the paralogous microsatellites (we call as "microsatellite twins") follow independent dynamics as both the sites are now different or similar dynamics as the neighbouring environment is still essentially the same. The latter point is important to understand whether microsatellite hypermutability is random or directional.

Sequence Resources.
Whole-genome sequence of Oryza sativa subspecies indica was downloaded from http://rise.genomics.org.cn/rice/index2.jsp (BGI release 2003-08-01) in FASTA format. Based on the coordinates of duplicated sequences as provided by Yu et al. [10], the sequences of duplicated regions were retrieved from the whole-genome sequence in a text editor and were used as plain text files. The first set of sequences described by Yu et al. [10] has been referred here as group I sequences, and their paralogous duplicated sequences have been designated as group II sequences. These sequences were further split into 2.0 Mb bins for further analysis.

Analysis of Duplicated
Sequences. Repeatmasker (http:// www.repeatmasker.org/) with WU-blast [12] search engine was used with default sensitivity and rice as "DNA source" for mining of microsatellite repeats, which were subsequently aligned using glocal algorithm [13] in Vista Genome Browser (http://pipeline.lbl.gov/cgi-bin/gateway2) [14] following the method described earlier by Roorkiwal et al. [7]. A simple sequence with repeat motif length of 1-6 bp spanning a minimal length of 20 bp was considered as a microsatellite. Genes were predicted using MolQuest ver. 1.6.2 (Softberry; http://www.molquest.com/). Following analysis of the aligned map, segmental duplications were identified by the criteria of similarity >90% and length ≥1 Kb [15] and analysed for microsatellites and coordinates of the predicted genes.

Statistical Analysis.
The data generated by mining of duplicated sequences and associated microsatellites were subjected to statistical analysis using χ 2 test and correlation test. The expected values were derived from the published reports [5,7,10].

Results and Discussion
Microsatellites constitute nearly 1% of the eukaryotic genomes, though in some organisms like Plasmodium they may be overrepresented [16]. Their biological significance to the host genomes has been a topic of debate in recent years. Moreover, little knowledge is available about their mutational dynamics [17,18], primarily derived from the limited genomewide studies in model organisms [4,5,7]. Comprehensive surveys on microsatellite conservation across the species and within duplicated sequences of the same genome are, therefore, required to expand our understanding regarding their genomic significance. In the following sections, we present some points emerging from our study justifying our opinion that at least in part such a conservation and maintenance of microsatellites in segmentally duplicated sequences are visible in the rice genome.

Alignability of Duplicated Regions.
Evidences exist for genome duplications in rice that occurred between 53 and 94 mya sometime prior to divergence of the cereal genomes [9,10]. Further, a segmental duplication event between chromosomes 11 and 12 occurred around 5 mya is also well documented [19], in addition to numerous other individual gene duplications [1,9]. In totality, the duplicated sequences in rice span 295 Mb, representing nearly two-third of the entire genome including 47% of the genic regions [10]. It is believed that duplication events are followed by several genomic changes including loss of gene functions, and in certain cases, loss of entire genes also [9].
Based on the data presented earlier by Yu et al. [10], we delimited total duplicated regions as 141 Mb of group I sequences and 154 Mb of group II sequences. However, the actual traceable duplicated segments meeting the criteria of >90% similarity and minimum of 1 Kb [15] length covered merely 3.8 Mb genome. The first and second groups of sequences spanned 1.89 and 1.90 Mb of the genome, respectively. Thus, the actual portion of the rice genome studied here came out to be merely 1% (∼3.79 Mb). Maximum duplication events were observed on chromosome 2 (∼0.34 Mb) and minimum on chromosome 7 spanning little lesser than 0.1 Mb (Table 1). Their distribution was obviously non random with P(χ 2 ) < 0.001. Further, no correlation was observed between the size of duplicated segments and the length of chromosomes. Average length of bins was found highest on chromosome 5, and minimum on chromosome 6.
The size of the aligned pair and the alignment scores between two segments are generally in inverse relationship to their divergence time. However, in the present case, such a relationship has not been observed, as the most recent pair of duplicated sequences on chromosome 11 and 12 [19] was not the longest one (Table 1). Nevertheless, the mean similarity between the duplicated bins on chromosome 11 and 12 (Figure 1) was little higher at 94%, compared to mean similarity of 93.5% between duplicated bins of chromosome 2 and 4.

Microsatellite Abundance in Duplicated Regions.
We earlier reported 45,782 microsatellites in 374.5 Mb of rice genome [7] using the same criteria and the tools used in the present study. Accordingly, 1% of the genome should  Figure 2). When the frequency of specific microsatellite motifs in duplicated regions were plotted against the expected values, based on previous studies [5,7], frequency of most of the microsatellites were found much lower P(χ 2 ) < 0.001, except for motifs like AAT, AGC, and CCG for which observed values corresponded to expected values. Clearly, there is certain level of purifying selection against the microsatellites in these duplicated regions of the rice genome. CCG repeats (and direct and reverse complementary permutations thereof) were found most abundant in either set of sequences in consistency with the earlier reports [5,7]. GAAAA repeats (and their permutations), known to be most abundant in rice genome among the penta-nucleotide repeats [5], were found the second most abundant and least mutable repeats (Table 2) among the duplicated sequences. Other repeats like A, AT, and so forth, otherwise abundant in rice genome, were not found preferentially distributed in duplicated regions (Figure 3). Relative abundance of each of the repeat motif in both of the sets of sequences was fairly comparable. Quite expectedly, majority of the microsatellites occurred in the intergenic sequences (Table 2), and least in the exonic sequences. Consistent with the previous findings [7], CCG repeats most frequently occurred in exonic sequences. As suggested earlier by some researchers [17,18], intrinsic factors specific to the host genome and microsatellite themselves like repeat length, repeat sequence, neighboring genomic sequences, and so forth, are responsible for differential occurrence and conservation of microsatellites. Importantly, while the duplicated sequences have shown a higher frequency of genes, they have particularly shown a bias against the microsatellites (Figure 2).  International Journal of Evolutionary Biology   [10], and the length of the fragments that we found aligning with >90% similarity for a minimum length of 1 Kb.

Microsatellite Conservation within the Duplicated
Sequences. Out of the 259 microsatellites existing in the duplicated sequences, only 45 (17%) were found conserved in the paralogous sequences. Considering the mutability of microsatellites per locus per generation in rice, as described by Grover et al. [5], a microsatellite of 20 bp length may entirely be lost in around 2 million years provided all the mutations are unidirectional, targeting the shortening of the microsatellite. Thus, conservation of 17% of microsatellites in duplicated regions, with the average age of duplication around 56 mya, is especially significant as only 1% of the entire duplication blocks is identifiable today (discussed above). Interestingly, 42% of these repeats have their length conserved, which is significantly lesser than the global average in rice observed earlier [7], but clearly indicating that these alleles have been fixed in duplicated segments, most probably due to the vitality of their spatial occurrence [18]. Differences in the lengths of at least two paralogous microsatellites (with CCG motif) falling in exonic sequences on duplicated blocks on chromosome 11 and 12 indicate International Journal of Evolutionary Biology 7 (CGA)n 6% (TAA)n 5% (TCG)n 2% Others 10% (GAAAA)n 26% (CCG)n 51% Figure 3: Abundance of microsatellite motifs in duplicated regions of the rice genome.  the relative advantage of repeatability and hypermutability of microsatellites in genes, as has been suggested earlier as well [1,3,[20][21][22]. It was also interesting to note that at some of the genomic positions a single microsatellite repeat corresponded to two microsatellite repeats with the same motif (Table 3). This is possible due to recurring splitting and expansion events at microsatellite loci [18]. Of all the paralogous microsatellites observed, 40% maintained both sequence and length characteristics. Majority of these microsatellites were located on duplicated segments of chromosomes 1 and 5. It is quite possible, that these loci might have been fixed. However, we do not overrule the possibility that one or both of the sequences have undergone a number of mutations purely in stochastic manner and eventually arriving to the same lengths simultaneously, now seen as conserved alleles. Out of these two possibilities, it is the first one that generates more interest, as microsatellites associated with important regions in the genome will display lower variability during genetic drift and selective sweeps [18,23]. Consequently, lesser activity will be observed on a microsatellite locus that is lying next to a genomic region adapted to a given environment [24]. Therefore, we do not overrule the possibility that the microsatellites that show sequence as well as length conservation represent important "evolutionary chronometers" [25] and might have been tightly linked to genomic regions of significance [18]. Microsatellites located in mutationally constrained regions are expected to be maintained passively. Highly conserved microsatellites are often associated with other conserved genomic elements and show a stronger negative relationship with single nucleotide polymorphisms (SNPs) density [26]. Interestingly enough, in five instances, a particular microsatellite motif has given way to another motif, precisely at the same site (Table 3). Grover and Sharma [18] explained such events by calling them as "metamorphosis" at microsatellite sites. Apparently, in three of the five cases, the new microsatellites appeared originally by a single site substitution, which later expanded possibly by "polymerase slippage" to mature into a fully grown microsatellite. Evidently, both the abundance and conservation of microsatellites had a heterogeneous pattern across the rice chromosomes. However, the distribution of sequence motifs across the chromosomes and across the blocks and segments of duplications more or less remained the same. Conserved microsatellites within the duplicated regions of the genome are desired candidates to study the overall significance of microsatellite conservation in different genomes.

Microsatellites versus Genes in Segmentally Duplicated
Regions. Out of 259, only 68 (26.25%) microsatellites were found to be associated with genes. Out of these genic microsatellites, 17 (25%) were present in exonic regions and remaining 51 (75%) were located in the intronic regions. Interestingly, 18 of the repeats and their counterparts were located to different genomic entities. For example, while one locus was located in the intergenic region, its paralgoue occurred in the genic region. Such spatial distribution can occur due to homologous recombination [27] or some other minor genomic rearrangements due to retrotransposition, local genomic reorganization and reshuffling. Thus, such microsatellites can be considered as "genomic fossils," which can help in retracing the evolutionary events in the genome.