SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation

Microsatellites or SSRs (simple sequence repeats) are ubiquitous short tandem duplications occurring in eukaryotic organisms. These sequences are among the best marker technologies applied in plant genetics and breeding. The abundant genomic, BAC, and EST sequences available in databases allow the survey regarding presence and location of SSR loci. Additional information concerning primer sequences is also the target of plant geneticists and breeders. In this paper, we describe a utility that integrates SSR searches, frequency of occurrence of motifs and arrangements, primer design, and PCR simulation against other databases. This simulation allows the performance of global alignments and identity and homology searches between different amplified sequences, that is, amplicons. In order to validate the tool functions, SSR discovery searches were performed in a database containing 28 469 nonredundant rice cDNA sequences.


INTRODUCTION
Microsatellites or SSRs (simple sequence repeats) are sequences in which one or few bases are tandemly repeated for varying numbers of times [1]. Variations in SSR regions originate mostly from errors during the replication process, frequently DNA polymerase slippage, generating insertion or deletion of base pairs, resulting, respectively, in larger or smaller regions [2,3]. SSR assessments in the human genome have shown that many diseases are caused by mutation in these sequences [4]. SSRs can be found in different regions of genes, that is, coding sequences, untranslated sequences (5 -UTR and 3 -UTR), and introns, where the expansions and/or contractions can lead to gene gain or loss of function [5]. Also, there are evidences that genomic distribution of SSRs is related to chromatin organization, recombination, and DNA repair. SSRs are found throughout the genome, in both protein-coding and noncoding regions. Genome fractions as low as 0.85% (Arabidopsis thaliana), 0.37% (Zea mays), 0.21% (Caenorhabtis elegans), 0.30% (Sacharomyces cerevisae) and as high as 3.0% (Homo sapiens) and 3.21% (Fugu rubripes) have been found. Some bias for defined genomic locations has also been reported [6,7]. This class of markers is broadly applied in genetics and plant breeding, due to its reproducibility, multiallelic, codominant nature, and genomic abundance. Its use for integrating genetic maps, physical mapping, and anchoring gives geneticists and plant breeders a pathway to link genotype and phenotype variations [8].
The protocols for isolating SSR loci for a new species were always very labor-intensive. Currently, with the accumulation of biological data originating from whole genome sequence initiatives, the use of bioinformatics tools helps to maximize the identification of these sequences and consequently, the efficiency in the number of generated markers [9].
The first in silico studies of SSRs were developed using FASTA [10] and BLAST [11] packages. Later, more specific algorithms, such as SPUTINICK [12], REPEATMASKER 2 International Journal of Plant Genomics [13], TRF-Tandem Repeat Find [14], TROLL [15], MISA [16] and SSRIT (Simple Sequence Repeat Tool) [17], were obtained [9]. SSR detection is generally followed by the use of another program for primer design, to be anchored on flanking sequences. Also, in some applications, a third step using e-PCR [18] is added, with the goal of verifying primer redundancy. The sequential use of a number of software is often called a pipeline. Building such a pipeline can be a very difficult task for research groups not familiar with programming tools.
In the present work, a computing tool with an interface for Windowsusers was developed, called SSR Locator. The application integrates the following functions: (i) detection and characterization of SSRs and minisatellite motifs between 1 and 10 base pairs; (ii) primer design for each locus found; (iii) simulation of PCR (polymerase chain reaction), amplifying fragments with different primer pairs from a given set of fasta files; (iv) global alignment between amplicons generated by the same primer pair; and (v) estimation of global alignment scores and identities between amplicons, generating information on primer specificity and redundancy. The described tool is publicly available at the site http://www.ufpel.edu.br/∼lmaia.faem.

Algorithms
The algorithms used for the searches, alignment, and homology estimates are described separately.

SSR search
The algorithm used for perfect and imperfect micro-/minisatellite searches was written in Perl and consists of the generation of a matrix that mixes A(adenine), T(thymine), C(cytosine), and G(guanine) in all possible composite arrangements between 1 and 10 nucleotides. The script instructions perform readings on fasta files, searching all possible arrangements in each database sequence.
Several instructions in the algorithm used in SSRLocator resemble those from MISA [16] and SSRIT [17]. However, additional instructions have been inserted in SSRLocator's code. Instead of allowing the overlap of a few nucleotides when two SSRs are adjacent to each other and one of them is shorter than the minimum size for a given class as found in MISA and SSRIT, a module written in Delphi language records the data and eliminates such overlaps.
The SSR Locator software contains windows focused on the selection and configuration of SSR and minisatellite types (mono-to 10-mers) and a minimum number of repeats for each one of the selected types. The algorithm calls a perfect repeat when one locus is present with adjacent loci at an up or downstream distance higher than 100 bp.
The algorithm calls an imperfect repeat when the same motif is present on both sides of a fragment containing up to 5 base pairs. The algorithm identifies a composite locus when two or more adjacent loci were found at distances between 6 and 100 bp [16].
In order to validate the efficiency of SSRLocator in finding SSRs and minisatellites, the same database was analyzed withMISA and SSRIT, using the same parameters for minimum number of repeats.

Primer design
An algorithm written in Delphi language performs calls to Primer3 [19], which execute primer designs. These results are fed to a module that performs Virtual-PCRs and allocates individual identification, forward and reverse primer sequences, and a sequence fragment corresponding to the region flanked by the primers (original amplicon) to each SSR locus. A window allows the selection of Primer3 parameters, such as range of primer and amplicon sizes, as well as optimum primer size, ranges of melting temperature (TM) (minimum, maximum, and optimum) and GC content (minimum and optimum). For primer searches, the software automatically looks for five base pair distances from both SSR (5 and 3 ) flanking sites. In this study, the following parameters were used: amplicon size between 100 and 280 bp; minimum, optimum, and maximum annealing temperature (TM) of 45, 50, and 55, respectively, minimum, optimum, and maximum primer size of 15, 20, and 25 bp, respectively.

Virtual-PCR
The module used to simulate a PCR reaction was written in Delphi. The algorithm consists in reading the file generated by the previous module (SSR locus, forward and reverse primers, and original amplicon), followed by a search of sequences containing primer annealing sites. When annealing sites are found for the two primers, the flanked region and the primer sequences are copied to a new variable called "paralog amplicon."

Global alignment
For the global alignment between paralog and original amplicon sequences and score calculations (match, mismatch, gaps), a routine was written in Delphi language using the algorithms of Needleman and Wunsch (1970) [20] and Smith and Waterman (1981) [21]. Also, in the same module, amplicon identities were calculated according to Waterman (1994) [22] and Vingron and Waterman (1994) [23].

Implementation
The strategy of creating a two-language hybrid program was established as a function of: (i) the higher speed achieved by  handling large text files with Perl as compared to Delphi,and (ii) the better fitness of Perl for generating combinatory strings to be located. The Perl module was transformed into an executable file, making unnecessary to install Perl libraries during program installing. The graphic interface built, integrating input and output windows to the Windows operational system, was obtained using the Suite Turbo Delphi, where a menu system executes calls for each of the previously described modules.

Sequences for analysis
A total of 28 469 rice (Oryza sativa ssp. japonica-cv. Nipponbare) nonredundant full length nonredundant cDNA sequences, sequenced by The Rice Full-Length cDNA Consortium, mapped on the databases derived from the sequencing of japonica (japonica draft genome, BAC/PAC clones-IRGSP) and indica (indica draft genome) subspecies [24] were used for the analyses. These sequences are deposited in NCBI as two groups, the first comprising accesses from AK058203 to AK074028, and the second comprising accesses from AK98843 to AK111488. All these sequences can be also found in KOME (Knowledge-based Oryza Molecular Biological Encyclopedia). A flow chart representing the different steps performed by the software is shown in Figure 1.

Program validation
A total of 3907 micro-and minisatellites were detected by SSRLocator in the 28 469 analyzed cDNA sequences. The same database searched with MISA and SSRIT presented 3913 and 3917 loci, respectively. The mono-, 4-mer, 6-mer, 7-mer, 8-mer, 9-mer, and 10-mer repeats were identical for the three programs. In the case of 2-mer repeats, 594 elements were detected by SSRLocator and 596 elements were detected by MISA and SSRIT. 3-mer repeats were differently scored by SSRLocator (1990) and the other two (1994) algorithms. For 5-mer repeats, SSRLocator and MISA found the same number of repeats (426), while SSRIT (430) found a different value.
Considering the 3765 fl-cDNA sequences, in 3632 (92.96%) only a single micro-/minisatellitelocus was detected. In 125 sequences, two loci were detected, in seven sequences three lociandonly one sequence had four loci, adding up to 3907 occurrences. Among the types analyzed, SSRs (mono to 6-mer repeats) and minisatellites (7-to 10-mer repeats) comprised 96.98% and 4.12% of detected loci, respectively.

Occurrence patterns for different SSR and minisatellite types and motifs Monomers, 2-mers, 3-mers, and 4-mers
On Table 2, the contents and percentage values for different micro-/minisatellite motifs are shown. For monomer, 2-mer and 3-mer repeats, all possible arrangements are shown, while for 4-mer to 10-mer repeats, only the ten most frequent motifs are shown. The A/T monomer repeats were found in 125 loci, with 111 (88.80%) and 14 (11.20%) loci formed by A and T nucleotides, respectively. The C/G motifs were found in 13 loci, with ten (76.92%) and three (23.08%) loci formed by C and G, respectively. A/T containing SSRs were predominant and comprised 90.58% of monomer loci. In the overall distribution, the monomers represent 3.53% of 3907 detected loci. Motifs AG/CT and GA/TC were the most frequent and added up to 8.52% of 2-mer SSRs, and 6.89% and 5.96% of all 3907 detected occurrences. The motifs CT, GA, and TC were the most abundant adding up to 172, 143, and 90 loci, respectively. In maize, barley, rice, sorghum, and wheat ESTs, the motif AG was described as the most frequent [6, 16,28,29,31,32]. However, in some studies, the most frequent motif was GA [30,33]. Repeats composed by guanine and cytosine were the most abundant among trimers, with occurrences of 18.44%, 17.89%, and 10.60%, respectively, for the motifs CCG/CGG, CGC/GCG, and GCC/GGC, adding up to 23.9% of the overall frequencies of micro-/minisatellites in the analysis. The motifs CGC, CCG, and CGG were the most frequent comprising 218, 197, and 170 loci, respectively. Many reports indicate the 3-mer CCG as the most frequent in maize, barley, wheat, sorghum and rye [6,16,28,32], sugarcane [27] and rice [29,31].

Remaining repeats
Among 5-mers, 188 different arrangements were detected and the most frequent were CTCCT, CTCTC, and CCTCC with 17, 17, and 12 occurrences, respectively. In the analysis of CDS regions, the ACCCG motif was the most frequent in Arabidopsis, AAAAG in S. cerevisae, C. elegans, and AAAAC in different primates [38]. Also, the motifs AAAAT, AAAAC, and AAAAG were described as the most frequent in eukaryotes [39]. In rice, the motifs AGAGG and AGGGG were the most abundant [31]. Repeats of type 6-mer were detected in 230 different arrangements, where CGCCTC and TCGCCG were the most frequent, occurring in 12 and 10 loci, respectively. Other studies have shown higher frequencies for the motifs AAGATG, AAAAAT in arabidopsis [35], AAAAAG in citrus [36], AACACG in S. cerevisae, ACCAGG in C. elegans and CCCCGG in primates [38]. For all remaining repeats (minisatellites), the occurrences are widely distributed with low-percentage values for each arrangement. For 7-mer, 8-mer, 9-mer, and 10-mer repeats, the totals of occurrences were 57, 5, 23, and 5, respectively.

Primer design and PCR simulation
The design of primers for the 3907 detected micro-/minisatellites resulted in 3329 primer pairs, covering 85.20% of loci. The running of "Virtual PCR" generated a total of 4610 amplicons. A module in SSRLocator checks for primer redundancy. A total of 2397 primer pairs amplified only the fragment from its original locus (specific amplicons) and 932 pairs amplified one or more regions besides the original locus. From these, 692 pairs amplified two fragments, one from the original site and a second from another region (paralogous). In this case, 692 specific amplicons plus 692 redundant amplicons, were detected. A total of 143, 90, 2, and 5 primer pairs generated three (two redundancies), four (three redundancies), five (four redundancies), and six (five redundancies) fragments, respectively. The final product of 932 primers with more than one anchoring region resulted in 932 specific amplicons and 1281 redundant amplicons, adding up to 2213 fragments.
To investigate the ability of these primers in amplifying genomic sequences, an extra experiment was performed against the whole rice genomic sequence available at NCBI. The different groups of redundant and nonredundant primer sets, that is, amplifying one, two, three, or more times in the cDNA database, were tested against the genomic sequence. From the 2397 nonredundant primers, only 924 amplified a locus in the genomic sequence. This difference was already expected because of difficulties in amplifying genomic regions, that is, if some primers anneal to a boundary region between two exons in the cDNA, the presence of introns would make this annealing site no more available. It is interesting to note that from the 924 amplicons detected, 914 (99%) did amplify only one locus in the genomic region, agreeing with the cDNA results. When the primer sets that amplified two different cDNAs were run against the genomic sequence, only 294/692 (42.5%) did amplify, having 14.5% been able to amplify two different loci.  Only one primer set did amplify more than two loci. These results indicate that SSR locator performance was consistent between the two databases regarding the nonredundant loci, that is, from those loci that were able to be amplified in both databases, their status of nonredundant was maintained. The changes observed for the redundant loci can be attributable to many causes, including redundancy in the cDNA database, but also to biological reasons due to primer positioning.

Identity between specific and redundant amplicons
Results of a global alignment between amplicons from original and redundant sites are shown in Table 3. Among the 1281 redundant amplifications, 787 (61.44%) resulted in a perfect alignment between both loci (identity equal to 100). For redundant amplicons with identity levels of 96-99%, and 90-95%, 452 (35.28%) and 8 (0.62%) loci were found, respectively. Alignments with identity levels bellow 90% were found in only 2.65% of cases. The fact that such a high percentage of redundant loci show high identity is probably a consequence of the genome fraction chosen, that is, expressed sequences. This fraction is under tight selection pressure and should not accumulate variations such as substitutions or indels at a high rate. As expected, comparisons to whole genome, generated a great deal of polymorphism, due to the inclusion of intronic regions in the alignments (data not shown).

CONCLUSIONS
The software SSRLocator was successfully implemented, adding steps for (1) SSR discovery, (2) primer design, and (3) PCR simulation between the primers obtained from original sequences and other fasta files. Also, the software produces reports for frequency of occurrence, nucleotide arrangement, primer lists with all standard information needed for PCR and global alignments. From the PCR simulation, it was possible to point out which primer pairs were nonredundant, suggesting that these primers are more appropriate for mapping purposes. In this case, however, wet lab experiments should be performed to confirm the advantage of nonredundant over redundant primers for mapping. It is possible that the results for micro-/minisatellite frequencies (loci/Mb) obtained in this study diverge from the results found in the literature. This can be explained by the different databases used (redundant ESTs, nonredundant ESTs and/or fl-cDNA), different algorithm configurations and minimum requirements set for counting motifs. Another explanation for some contrasting results is the fact that only "Class I" repeats were analyzed in our study.
The results showed that 932 (27.99%) primers presented amplifications in more than one gene sequence. This could be mostly due to the fact that primer pairs derived from a specific gene (cDNA) anchored in similar sites in other duplicated genes, since 5,607/28,469 (19.70%) genes were described as paralogs in the annotation of the database used [24]. Gene duplication along with polyploidy and transposon amplification are the major driving forces in genome evolution [40]. It is therefore not surprising that so many loci have redundancy. Also, a second possibility is that some primers were generated from protein domain regions within the analyzed cDNAs. These domains could be found in protein families with many genome copies, resulting in the observed redundancies. A validation of the redundancies of cDNA results was obtained through a virtual-PCR against the whole rice genome sequence. From the nonredundant primers that generated an amplicon, ca. 99% were nonredundant.
Finally, this tool can be used successfully for data mining strategies to find SSR primers in genomic or expressed sequences (ESTs/cDNAs). Also, this software can be a tool for microsatellite discovery in databanks of related species, anchoring primers in ortholog or paralog regions contained between databases from two different species.