Systemic Identification of Hevea brasiliensis EST-SSR Markers and Primer Screening

This research aimed to systematically identify and preliminarily validate the Hevea brasiliensis expressed sequence tag (EST) information using Simple Sequence Repeat (SSR) and provide evidence for further development of SSR molecular marker. The definition of general SSR features of Hevea EST splicing sequences and development of SSR primers founded the basis of diversity analysis and variety identification for Hevea tree resource. 1134 SSR loci were identified in the EST splicing sequence and distributed in 840 Unigene. The occurrence rate of SSR loci was 23.9%, and the average distribution distance of EST-SSR was 2.59 kb. The major repeat type was mononucleotide repeat motif, which accounted for 38.89%, while the corresponding value was 36.95% for dinucleotide repeat motif and 18.17% for trinucleotide repeat motif; the proportion of other motifs was only 5.99%. The superior repeat motifs for mononucleotide, dinucleotide, and trinucleotide were A/T, AG/CT, and AAG/CTT, respectively. 739 pair of primers were designed for 1134 SSR loci. PCR amplification was performed on Hevea Reyan5-11, Reyan87-6-47, and PR107, and 180 pairs of primers were selected which were able to amplify polymorphism bands.


Introduction
Hevea brasiliensis, also known as Pará rubber tree in Brazil, belonging to Hevea in Euphorbiaceae, is originated in the Amazon River basin, Brazil, and now is distributed in more than 40 countries and areas in Asia, Africa, Oceania, and Latin America. China is one of the major rubber producers after Indonesia, Thailand, Malaysia, and India, of which the production is the fifth largest in the world. Due to the limitation of climate condition, the Hevea brasiliensis planting in our country is distributed in Hainan, Guangdong, Guangxi, Yunnan, and Taiwan, among which Hainan is the main planting area.
Simple Sequence Repeat (SSR), also referred to as microsatellite DNA, short tandem (1)(2)(3)(4)(5)(6) repeat sequence, is widely distributed in animal and plant gene coding region and noncoding region. SSR marker is a codominant marker, which is of good repeatability, easy operation, and wide coverage and also shows higher polymorphism compared with other markers [1,2]. The SSR can be divided into genome SSR and expressed sequence tag-Simple Sequence Repeat (EST-SSR) according to its origin. The development of traditional genome SSR marker is time consuming, with low positive clone rate and small success probability [3][4][5]. Therefore, the development of SSR marker is greatly limited, and the analysis and screening usually rely on related information from closely related species. With the development of sequencing technology, the EST-related data are increasing, which enriches the EST-SSR marker. The EST-SSR not only shares the similar advantages as genome SSR marker such as high polymorphism, codominance, and good repeatability, but also possesses good generality between species [6][7][8][9][10][11][12][13][14].
In recent years, the development of SSR marker based on the EST data has been reported not only in fruit trees such as citrus [15], peach [16], pear [17,18], kiwi fruit [19], walnut [20], apricot [21], and lychee [22,23], but also in those 2 Journal of Nucleic Acids Euphorbiaceae family trees such as cassava [24,25], castor oil [26], Jatropha [27], and Vernicia [28,29]. SSR marker has been widely used in variety identification and improvement, resource analysis, genetic map construction, functional gene discovery, and so forth. So far, Hua et al. [30], An et al. [31], Li et al. [32], and Feng et al. [33] have reported the Hevea EST-SSR markers, but the EST they used were ones before 2009. EST data was developed very rapidly, and more than 20,000 EST sequences were developed from cDNA gene bank in the last two years. So the redevelopment of EST is very important.
In this study, we designed SSR loci specific primer according to the data of 38815 Hevea EST sequences included in NCBI database before June 1, 2012, performed polymorphism test on synthetic primers, and developed Hevea EST-SSR molecular marker. This study aimed to provide evidence for further development of SSR molecular marker and lay foundation for diversity analysis and variety identification of Hevea tree resource.

Plant Material and EST Retrieval.
Fresh leaf samples were collected from Reyan5-11, Reyan87-6-47, and PR107 of the cultivated rubber tree species (Hevea brasiliensis) growing in the Rubber Cultivation Research Institute, Chinese Academy of Tropic Agricultural Science (Danzhou). Leaf genomic DNA was extracted following the one-step method from Bioteke Co., Ltd.
EST sequences were obtained via the ENTREZ search tool of the EST database at the NCBI (http://www.ncbi.nlm.nih .gov). A total of 38,815 ESTs available on June 1, 2012, were obtained for this study.
The Seqclean software (https://sourceforge.net/projects/ seqclean/) was used to remove the polyA/polyT tail, clip lowquality ends (the ends rich in undetermined bases), and trash the ones which are too short (shorter than 100 bp) or which appear to be sequence (mostly low-complexity sequence, vectors, adapters, mitochondrial, ribosomal, bacterial, and other species than the target organism, etc.).
CD-HIT program (http://weizhongli-lab.org/cd-hit/) was used for clustering the ESTs, and then the same ESTs were removed.

Data
Mining. After pretreatment, the MISA software (http://pgrc.ipk-gatersleben.de/misa/) was used to search for SSRs from the rubber ESTs. The search criteria were mononucleotide repeats ≥10, dinucleotide repeats ≥6, trinucleotide repeats ≥5, and tetranucleotide to hexanucleotide ≥4; meanwhile, those interrupted composite SSRs had also been selected (interval bases ≤100). Dinucleotide repeats such as AT/TA and CT/GA were treated as the same type of repeat motif.

Development Process of Hevea EST-SSR Marker.
The development process of SSR was shown in Figure 1. The data of 38815 EST sequences of Hevea brasiliensis in the FASTA format were downloaded from NCBI dbEST database (up to June 1, 2012). 38079 sequences were obtained, and 736 were trashed. Among the 736 sequences, 644 were short, 79 were shortq, and 13 were dust. CD HIT software was adopted for cluster, and redundant sequences were removed, thus obtaining 27865 sequences. The redundancy rate reached 28.21%. After splicing was performed on the 27865 sequences with CAP3 software, 3519 splicing sequences and 7532 single sequences were obtained. SSR loci were searched in the 3519 sequences using MISA software, and 1134 loci were obtained. 739 pairs of primers were designed using primer3 software for the 1134 SSR loci.

The Occurrence Rate and Distribution of Hevea SSR
Loci. Using MISA software, SSRs loci were searched on the 3519 splicing sequences composed of EST sequences and possessing a total length of 2942162 bp, and the results showed that there were 840 sequences containing SSR repeat loci sequences in 3519 splicing sequences. The occurrence rate of SSR was 23.9% (the ratio of the amount of SSR-contained Unigene to the total amount of Unigene), as shown in Table 1.
The total length of 840 SSR sequences was 798917 bp, and the analysis on distribution features of SSRs-contained sequences discovered that the quantity of Unigene with single SSR locus was 620 (73.81%), while that of Unigene with two or more SSR loci was 220 (26.19%), and that of Unigene with compound type of SSR loci was 136, 1134 SSRs in total. From the perspective of distribution, the distribution rate (ratio   of SSR quantity to total Unigene quantity) was 32.2%. The average distance of SSR was 2.59 kb (ratio of total length of Unigene to SSR quantity); namely, one EST-SSR occurred every 2.59 kb. The occurrence rate and distribution rate of SSR loci of Hevea were quite high. Table 2, the statistical analysis on repeat motifs of all SSR loci showed that, in the Hevea EST-SSR, the type of repeat nucleotide was 1 to 6, and the occurrence rate of different SSR types was different. The quantity of mononucleotide repeat sequence motifs was the largest, 441 in total, accounting for 38.89%; the second largest quantity of repeat sequence motifs was dinucleotide, 419 in total, accounting for 36.95%, and the quantity of trinucleotide repeat sequence motifs was 206 and accounted for 18.17%. The total ratio of mono-, di-, and trinucleotide was 94.01%. The rest of nucleotide repeat motifs accounted for 5.99%. Among them, the quantity of tetranucleotide, pentanucleotide, and hexanucleotide motifs was 33, 9, and 26, respectively, which accounted for 2.91%, 0.79%, and 2.29%, respectively. The results above indicated that the major repeat sequence type of Hevea SSR loci was mononucleotide repeat, and the dinucleotide and trinucleotide repeats were common in the polynucleotide repeat. The quantity of different SSR repeat types showed a decline trend as the quantity of motif nucleotide increased.

The Repeat Motif Type and Times of Repetition
The different types of Hevea EST-SSR repeats possessed various kinds of motifs, and 55 types of repeat sequence motifs were detected in SSRs loci, including 2 kinds of mononucleotide repeat motifs, 3 kinds of dinucleotide repeat motifs, 10 kinds of trinucleotide repeat motifs, 12 kinds of tetranucleotide, 5 kinds of pentanucleotide, and 23 kinds of hexanucleotide repeat motifs. Table 3, we could see that the number of times of SSR motifs repetition of Hevea splicing sequences ranged from 4 to 78 (those >15 times were not listed in the table). 834 SSR motifs showed 4 to 15 repetitions, accounting for 73.54%, and 300 SSR motifs had repetitions >15, accounting for 26.46%. The repetition number of times ranked first was 10 in 154 (13.58%),  Total repeats Frequency of repeats  4  5  6  7  8  9  10  11  12  13  14  15 >15  A/T  ------119 66 45 28 23 18  137  436 38.45    Total repeats Frequency of repeats  4  5  6  7  8  9  10  11  12  13  14 15  in the table) were mononucleotide, dinucleotide, and trinucleotide. The repetition numbers of times were 78 for A/T, 19 for C/G, 18 for AC/GT, 50 for AG/CT, 39 for AT/AT, 17 for AAG/CTT, and 16 for AAT/ATT.

The Repetition Number of Times of Each Type of SSR Repeat Motifs.
From the types of SSR motifs and occurrence rate (Table 3, Figure 2), there were 2 types of mononucleotide repeats, that is, (A/T) and (C/G) , primarily (A/T) , which accounted for 98.9% of the repeat motifs. (A/T) repeat ranged from 10 to more than 15, among which 11 is the most common, followed by 12. For (C/G) , 10 repetitions occurred 4 times, and >15 repetitions presented only once.
There were 3 types of dinucleotide repeats (Table 3, Figure 3), that is, (AC/GT) , (AG/CT) , and (AT/AT) , wherein (AG/CT) was present most frequently, accounting for 83.3%, followed by (AT/AT) , accounting for 14.6%. The number of times of (AG/CT) repeat ranged from 6 to more than 15, among which 10 was the most, followed by 11.

Distribution Characteristics of SSR Loci Length and
3.6. Processing Results Using PRIMER3.0 Software. PRIMER3.0 software was used to design primers for 840 sequences. Among 1134 SSR loci in Hevea brasiliensis, 998 SSR loci were qualified for primer design, and 136 were unqualified. In 998 SSR loci, primers were successfully designed for 739 loci, with the success rate of 74.05%, and primers were not successfully designed for 259 loci. Among the 258 sequences corresponding to the 259 loci, each of 203 sequences had one SSR loci, and primers were not successfully designed since the SSR loci were located at the end or in the front of the sequence, while each of 48 sequences possessed two SSR loci, and primers were not obtained for one of the two loci on each sequence; Contig1648 splicing sequence had 2 SSR loci, and primers were not successfully designed for both loci. In 6 splicing sequences each had three SSR loci, and primers were not obtained for one of the three loci on each sequence.

Preliminary Screening of SSR Primers.
The 739 pairs of synthetic primers were amplified and screened with Reyan5-11, PR107, and Reyan87-6-47 genomic DNA in Hevea brasiliensis as the template, and electrophoresis results showed that there were a total of 180 pairs of polymorphic primers, accounting for 24.36%; 386 pairs showed clear bands but were nonpolymorphic, accounting for 52.23%; and 173 pairs showed no amplified band.

Occurrence Rate of SSR Loci.
As the rapid development of the research in plant functional genomics, the ESTs of public database are showing an exponential growth trend. It is becoming a locus to develop new EST-SSR marks by searching the SSR loci of EST sequences. In this study, we found 840 sequences containing 1,134 SSR loci from 38,815 EST sequences of Hevea brasiliensis searched in the public database and the occurrence and distribution rate were 23.9% and 32.2%, respectively. It meant that one EST-SSR was present every 2.59 kb on average, which was higher than that (9.1%) of 11809 ESRs in Hevea brasiliensis reported by Hu et al. [30] before May 10, 2007, and that (11.42%) in Hevea brasiliensis by An et al. [31]. Its distribution distance was higher than 3.93 kb reported by An et al. [31] and 3.96 kb by Li et al. [32] but lower than 2.25 kb by Feng et al. [33] and 281.39 bp by Li et al. [34]. All above are related to testing standard, data size, and test instrument.
The occurrence rate of SSR was higher than that in oil palm [35]  There is a big difference in the occurrence and distribution rate of EST-SSR between different plants or within the same species, which may be related to varied genome constitution of plants, transcriptome sequencing methods, quantity of data, and microsatellite search criteria [41].

SSR Repetition
Types. SSR repetition type has different distributions in different plants. In this study, there were 1 to 6 SSR types. The mononucleotide repeat had the maximum value, accounting for 38.89% and mainly involving (A/T) (accounting for 98.9% of its repeat motifs). Although the polyT and polyA sequence at the 5 -and 3 -end were removed during the pretreatment for the original sequence, the A/T type still had a high proportion of 98.9% in the total mononucleotide SSR, which indicated presence of false positive A/T. However, the polynucleotide was mainly involved in dinucleotide and trinucleotide repeats, which was consistent with the result of Hu et al. [30], An et al. [31], Li et al. [32], and Feng et al. [33]. Dinucleotide and trinucleotide repeats accounted for 36.95% and 18.17% of the total SSR, respectively; in dinucleotide and trinucleotide repeat, (AG/CT) and (AAG/CTT) were in the majority, respectively, accounting for 83.3% and 37.9% of the repeat motifs.
In a study of castor, Zhou et al. [26] found that the mononucleotide repeat type had the highest occurrence rate (37.51%) among 1-6 bp repeat motifs, followed by trinucleotide repeat type (34.63%) and dinucleotide repeat type (25.61%).
Through investigation of EST sequences of Vernicia fordii, Jia et al. [28] and Xu et al. [29] found that dinucleotide and trinucleotide repeats were common. For dinucleotide repeat, AG/CT shared the largest proportion; and for trinucleotide repeat, AAG/CTT showed the largest proportion.
Among the nucleotide repeat types, A/T was the most abundant repeat motif, which may be related to the energy in the base. As less energy was required to break AT bond compared with GC bond, AT fluctuated more easily. A/T bonds were more common in SSR, indicating that A/T motifrich SSR types may be more commonly seen in plants [42,43].

SSR Repeat
Lengths. The usability of SSR molecular markers significantly depends on its polymorphism, which is mainly influenced by the length of SSR. It is generally considered that the variation of SSR lengths mainly results from sliding and mismatching of DNA chains during duplication and DNA repair, or the unequal sister chromatid exchanges (SCEs) during mitosis or meiosis. In this study, the SSR length of Hevea brasiliensis varied from 10 bp to 100 bp, 24859 bp in total, and 21.92 bp on average. In a study 8 Journal of Nucleic Acids exploring variation of SSR lengths in Eucalyptus, Li et al. [44] found that SSR repeat motif lengths were negatively correlated with the variation rate of EST-SSR lengths, as well as the abundance of SSR loci, with the SSR lengths ranging from 16 to 64 bp and 18.5 bp on average.

Primer Amplification.
In this study, 173 out of 739 pairs of primers cannot be effectively amplified. It may be caused by the following reasons: first, the sequence amplified by primer pairs contains large introns, which cannot be displayed on electrophoresis. Second, one or both ends of the primer pairs happen to be located at a certain shear site. Third, EST fragments change. Fourth, the annealing temperature is too high or too low.

Conclusions
In the Hevea brasiliensis EST splicing sequences, the occurrence rate of SSR loci is 23.9%, and an EST-SSR is present every 2.59 kb on average. The SSR repeat type is primarily mononucleotide repeat motifs, accounting for 38.89%; the dinucleotide and trinucleotide repeat sequence motifs share the proportions of 36.95% and 18.17%, respectively; other motifs only accounted for 5.99%. A/T, AG/CT, and AAG/CTT are the superior repeat motifs of mononucleotide, dinucleotide, and trinucleotide. 739 pairs of primers are designed successfully for 1134 SSR loci. Through PCR amplification for Reyan5-11, Reyan87-6-47, and PR107 of Hevea brasiliensis, 180 pairs of primers that can show polymorphic bands are selected. As EST is an effective source for developing SSR marker, it is feasible and effective to develop SSR markers from EST in Hevea brasiliensis.