Development of New Candidate Gene and EST-Based Molecular Markers for Gossypium Species

New source of molecular markers accelerate the efforts in improving cotton fiber traits and aid in developing high-density integrated genetic maps. We developed new markers based on candidate genes and G. arboreum EST sequences that were used for polymorphism detection followed by genetic and physical mapping. Nineteen gene-based markers were surveyed for polymorphism detection in 26 Gossypium species. Cluster analysis generated a phylogenetic tree with four major sub-clusters for 23 species while three species branched out individually. CAP method enhanced the rate of polymorphism of candidate gene-based markers between G. hirsutum and G. barbadense. Two hundred A-genome based SSR markers were designed after datamining of G. arboreum EST sequences (Mississippi Gossypium arboreum   EST-SSR: MGAES). Over 70% of MGAES markers successfully produced amplicons while 65 of them demonstrated polymorphism between the parents of G. hirsutum and G. barbadense RIL population and formed 14 linkage groups. Chromosomal localization of both candidate gene-based and MGAES markers was assisted by euploid and hypoaneuploid CS-B analysis. Gene-based and MGAES markers were highly informative as they were designed from candidate genes and fiber transcriptome with a potential to be integrated into the existing cotton genetic and physical maps.


Introduction
Molecular markers provide valuable information in assessing the genetic variability, generating linkage maps, enabling better understanding of the genome organization, and deciphering quantitative trait loci (QTLs). Initial effort to map the cotton genome using an F 2 population utilized 705 restriction fragment length polymorphism (RFLP) probes that were polymorphic between G. hirsutum and G. barbadense and generated 41 linkage groups spanning 4675 cM [1]. Genetic variation at molecular level in cotton has been characterized using isozyme/allozyme markers [2], RFLPs [1,3,4], AFLPs [5,6], and microsatellites [7,8] in G. hirsutum and its related species. A comprehensive comparative genetic map with 2584 loci at ∼1.72 cM intervals for tetraploid (A t D t ) cotton and with 662 loci at ∼1.96 cM intervals for diploid (D) genome was constructed using RFLPs, genomic SSRs, and sequence tagged sites (STS) as probes [9].
Advances in technology have facilitated sequencing of complete transcriptomes and genomes that are accessible through public domain databases. Increasing number of expressed sequence tags (ESTs) for cotton facilitated the identification of simple sequence repeat (SSR) regions from the ESTs through data mining techniques. EST-SSR markers reveal putative functional genes and aid in map-based cloning of important genes [10,11]. In cotton, several EST-SSRs have been recently mapped [12][13][14][15]. Cotton fiber genes were mapped with EST-derived SSR loci using recombinant inbred line (RIL) population derived from an interspecific cross between G. hirsutum × G. barbadense 2 International Journal of Plant Genomics [16]. Other alternative mapping approaches such as wholegenome radiation hybrid (WGRH) mapping and fluorescent in situ hybridization (FISH) mapping have been utilized to generate an integrated cotton genome map [17]. Assignment of linkage groups or markers to chromosomes is made by use of aneuploid chromosome substitution (F 1 Stocks) lines of G. barbadense in G. hirsutum, euploid chromosome substitution lines of G. barbadense in G. hirsutum, and aneuploid chromosome substitution (F 1 Stocks) lines of G. tomentosum in G. hirsutum [18]. Euploid chromosome substitution stocks were developed by inbreeding the hemizygous monosomic and monotelodisomic substitution stocks through backcrossing up to BC 5 generation [19]. Using these available substitution lines, various molecular markers, linkage groups, and QTLs for agronomic and fiber quality traits were physically mapped to different cotton chromosomes [1,9,20].
Current integrated genetic maps in cotton utilized mainly RFLP, AFLP, genomic SSR, STS, and EST-SSRs. However, increasing the marker density with functionally expressed genes would make the linkage maps more valuable for crop improvement programs. New sources of molecular markers such as cleaved amplified fragment polymorphism (CAP), EST-SSR, and single nucleotide polymorphism (SNP) markers expand the current but limited repertoire of existing molecular markers. In this study, our objective was to understand the genetic diversity and phylogeny of the cultivated tetraploid, and wild diploid cotton species including the Cotton Marker Database (CMD) panel through the evaluation of candidate genes, CAP and EST-SSR marker technologies, and to use these markers in the construction of integrated genetic and physical maps in cotton.
Sequence information of candidate genes available in related species for functional genes helps in designing primers to amplify and differentiate between the species. CAP markers are extensively used in human and animal sciences while they were not exploited well in plant sciences for mapping [21]. CAP is an effective technology that uses PCR and restriction digestion to elucidate the polymorphism at nucleotide level without the knowledge of the sequence information of a marker. Using the abundant sequence information available for G. arboreum (diploid) fiber ESTs in GenBank, we designed over 700 nonredundant primer pairs based on the identification of the SSR regions, and tested 200 primer pairs using the two major cultivated tetraploid species G. hirsutum (TM-1) and G. barbadense , the parents for RIL population used in this study. Polymorphic markers were used for genetic mapping using an RIL population, and were further physically localized through the use of monosomic, monotelosomic aneuploid, and euploid chromosome substitution lines of G. barbadense in G. hirsutum genetic background.

DNA Extraction.
Young leaf tissues were lyophilized and the DNA was extracted using the Yu lab method at USDA-ARS, College Station, TX [23]. DNA quality was evaluated using 0.7% agarose gel electrophoresis at 40 V for 3 hours. Genomic DNA was also quantified using TKO 100 fluorometer and further diluted to a working concentration of 50 ng/µL for use in polymerase chain reaction (PCR).

Gene-Based Markers. G. arboreum EST sequences in
GenBank were compared with non-redundant (nr) protein database to derive the putative gene function using BLASTX program. ESTs that had significant homology with functional genes in Arabidopsis, Oryza, and others were selected for polymorphism screening. Forty-seven primer pairs were synthesized (Sigma Genosys, The Woodlands, TX) based on these candidate genes that have functional significance in cotton (See Supplementary File-1 in supplementary material available on line at doi:10.1155/2011/894598). Primers were evaluated for amplification using PCR at two annealing conditions (T m = 50 • C and 60-55 touchdown). Amplified products were surveyed for polymorphism using 6% polyacrylamide gel electrophoresis (PAGE) and scored in binary fashion for each fragment size. The data was used to calculate the Polymorphism Information Content (PIC) value. Cluster analysis was conducted with nearest neighborhood joining method in classifying the binary data derived to generate phylogenetic tree to assess the evolutionary relationships [24] among the five tetraploid and 21 diploid species. PCR products of the monomorphic markers between G. hirsutum and G. barbadense were subjected to digestion using RsaI, MspI, HhaI, and HaeIII restriction endonucleases and surveyed for polymorphism using PAGE for detection of CAP markers. Fragment-based and CAP-based markers were subsequently tested for chromosomal localization in aneuploid and euploid chromosome substitution lines.

EST-SSR-Based Markers. G. arboreum ESTs (38,893)
were collected from GenBank and were searched for the presence of SSR sequences, followed by masking. The masked ESTs were clustered using StackPack v2.1 (Electric Genetics, Reston, VA) software to reduce the redundancy. The nonredundant (NR) sequences that contain an SSR motif were selected for further analysis as described by Kantety et al. [25]. A subset of NR-ESTs mainly expressed in fibers (725) were identified with having a repeat length 18 or more. Among this subset SSR containing NR-ESTs, we designed 200 primer-pairs for further analysis. They were designated as Mississippi Gossypium arboreum EST-SSRs (MGAES). The design of the primers was based on the sequence information flanking the SSR region with an estimated product size of ∼200-300 base pairs using Primer3 software [26] and were synthesized at Sigma-Genosys (Sigma-Aldrich, Saint Louis, MO). The primer sequences, EST sources, and their putative function were summarized in Supplementary File-2. The MGAES primers were verified against all the SSR marker primer sequences available at Cotton Microsatellite Database (CMD, http://www.cottonmarker.org/) for redundancy and sequence homology using BLAST search. These MGAES and gene-based primer sequences will be submitted to CMD for cotton research community use. MGAES primers were first amplified on the RIL parents: G. hirsutum and G. barbadense, at annealing conditions of 50 • C and 55 • C; and surveyed for fragment length polymorphism using 6% PAGE. Polymorphic markers were then identified to genotype the RIL population for construction of genetic linkage groups. The amplified markers were also used for physical mapping onto chromosomes and chromosome arms.

Data Analysis.
Polymorphic data was scored as binary values (1 as presence of fragment, 0 as absence) and used for the calculation of PIC value [27]. Binary data was also used to generate a phylogram using cluster analysis with SAS v9.1 (SAS Inc, Cary, NC). Similarly, binary data from RIL population for polymorphic fragments were used to create linkage groups using MapManager QTX software [28]. Recombination frequencies were converted into linkage distances using Haldane function [29]. The maximum linkage distance below 50 cM between any two markers and an LOD (logarithm of odds) score of 4 and above were considered optimal to qualify as a linkage group.

Results and Discussion
3.1. Candidate Gene-Based Markers. The thrust of this effort was to expand the very limited base of markers that are utilized in characterizing genetic variability in Gossypium.
Comparative genetic approaches have been proved successful for characterizing genomes and mapping important traits based on the sequence information in related plant species [30]. Candidate genes in G. arboreum EST sequences were identified by comparison with distant species using BLASTX to reveal the gene function information and such ESTs were used to design primers. Nineteen candidate gene markers (40%) were successfully amplified out of 47 gene-based markers screened across the 32 genotypes from 26 diverse Gossypium species tested in this study. PAGE fragment analysis for amplified products identified 13 markers that were polymorphic among these cotton species (68%). Binary fragment data for these markers were used to calculate PIC value for each marker that ranged from 0.794 to 0.998 (Table 2). Though these gene-based markers were highly polymorphic across multiple cotton species, the fragment polymorphism rates detected using direct amplicons were very low for the two cultivated tetraploids versus G. hirsutum and G. barbadense. Only one polymorphic marker was identified between G. hirsutum TM-1 (CMD-1) and G. barbadense PIMA 3-79 (CMD-2). This limitation led us to explore additional avenues by restriction digestion of the large PCR fragments to survey for polymorphism at a higher resolution. CAP markers have been used in marker-assisted selection process and mapping genetic loci of interest [21,31]. Large amplicon sizes for many gene-based markers provided an opportunity to employ CAP technique to detect nucleotide level polymorphism for TM-1 and PIMA 3-79. To enable higher restriction site choices in these amplicons, we used RsaI, MspI, HhaI, and HaeIII enzymes that detect four base restriction site recognition motifs. CAP technique identified eleven polymorphic markers (58%) of the 19 tested on TM-1 and PIMA 3-79 suggesting the potential for CAP technology as a useful resource for identifying genetic variation. One fragment-based and five CAP-based markers were localized to cotton chromosome or chromosome-arm using the euploid CS-B lines. Our results suggest that CAP-based marker technology is a robust approach for detection of variation in closely related species and provides an alternative to cost-intensive SNP-based approaches. Euploid CS-B lines were annotated on the basis of the chromosome pair substituted for the complete chromosomes or chromosome arms of G. hirsutum monosomic or monotelodisomics [19,32]. If a polymorphic marker between G. hirsutum and G. barbadense showed similar fragment patterns to that of G. barbadense in a euploid CS-B line, then that marker was concluded to be localized to particular substituted chromosome or arm. In this manner, both dominant and recessive alleles were physically mapped using euploid CS-B lines. One amplicon length polymorphism and five CAP-based markers were localized to seven chromosomes or arms using the euploid CS-B lines ( Table 2). As these markers were based on homologous gene sequences, there is a possibility of having multiple copies in the tetraploid cotton genomes. Therefore this study adds a new set of gene-based markers with their specific chromosomal location and help in assessing the evolutionary relationships among the 26 Gossypium species.

Phylogenetic
Analysis. Gossypium genus includes five tetraploid species from AD 1 -AD 5 genomes and approximately forty-five diploid species from genome groups A-G and K [33]. Of these all five tetraploid species and twenty-one diploid species belonging to A(1), B(1), C(4), D(11), E(3), and G(1) genomes were included in this study. Relationships among these cotton genome groups were studied earlier using polymorphisms exhibited in chloroplast genome [34], ribosomal genes [35], and Adh genes [36]. These studies showed close relationships among the species belonging to the same geographical origin [37] besides explaining the origin of New World tetraploids from the Old World diploids [38].
In this study, a total of 76 fragments from nineteen candidate gene-based markers were observed across the diverse panel of 32 cotton genotypes. Binary data derived from the fragment analysis was used to generate a phylogenetic tree providing the evolutionary relationships among the 26 cotton species by cluster analysis using maximum likelihood method ( Figure 1). Cluster analysis resulted in a dendrogram comprising four major clusters grouping 23 species except that G. bickii (G1), G. pulchellum (C3), and G. australe (C8) branched out individually. As the dendrogram was derived based on the maximum likelihood method, it provides the evolutionary relationships based on combined genetic lineages of the candidate genes used in this study. Many genotypes from tetraploid species formed into two clades while the diploid species form the remaining. Species belonging to the same genome are grouped together to form subclusters in the dendrogram and was evident in grouping of tetraploids as wells as other diploid species. Though the number of genes used in this study is not an exhaustive data set, the evolutionary relationships among the 26 species were mostly congruent with earlier studies [37]. The dendrogram also supported the theory that G. darwinii (AD 5 ), another tetraploid species endemic to Galapagos Islands is closely related to G. barbadense (AD 2 ) [39,40].  EST-SSRs as an informative resource for genetic mapping [7,12,14,25,41]. Despite the earlier efforts by Han et al. [15] and Part et al. [16] to characterize cotton genome using EST-SSRs, the molecular variation in the coding regions of many fiber expressed genes was not yet fully utilized to assist the marker-assisted selection of important fiber traits. Two hundred primer pairs were designed specifically from fiber related ESTs were used for polymorphism detection and mapping in this study. Except 27 individual primer sequences, the rest of the MGAES markers were new and nonrepetitive based on BLAST homology search from earlier studies of Han et al. [15] and Park et al. [41]. Though these EST-SSRs were derived from a diploid progenitor (genome A 1−2 ) of tetraploid species (A t D t ), 147 markers (74%) out of 200 primer pairs were successfully amplified in tetraploid cultivars G. hirsutum TM-1 and G. barbadense PIMA 3-79 suggesting considerable homology exists of the tetraploid cotton with the diploid ancestral species. Sixtyfive MGAES markers (44%) were polymorphic between TM-1 and PIMA 3-79 indicating the merit of these markers due to high rate of polymorphism compared to earlier studies [15]. Though these G. arboreum EST-SSR markers were highly polymorphic between the G. hirsutum and G. barbadense species, we observed very low polymorphism rate within each species. High polymorphic rate between the species could also be attributed to amplification of genotypes under stringent PCR conditions to avoid nonspecific amplification. High levels of polymorphism were detected using G. arboreum-based EST-SSRs suggested the potential of cross-species transferability of these markers among diploid and tetraploid species [42]. Many of these MGAES markers were derived from the fiber expressed ESTs thus making them more valuable in breeding programs and marker-assisted selection for fiber-associated traits. Polymorphism detected for each fragment in 186 RILs was scored initially in ternary fashion and then converted into binary fashion by treating the heterozygous alleles as missing values. MapManager program was used for constructing linkage groups. Fourteen linkage groups were generated spanning ∼399 cM with minimum two markers for linkage group and with an LOD score threshold of 4. These linkage groups and polymorphic markers can be incorporated into existing genetic maps to generate an integrated genetic map for cotton.

Chromosomal Localization of MGAES Markers.
Physical mapping of the polymorphic markers was facilitated using aneuploid ( Figure 2) and euploid chromosome substitution lines (Figure 3). Sixteen markers were localized to different chromosomes using euploid CS-B lines while 14 markers were localized using aneuploid CS-B lines. Missing a polymorphic locus in a specific aneuploid (BC 0 F 1 ) accession determines the chromosomal localization of a dominant marker to that corresponding specific chromosome or chromosome arm. Results derived from both aneuploid and euploid CS-B lines served as cross-reference to each other.   For example, MGAES-64 marker was localized to H11 and Te11Lo aneuploid CS-B accessions where the accessions were deficient for chromosome 11 and its long arm, respectively; in both accessions the missing G. hirsutum fragment or locus has been observed confirming the localization of the marker to chromosome 11 and its long arm ( Figure 2). Using euploid chromosome substitution lines, the same MGAES-64 marker has been localized to CS-B11Sh accession, where a pair of chromosomes from G. barbadense was substituted for the long-arm deficient ditelosomic lines of G. hirsutum; polymorphic fragment corresponding to G. barbadense was only observed in CS-B11Sh explaining its localization to chromosome 11 long arm (Figure 3). Polymorphic markers, linkage group information, euploid, and aneuploid CS-B chromosome localization were shown in Table 3. However, we observed incongruency in localizing some markers to just a single chromosome using euploid and aneuploid CS-B analysis. This needs to be further investigated as it might be a result of duplicated loci or genome reorganization in some lines.
Our research demonstrated the use of gene-based markers for analyzing the genetic diversity among cultivated and wild cotton species. CAP markers proved more useful for detection of polymorphism in monomorphic fragments of closely related species. MGAES markers derived from G. arboreum ESTs were highly polymorphic and informative for developing genetic maps and other applications. Incorporation of the linkage groups and polymorphic markers into existing genetic maps help in developing integrated cotton genetic maps to assist cotton breeders.