Recent Advances in Cotton Genomics

Genome research promises to promote continued and enhanced plant genetic improvement. As a world's leading crop and a model system for studies of many biological processes, genomics research of cottons has advanced rapidly in the past few years. This article presents a comprehensive review on the recent advances of cotton genomics research. The reviewed areas include DNA markers, genetic maps, mapped genes and QTLs, ESTs, microarrays, gene expression profiling, BAC and BIBAC libraries, physical mapping, genome sequencing, and applications of genomic tools in cotton breeding. Analysis of the current status of each of the genome research areas suggests that the areas of physical mapping, QTL fine mapping, genome sequencing, nonfiber and nonovule EST development, gene expression profiling, and association studies between gene expression and fiber trait performance should be emphasized currently and in near future to accelerate utilization of the genomics research achievements for enhancing cotton genetic improvement.


INTRODUCTION
Cottons (Gossypium spp.) belong to the genus Gossypium of the family Malvaceae. Gossypium consists of 45-50 species, with 40-45 being diploids (2n = 26) and 5 being allotetraploids (2n = 52). The species are grouped into eight genome groups, designated A through G and K, on the basis of chromosome pairing affinities [1]. At the tetraploid level, there are five species, designated (AD) 1 through (AD) 5 for their genome constitutions. Phylogenetic analyses clustered the diploid species of Gossypium into two major lineages, including the 13 D-genome species lineage and the 30∼32 A-, B-, E-, F-, C-, G-, and K-genome species lineage, and the polyploid species into one lineage, that is, the 5 AD-genome species lineage (Figure 1; [2]).
Of the Gossypium species, four are cultivated in agriculture, including two allotetraploids (G. hirsutum and G. barbadense) and two diploids (G. herbaceum and G. arboreum). Gossypium hirsutum, also known as Upland cotton, Long Staple Cotton, or Mexican Cotton, produces over 90% of the world's cotton; G. barbadense, also known as Sea Island Cot-ton, Extra Long Staple Cotton, American Pima, or Egyptian Cotton, contributes 8% of the world's cotton; and G. herbaceum, also known as Levant Cotton, and G. arboreum, also known as Tree Cotton, together provide 2% of the world's cotton.
Cottons are not only a world's leading textile fiber and oilseed crop, but also a crop that is of significance for foil energy and bioengergy production. Although cottons are native to tropics and subtropics naturally, including the Americas, Africa and Asia, they are cultivated in nearly 100 countries. India, China, USA, and Pakistan are the top four cotton growing countries, accounting for approximately 2/3 of the world's cotton (http://www.ers.usda.gov/ Briefing/Cotton/trade.htm). According to the Food and Agriculture Organization (FAO) of the United Nations (http://www.fao.org), the cotton planting area reached about 35 million hectares and the total world's cotton production had a record of about 23 million metric tones in 2004/2005. Cotton products include fibers and seeds that have a variety of uses. Cotton fibers sustain one of the world's largest industries, the textile industry, for wearing apparel, home furnishings, and medical supplies, whereas cottonseeds are widely used for food oil, animal feeds, and industrial materials (such as soap). Cottonseed oil is ranked fifth in production and consumption volume among all vegetable oils in the past decades, accounting for 8% of the world's vegetable oil consumption. The business stimulated by cotton is hundreds of billion dollars in the world. In the USA alone, for instance, the annual cotton business revenue exceeds $120 billion (Agricultural Statistics Board 1999; National Cotton Council of America, http:// www.cotton.org/news/releases/2003/cotton-trade.cfm). Moreover, nearly a billion barrels of petroleum worldwide are used in every year to synthesize artificial "synthetic" fibers. Further improvement of cotton fibers in yield and quality will replace or significantly reduce the consumption of fossil oil for synthetic fiber production, thus being saved for energy production. Finally, cottonseed oil, the main by-product of cotton fiber production, could be potentially used as biofuel.
In addition to their economic importance, cottons are an excellent model system for several important biological studies, including plant genome size evolution, plant polyploidization and single-celled biological processes. The genomes of angiosperm plants vary over 1000 folds in size, ranging from 100 to >100,000 Mb/1C (haploid) [6]. It has long been recognized that polyploidy is a common, prominent, ongoing, and dynamic process of genome organization, function diversification, and evolution in angiosperms [7]. The genomes of most angiosperms are thought to have incurred one or more polyploidization events during evolution [8]. Studies have demonstrated that genome doubling has also been significant in the evolutionary history of all vertebrates and in many other eukaryotes [9][10][11][12]. It is estimated that about 70% of the flowering plant species are polyploids. For instance, of the world-leading field, forage, horticultural, and environmental crops, many are contributed by polyploid species, such as cotton, wheat, soybean, potatoes, canola, sugarcane, Brassica, oats, peanut, tobacco, rose, coffee, and banana. Therefore, studies of both genome size evolution and polyploidization have long attracted the interests of scientists in different disciplines. Nevertheless, much remains to be learned. Examples include impacts of polyploidization on genome size, genome organization, gene duplication and function, and gene family evolution; the role of Hong-Bin Zhang et al. 3 transposable elements in structural and regulatory gene evolution and gene functions; and mechanisms and functional significance of rapid genome changes.
Cottons have several advantages over other polyploid complexes for plant genome size and polyploidization studies. First, the genome sizes of 37 of the 45∼50 Gossypium species, including all eight genomes and polyploidy species, have been determined and shown to vary extremely significantly ( [3]; Figure 1). At the diploid level, the genome sizes vary by three folds, ranging from 885 Mb/1C in the D-genome species to 2,572 Mb/1C in the K-genome species. Within each lineage, the genome sizes vary most in the A+F+B+E+C+G+K lineage, ranging from 1,311 to 2,778 Mb/1C with a difference of 1,467 Mb (110.2%); second in the D-genome lineage, ranging from 841 to 934 Mb/1C with a difference of 93 Mb (10.5%); and least in the polyploidy lineage, ranging from 2,347 to 2,489 Mb/1C with a difference of 142 Mb (5.9%). Variations were also observed within a species. For instance, within G. hirsutum, the variation (n = 5) was from 2,347 to 2,489 Mb/1C, differing by 142 Mb (5.9%) while within G. arboreum, the variation (n = 5) was from 1,677 to 1,746 Mb/1C, differing by 69 Mb (4.0%).
Second, the evolutionary history of the allotetraploid species of Gossypium has been established ( Figure 1), especially for the two cultivated AD-genome cottons, G. hirsutum and G. barbadense, and their closely related diploid progenitors, G. herbaceum (A 1 ), G. arboreum (A 2 ), G. raimondii (D 5 ), and G. gossypioides (D 6 ). The A-genome species are African-Asian in origin, whereas the D-genome species are endemic to the New World subtropics, primarily Mexico. Following the transoceanic dispersal of an A-genome taxon to the New World, hybridization between the immigrant A-genome taxon and a local D-genome taxon led to the origin and evolution of the New World allopolyploids (AD-genome) [13,14]. Subsequent to the polyploidization event, the allopolyploids radiated into three sublineages [15], among which included are the world's commercially most important species, G. hirsutum and G. barbadense. Studies showed that the A subgenome of the AD-genome-cultivated cottons is the most closely related to the genome of the extant diploid G. herbaceum (A 1 ) [16]; the D subgenome of the AD-genome-cultivated cottons is the most closely related to the genome of the extant diploid, G. raimondii (D 5 ) or G. gossypioides (D 6 ) [13]; and the cytoplasm of the ADgenome-cultivated cottons is the most closely related to that of the extant diploids G. herbaceum (A 1 ) and G. arboreum (A 2 ) [14,17]. Sequence analysis and paleontological record suggest that the A-genome and the D-genome groups diverged from a common ancestor 5-10 million years ago, and that the two diverged diploid genomes became reunited in a common nucleus to form the polyploid cottons, via allopolyploidization, in the mid-Pleistocene, or 1-2 million years ago [14,15,18,19].
Finally, as in the wheat polyploid complex, cottons have a long history of research at the cytological level. A wealth of cytogenetic stocks has been developed, including artificially synthesized AD-genome polyploids between the A-genome and D-genome diploid species [20] as well as individual chromosome addition and substitution lines [21]. These cytogenetic stocks are unique and valuable not only for cotton genetics research, but also for deciphering the ramifications of polyploidization on genome organization, function, and evolution.
Cotton fiber is an excellent single-celled model system for studies of many single-celled biological processes, particularly cell expansion and cellulose biosynthesis. Cotton fibers are unicellular, unbranched, simple trichomes that differentiate from the protoderm of developing seeds. There are probably over one-half million quasi-synchronously elongating fibers in each boll or ovary. Although all plant cells extend to some degree during development and differentiation, cotton fibers can reach up to 5.0 cm in length in some genotypes, being among the longest cells. Therefore, they offer a unique opportunity to study cell expansion at the single cell level. Cellulose is a major component of the cell walls of all higher plants, constituting perhaps the largest component of plant biomass, with an estimated annual world production of 100 million metric tons. The fiber cell wall of cottons consists of >90% cellulose. Therefore, cotton fiber cells have long been used as a model system to study cellulose biosynthesis [22] that is the basis for biomass-based bioenergy production.

ADVANCES IN COTTON GENOMICS RESEARCH
Genome research has been demonstrated to be promising for continued and enhanced crop plant genetic improvement. Therefore, efforts have been made in cotton genome research, especially development of genomic resources and tools for basic and applied genetics, genomics, and breeding research. These resources and tools include different types of DNA markers such as restriction fragment length polymorphism (RFLP), randomly amplified polymorphic DNA (RAPD), amplified fragment length polymorphism (AFLP), resistance gene analogs (RGA), sequence-related amplified polymorphism (SRAP), simple sequence repeat (SSR) or microsatellites, DNA marker-based genetic linkage maps, QTLs and genes for the traits important to agriculture, expressed sequence tags (ESTs), arrayed large-insert bacterial artificial chromosome (BAC) and plant-transformation-competent binary BAC (BIBAC) libraries, and genome-wide, cDNA-, or unigene EST-based microarrays. Efforts are also being made to develop the genome-wide, BAC/BIBAC-based integrated physical and genetic maps, and sequence the genomes of the key cotton species. However, compared with other major crops, such as rice, maize, and soybean, the genome research of cottons is far behind, mainly due to the limited funds allocated to the species. Summarized below are the major advances achieved recently in cotton genomics research.

DNA markers and molecular linkage maps
Genetic maps constructed in the Gossypium species and the types of markers used are listed in Table 1. As in most plant species, the early application of DNA markers in cotton genomic research has been in the form of RFLPs. It is, therefore, not surprising that the first molecular linkage map of the Gossypium species was constructed from an interspecific 4 International Journal of Plant Genomics G. hirsutum × G. barbadense F 2 population based on RFLPs [23]. The map contained 705 loci that were assembled into 41 linkage groups and spanned 4,675 cM. This map later was further advanced by Rong et al. [24] that comprised 2,584 loci at 1.74-cM intervals and covered all 13 homeologous chromosomes of the allotetraploid cottons, representing the most complete genetic map of the Gossypium to date. Many of the DNA probes of the map were also mapped in crosses of the D-genome diploid species G. trilobum × G. raimondii [24] and the A-genome diploid species G. arboreum × G. herbaceum [16]. Detailed comparative analysis of the relationship of gene orders between the tetraploid ADsubgenomes with the maps of the A and D diploid genomes has revealed intriguing insights on the organization, transmission and evolution of the Gossypium genomes.
Because RFLPs are labor-intensive and require large amounts of DNA, tedious blot hybridization and autoradiographic methods, polymerase chain reaction (PCR)-based DNA marker methods have come into vogue. Several types of PCR-based DNA markers have been utilized in cotton genome research. Methods, such as RAPD, AFLP, RGA, and SRAP, offer an excellent opportunity to scan enormous numbers of DNA loci rapidly, often targeting the DNA elements that are rapidly-evolving and therefore, are more likely to contain loci differing among genotypes. Kohel et al. [25] constructed a genetic map based on a population derived from an interspecific cross between Texas Marker-1 (TM-1) (G. hirsutum) and 3-79 (G. barbadense) in which a total of 355 DNA markers (216 RFLPs and 139 RAPDs) were assembled into 50 linkage groups, covering 4,766 cM. Brubaker and Brown [26] presented the first AFLP genetic linkage map for the Gossypium G-genome that was constructed from an interspecific G. nelsonii × G. australe population. The AFLP genetic linkage maps were used to identify G-genome chromosome-specific molecular markers, which, in turn, were used to track the fidelity and frequency of G. australe chromosome transmission in a G. hirsutum × G. australe hexaploid bridging family.
Advent of SSR or microsatellite markers has brought a new, user-friendly, and highly polymorphic class of genetic markers for cotton. The latter feature is especially useful to the cultivated Upland cotton due to its low intraspecific polymorphism. SSRs are PCR-based markers, usually codominant, well dispersed throughout the genome, easily shared between labs via flanking primer sequences, and well portable from one population to another [84]. Reddy et al. [85] suggested that the total pool of SSRs present in the cotton genome is sufficiently abundant to satisfy the requirements of extensive genome mapping and marker-assisted selection (MAS). Liu et al. [86] reported the assignment of SSRs to cotton chromosomes by making use of aneuploid stocks. SSRs have been widely employed in genetic diversity analyses of cotton [87][88][89][90] and several genetic linkage maps based mostly on SSRs have now been developed [37][38][39][40][41].
The development of a large number of ESTs (see below) provides a good source of PCR-based primers for targeting SSRs [92,95,96]. Taliercio et al. [97] sequenced ESTs representing a variety of tissues and treatments with SSRs identified among the ESTs. Their results indicated that these SSRs could potentially map the genes represented by the ESTs. Guo et al. [98] examined the transferability of 207 G. arboreum-derived EST-SSR primer pairs among 25 different diploid accessions from 23 species representing 7 Gossypium genomes. Their results demonstrated that the transferability of EST-SSR markers among these diploid species could assist the introgression of genes into cultivated cotton species especially by molecular tagging of the important genes existing in these diploid species. Guo et al. [40] also developed 2,218 EST-SSRs, with 1,554 from G. raimondii-derived ESTs and 754 from G. hirsutum-derived ESTs. By integrating these new EST-SSRs to enhance the genetic map constructed by Han et al. [39], the present SSR-based genetic map consists of 1,790 loci in 26 linkage groups and covers 3,425.8 cM with an average distance between markers of 1.91 cM. This SSRbased high-density map contains 71.96% functional marker loci, of which 87.11% are EST-SSR loci.
DNA sequences derived from clone end sequencing of BAC libraries provide yet another resource for SSR marker development. In addition to the uses as genetic markers, SSRs developed from BAC-end sequences provide the possibility to efficiently integrate the genetic and physical maps of cotton. Frelichowski et al. [36] developed 1,316 PCR primer pairs to flank SSR motif sequences from 2,603 new BACend genomic sequences developed from G. hirsutum Acala "Maxxa." An interspecific recombinant inbred population was used to map 433 marker loci in 46 linkage groups with a total genetic distance of 2,126.3 cM and an average distance between loci of 4.9 cM which covered approximately 45% of the cotton genome.
To overcome the paucity of a particular type of DNA markers, genetic maps were developed by incorporating different classes of markers. For example, Lacape et al. [28] constructed a combined RFLP-SSR-AFLP map based on an interspecific G. hirsutum × G. barbadense backcross population of 75 BC 1 plants. The map consists of 888 loci that ordered into 37 linkage groups and spanning 4,400 cM. This map was updated, mostly with new SSR markers, to contain 1,160 loci that spanned 5,519 cM with an average distance between loci of 4.8 cM [29]. Mei et al. [27] developed a genetic map using an interspecific G. hirsutum and G. barbadense F 2 population that contained 392 genetic loci, including AFLPs, SSRs, and RFLPs, and mapped into 42 linkage groups that spanned 3,287 cM, thus covering approximately 70% of the cotton genome. Lin et al. [33] constructed a linkage map of tetraploid cotton using SRAPs, SSRs, and RAPDs to screen an interspecific G. hirsutum × G. barbadense F 2 population. A total of 566 loci were assembled into 41 linkages that covered 5,141.8 cM with a mean interlocus space of 9.08 cM. He et al. [34] constructed a more detailed cotton map with this same F 2 population [33] using SSRs, SRAP,  RAPD, and retrotransposon-microsatellite amplified polymorphisms (REMAPs). One thousand twenty nine loci were mapped to 26 linkage groups that extended for 5,472.3 cM with an average distance between loci of 5.32 cM. The linkage groups of the genetic maps have been assigned to their corresponding chromosomes by using the available cotton aneuploid stocks [21,23] and fluorescent in situ hybridization using mapped genetic marker-containing BACs as probes [99].

Gene and QTL mapping
Although molecular linkage maps have contributed greatly to our understanding of the evolution and organization of the cotton genomes, a primary purpose of the map construction is to provide a common point of reference for locating the genes affecting qualitative and quantitative traits. DNA markers that are associated with genes conferring important agronomic traits that are costly or laborious to measure will provide a less costly and yet more dependable means of selection for identifying desirable progenies in breeding programs.

Mapping qualitative traits
Qualitative or simple Mendelian inherited traits are traits of individuals that differ as to kind and not of degree, typically controlled by single genes and the phenotypic variation falls into discrete classes in the segregating progenies. Over 200 qualitative traits have been identified in either the diploid (G. arboreum and G. herbaceum) or tetraploid (mostly in G. hirsutum and G. barbadense) species [1]. Examples of such traits include leaf shape, pollen color, leaf color, lint color, pubescent, bract shape, and so on. Because many qualitative traits are either morphological mutants that have arisen through spontaneous mutation, irradiation, or from natural variation between species in interspecific hybrids, they have little utility in crop improvement. Consequently, there have been little efforts in mapping qualitative traits onto the molecular genetic map. Qualitative traits that have been mapped using molecular markers were recently summarized in [105]. Many of these traits were mapped not as the main objective but as a tool for aligning the various linkage groups to chromosomes assigned by the classical map. Noteworthy exceptions include those that are related to agricultural productivity and quality of cotton and can be broadly grouped into four categories: genes for leaf shape, fiber development, resistant to disease and insect pests, and fertility restoration [105].

Mapping quantitative traits
Quantitative traits are traits of individuals that differ as to degree and not of kind, typically considered as interactions of multiple loci, tend to exhibit continuous variation in a 6 International Journal of Plant Genomics segregating population, and are readily subjected to variation of environments. With the increased availability of DNA markers for use in cotton genetic map construction in the last ten years, activities in identifying and locating quantitative trait loci (QTLs) have blossomed. QTLs that have been identified in cotton include yield and yield components, fiber quality, plant architecture, resistance to diseases such as bacterial blight and Verticillium wilt, resistance to pests like rootknot nematode, and flowering date. A list of QTLs mapped in cotton is presented in Table 2.  (a) There is no relationship in the genome letter designation between genera, but there is a relationship in the genome letter designation between species within a genus, the species with the same genome letter being closely related.
Several noteworthy findings have come out of QTL mapping in cotton. First, in tetraploid cottons, although the Dsubgenome was derived from an ancestor that does not produce spinnable fibers, many QTLs influencing fiber quality traits were detected on the D-subgenome [106]. For example, Jiang et al. [45] pointed out that D-subgenome QTLs may partly explain the fact that domestication and breed-ing of tetraploid cottons has resulted in fiber with a higher quality than those achieved by parallel improvement of the A-genome diploid cottons which produce spinnable fibers. The merger of the A-and D-genomes in tetraploid cottons, where each genome has a different evolutionary history, may have offered unique avenues for phenotypic response to selection. Second, numerous studies have shown 8 International Journal of Plant Genomics  that QTLs occur in clusters genetically in the cotton genome [27,46,55,56,76,106]. Ulloa et al. [56] suggested the possible existence of highly recombined regions in the cotton genome with abundant putative genes. QTL clusters might exert their multiple functions to compensate for a numerical deficiency, expanding their roles in cotton growth and development [76]. Finally, the position and effect of QTLs for fiber quality are not comparable in different populations and environments evaluated [60,106]. This suggests that QTL studies conducted thus far have detected only a small number of loci for fiber growth and development and that additional QTLs remain to be discovered [58,59]. Furthermore, because quantitative traits are readily subjected to variation of environments, mapping efforts of these traits need to be pursued in multiple environments including years and locations.

ESTs
Cloning and sequencing of expressed gene sequence tags (ESTs) by single sequencing pass from one or both ends of cDNA clones have been widely used to rapidly discover and characterize genes in a large-scale and highthroughput manner. As have been done in many other plant and animal species of biological and/or economical importance, significant efforts have been made to generate ESTs in cottons.  [102][103][104]. These ESTs were collectively generated from 32 cDNA libraries constructed from mRNA isolated from 18 genotypes of three species, G. hirsutum, G. arboreum, and G raimondii, by one-pass sequencing of cDNA clones from one (3 or 5 end) or both ends. They were generated from 12 different organs, including developing fibers, seedlings, buds, bolls, ovules, roots, hypocotyls, immature embryos, leaves, stems, and cotyledons. Some of the ESTs were generated from plants growing under biotic or abiotic stress conditions such as drought, chilling, and pathogens. By analyzing approximately 185,000 ESTs from both fibers/ovules (124,299 ESTs) and nonfiber/ovule tissues (60,899 ESTs) of G. hirsutum, G. arboreum and G. raimondii, Udall et al. [102] obtained 51,107 unigenes. A few months later, Yang et al. [103] analyzed their 32,789 ESTs generated from −3to +3-dpa fibers of Upland cotton cv. TM-1, along with 211,397 cotton ESTs downloaded from GenBank (as of April 2006), resulting in 55,673 unigenes and updating The Institute of Genomic Research Cotton Gene Index version 6 (CGI6) into CGI7 (http://www.tigr.org). The unigene EST number may provide a reasonable estimation about the number of expressed genes in the cotton genomes. Of the unigene set, those derived from fibers or fiber-bearing ovules suggest the number of genes potentially involved in fiber development and genetic complexity of fiber traits.
A predominant feature of the cotton EST set is the significant preference of their tissue sources for fiber or fiberbearing ovules than other organs. Of the 247,979 ESTs listed in Table 5, 187,080 (75.4%) were from developing fibers or fiber-bearing ovules while only 60,899 (24.6%) were from nonfiber and nonovule organs. Within each of the two categories, fiber/fiber-bearing ovules and nonfiber/ovule organs, there is also a significant bias in the number of ESTs. Cotton fiber development is classified into four clearly characterized, but overlapping stages, including fiber initiation (−3to 5 dpa), elongation (5-25 dpa), secondary cell wall deposition (15-45 dpa), and maturation/dehydration (45-70 dpa) (see Figure 3). All of the 187,080 fiber ESTs were generated from the fibers or fiber-bearing ovules collected from the first three stages with 43.6% from the initiation stage, 46.5% from the elongation stage, and 5.7% from the secondary cell wall deposition stage. It is apparent that the number of fiber ESTs from the secondary cell wall deposition stage is much smaller than that of either initiation or elongation stage. Although the initiation and elongation stages are of significance for the number of fibers per seed and fiber length, the secondary cell wall deposition stage is crucial to fiber strength. Of the 60,899 nonfiber/nonovule ESTs, 66.0% were from seedlings, 14.2% from stems, and 2.4% from roots.
The cotton ESTs have been used in several aspects, including development of genome-wide cotton microarrays (see below), mining of SSRs (see above) and study of polyploidization. The development of the significant numbers of ESTs from the cultivated tetraploid cotton, G. hirsutum [(AADD) 1 ], and its closely related diploid species, G. arboreum (A 1 A 1 ) and G. raimondii (D 5 D 5 ) (see Table 5) made it possible to compare the transcriptomes among the three species. Udall et al. [102] comparatively analyzed 31,424, 68,732, and 69,853 ESTs derived from G. arboreum, G. raimondii, and G. hirsutum, respectively. Although the comparison was significantly affected by the tissue sources and developmental status, they identified the putative homoeologs among the four genomes, A, D, A 1 , and D 5 . This information is useful for our understanding of how the cotton genomes function and evolve during the courses of speciation, domestication, plant breeding, and polyploidization.

Microarray
Microarray has been a technology that is widely used in many aspects of genomics research, including gene discovery, gene expression profiling, mutation assay, high-throughput genetic mapping, gene expression mapping (eQTL mapping), and comparative genome analysis. It involves robotically printing tens of thousands of cDNA amplicons or genespecific long (70 mers) oligonucleotides as array elements on a chemically-coated glass slide, followed by hybridizing the array with one or more fluorescent-labeled cDNA or cRNA targets derived from mRNA isolated from particular tissues, organs, or cells. Therefore, it allows the simultaneous monitoring of the expression/activities of all genes arrayed on the array in a single hybridization experiment. To facilitate cotton genomics research, microarrays have been developed from the cotton ESTs (Table 5) in several laboratories worldwide.
The first batch of cotton microarrays was fabricated from 70-mers oligos designed from the 7-10 dpa fiber nonredundant (NR) or unigene ESTs of G. arboreum (  Figure 3: Cotton fiber development and corresponding morphogenesis stages (according to [138,139]). The initiation stage is characterized by the enlargement and protrusion of epidermal cells from the ovular surface; during the elongation stage the cells expend in polar directions with a rate of >2 mm/day; during the secondary cell wall deposition stage celluloses are synthesized rapidly until the fibers contain ∼90% of cellulose; and at the maturation stages minerals accumulate in the fibers and the fibers dehydrate.
Using the microarrays, Arpat et al. [100] compared the expression of the genes between 10-dpa fibers at elongation or primary cell wall synthesis stage and 24-dpa fibers at secondary cell wall disposition stage (see Figure 3). The expression of fiber genes was found to change dynamically from elongation or primary cell wall to secondary cell wall biogenesis, with 2,553 of the fiber genes being significantly downregulated and 81 being significantly upregulated. This result suggests that the expression of fiber genes is stage-specific or cell expansion-associated. Annotation of the genes upregulated in the secondary cell wall synthesis relative to the primary cell wall biogenesis showed that most of the genes felt in three major functional categories, energy/metabolism, cell structure, organization and biogenesis, and cytoskeleton. This finding is consistent with the fact of massive cellulose synthesis and cell wall biogenesis during this stage. The fiber gene microarrays have been updated recently by incorporating nearly 10,000 gene elements designed from the fiber and ovary ESTs of the tetraploid cultivated cotton, G. hirsutum (Table 5; T.A. Wilkins, pers. communication). The current fiber microarrays each slide consist of four duplicated arrays with 22,406 60-mers oligo elements per array and a duplicate of each element (see, e.g., Figure 4). The new version of fiber gene arrays covers 100% of the fiber ESTs of diploid cotton and 65% of the fiber ESTs of the tetraploid cultivated cotton, G. hirsutum that are available in GenBank, thus representing the most comprehensive coverage of the cotton fiber genes. The elements are printed on a slide in a randomized manner instead of the conventional ordered manner. The fabrication of four duplicated arrays per slide and randomized printing design have significantly minimized the systematic problems that are frequently encountered in the conventional array design (one array per slide and ordered printing), thus further enhancing the reproducibility and accuracy of the microarray analysis results.
Recently, several additional batches of EST-or cDNAbased microarrays with different formats and elements have been reported in cotton [102,104,140]. Shi et al. [104] reported the fabrication of microarrays from unigene ESTs derived from 5-10 dpa ovules of the Upland cotton cv. Xuzhou 142 and using the amplicons of the EST clones as the array elements. The microarrays each consist of 11,962 uniEST elements. Using the microarrays, Shi et al. [104] comparatively studied the wild-type Xuzhou 142 versus its fuzzlesslintless (fl) mutation using the RNAs isolated from the ovules at stages of 0-, 3-, 5-, 10-, 15-, and 20-dpa. It was found that ethylene biosynthesis is one of the most significantly upregulated biochemical pathways during fiber elongation. Similarly, Wu et al. [140] also fabricated a set of microarrays from amplicons of 10,410 cDNA clones derived from −3to 0-dpa ovules of the Upland cotton cv. DP16 (see Table 5, Wu & Dennis). The arrays were analyzed with RNAs isolated from 0-dpa whole ovules, outer integument, and inner integument/nucellus of five lintless mutation lines against the wild-type DP16. Of the 10,410 gene elements on the array, 60 to 243 were found to significantly differentially express between each pair of the wild type and mutant when the array was hybridized with the RNAs isolated from the 0-dpa whole ovules. Of these differentially expressed genes, 70.6% were upregulated and 29.4% downregulated in the fiber mutant, suggesting that the mutation caused not only gene downregulation, but also gene upregulation. However, when the whole ovule was dissected into three layers, outer integument, inner integument, and nucellus, of which cotton fibers develop from the epidermal cells of the outer integument, and analyzed with the outer integument against the inner integument and the nucellus, the number of the genes downregulated in the mutants was reduced to 13. These include an Myb transcription factor, a putative homeodomain protein, a cyclin D gene, and some fiber-expressed structural and metabolic genes, suggesting that these genes may be involved in the process of fiber initiation.
In summary, three batches of EST-or cDNA-based cotton microarrays were fabricated from fiber genes of either cultivated tetraploid cotton, G. hirsutum [104,140], or cultivated diploid cotton, G. arboreum [100]. Using the microarrays, the expression of the fiber genes was profiled and comparatively analyzed at fiber initiation stage [140], elongation stage [100,104], and secondary cell wall deposition stage [100]. However, the expression of other cotton genes such as those from nonfiber and nonovary tissues remains to profile. To fill this gap, another two batches of long oligo-based microarrays have been developed. The first batch contains approximately 21,000 gene elements per array (http://cotton.agtec.uga.edu/CottonFiber/pages/mcriarray/Array.aspx). These genes were from 52 cDNA libraries constructed from a variety of tissues and organs in a range of conditions, including drought stress and pathogen challenges, and represents tetraploid (G. hirsutum) and its diploid relatives (G. arboreum and G. raimondii). Of the 21,000 genes, approximately one-forth were from fiber genes and three-forth were from nonfiber and nonovary tissues (J. A. Udall, pers. communication). The second batch contains 38,716 gene elements per array. Of the gene elements, 22,409 are designed from fiber ESTs and 16,307 from nonfiber ESTs (T.A. Wilkins, pers. communication). There is no doubt that these versions of cotton microarrays will provide new tools for comprehensive functional and comparative genomics research of cottons.

Physical mapping
Whole-genome, BAC-and/or BIBAC-based, integrated physical/genetic maps have played a central role in genomics research of humans, plants, animals, and microbes [110,123,127]. This is because they provide central platforms for many areas, if not all, of modern genomics research, including large-scale transcript or gene mapping, regiontargeted marker development for fine mapping and MAS of genes and QTLs, map-based gene/QTL cloning, localand whole-genome comparative analysis, genome sequencing, and functional analysis of DNA sequences and component network. Therefore, whole-genome, BAC/BIBACbased, integrated physical/genetic maps have been developed for a number of plant and animal species. In plants, wholegenome BAC physical maps have been developed for several species, including Arabidopsis [114,118], indica rice [121], japonica rice [117], soybean [124], and maize [141]. However, whole-genome physical maps of cottons have only been initiated in several laboratories. One is the laboratory of H.-B. Zhang, Texas A&M University, College Station (Texas, USA). This laboratory is developing a whole-genome BAC/BIBAC physical map of the Upland cotton cv. TM-1 by using the latest physical mapping technology [123,126]. The project was a collaborative effort among the laboratories of H.-B. Zhang, R. J. Kohel, USDA/ARS, College Station (Texas, USA) (who provided a part of the fund for the project), and D. M. Stelly, Texas A&M University (Texas, USA). Nearly 120,000 (∼7.3x) BIBACs and BACs selected from the TM-1 BIBAC and BAC libraries (see Table 3) have been fingerprinted and a draft BAC/BIBAC contig map has been constructed. The draft physical map consists of 5,088 contigs collectively spanning approximately 2,300 Mb of the 2,400 Mb Upland genome (unpublished). Currently, additional clones (to reach about 10x genome coverage clones) are being analyzed. Furthermore, because the Upland cotton is an allotetraploid which makes the physical map construction more complicated, several approaches are being used to sort the map contigs according to their origin of subgenomes. The laboratory of A. H. Paterson, University of Georgia (Athens, Georgia) is also working toward development of a whole-genome BAC-based physical map of the diploid species, G. raimondii (A. H. Paterson, pers. communication). Given the importance of physical maps for modern genome research, there is no doubt that development of a robust integrated physical/genetic map will greatly promote advanced genomics research of cottons and related species (also see below).

Genome sequencing
Sequence maps represent the most-fine physical maps of genomes [108]. They provide not only physical positions of and distances between genes and other components constituting a genome [142], but also their sequences and putative functions inferred from the sequences. Therefore, development of a complete genome sequence map of a species will significantly promote genomics research of the species in a variety of aspects. Because of this reason, the whole genomes of several plant and animal species have been sequenced. In plants, the genomes of two model species, Arabidopsis [130] and rice [132], have been completely sequenced and the genomes of several other species, including Medicago truncatula (http://www.medicago.org), Lotus japonicus (http:// www.kazusa.or.jp/lotus), tomato (http://www.sgn.cornell .edu/about/tomato sequencing.pl), maize (http://www.maizegenome.org), and soybean (http://genome.purdue.edu/ isgc/Tsukuba07/ISGC report Apr2007.htm), are currently being sequenced.
However, there is only a limited amount of genomic sequences available for cotton and related species in GenBank. A major source of the genomic sequences of Gossypium species was from Hawkins et al. [143]. To understand the underlying genome size variation and evolution of Gossypium species, Hawkins et al. [143] constructed whole-genome shotgun libraries for G. raimondii (D 5 D 5 ), G. herbaceum (A 1 A 1 ), G. exiguum (KK), and the species that was used as the outgroup species for phylogenetic analysis of the Gossypium species, Gossypioides kirkii, with each species library containing 1920-10,368 clones. From each of the four shotgun libraries, 1,464-6,747 clones were sequenced, together covering a total length of 11.4 Mb. Annotation of these clone sequences and estimation of the copy number of each type of the sequences suggested that differential lineage-specific amplification of transposable elements is responsible for genome size variation in the Gossypium species. Moreover, G. raimondii has been selected recently by the DOE Joint Genome Institute, U.S. Department of Energy to be sequenced for genomic study of cotton and related species (http://www.jgi.doe.gov/sequencing/cspseqplans2007.html). At the first phase of the sequencing project, a whole-genome shotgun library covering about 1x of the G. raimondii genome will be sequenced. While this number is far from the genome coverage of clones (>6x) that is needed to assemble the sequence map of the genome, it will provide the first glimpse into the cotton genome and useful information for sequencing the entire genomes of this and other cotton key species efficiently.

APPLICATIONS OF GENOMIC TOOLS IN COTTON GENETIC IMPROVEMENT
One of the major goals of genome research is to use the genomic tools developed to promote or assist continued crop genetic improvement. In cottons, the development of the genomic resources and tools has allowed addressing many significantly scientific questions that are impossible to do so before. These include, but not limited to, construction of genome-wide genetic maps (Table 1), identification and mapping of genes and loci controlling traits underlying qualitative and quantitative inheritance ( Table 2), determination of mechanisms of cotton genome evolution, and identification and determination of genes that are involved in cotton fiber initiation, elongation, and secondary cell wall biogenesis. The genomic resources and tools could be used to promote or facilitate cotton genetic improvement in numerous ways. Marker-assisted selection (MAS) is likely one of the most important and practical applications at present time and in near future. The MAS technology could offer many potential benefits to a breeding program. For instance, DNA linked to a gene of interest could be utilized in early generation of breeding cycle to improve the efficiency of selection. This approach has a particular advantage when screening for phenotypes in which the selection is expensive or difficult to perform, as is the case involving recessive or multiple genes, seasonal or geographical considerations, and late expression of the phenotype [144]. However, application of MAS in cotton breeding programs is still in its infancy as the major effort of cotton genome research in the past has been on the development of genomic resources and tools for the eventual goal of enhanced cotton genetic improvement.

Fiber quality
Zhang et al. [53] used a G. anomalum introgression line 7235 with good fiber quality properties to identify molecular markers linked to fiber-strength QTLs. A major QTL, QTLFS1, was detected at the Nanjing and Hainan field locations (China) and College Station, Texas, (USA). This QTL was associated with eight markers and explained more than 30% of the phenotypic variation. QTLFS1 was first thought to be mapped to chromosome 10, however, further study showed that this QTL was located on LGD03 [67]. Guo et al. [145] showed that the specific SCAR4311920 marker could be applied to large-scale screening for the presence or absence of this major fiber strength QTL in breeding populations. The DNA markers tightly linked to this QTL could be useful for developing commercial cultivars with enhanced fiber length properties [67]. Wang et al. [76] identified a stable fiber length QTL, qFL-D2-1, simultaneously in four environments in Xiangzamian 2. The high degree of stability suggests this QTL might be particularly valuable for use in MAS programs. Chee et al. [59] dissected the molecular basis of genetic variation governing 15 parameters that reflect fiber length by applying a detailed RFLP map to 3,662 BC 3 F 2 plants from 24 independently derived BC 3 families utilizing G. barbadense as the donor parent. The discovery of many QTLs unique to each trait indicates that maximum genetic gain will require breeding efforts that target each trait. Lacape et al. [64] performed QTL analysis of 11 fiber properties in BC 1 , BC 2 , and BC 2 S 1 backcross generations derived from the cross between G. hirsutum "Guazuncho 2" and G. barbadense "VH8." They detected 15, 12, 21, and 16 QTLs for length, strength, fineness, and color, respectively, in one or more populations. The results showed that favorable alleles came from the G. barbadense parent for the majority of QTLs, and cases of colocalization of QTLs for different traits were more frequent than isolated positioning. Taking these QTL-rich chromosomal regions into consideration, they identified 19 regions on 15 different chromosomes as target regions for the marker-assisted introgression strategy. The availability of DNA markers linked to G. barbadense QTLs promises to assist breeders in transferring and maintaining valuable traits from exotic sources during cultivar development.

Cytoplasmic male sterility
In cotton, cytoplasmic male sterility conditioned by the D8 alloplasm (CMS-D8) is independently restored to fertility by its specific D8 restorer (D8R) and by the D2 restorer (D2R) that was developed for the D2 cytoplasmic male sterile alloplasm (CMS-D2). Zhang and Stewart [146] concluded that the two restorer loci are nonallelic, but are tightly linked with an average genetic distance of 0.93 cM. The D2 restorer gene is redesignated as Rf1, and Rf2 is assigned to the D8 restorer gene. The identification of molecular markers closely linked to restorer genes of the cytoplasmic male sterile could facilitate the development of parental lines for hybrid cotton. Guo et al. [147] found that one RAPD marker fragment, designated OPV-15(300), was closely linked with the fertilityrestoring gene Rf1. Zhang and Stewart [148] identified RAPD markers linked to the restorer gene and, furthermore, converted the three RAPD markers into reliable and genomespecific sequence tagged site (STS) markers. Liu et al. [51] determined that the Rf1 locus is located on the long arm of chromosome 4. Two RAPD and three SSR markers were identified to be closely linked to the Rf1 gene. These markers are restorer-specific and should be useful in MAS for developing restorer parental lines. Yin et al. [54] further constructed a high-resolution genetic map of Rf1 containing 13 markers in a genetic distance of 0.9 cM. They constructed a physical map for the Rf1 locus and enclosed the possible location of the Rf1 gene to a minimum of two BAC clones spanning an interval of approximately 100 kb between two clones, designated as 081-05K and 052-01N. Work to isolate the Rf1 gene in cotton is now in progress.

Resistance to diseases and insect pests
Breeding for disease resistance is of great importance in cotton breeding program. To facilitate analysis, cloning, and manipulation of the genes conferring resistance to different pathogens, including bacteria, fungi, viruses, and nematodes, He et al. [149] isolated and characterized the family of nucleotide-banding site-leucine-rich repeat (NBS-LRR)encoding genes or resistance gene analogues (RGAs) in the 14 International Journal of Plant Genomics Upland cotton cv. Auburn 634 genome. Genetic mapping of a sample (21 genes) of the RGAs indicated that the gene family resides on a limited number of the cotton AD-genome chromosomes with those from a single subfamily tending to cluster on the cotton genetic map and more RGAs in the A subgenome than in the D subgenome. Of the 16 RGAs mapped, two happened to be comapped with the cotton bacterial blight resistance QTLs previously mapped by Wright et al. [42]. Since nearly 80% of the genes (>40 genes) cloned to date that confer resistance to bacteria, fungi, viruses, and nematodes are contributed by the NBS-LRR gene family, the cotton RGAs of the NBS-LRR family have provided valuable tools for cloning, characterization, and manipulation of the resistant genes to different pathogens and pests in cottons.
Root-knot nematodes (RKN), Meloidogyne incognita, can cause severe yield loss in cotton. Wang et al. [71] identified one SSR marker CIR316 on the linkage group A03 tightly linked to a major RKN resistant gene (rkn1) in the resistant cultivar G. hirsutum "Aacla NemX." In a companion study, a bulked segregant analysis (BSA) combined with AFLP was used to identify additional molecular markers linked to rkn1 [72]. An AFLP marker linked to rkn1 designated as GHACC1 was converted to a cleaved amplified polymorphic sequence (CAPS) marker. These two markers have potential for utilization in MAS. Shen et al. [68] identified RFLP markers on chromosome 7 and chromosome 11 showing significant association with RKN resistance from the Auburn 634 source, a different source of resistant germplasm than Acala NemX. The association was further confirmed by detection of a minor and a major dominant QTL on chromosomes 7 and 11, respectively, using SSR markers. Ynturi et al. [73] identified two SSR markers which together accounted for 31% of the variation in galling index. The marker BNL 3661 is mapped to the short arm of chromosome 14 while BNL 1231 to the long arm of chromosome 11. The association of two different chromosomes with RKN resistance suggests at least two genes are involved in resistance to RKN.
Bacterial blight caused by the pathogen Xanthomonas campestris pv. malvacearum (Xcm) is another economically important disease in cotton. Wright et al. [42] and Rungi et al. [43] both used mapped RFLP markers to investigate the chromosomal location of genes conferring resistance to the bacterial blight pathogen. The mapping data suggest that the resistance locus segregates with a marker on chromosome 14 known to be linked to the broad-spectrum B12 resistance gene originally from African cotton cultivars. AFLPs and SSRs were also used to search for novel markers linked to the Xcm resistance locus to facilitate introgression of this trait into G. barbadense through MAS.

CONCLUDING REMARKS
A significant amount of genomic resources and tools has been available in cottons though cotton genomics research is far behind those of other major crops such as rice, maize, wheat, and soybean. These resources and tools have allowed identifying and mapping many genes and QTLs of importance to cotton fiber quality, fiber yield, and biotic and abiotic stresses and addressing several significant questions to plant biology in general and to cotton in particular. Nevertheless, many efforts are needed to further develop the resources and tools and to make the tools readily usable in applications in order to fully and effectively use them in cotton genetic improvement and biology research. In particular, the following areas of cotton genomics research should be emphasized.
(i) Development of whole-genome BAC/BIBAC-based, integrated physical maps of cottons. There is no whole-genome, robust BAC/BIBAC-based, integrated physical/genetic map that has been developed for cottons. The maps should be developed for at least two species of Gossypium. One is the Upland cotton that produces >90% of the world's cotton whereas the other is G. raimondii, the species having the smallest genome among all Gossypium species, thus likely having highest density of genes. This research is emphasized because it has been proven in model and other species, including Arabidopsis, rice, Drosophila, human, mouse, and chicken, that whole-genome integrated physical/genetic maps provide powerful platforms and "freeways" for many, if not all, modern genetics and genomics research ( [110,123,127]; see above). These include not only genome sequencing (see below), but also development of closely linked, user-friendly DNA markers for any region or loci of the genome, fine QTL and gene mapping (see below), mapbased gene/QTL cloning, and high-throughput and highresolution transcript (unigene EST) mapping [150]. Development of the integrated physical maps will allow rapidly and efficiently integrating all existing genetic maps, mapped genes and QTLs, and BAC and BIBAC resources and cotton unigene ESTs, and accelerate the efficiency and reduce the cost of research in all areas by manifold.
(ii) QTL fine mapping. Many genes and QTLs that are important to cotton fiber yield fiber quality, and biotic and abiotic stresses have been genetically mapped, but two problems are apparent. The first one is that almost all of the QTLs were mapped using F 2 , BC 1 , or early segregating generations in a single or a limited number of environments (Table 2). Since quantitative traits are readily subjected to environmental variation, the mapping results using the early generations in a single or a limited number of environments would vary from experiments to experiments [59,60,106]. The other problem is that the genetic distances between DNA markers and most of the QTLs are too far to be used for MAS. Therefore, it is of significance to fine map the QTLs using large and advanced generation or homozygous populations, such as RILs and DHs, in multiple environments, and closely linked DNA markers, for which advantage of integrated physical maps could be taken. In addition to accurate mapping of the QTLs and development of DNA markers that are well-suited (closely linked and user-friendly) for MAS, fine mapping is also an essential step toward the final isolation of the QTL genes by map-based cloning [134]. The isolated genes are not only the sources for molecular breeding via genetic transformation, but also the most desirable for marker development for MAS because there is no recombination between the gene and its derived marker.
(iii) Sequencing of one or more key cotton genomes. While it is costly using the current sequencing technology, wholegenome sequencing is a most-efficient method to discover and decode all cotton genes and provides a most-desired and most-fine integrated physical and genetic map of the cotton genome. Comparative genomics studies demonstrated that the gene contents and orders are highly conserved among the genomes of Gossypium species even they are significantly different in genome size [16,24]. Based on this result, G. raimondii is an excellent choice to be sequenced because it has the smallest genome among all Gossypium species though it is not cultivated. If an integrated physical map is available for the major cultivated cotton, G. hirsutum, that has a three-fold larger genome than G. raimondii, the sequence information of G. raimondii could be readily transferred to the cultivated cotton by using the BAC end sequences of its integrated physical map as anchors.
(iv) ESTs from nonfiber and nonovary tissues and fibers at the secondary cell wall deposition stage. As shown above, the number of cotton ESTs available in GenBank has been increased significantly in the past few years; however, the distribution of the ESTs among tissue sources are extremely biased. The numbers of ESTs from both nonfiber/nonovary tissues and fibers at the secondary cell wall deposition stage (15-45 dpa), particularly after 20 dpa, are especially small. The former set of expressed genes, despite of not directly contributing to fiber yield and quality, is of significance to fiber yield and quality, whereas there is no doubt that the later set of expressed genes directly contribute to the fiber strength.
(v) Profiling and identification of genes involved in individual biological processes and conditions with emphasis on those involved in fiber development. Development and availability of cDNA-or unigene EST-based microarrays have provided unprecedented opportunities for research of molecular biology, functional genomics, and evolutionary genomics, however, cotton research in these regards are very limited. Identifying and characterizing genes that are involved in the processes of fiber development, plant growth and development, and biotic and abiotic stresses will greatly facilitate our understanding of underlying molecular basis of these processes in cottons, and thus, enhance breeders' ability to cotton genetic improvement.
(vi) Translating the gene activities or expressions at different tissues and stages into fiber yield and fiber quality, thus as-sisting in cotton breeding. The genes that are involved in fiber initiation [104,140], elongation [100,104], and secondary cell wall deposition [100] have been identified from several genotypes of cottons, but it is unknown about what the upor downregulation, or active expression of fiber genes at a developmental stage and organ means to final fiber yield and/or quality. For instance, does the active expression of a gene at fiber elongation stage in fiber suggest longer fibers? Studies in this regard are essential to use the gene expression data in cotton germplasm analysis and breeding.