Unlimited Thirst for Genome Sequencing, Data Interpretation, and Database Usage in Genomic Era: The Road towards Fast-Track Crop Plant Improvement

The number of sequenced crop genomes and associated genomic resources is growing rapidly with the advent of inexpensive next generation sequencing methods. Databases have become an integral part of all aspects of science research, including basic and applied plant and animal sciences. The importance of databases keeps increasing as the volume of datasets from direct and indirect genomics, as well as other omics approaches, keeps expanding in recent years. The databases and associated web portals provide at a minimum a uniform set of tools and automated analysis across a wide range of crop plant genomes. This paper reviews some basic terms and considerations in dealing with crop plant databases utilization in advancing genomic era. The utilization of databases for variation analysis with other comparative genomics tools, and data interpretation platforms are well described. The major focus of this review is to provide knowledge on platforms and databases for genome-based investigations of agriculturally important crop plants. The utilization of these databases in applied crop improvement program is still being achieved widely; otherwise, the end for sequencing is not far away.


Introduction
Most recent development of high-throughput methods for analyzing the structure and function of genes is collectively referred to as "genomics. " The comprehensive information of this kind is currently available for only a few plants and is rapidly being available for most of the higher plants and several underutilized crop plant species. Public access to this information will exploit biological selections and have direct impact on application of genomics to the improvement of economically important plants. Getting sequences of major plants on the one hand and access to all sequenced information for further applications on the other hand are most important. Therefore, global biological community should have open-access database for all plant genome sequenced so far.
Plant databases are facilities or long-lived record that are systematically updated with massive amount of data which has been generated as research outcomes in the context of the whole field of plant biology to ensure maximal accessibility and visibility to use by researchers in different fields of interest. These databases assist in drawing conclusion to make some new hypotheses to address basic questions of researchers. Internet-accessible information has become an integral part of most scientific enterprise, including the plant sciences. It now seems that it is impossible to conceive of future significant progress being made without the internet and the databases and many other similar resources the internet makes openly available. This is particularly true as the information flows from genomics and other highthroughput technologies to all aspects of crop plant sciences. The ultimate goal of plant genomics is to improve our ability to identify the genotypes with optimal agronomic traits in order to improve yield, a must with the increasing world population [1].

Omics Research on Crop Plants: Present Status
"Omics" refers to the collective technologies that are made available in recent years which are used to explore the roles, relationships, and actions of the various types of molecules that make up the cells of a living organism. The "omics" technology includes genomics (the study of genes and their function), proteomics (the study of proteins), metabolomics (the study of molecules involved in cellular metabolism), transcriptomics (the study of the mRNA), glycomics (the study of cellular carbohydrates), and lipomics (the study of cellular lipids). These omics technologies provide the tools needed to look at the differences in DNA, RNA, proteins, and other cellular molecules between species and among individuals of the same or different species. A combinatorial approach using multiple omics platforms and integration of their outcomes is now an effective strategy for clarifying molecular systems integral to improving crop plant productivity ( Figure 1). Recent progress in plant genomics and utilization of genetic resources has allowed us to discover and isolate important genes and analyze their functions that regulate yields as well as stress tolerance [2]. A technological advance in omics research integrating animal and plant science has become essential resources for the investigation of gene function in association with phenotypic changes. Some of these advances include the development of high-throughput methods for profiling expressions of thousands of genes, for identifying modification events and interactions in the plant proteome and for measuring the abundance of many metabolites simultaneously. In addition, large-scale collections of bioresources, such as mass-produced mutant lines and clones of full-length cDNAs and their integrative relevant databases, are now made available [3,4]. The importance of crop plant genetic resources and insights that have been emerged in recent years through genomics are well reviewed [5,6]. The recent highthroughput technological advances have provided opportunities to develop collections of sequence-based resources and other related resource platforms for specific organisms. Various bioinformatics platforms have become essential tools for accessing omics dataset for the efficient mining and integration of biologically significant knowledge to deposit in databases for public access (Figure 1).

Crop Plant Genome Sequence Resources
In recent years, many crop plant genomes have been sequenced and data is available to public (Table 1). On the other hand, collected sequence data provide essential genomic resources for accelerating molecular understanding of biological properties and for promoting the application of such knowledge to the benefit of humans. The recent accumulation of nucleotide sequences of model plants and other crop species has provided fundamental information for the design of sequence-based research applications in functional genomics. Species-specific nucleotide sequence collections also provide opportunities to identify the genomic aspects of phenotypic characters based on genome-wide comparative analyses and knowledge of model organisms [46].

Rationale of Genome Sequencing Projects.
Recent revolution in DNA sequencing technology has brought down the cost of DNA sequencing of several crop plant species and made the sequencing of an increased number of genomes both feasible and cost effective [46]. The first plant genome Arabidopsis was completely sequenced in December 2000, and it was the third complete genome of a higher eukaryote and further studies were carried out in recent years on Arabidopsis thaliana and Arabidopsis lyrata [30,31]. Subsequently, after Arabidopsis, several other crop plants have been sequenced (Table 1). These genomes reveal numerous species-specific details, including genome size, gene number, patterns of sequence duplication, a catalog of transposable elements, and syntenic relationships. To understand

Contribution of Whole-Genome
Resequencing. Advancement in next-generation sequencing (NGS) technology coupled with many reference genomes sequence data allows us to discover variations among many crop plants. A wholegenome resequencing project to discover whole-genome sequence variations in 1,001 strains (accessions) of Arabidopsis resulted in dataset that became a fundamental resource for promoting future genetics studies to identify alleles in association with phenotypic diversity across the entire genome and across the entire crop plant species (http://1001genomes.org/) [47,48]. In rice, a high-throughput method for genotyping recombinant populations that used whole-genome resequencing data generated by the Illumina Genome Analyzer was performed [18] and recently resequencing of 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes has been completed [49].

Analyzing Crop Plant Genome Sequences. Galaxy
(http://galaxyproject.org) is a software system that provides knowledge and support through a framework that provides researchers with simple interfaces to powerful data interpretation tools. Galaxy is web-based framework designed for use of experimental and computational biologists in all fields of biological science. With Galaxy, one can easily use analysis tools through a web-based interface [50]. Another tool made available from the Sanger institute (http://www.sanger.ac.uk/) is Artemis, a free genome browser and annotation tool that allows visualization of sequence features, next generation data, and the results of analyses [51]. The Broad's Genome Sequencing and Analysis Program (GSAP) plays a major role in providing several analyses tools for genome sequences coming out of the NGS platforms in all biological fields (http://www.broadinstitute.org/).

Crop Plant Genome Resources and Variation Analysis
Genome-wide study of both structural and gene content variation are hypothesized to drive important phenotypic variation within a crop plant species. Previous studies have shown that both structural and gene content variations were assessed in several crops using array hybridization and targeted resequencing. Genetic variation within and between species is most commonly quantified by single nucleotide polymorphisms (SNPs). There has been increased interest in recent years to resolve genetic differences in terms of structural variation (SV), which includes copy number variation (CNV) caused by large insertions and deletions, and other types of rearrangements such as inversions and translocations. CNV together with SV is thought to be an important factor in determining phenotypic variation for a wide range of traits reviewed [52] in both crop plant and animal species.

Role of Molecular Markers.
Among various DNA markers available to research community, single sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs) are most widely used today. SSRs are demonstrated to be of high degree of transferability between species and could easily be transferred to related species to amplify the same corresponding locus. SNPs represent the most frequent type of genetic polymorphism and may therefore provide a high density of markers near a locus of interest compared to SSRs. The high density of SNPs makes them valuable for genome mapping, and in particular they allow the generation of ultrahigh density genetic maps and haplotyping systems for genes or regions of interest and map-based positional cloning in crop plants. SNPs are used routinely in crop breeding programs, for genetic diversity analysis, cultivar identification, phylogenetic analysis, characterization of genetic resources, and association with agronomic and physiological traits in both cereals and legumes [53,54]. Application of SNP markers for genetic dissection of complex traits like delta 13 C and delta 15 N in legume like soybean with high density SNP chips has also increased and been made available [55][56][57].

Biparental QTL Mapping.
The quantitative traits loci (QTL) identified for a trait of interest that contribute to higher phenotypic variation are considered major QTL. These identified QTLs, after validation in desired germplasm, can be used for introgression of the trait from the donor genotypes (generally used for identification of the QTL for the trait) into elite cultivars to traits of less phenotypic variation cultivars or breeding lines (recipient parents) without transfer of undesirable genes from the donors (linkage drag). The process is commonly referred to as marker-assisted backcrossing (MABC) most commonly employed by plant breeders. Superior lines or cultivars are developed which contain only the major QTL from the donor parent while retaining the wholegenome of the recurrent parent [58]. MABC has been used extensively for introgression of resistance to biotic stresses and abiotic stress in crop plants. To overcome the limitations of MABC, particularly when multiple QTLs control the expression of a complex trait, the MARS approach, which involves intermating selected individuals in each selection cycle, has been recommended [59,60]. It generally involves the use of an F2 base population and can be used in self-pollinated crops like wheat, barley, and chickpea for developing pure lines with superior per se performance (for more details, see [60]). MARS has the additional advantage of overcoming the limitation of inadequate improvement in the frequency of superior alleles in F2 enrichment since MAS is practiced in each cycle following intermating to improve the frequency of favourable alleles [59].

Genome-Wide Association Analysis.
Genome-wide association analysis (GWAS) is a powerful approach to identify the causal genetic polymorphisms underlying both simple and complex traits in crop plants. Advancement in genomics has provided alternative tools to improve breeding efficiency in plant breeding programs. Molecular markers linked to the causal genes and/or QTLs can be used for marker-assisted selection (MAS) [61]. Recent advances in genome sequencing and single nucleotide polymorphism (SNP) genotyping have increased the applicability of association analysis for QTL mapping in crop plants [62,63]. Genome-wide association analyses with SNP markers have been conducted for several important traits in many plant species, including Arabidopsis thaliana [64], maize [65], rice [66], and soybean [67][68][69], and also in tree crops like peach [70].

Genomic Selection.
Genomic selection (GS) is more reliable and relatively simple and most powerful approaches used in crop plant species where breeding values of the genotype/cultivar lines are predicted using their marker genotypes and phenotypes [71]. GS captures the small QTL effect that governs the variation including epistatic interaction effects. GS has been successfully used in wheat, maize, and soybean [71][72][73]. The accuracy of GS depends on genetic × environmental (G × E) interaction and major challenge of GS is to arrive with the accurate genomic estimated breeding values (GEBVs) with respect to the G × E interaction. Application of GS has been extended to other crops plants like Arabidopsis, sugarcane, and sugar beet in recent years.

Application of Molecular Platforms for Variation Analysis.
High-throughput polymorphism analysis is an essential tool for facilitating any genetic map-based approach, and the number of platforms has been developed and applied to genetic map construction, marker-assisted selection, and QTL cloning using multiple segregation populations in major crop plants. These types of genotyping systems have been successfully used in postgenome sequencing era with extending of their projects on genotyping of genetic resources, identifying their population structure, and association of their phenotypic values to identify their genomic regions. This recent expansion of analysis platforms provides an essential resource in the "variome" study of crop plants. The increasing demand for high-throughput and cost-effective platforms for comprehensive variation analysis (also called variome analysis) has rapidly increased. Whole-genome resequencing approaches are already being realized as a direct solution for variome analysis in species whose reference genome sequence data are available [74,75]. Diversity Array Technology (DArT) is a high-throughput genotyping system developed based on a microarray platform (http://www.diversityarrays.com/index.html) [76]. In various crop species such as wheat, barley, and sorghum, DArT markers have been used together with conventional molecular markers to construct denser genetic maps and perform association studies [77][78][79]. The Illumina Golden-Gate assay allows the simultaneous analysis of up to 1,536 SNPs in 96 samples and has been used to analyze genotypes of segregation populations in order to construct genetic maps allocating SNP markers in crops such as barley, wheat, soybean [80][81][82], and peach [70,83]. Recently 3K to 700K Infinium i Select HD and HTS custom genotyping bead chips are made available for the high-throughput genotyping of SNPs, indels, and CNVs.

Databases for Variation Analysis.
Characterizing the genetic basis of variation in crop plants and linking to observable traits will provide an important framework for understanding evolutionary patterns and population structure and could specially increase the efficiency of selection made in the crop plant breeding programmes.
GRAMENE. The Genetic Diversity Database in GRAMENE specializes in storage of genotypes, phenotypes and their environments, germplasm, and association data. Genomic Diversity and Phenotype Data Model (GDPDM) database schema which efficiently stores anything from smallscale SSR diversity studies to large-scale SNP/indel-based genotype-phenotype studies with billions of allele calls [84].
The Plant Variation Mart Database. It holds a catalogue of DNA variants for single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) for Arabidopsis, rice, and grapes.

Crop Plant Comparative
Genomics Resources    RadishBase-a database for genomics and genetics of radish.
A database containing radish pathways predicted from unigene sequences is also included in RadishBase RoBuST-an integrated genomics resource for the root and bulb crop families Apiaceae and Alliaceae.
The RoBuST database has been developed to initiate a platform for collecting and organizing genomic information useful for RBV (root and bulb vegetables) researchers

WhETS-Wheat Estimated Transcript Server
A tool to provide the best estimate of hexaploid wheat transcript sequence their analysis theoretically challenging and highly demanding of computational resources. Issues also include genome size, polyploidy, and the quantity, diversity, and dispersed nature of data in need of integration.
Plant Trait Database. The main focus of TRY (https://www .try-db.org/TryWeb/Home.php) database is to bring together the different plant trait databases worldwide into a comprehensive web-archive of the functional biodiversity of plants at the global scale by assembling, harmonizing, and distributing published and unpublished data on functional plant traits as well as a wide range of ancillary methodological and environmental information. It contains 3 million trait records for 750 traits of 1 million individual plants, representing 69,000 plant species [85,86].
TransPLANT. Recently 11 European partners gathered to address growing database challenges and to develop a transnational database called "transplant" (http://www .transplantdb.eu/about) to help increasing database needs.
Bringing together groups with strengths in data analysis, plant science, and computer science and from the academic and commercial sectors, transPLANT has developed integrated standards and services and undertaken new research and development needed to capitalize on the sequencing revolution, across the spectrum of agricultural and model plant species.
PlantsDB. This is another most commonly used database by various degree of researchers, and it comprises database instances for tomato, Medicago, Arabidopsis, Brachypodium, Sorghum, maize, rice, barley, and wheat. Building up on that, state-of-the-art comparative genomics tools such as CrowsNest are integrated to visualize and investigate syntenic relationships between monocot genomes. Results from novel genome analysis strategies targeting the complex and repetitive genomes of Triticeae species (wheat and barley) were provided and cross-linked with model species [87,88].

Application of Comparative Genomics Platforms.
Advancing genomic tools have provided higher boost for researchers in plant science community to understand the functional roles of genes and their evolutionary histories. Recently, resequencing additional genomes of a reference species has been made available [89], improving the understanding of genomic variation. Comparison of genomes gives insights into the evolution and adaptation of species to specific environments when compared to the information of genes provided by a single genome. To do comparative genomics studies there is a need of additional cost and as the number of available genomes increases, largescale analyses become increasingly difficult for nonexperts, where need for computational biologist becomes essential [17]. Furthermore, biological variation between species and differences in sequence quality enhance the complexity of evolutionary analyses. Therefore, platforms for comparative genomics that take care of some of these challenges are valuable resources for experimental biologists [90,91]. Comparative genomics has proven to be a valuable approach to understanding biology, not only for dissecting patterns and processes of genome evolution but also in revealing aspects of different gene function. The rapid advancement in comparative genomics technology, both for sequencing and for determining expression and interaction patterns, will continue to propel comparative genomics area of research in near future.

Emerging Databases for Comparative Genomics Analysis.
To cope up and interact with increased data due to higher number of plant genome sequencing and inexpensive NGS technologies, recently developed and improved Phytozome database (http://www.phytozome.net) has provided a comparative hub for crop plant genome and gene family data analysis. The number of sequencing crop plant genomes is rapidly increasing and, at the same time, comparative sequence analysis has significantly changed our vision on the complexity of gene function, genome organization, and regulatory pathways. To explore all this genome information, a centralized infrastructure is required where all data generated by different sequencing initiatives is integrated and combined with advanced methods for data mining.
PLAZA. It is an online platform of plant comparative genomics (http://bioinformatics.psb.ugent.be/plaza/) that integrates functional and structural annotation of published crop plant genomes together with a large set of interactive tools to study gene and genome evolution along with their gene function. Precomputed datasets cover, intraspecies dot plots, whole-genome multiple sequence alignments, homologous gene families, phylogenetic trees, and genomic colinearity between species are provided by PLAZA. In conclusion, PLAZA provides the most comprehensible and up-to-date research environment to aid researchers in the exploration of genome information [92].
GreenPhylDB. GreenPhylDB is a component of the South Green Bioinformatics Platform (http://southgreen.cirad.fr/) and is open to public access (http://greenphyl.cirad.fr). GreenPhylDB is a database designed for functional and comparative genomics-based study on complete genomes. GreenPhylDB contains sixteen full genomes of members of the plantae kingdom, ranging from algae to angiosperms, automatically clustered into gene families. The database offers various lists of gene families including plant, phylum, and species specific gene families. Gene families are manually annotated and then analyzed phylogenetically in order to elucidate orthologous and paralogous relationships. It enables comparative genomics in a broad taxonomy context to enhance the understanding of evolutionary processes and thus tends to speed up gene discovery [91].

Tools Needed for Data Interpretation and Utilization for Crop Improvement
All crop plant databases should be updated with basic statistical to advanced sequence analysis tools. As the sequence information has been made available to public for several crop plant genomes. Data interpretation tools should be developed within the databases for easy access of researchers.
Reality is that many potential users will not use available resources for a number of reasons including lack of basic training in the use of bioinformatics, resources too difficult to learn and extract data, and simple inertia at learning new tools. Training of scientists for the current and future bioinformatics landscape is essentially important. Part of the solution is time since younger researchers are more attuned to the importance of bioinformatics than many established researchers. But more formal training in all aspects of bioinformatics tools, including database essentials and use, should be done for all future biological scientists. Having inbuilt tools for QTL linkage mapping, association mapping, genomic selection, and many more tools will aid the plant researchers to use the tool of interest and speed up the process of crop improvement.

Need for More Applied Research in Crop Plants
Alike quantitative trait loci (QTL), the genome sequencing project has provided much of the raw data for most of model as well as cultivated crops, which has shaped our view on genetics insights and evolution over the past two decades.
Since it is a well teaching stuff to understand the complete architecture of organism, however, no applied researches have been undertaken so far in many of the sequenced crops that are already available to public (i.e., research impact is as same as the presequencing era) and now such work is just pleasure to read with beautiful chromosome maps and dizzying Venn diagrams. For instance, cereal genome sequencing (rice, wheat, sorghum, etc.,) was completed, but yet no demonstrated work on the cultivar development had been published or undertaken for wider applied research. Genome papers have been the bread and butter of evolutionary biologists and geneticists for decades [93]. Everyone is jumping from one genome sequence to the next and looking to score a major publication aiming long-run project funding as some donors encourage them [93]. Everyone would like to see the genome sequencing projects in an optimistic way (any innovation takes its own time to influence the community) that can help us break some of the genetic bottle-neck for crop improvement in the early phase of 21st century. One and all, we should agree that every genome sequence project should have been deliberately designed to study the function of the gene in addition to the structural architecture for applied research since applied research is badly required for ongoing multi-sector crisis including agricultural food production under marginal lands. Product oriented research will have more impact than basic research alone. For instance, if more applied research is not undertaken then "genomebased research" could soon be dead which would affect the applied breeding for new cultivar development with respect to food crops as food security has still been a critical challenge for coming decades; populations blowing up unexpectedly in most of the developing countries and the novel agricultural research system should be in place to feed more than 9 billion people around the world in 2050.

Major Limitation of the Databases
As new sequencing technologies come online and the costs continue their downward trend, there will always be "more" worthy sequencing projects. Already we see multiple sequencing from the same genera with both the Oryza japonica and Oryza indica genomes sequenced and additional Arabidopsis genome projects following that of Arabidopsis thaliana. Making the crop plant databases and related bioinformatics tools easily accessible to research community is going to be a continual problem. As the volume of data power of computers increases, what is not possible is the software to fully use the potentials and the expertise of users in accessing those potentials. The amount of sequence data generated in crop plant research has dramatically increased over the last few years and will continue to accelerate in near future. Researchers would want the complete genome sequence of every line of every organism under study; thus, an effectively unlimited thirst for sequence information will happen in near future. There will be whole-genomes of additional plants, the already mentioned sequence of additional versions of plant genomes, and intense resequencing of specific regions over tens, hundreds, and thousands of genomes. Custom microarrays are already made to resequence hundreds of thousands of dispersed DNA sequences. Resequencing to discover SNPs allows rapid genotyping through various array technologies. Currently, the planning is based more towards a minimal number necessary for a given program, but as cost declines and higher resolutions are within range of breeding programs, the density of desired SNPs may approach the entire genome level. There will also be more integration of data as knowledge, database, and analysis tools interlink. Functional genomics data on mRNA transcription and expression will tie to proteomic analyses and metabolomics of entire plants.

Conclusions
The implications of genomics on crop production can be envisioned on many fronts since fundamental advances in genomics would greatly accelerate the acquisition of knowledge and in turn will directly impact many aspects of the processes associated with crop plant trait improvement thereby considering productivity in a given environment. However, the complexity of possible higher orders of interactions can only be speculated with much more information, but the reasonable assumption is that it will dwarf our current limited views. A consequence of more voluminous and complex data is essential for better visualization and final validations. Better graphic tools to consolidate and summarize, and integration of data in a flexible manners to customize each researchers requirement. There will be more adoption of simultaneous data presentations and near future will involve ever more powerful computers, computational capability, sophisticated displays and interpretation tools, and greater practical expertise in the capabilities and exploitation of databases. Unless all these datasets are utilized in applied/product-oriented breeding program, the sequence data's just to stay with its obituary notes in database network. Hence, scientist needs critical attention and discussion within and among disciplinary on the applied platforms of outcomes for better recognition of their novel research for betterment of humankind.

Conflict of Interests
Arun Prabhu Dhanapal and Mahalingam Govindaraj approve this paper and declare that they do not have any conflict for interests.