Extending MapMan Ontology to Tobacco for Visualization of Gene Expression

Microarrays are a large-scale expression profiling method which has been used to study the transcriptome of plants under various environmental conditions. However, manual inspection of microarray data is difficult at the genome level because of the large number of genes (normally at least 30 000) and the many different processes that occur within any given plant. MapMan software, which was initially developed to visualizemicroarray data forArabidopsis, has been adapted to other plant species bymapping other species ontoMapMan ontology.This paper provides a detailed procedure and the relevant computing codes to generate a MapMan ontology mapping file for tobacco (Nicotiana tabacum L.) using potato and Arabidopsis as intermediates. The mapping file can be used directly with our custom-made NimbleGen oligoarray, which contains gene sequences from both the tobacco gene space sequence and the tobacco gene index 4 (NTGI4) collection of ESTs.The generated dataset will be informative for scientists working on tobacco as their model plant by providing a MapMan ontology mapping file to tobacco, homology between tobacco coding sequences and that of potato andArabidopsis, as well as adapting our procedure and codes for other plant species where the complete genome is not yet available.


Introduction
Plants, being sessile organisms, must react and acclimatize to abiotic stresses to survive in various environmental conditions.Plants have developed various stress tolerance mechanisms, such as physiological and biochemical alterations, that result in adaptive or morphological changes.In crop production, understanding how cultivated crops respond to abiotic stress is crucial in developing new varieties that could tolerate stress without affecting potential yield.With the rapid development of technologies for functional genomics research, comprehensive analyses at the mRNA, protein, and metabolites level have become possible.This is leading to increased understanding of the complex regulatory networks associated with stress adaptation and tolerance [1].
Currently, microarrays are one of the most popular technologies for large-scale expression profiling because they allow the simultaneous detection of tens of thousands of transcripts at a reasonable cost [2].The development of gene chips for model plants like Arabidopsis and rice and other species that have a sequenced genome has led to genome-wide transcriptional profiling from diverse tissues.This is a key tool for the identification of novel target genes for functional genomics [3].Studies using microarrays to characterize abiotic stress responses have been reported for model species such as the moss Physcomitrella patens [4], Arabidopsis thaliana [5,6], Medicago truncatula [7,8], and rice [9], as well as nonmodel species such as soybean [10] and Musa [11].However, microarrays generate huge amounts of data which are often in the form of lists of differentially expressed genes.Manual inspection of these data is time consuming, and the volume and variety of information creates a problem in interpretation.This is compounded when transcriptomic analysis is being combined with other OMICS data.Development of new, more reliable methods of data analysis and visualization will enable easier interpretation of results and thus a greater contribution to explaining the biological problem [12].Prior to MapMan release in 2004 [13], several bioinformatics tools have been developed to visualize datasets in the context of biological pathways.These include GenMAPP (http://www.genmapp.org/)and BioMiner [14] among others.However, their application to plant datasets is limited due to the following reasons.Firstly, these tools were developed for microbial and animal systems and, secondly, flexibility is limited in terms of the display of family members (e.g., class of enzymes) [13].This limitation had been addressed by MapMan software [13], which relied on its own ontology to classify genes and metabolites and visualize the pathways and processes in pictorial diagrams in a modular system [15].Currently, there are several standalone or web-based visualization tools for biological pathways such as Pathway Tools Omics Viewer (http://biocyc.org/)[16], KaPPa-View (http://kpv.kazusa.or.jp/en/) [17] Blast2Go (http://www.blast2go.com/b2ghome)[18], and KEGG Atlas (http: //www.kegg.jp/kegg/atlas/)[19].
MapMan was initially developed to analyze two sets of 22K Affymetrix arrays that investigated the response of Arabidopsis rosettes to low sugar [13].Rapid advances in sequencing have resulted in full-genome sequences for an increasing number of important crop species (e.g., soybean, rice, maize, papaya, sorghum, and corn).These genome sequences have facilitated the development of large-scale whole-genome arrays.MapMan software can be applied in new species by transferring the MapMan ontology to the transcripts and proteins of the studied species [15].Several studies have been reported in extending MapMan ontology to other species such as soybean [20], cotton [21,22], grapevine [23], maize [15], Musa [11], potato [24], and tomato [25].
MapMan as a complement to existing visualization tools offers several advantages.Aside from the ease of use, Map-Man can display time-based experiments in pictorial format.It can superimpose different datasets as overlay plots which simplifies the identification of shared features on a global and gene-to-gene basis [26].
Tobacco is a popular model plant in recombinant technology because of its well-established gene transfer and regeneration methodologies as well as the availability of many robust expression cassettes for the control of transgene expression [27].However, the disadvantage of tobacco is that it is an allotetraploid and its genome is not yet fully sequenced.A large gene space sequence project was performed for tobacco (http://www.tobaccogenome.org/),and the gene space reads have been deposited as individual unassembled reads at the National Center for Biotechnology Information (http://www .ncbi.nlm.nih.gov/).This is an excellent resource but does not cover the entire genome.The tobacco cultivar Bright Yellow 2 (BY-2) cell line is an important model system to study cell physiology, hormone signaling, cell cycle, cell growth, and stress situations [28].However, tobacco is still a mostly unsequenced and relatively unannotated plant system in which identification of the proteins and their interactions relies on cross-species identification based on homology and orthology [29].
In this paper, we extend MapMan ontology of sequenced dicot plants to generate a mapping dataset for tobacco.The dataset we have generated from this study will be a tool for scientists working on tobacco as their model plant by providing a MapMan ontology mapping file for tobacco and homology comparisons between tobacco coding sequences and those of potato and Arabidopsis.In addition, we provide a method and the required computer codes to generate MapMan mapping files which may be adapted for other plant species where the complete genome is not yet available.The code and data package can be downloaded from http://maurice.vodien.com/datasets/MapMan-Tobacco.rar.from the identification of individual differentially regulated genes (Figure 3).The tobacco MapMan mapping file based on the tobacco gene index 4 (NTGI4) and tobacco genomic survey sequences therefore revealed new insights into water stress responses in tobacco.

Dataset Description
The dataset associated with this Dataset Paper consists of 9 items which are described as follows.

Dataset Item 2 (Nucleotide Sequences).
Tobacco genomic survey sequences used as an input file for nucleotide BLAST (blastn) and translated nucleotide BLAST (tblastx).

Dataset Item 3 (Nucleotide Sequences).
Potato DNA coding sequence used to generate nucleotide BLAST database.
Dataset Item 4 (Nucleotide Sequences).TAIR9 coding sequence used to generate nucleotide BLAST database.).The processed BLAST result file for regenerating the required MapMan ontology map file.The Query ID column is the identifier for the query sequence from either NTGI transcript or Tobacco GSS, which is also the source of the Identifier column in the ontology map file, while the Mapped ID column is the identifier of the hit sequence from either TAIR9 CDS or Potato CDS.Query Length, Global Identity, E-Value, Query Source, BLAST Database, and BLAST Method are attributes containing the source data for concatenation into the Description column in the ontology map file.).The resulting MapMan ontology map file, containing 5 columns as per MapMan ontology map file format.They are Bincode, Name, Identifier, Description, and Type.Through our own map file generation and use, we found two limitations in the format of the map file.Firstly, only Bincode, Identifier, and Type columns are mandatory and used by MapMan [13].Bincode and Identifier columns are the MapMan ontology bin identifier and microarray probe identifier, respectively.In our case, the Identifier column refers to NTGI4 or tobacco genomic survey sequence identifier, which is also used as a probe identifier in our custom microarray.The Type column is default to "T".Secondly, the Name column is the name descriptor used by MapMan [13] for displaying the ontology in a tree format, together with the Bincode, even though it is not mandatory.However, if the Name column is used, each Bincode can only be mapped to one Name.As we had combined two primary map files (potato and Arabidopsis) to generate a tobacco map file, we found that the Name column may not be consistent with the Bincode and resulted in error.Thus, the Name column is not used and left blank.We used the Description column as a composite of 6 attributes to describe the BLAST process.The six attributes are as follows: Query Length to denote the length of query sequence in a number of bases; Global Identity to denote the global sequence identity between the query sequence and the matched sequence in the BLAST database; E-value to denote the expectation value from the BLAST hit; Query Source to denote the source of the query sequence; hence, the source of the Identifier; which is either "NTGI transcript" or "Tobacco GSS"; BLAST Database to denote the source of sequences to generate the custom BLAST database, which is either "TAIR9 CDS" or "Potato CDS"; and BLAST Method to denote the BLAST program used, which is either "blastn" or "tblastx".

Concluding Remarks
Figure 2 shows that the MapMan ontology mapping file for tobacco that we have generated does indeed work.It shows the changes in expression level of genes associated with regulatory processes after water stress in both roots and leaves.
The blue color shows genes that are upregulated at the mRNA level and the red color shows genes that are downregulated.The darkest color represents at least 8-fold change in mRNA level.Based on the MapMan results, we can clearly identify areas of primary and secondary metabolisms that are subject to regulation during water stress.These genes are therefore identified as potential targets for improving drought responses.
The successful use of the MapMan ontology mapping file for tobacco (Figure 2) illustrates that our strategy of going via potato has been a good one.This is because each unassembled gene space read from tobacco that is present on the oligo array may only contain a short part of an exon and this may not correspond to any protein sequence in the more distantly related Arabidopsis proteome.Potato is much more closely related to tobacco as it is also a member of the Solanaceae.This means that most fragmentary tobacco sequences can be assigned to a corresponding full-length potato sequence that will contain conserved domains that allow identification of the protein.This full-length potato protein sequence will in the majority of cases have a similar type of protein in Arabidopsis for mapping purposes.We propose that adapting our procedure and codes for other plant species where the complete genome is not yet available will facilitate MapMan ontology mapping for those plant species.

Figure 2 :
Figure 2: MapMan visualization of changes in expression levels of genes associated with regulatory processes using our MapMan ontology mapping file for tobacco (Nicotiana tabacum L.).Blue denotes upregulation and red denotes downregulation.Intense blue or red denotes fold changes of 8-fold or more.
Processing BLAST XML to comma-delimited file.
will visualize microarray data, the mapping file was added into MapMan.The microarray data that was used to test the mapping file is the transcriptional profile of tobacco plants subjected to time-based dehydration.Figure2shows the MapMan overview of genes involved in regulation and water stress signaling.The changes in transcript levels in different tissues and at varying times of dehydration stress can easily be compared in MapMan since the data is presented in a pictorial format.MapMan visualized the changes in gene expression in the families of transcription factors and interestingly revealed the involvement of calcium signaling, receptor kinases, and the plant hormones abscisic acid, jasmonates, and ethylene in tobacco responses to water stress.The visualization provided by MapMan concurred with the results Differentially regulated genes in roots and leaves under different dehydration timepoints.Student's -test (with false discovery rate correction using Benjamini-Hochberg method) was conducted and genes with  value of ≤0.05 were selected.