GeneSpeed Beta Cell: An Online Genomics Data Repository and Analysis Resource Tailored for the Islet Cell Biologist

Objective. We here describe the development of a freely available online database resource, GeneSpeed Beta Cell, which has been created for the pancreatic islet and pancreatic developmental biology investigator community. Research Design and Methods. We have developed GeneSpeed Beta Cell as a separate component of the GeneSpeed database, providing a genomics-type data repository of pancreas and islet-relevant datasets interlinked with the domain-oriented GeneSpeed database. Results. GeneSpeed Beta Cell allows the query of multiple published and unpublished select genomics datasets in a simultaneous fashion (multiexperiment viewing) and is capable of defining intersection results from precomputed analysis of such datasets (multidimensional querying). Combined with the protein-domain categorization/assembly toolbox provided by the GeneSpeed database, the user is able to define spatial expression constraints of select gene lists in a relatively rigid fashion within the pancreatic expression space. We provide several demonstration case studies of relevance to islet cell biology and development of the pancreas that provide novel insight into islet biology. Conclusions. The combination of an exhaustive domain-based compilation of the transcriptome with gene array data of interest to the islet biologist affords novel methods for multidimensional querying between individual datasets in a rapid fashion, presently not available elsewhere.


INTRODUCTION
Genomics is playing a growing role in almost any biological experimentation. Based on presently available commercial expression array technologies, an investigator is given almost full-genome coverage of transcriptional changes that provides for novel methods for gene identification and validation. However, exhaustive data mining from genomics datasets is cumbersome, and to a large extent is outside the expertise of the individual experimenter. The greatest strength in genomics data analysis stems from multidimensional analysis, as such orthogonal comparison can bring out biologically relevant information not extractable from the individual datasets alone. However, such multidimensional querying is often advised against, as individual genomics experiments are performed in different laboratories, using dissimilar methodologies. Such array data should not be uploaded and analyzed concomitantly in the available software data analysis programs commonly used. Prudent analysis of multiexperimental results would therefore call for individual data analysis of experimental sets, and only parse for intersections/exclusions within the resulting gene lists. This is possible through genomics analysis platforms using separate gene list saving. However, the process is burdened by the fact that the analysis of a relevant dataset for orthogonal querying requires the identification of the existence of the data, upload, normalization, and scaling of individual DNA chip scan files and thereafter selecting and executing a valid analysis for the particular dataset, followed by results storage. In practice, this is time consuming, and too overwhelming, for most biologists.
In the islet and islet developmental biology research fields, a continuously growing set of public genomics data is becoming available. Also, initial problems in both genomics  chip design and experimental execution are gradually being overcome. It is therefore appreciated that a large, generally untapped resource is provided by genomics analyses performed in islet-focused laboratories around the world. Two online databases, T1dbase [1] and EpconDb [2], contain genome data repository components within their sites. However, they do not provide advanced multiexperimental querying options that would allow generation of gene lists between experiments. Acknowledging this, we set forth to create a resource that would consolidate diabetes-research relevant genomics data and allow rapid multidimensional analysis between such datasets. To do this, we created an online genomics data repository, which we term GeneSpeed Beta Cell. This was developed as an additional component of the GeneSpeed resource [3], see Figure 1. GeneSpeed Beta Cell (http://genespeed.ccf.org/betaCell/) contains two forms of data: normalized and similarly scaled genomics data relevant for the islet or pancreatic developmental biologist. Secondly, it contains precalculated analyses, which include pairwise and self-organizing neural network clustering results applied to relevant data series. On the analysis side, GeneSpeed Beta Cell provides "My Gene Workspace" where gene list overlap can be evaluated. It also provides access to any search parameter in the GeneSpeed environment, including precalculated data on tissue specificity (Shannon entropy) and wide-tissue batch expression queries. Together, the unified environment within the GeneSpeed database provides for some unique capacities not found elsewhere. We here describe the use of GeneSpeed Beta Cell by addressing a set of novel, and biologically relevant, questions appealing to the islet biologist.

Multiexperimental viewing
"GeneSpeed Beta Cell" is a gene array data repository linked to the GeneSpeed environment. For a more detailed descrip-tion of the domain-based gene categorization afforded by GeneSpeed, please refer to [3] and the online background and tutorials. GeneSpeed Beta Cell consists of a central "experiment selector page" (Figure 2), listing experiments by relevance to embryonic development, adult islet studies, adult whole pancreas studies, experiments using cell lines, and solid tumor data. Currently, the site is being expanded with gene chip data of developing nonpancreatic endoderm. For each group, experiments are separated into human and mouse studies. At current, Affymetrix-type data is supported, given that the body of relevant genomics data is the largest on this particular platform; but as the parent GeneSpeed database is not platform-specific, we have also made it capable of operating with the Illumina BeadChip type datasets. All current Affymetrix-type datasets in the repository were obtained as unnormalized raw cel files, and were normalized using MAS5.0 using identical settings and similarly scaled for cross-experimental comparisons (see methods). The available data can be viewed through multiexperimental viewing. Any saved gene list can be viewed for any of the available datasets. This is a fast and convenient way to display the normalized expression values of defined gene lists between independent experiments performed in different laboratories. The resulting display page is constructed to facilitate horizontal glancing of expression values, while maintaining the individuality of experiments. As there is no limit as to the number of genes shown or number of experiments selected, the resulting page view can be quite large.
To assist the identification of the respective column (tissue type/experimental condition) and row (gene symbol), a hovering tool supplying this information was implemented. Also, for quick analysis of the gene ID, each cell is hyperlinked to the respective Unigene page for that gene. If a gene within a gene list does not contain a respective probeset, the cell content is displayed as N.A. The multiexperiment viewer facilitates table sorting based on each component in selected datasets. This provides, for example, quickly arranging genes in a larger gene list according to expression levels for any tissue/condition selected (e.g., Figure 3 shows a list of  homeodomain-class genes sorted based on expression at the E12.5 gestational time point in pancreatic development).

Multidimensional analysis in "My Gene Workspace"
To enhance online capabilities, we developed tools for multidimensional analysis. The multidimensional analysis tool operates within a "My Gene Workspace" environment ( Figure 4), which is array-platform independent as it stores genes by Unigene identifier. "My Gene Workspace" allows for temporary storage of gene lists, naming such, and selecting individual lists to be combined using the Boolean operators AND or OR. Hereby, intersections (AND), or additive combinations (OR) can be performed on the selected gene lists, for further logical operations or visualized using the multiexperimental viewer. Several means of populating the workspace is possible. The user is provided with a "permanent list" account, in which work between sessions can be saved. Lists can here be grouped according to project name. Gene lists from the permanent account can be ported to the workspace or gene family choices from a concurrent GeneSpeed query that can be directly imported.
In addition, gene list results from precalculated analyses based on the available datasets can be added. The final option provides a highly useful method to dynamically aggregate and compare results from individual experimental data that was not initially designed for a combined analysis. Such comparisons can be highly scientifically relevant, and examples are provided later. As this latter method is based on precalculated analysis of available datasets, a certain level of a priori choice has been necessary to implement, as all permutations of possible data analysis could not be practically implemented. Consequently, depending on the underlying experimental conditions, the precomputed analysis is restricted to pairwise analysis (although multiple pairwise comparisons are often provided for a given dataset), or a self-organizing cluster analysis (for series-type data such as experimental time, drug concentration, or developmental time). Graphical presentation of each analysis is provided to help the user gauge gene numbers given the conditions chosen. For pairwise analysis,   a volcano plot (plotting significance (p-value) versus foldchange) of the pairwise analysis result is shown. The default cutoff for gene selection is set at a false discovery rate of 0.1, but can be changed to the user's preference. Similarly, the fold-change range can be freely set, allowing the user to port, that is, >2-fold upregulated genes in a given condition into the workspace. The graphical presentation of cluster analyses contains a cluster number, and number of genes within the cluster. The user is free to select any number of clusters and port to the workspace. In this manner, various experimental conditions can be continuously ported to the workspace, and the experimental multi-intersectional analysis occurs there. There is no limit to the number of gene lists present in the workspace. We should note that for both the multidimensional viewing page, and for the multidimensional query form in the workspace, individual datasets are always kept separate (viewer), treated as such (query page), and are never pooled. Cross-experimental pooling is not tolerable due to varying conditions in different laboratories during data generation.

Current experimental content of GeneSpeed Beta Cell
As the available datasets and analyses grow on a daily basis, users should visit the site for a list of currently available datasets and analyses.

GeneSpeed Beta Cell use-case scenarios
Some biologically relevant use-case scenarios for the islet cell biologist are described in the following. Each of these is available also as online tutorials at GeneSpeed Beta Cell at http://genespeed.ccf.org/betaCell/tutorial.jsp. As for any bioinformatics-based method application, the results are provided as candidate gene lists, corresponding to genes/probesets fulfilling input criteria. The further validation of such lists using noninformatics-based techniques is a general requirement. In the following demonstrations, the end-result gene lists are often supported by previous published data from other sources, hereby providing the validation required for the particular demonstration scenarios.
Example 1 (Compiling lists of islet-expressed transcription factors (online tutorial 1)). We wish to address the issue of defining islet-expressed transcription factor (TF) encoding genes. To do this, we will utilize the predefined transcription factor categorization provided by the GeneSpeed database, assemble a nonredundant list of TF encoding genes, and find those reduced in Ngn3 null pancreas. First, we select "new search," and desired species "mouse" from the dropdown menu. Next, we select "search by transcription factor classification" within the GeneSpeed search options. As 5 major domain family groupings exist for the transcription factor type genes, we will need to iterate the following procedure for each, but will here limit the families to the "Basic," "Beta-Scaffold," and "HTH" superfamilies. These families contain, for example, the leucine zipper, bHLH, and homeodomain transcription factor families, but not the Zn-finger class. Selecting "Basic" as the first type, we ctrlselect all the subfamily members of the basic TF superfamily.
Displaying the result provides 685 hits. These correspond to every instance where the Unigene database of the mouse contains a homology hit for any of the domain types associated the "basic" superfamily. However, as the database has no preset lower E-score cutoff, several false positives exist in this list (see discussion of how to set an E-score cutoff on the description pages at GeneSpeed for a full explanation).
To eliminate low-scoring similarity hits, we set the E-score cutoff at E10-6, and redo the search. Now, a resulting list of 167 genes is detected. We save these to the user account under an arbitrary name (All TFs). This process is repeated for the TF superfamilies mentioned above, where the individual results is added to the All TF's list, consequently providing a list of >1600 individual Unigenes. These are next imported into the "My Gene Workspace." To extract genes unique to pancreatic islets in the developing pancreas, we will take advantage of the available dataset for Ngn3-null embryonic pancreas, which is listed under the experiment listing page of GeneSpeed Beta Cell. A pair-wise analysis is provided comparing E15.5 Wt and E15.5 Ngn3 Null pancreas. The Ngn3-deficient pancreas is excellent to define endocrinespecificity, as the organ is devoid of endocrine cells. Selecting genes upregulated >1. 5  Example 2 (Multidimensional intersection analysis to define developmentally regulated expression of protein kinase-encoding genes (online tutorial 2)). We here will seek to discover kinase-encoding genes that are enriched in either early or late pancreatic development. A similar study has not been done before. To perform this task, we first need to compile a list of all protein kinase-type genes in the mouse transcriptome. Using a text-search for a gene known as a protein kinase (e.g., insr), we obtain two hits: Insr and Insrr. Both of these are receptor tyrosine kinases, and display the presence of the Tyr pkinase domain (IPR001245) with an E-score at 1E10 −145 . We also note that the generic kinase domain (IPR000719) is detected in both at 1E10 −24 . By checking the "InterPro sub-search" box for the IPR001245 domain, and execute the search: "refine by subsearch," we obtain a nonredundant list of Unigene clusters having similarity to the IPR001245 domain. This provides 480 hits, covering all kinase-domain forms (S/T as well as Y-kinase types). To curate against low-similarity hits, we manually set the E-score threshold at <1E-6. The resulting list contains bona-fide 432 kinase-containing genes, which we subsequently save as "Kinase all" to our account. Many of these genes represent genes with no previous annotation as being of the kinase-domain containing type, and may not have been named yet. Next, we wish to identify which of these kinase-encoding genes display a downward trend during pancreatic development. To do this, we move to the "search GeneSpeed Beta Cell," and expand the "Embryonic studies" dataset tab. Selecting the "kinetic series of mouse pancreatic development 1" precomputed cluster analysis, we are provided with the results of a Kohonen's self-organizing cluster analysis in a graphical format. genes are identified. We can conclude that more kinase signaling diversity exists prior to rather than after the secondary transition in the mouse pancreas.
Example 3 (Defining human islet-specific expression using Shannon Entropy with exocrine elimination (online tutorial 3)). This example uses the available dataset on human tissues, as provided by the Novartis Genomics Institute (http://symatlas.gnf.org/SymAtlas/about.jsp). A tissue set consisting of 79 human tissues and 61 different mouse tissues, mostly adult solid organs, has been generated in duplicate using the Affymetrix GNF1 platform. To provide a measure of tissue expression selectivity, we adopted the method of Shannon entropy determination, as previously described by Schug et al. [4]. Shannon entropy provides quantitative measures of expression using a bit-rate scale. For each gene, the Shannon entropy (H gene ) defines the degree of ordered expression; as a rule, the lower the H gene , the fewer tissues in the total set express the gene in question. To identify those tissues showing uniqueness in expression, the measure Q tissue can be used. Again, as a rule, the lower the Q tissue value, the more specific the gene is expressed in that particular tissue. A rank order of the lowest Q tissue values thus provides a list of those genes that have the highest selectivity for the tissue in question. Shannon entropy computations were performed for all tissues in the above datasets. As the human, but not the mouse, datasets contain array data for islets, the present example is currently limited to human. From the GeneSpeed search query, we select species "homo sapiens," and thereafter "search by expression." We select the human GNF1A chip, and "calculated Shannon entropy" using the drop-down boxes.  4)). It is known that neuroendocrine cell types shares certain characteristics related to production and release of secreted products. The pituitary and islets are highly enriched in cells producing polypeptide hormones. Using Shannon entropy, we will here ask what are the genes that may be in common between pituitary and pancreatic islets and not expressed widely elsewhere. Similar to above, we select Shannon entropy query in GeneSpeed, and input a slightly relaxed Q tissue value of 1.8 for both pituitary and pancreatic islets. Individually, Q Islet < 1.8×H g and Q pituitary < 1.8×H g identify 292 and 222 probesets, respectively. The intersection is 21 probesets, corresponding to 19 individual genes ( Table 2, 3 probesets for GNAS, guanine nucleotide binding protein were identified). Three genes encode known granule-type proteins (ChgrA, secretogranin 2 (SCG2), secretogranin 5 (SCG5)). Two transcription factors are found: InsM1 and ZNF91. The proprotein convertase subtilisin/kexin type-1 inhibitor (PCSKN1) and the peptidylglycine alphaamidating monooxygenase are also present. Other products include CACNA1F (Calcium channel, voltage-dependent, alpha 1F), CNGA3 (Cyclic nucleotide gated channel alpha 3), the transmembrane protein TMEM30 as well as several uncharacterized genes. Many of these genes represent expected hits, and show the value of combining parameters such as tissue uniqueness and overlapping gene expression to derive a meaningful candidate repertoire for further scrutiny.

DISCUSSION
Of the current available places for genomics data reposition, the NCBI GEO (gene expression omnibus, [5]) is presently the most exhaustive. The development of GEO proceeds to include data analysis of public array-type experiments, which also include those deposited on islets, or developing pancreas. The tools are currently limited to analyses performed within individual experiments, and data results cannot be ported between experiments. However, no other resource exists with a similar exhaustive compilation of DNA microarray-type datasets, and as such, GEO represents a growing and increasingly important pillar for array data compilation. In contrast to the more universal user-base that GEO seeks to cover, certain resources have also been made available and dedicated to the islet community. T1Dbase (http://www.t1dbase.org/) was specifically developed to catalogue information on the genetics of type-I diabetes, and contains extensive information on candidate gene regions [1]. It also contains a microarray repository and a recently developed Gene Atlas search function, aimed at providing a rapid visualization of gene expression in islets. The strength of the environment lies in the use of Gaggle [6], which is a Java-based communicator interface to several bioinformatics tools. However, to use this requires a significant knowledge of the Gaggle-implemented tools such as the TIGR tmev (http://www.tm4.org/mev.html) or R (http://www.r-project.org/), notwithstanding a rather complicated data upload scheme. Despite its strength, this may therefore represent a time-consuming and intellectual barrier to most biologists using the resource irregularly. Another comparable resource, the EpconDb (http://www .cbil.upenn.edu/epcondb42/) [2], originally generated by the Endocrine Pancreas Consortium and funded through the NIH Beta Cell Biology Consortium (http://www.betacell .org/), also provides microarray chip repository support.
Recently, precalculated analysis results for select experiments are also provided. The structure of the EpConDb resource centers on the GUS (genome unified schema), which includes the DOTS database. DOTS shares significant similarities to the NCBI-devised Unigene EST database, but extend to include splice site data, as well as promoter definition. The GeneSpeed Beta Cell site seeks to complement these resources on particularly two fronts: to provide more extensive orthogonal analysis between array experiments and to provide a functional gene list operator workspace, which neither the T1dbase nor Epcondb sites allow. To achieve the former, we focused on providing a larger degree of relevant precomputed analyses of array experiments providing these in an easy-to-query format. To achieve the latter, we developed a gene list workspace that would allow for platform-to-platform compatibility using the common Unigene denominator, which is the nexus of the GeneSpeed MySQL database. The current version of the database provides certain features not found elsewhere, some of which has been addressed through the demonstration cases. Yet, the database is a currently developing structure that in its present form is useful, but easily imagined improved. Therefore, we are currently focusing on key aspects for the further development of the GeneSpeed environment. These include the identification of additional relevant microarray experiments; filling out "missing links" by performing stopgap-type microarray experiments for populating critical, but missing, areas of the pancreatic expression space; improving the search and query formats for user-friendliness; and finally developing an export/import interface for pathway analysis programs such as Ingenuity Pathway Analysis (IPA).
The usefulness of GeneSpeed Beta Cell database is dependent on the amount of available genomics data content. A linear increase in number of available datasets and accompanying precomputed analyses translates into an exponentially growing set of query combinations. There are obvious gaps in the available datasets, as multiple null mutations have been created for several key developmental regulators during pancreatic development, and several mutant models resulting in diabetes due to beta-cell dysfunction have also been reported, all of which would represent valuable data in the present environment. Therefore, we are asking the islet research community to share available datasets for multidimensional analysis. Also, we will continue to upload publicly available datasets from the GEO environment.
For a wet-biology laboratory like our own, the present incarnation of the database has provided means of moving forward in otherwise difficult-to-execute bioinformaticsbased questions. We hope that the same appreciation may pioneer gene identification challenges in other laboratories hereby helping the diabetes research community.

Genomics data incorporation and analysis
The "GeneSpeed Beta Cell" environment was developed using the J2EE platform on a Linux server. For Affymetrixtype genomics data, we compiled the CEL files (raw data) associated with different experiments from different sources and normalized them locally using MAS5.0 algorithm, using an identical scaling factor of 500, to ensure optimal comparability in a cross-experimental setting.
The microarray experiments currently available can be grouped, and hence analyzed, according to experimental design type. For time-series experiments (and drug-effect studies), an SOM neural network clustering algorithm was applied. The number of clusters selected is empirically based on individual results, selecting the minimal number adequately describing the data complexity. A graphical presentation is provided of the log-transformed expression averages of genes within the cluster. Also, the total number of gene number contained/cluster is provided. R and Bioconductor [7] were used to accomplish this task.
For multicomponent analysis, which also includes single pair-wise analysis, ANOVA testing was performed. For multiple-condition datasets, several pair-wise analyses are provided. These results are depicted through volcano plots. A false discovery rate (FDR) test correction on the ANOVA result at 10% significance is provided for each plot as the default P-value setting. On the volcano plots, the boxed areas outlining a 10% FDR corrected P-value and the −2 to +2 fold regions of change are shown.

Account functionality
There are two account types in GeneSpeed: "guest" and "registered user." In order to use the workspace environment registration is required. A "registered user" can log back into their account to gain access to saved studies. Establishing a registered account is free and can be done on the GeneSpeed registration page (http://genespeed.ccf.org/loginReq.php). An automated password will be sent to the newly registered user. The confidentiality of all registration information is strictly maintained and we will only use such information to notify our users of any disruptions or modifications of the GeneSpeed service. At present, only registered users are allowed to use the GeneSpeed Beta Cell database.
The GeneSpeed account allows registered users to save gene lists into a private account that is permanent and may only be viewed by the owner. The "My Gene Workspace," on the other hand, maintains gene lists temporarily during the current login session; upon logging out the content of the "workspace" will be deleted.

Functional implementation of "My Gene Workspace"
The "My Gene Workspace" logic was developed using J2EE and sql-type queries. Facilitating cross-platform comparisons, the workspace utilizes Unigene cluster Ids (UID). Consequently, the probeset identification through the experimental analyses is translated into corresponding UID upon transfer to the workspace. As a result, if more than one probeset is detected for a given gene in the analysis, these probesets collapse into the UID of that gene. Secondly, upon selection of the content of a gene list in the workspace, followed by showing the content in the expression space, all probesets corresponding to the selected Unigene will be displayed. To reduce ambiguities, we update the system continuously upon the availability of updated mapping files from NetAffx Analysis Center server. Given that the NCBI Unigene dataset is constantly evolving, updated mapping to the most recent UID is done every 6 months.