ArraySearch: A Web-Based Genomic Search Engine

Recent advances in microarray technologies have resulted in a flood of genomics data. This large body of accumulated data could be used as a knowledge base to help researchers interpret new experimental data. ArraySearch finds statistical correlations between newly observed gene expression profiles and the huge source of well-characterized expression signatures deposited in the public domain. A search query of a list of genes will return experiments on which the genes are significantly up- or downregulated collectively. Searches can also be conducted using gene expression signatures from new experiments. This resource will empower biological researchers with a statistical method to explore expression data from their own research by comparing it with expression signatures from a large public archive.


Introduction
In recent years, there has been a massive influx of data from DNA microarray experiments into publicly available repositories such as the National Center for Biotechnology Information (NCBI). Many of these datasets represent genomewide expression profiles generated from experiments investigating controlled perturbations on molecular pathways, responses to chemical stimuli, responses to pathogen infections, or mutant behavior when certain genes are made inoperative. The largest repository of gene expression data is NCBI's Gene Expression Omnibus (GEO). As of July 1, 2011, GEO contains the expression profiles of 587,214 biological samples from different organisms. It is important to take advantage of this large body of data for the purpose of new genomics studies.
A number of sites and tools have become available that attempt to utilize the data in ways that will guide future research. Although GEO is primarily designed for information storage and downloading, GEO profiles provide methods for querying the expression profiles of genes (see http:// www.ncbi.nlm.nih.gov/geoprofiles).
The Nottingham Arabidopsis Stock Center microarray service (NASCArrays) [1] also provides a number of useful data mining tools for exploring a large collection of gene expression data. The Two Gene Scatter Plot tool allows a user to visualize gene expression over all microarray slides for two genes as a scatter plot. The Spot History tool allows a user to see patterns of gene expression over all slides for specific genes.
Genevestigator [2] provides online tools to access large and well-annotated datasets of curated microarray samples. The meta-profile analysis function of Genevestigator summarizes gene expression data across many samples according to the biological context of the sample. Each gene signal in a meta-profile (in contrast to a normal expression profile) is the average of gene expression levels across samples sharing the same biological context (anatomy, development, stimulus, and mutation). Genevestigator also provides a useful method for single or biclustering of gene expression values based upon sample conditions, anatomy, genotype, or developmental stage.
Other sites, such as the Arabidopsis Information Resource Center (TAIR) [3] and plexDB.org [4], provide standard analytical tools such as gene expression graphs across samples for genes of interest, hierarchical and partitional clustering analysis, and the display of heat maps for genes across samples.
The web service Oncomine (http://www.oncomine.org/) created by Rhodes et al. [5,6] provides an interface to a collection of cancer-related gene expression profiles and allows users to search a gene's expression across the entire collection. This tool has enabled researchers to make new discoveries, such as finding a common cancer signature defined by groups of genes that are consistently overexpressed in all cancers [7]. A newer version of the Oncomine site allows for the analysis of transcription factor binding motifs and has been instrumental in the identification of hundreds of gene regulatory programs associated with cancer [8].
The connectivity map (C-map) database collects the expression signatures of small molecule drugs [9]. C-map contains expression profiles of cell lines before and after the administration of hundreds of different drugs. It can be used to query an expression signature against these expression profiles to identify connections among small molecules sharing a common mechanism of action.
ArrayExpress [10] is a database for high-throughput functional genomics data for a number of different species. The database is comprised of two parts-the ArrayExpress repository which is a public data store of microarray data; and the ArrayExpress data warehouse, which is a collection of gene expression profiles taken from the repository. Gene expression profiles can be searched by gene names and properties, such as gene ontology terms, and can be visualized using the web interface: http://www.ebi.ac.uk/gxa/.
Although Oncomine, Genevestigator, and ArrayExpress provide facilities for searching through a database of gene expression profiles, they do not provide the ability for a researcher to submit queries with expression profiles from their own experiments and find contrasts highly correlated with their search query. The purpose of the ArraySearch website (http://ArraySearch.org/) is to leverage the existing large repository of genomics data by using it as a knowledge base for interpreting new data. ArraySearch achieves this goal by calculating statistical correlations between publicly available genomic data, which is well-characterized, and new experimental data. Table 1 displays a comparison of features between ArraySearch, Genevestigator, plexDB, Oncomine, and ArrayExpress.

Downloading, Parsing, and Normalizing the Genomics
Expression Data. Gene expression datasets from 160 experiments were downloaded from NCBI (http://www.ncbi .nlm.nih.gov/gds). Of these, 89 were then parsed and normalized using the Bioconductor package for R [11] to build the database ArraySearch uses. This is a multipart process, and the steps are summarized below for a single experiment.
(1) Each experiment in the GEO Omnibus is referenced by an accession number. This number is used by the GEOquery library in bioconductor to download the raw data files for each sample in the experiment (referred to by their extensions as CEL files) from the NCBI ftp site. The files are then decompressed.
(2) The gcrma package for R (also in bioconductor) is then used to process the raw data files for each sample in the experiment. This is done separately (3) The fold change (contrast) for each gene g in the genome is calculated as follows: where g treated = average gene expression for treated samples, g control = average gene expression for control samples, (4) A Bayesian t-test [12,13] is used to compute the P values for each gene. Low P values are of interest because they inform us that the gene was upregulated or downregulated for this experiment. The fold change table reveals the direction of regulation.

The Technology behind ArraySearch.
The ArraySearch website is composed of two parts: a front-end web interface created using the PHP scripting language and a backend database created using MySQL. It was designed to be deployed on computers using Linux and running the Apache HTTP web server. The heat maps and bar graphs that ArraySearch generates for the different user queries were created using extensive modifications to a graphing library called pChart (http://pchart.sourceforge.net/) [14].

Results and Discussion
We downloaded gene expression datasets from 160 microarray experiments from NCBI (http://www.ncbi.nlm .nih.gov/gds). Only experiments using the Affymetrix ATH1 array were selected. Out of these experiments, only those with a minimum of two control and two treated arrays were considered for further analysis, resulting in 89 experiments. The samples within each experiment were divided into groups, and the groups were annotated with descriptions (e.g., WT root versus root: Sol2 mutant). A paired Bayesian t-test using the Cyber-T R script (http://cybert.ics.uci.edu/) [12,13]   Rhodes et al. [5] Wise et al. [4] H r u z e t a l . [ 2] 4 Comparative and Functional Genomics     for each experiment were calculated. This resulted in 256 different contrasts.  The fold change is a measure of how much gene g is up/downregulated in the treated samples relative to the control samples. The details of this process are described for one experiment in Section 2. In Figure 1, a flowchart shows the various modules involved in creating the ArraySearch website and the flow of information between one module and the next. (1) When searching by expression signature, finds contrasts which are highly correlated with the expression profile the user entered,  (2) When searching single/multiple gene, finds contrasts which are significantly up-or downregulated for the gene or list of genes entered.
The GO table in Figure 2 contains information about gene ontologies. Each gene ontology is referenced by a GOID number (e.g., GOID 15886 refers to "heme transport"). All genes in the tables are referenced by their Affymetrix probe ids, and the GOID2Probe ID table takes GOIDs and associates them with a list of probe IDs that correspond to genes that are actively expressed under the GOID. The GenBank2Unigene table is another translation table used to  convert GenBank IDs into Unigene IDs. The annotation table  contains information and aliases for all genes in the ATH1  array and is used to convert the Unigene IDs coming from the  GenBank2Unigene table to their respective Affymetrix probe  IDs. The contrast table contains information about each contrast in the database (study title, number of samples in the control and in the experiment, the study design, etc.). The P value table contains the P values for all of the contrasts in the database for every probe ID.
The most important table in the ArraySearch database is  the expression signature table. This table contains   (2) Gene Ontology Search finds expression profiles for genes active in a gene ontological (GO) category search term (e.g., "defense response"); (3) Gene Title Search finds expression profiles for single genes; (4) GenBank Search returns expression profiles for a list of genes specified by their GenBank IDs.
Gene expression data currently come from one species: Arabidopsis thaliana. In the future, expression data from other species will be searchable.

Using the Expression Profile Search.
As an example, a researcher is analyzing the results of a microarray experiment on Arabidopsis, investigating the effect of salt stress on gene expression along the radial axis of the root. In this experiment, three replicates of Arabidopsis root cells (from the columella root cap) were treated with 140 mM (millimolars) of NaCl. For the control, three replicates of Arabidopsis columella root cells were left untreated. The fold change value for each gene is calculated. Sorting the fold change values for all of the genes and looking at the top 10 most up-/downregulated genes gives the expression signature in Table 3.
A researcher wants to know other microarray experiments in which this pattern of up-/downregulated genes occurs. ArraySearch can answer this question. Entering this expression signature as an input query returns a list of contrasts highly correlated with the signature (see Table 4).
The actual search returns a list of the top 30 most highly correlated contrasts, along with summary information and links to NCBI where more information about the samples and the experiments can be obtained. The search results displayed in Table 4 show that the same genes in the input query exhibit similar expression patterns when ethylene and auxin interactions in the roots of Arabidopsis seedlings are triggered.
A heat map is also returned to the user. The color of each cell in the heat map is based on the expression value for that particular gene-sample pair. If a cell is bluish in color, then the gene is downregulated for that sample. If the cell is reddish in color, then the gene is upregulated for that sample (see Figure 3).  For this input, a two-tailed t-test is computed using the expression values for each gene across all contrasts in the database. All contrasts with P-values less than α = 0.05 are returned to the user, along with the t-statistic used to infer whether the genes as a group were downor upregulated in an experiment. Summary information about each experiment, as well as links to NCBI, is also provided (see Table 5). Because this query involves multiple comparisons, the possibility of a number of false positives arising is a concern. To handle this, FDR q-values are also calculated for each contrast based on the method outlined in Benjamini and Hochberg [15].

Gene Ontology
Search. The Gene Ontology Search allows users to find genes active for specific ontologies. To use this search, a gene ontological term such as "defense response" is entered. ArraySearch returns a list of Genome Ontology (GO) IDs along with their GO categories, which contain the search term. Searching for "defense response" returns Table 6.
Each GO ID maps to one or more genes (denoted by their Affymetrix probe IDs) actively expressed in their respective ontologies. The output is displayed in a heat map (Figure 4), identical to the expression profile search. The heat map is 10 × 30, representing the expression values for the 10 most active genes and the 30 most active contrasts.

GenBank Search.
GenBank is the NIH DNA sequence database. It is a collection of all publicly released DNA sequences. Researchers often use GenBank IDs to refer to genes of interest; however, internally, ArraySearch refers to all genes by their Affymetrix probe IDs. This search function takes as input a tab-delimited list of one or more genes referred to by their GenBank Ids, for example, BP799393, DR371397, ES108934, and ES014211. It returns   the expression profile for the gene set as a bar graph across all samples (see Figure 5).
3.6. Gene Title Search. The Gene Title Search function searches the ArraySearch database for all genes entered in the search term title. For example, a search for all genes with "Cold" in the title would return the list shown in Table 7 (which has been truncated slightly).
By default, this search returns a heat map identical to the gene ontology search function. That is, it displays the heat map showing the expression values of the ten most active genes over the thirty most active experimental samples in the ArraySearch database. Provided the number of genes returned by this search is reasonable (less than ten genes), a user may also elect to display a bar graph in addition to the heat map.

Conclusions
Genes within an organism seldom act individually, but work in concert with other genes in a complex network of interaction. Often the same group of genes behaves similarly in the presence of related biological stimuli. The major purpose of the ArraySearch website is to act as a genomic search engine that returns expression signature sets from different experiments when provided with an expression profile query of gene identifiers and corresponding fold change values. We hope this resource will empower biological researchers to explore expression data from their own research by comparing it alongside expression signatures from a large public archive.