Navigating Public Microarray Databases

With the ever-escalating amount of data being produced by genome-wide microarray studies, it is of increasing importance that these data are captured in public databases so that researchers can use this information to complement and enhance their own studies. Many groups have set up databases of expression data, ranging from large repositories, which are designed to comprehensively capture all published data, through to more specialized databases. The public repositories, such as ArrayExpress at the European Bioinformatics Institute contain complete datasets in raw format in addition to processed data, whilst the specialist databases tend to provide downstream analysis of normalized data from more focused studies and data sources. Here we provide a guide to the use of these public microarray resources.


Introduction
Microarrays have rapidly become the tool of choice for monitoring genome-wide levels of cellular gene expression [10,14,37,38]. The main reasons for this growth in usage are as follows: (a) as a result of the many genome-and EST-sequencing projects, there have been large increases in the number of DNA sequences that require functional information [39]; (b) the technology has become more accessible, allowing many groups to harness its power, in particular (i) the technology has improved over time and become easier to use, and (ii) it is less expensive to run; (c) finally, the informatics infrastructure, through both hardware and software improvements, has advanced quickly enough to keep pace with these new challenges in molecular biology [16].
There are two major types of informatics solutions that are required for the management of microarray data: (a) software, usually installed locally, that allows storage, querying and analysis of data captured on-site [2,17,19]; and (b) databases that act as repositories for publicly available data and, in particular, published data [2,9,18,19,25,30,35,40]. In this article, we shall briefly address the local packages by giving a list of the main ones that are currently available, and then we shall concentrate on the latter databases by giving details about how to use them and what data they contain.

Local packages
In the late 1990s, as a rising number of both academic and commercial groups began to use commercial or home-made microarray platforms in earnest, a number of small bioinformatics companies started to develop software packages to capture and analyse the data being collected. One of the first companies in this field was Silicon Genetics, who developed the GeneSpring expression analysis software. This package took off rapidly as it was easy to use and contained most of the simple analysis features that users required.
Additional academic and commercial software packages that are currently available for microarray analysis are shown in Table 1. Packages that are principally available as web tools or desktop applications are shown in Section 1 of the table, while Section 2 shows software that also possess databases for the storage of microarray data (reviewed in [2,17]). In addition, statistical software can be tailored for the analysis of microarray data (listed in Section 3 of Table 1); e.g. there is the BioConductor package, which is a suite of tools for use with the freely available R package [17,47]. Links to many of these packages and other tools are provided at the following websites: http://genex.sourceforge.net/othertools.html; http://www.ifom-firc.it/MICRO-ARRAY/data analysis.htm; and http://micro array.ccgb.umn.edu/smd/html/MicroArray/ SMD/restech.html

Public repositories
The larger microarray groups tended to develop databases and software for the storage and analysis of their own data [1,2,13,17,19,20,22,27]. When data were published by these groups, they were made publicly available in different formats, usually over the web. With an ever-increasing amount of microarray data being published, it became clear that there was a need to store these data in central repositories, as had been the case in the sequencing and protein structure fields in previous decades.
Public repositories were initiated by the organizations that had previously collaborated to create the main sequence databases (see Section 1 of Table 2), viz. the NCBI (National Center for Biotechnology Information) in the US with its GEO database [18] (Gene Expression Omnibus), the EBI (European Bioinformatics Institute) in Europe with ArrayExpress [9,35], and the NIG (National Institute of Genetics) in Japan with CIBEX [25] (Center for Information Biology Gene EXpression database).
The aim of these repositories is to store the Minimum Information About a Microarray Experiment' (MIAME) [8], to allow researchers to replicate experiments. This has been developed by the international Microarray Gene Expression Data Society [5,6,31] (MGED Society; http://www.mged.org/). To the investigator, MIAME represents a checklist of information to be supplied during the experiment submission process. Initially, there had been a hope to save the raw array images centrally, but the acceleration in the amount of data collected meant that this was not going to be feasible in the longer term. Instead, all raw data from the files outputted by the image analysis programs are archived in these repositories. Additionally, a large amount of information is stored about the microarrays used in the experiments and the way they were produced, how the samples were obtained, protocols for RNA extraction and labelling, as well as methods used for sample hybridization, slide scanning and data normalization.
As researchers have become increasingly bold in utilizing microarray technologies, a variety of different experimental methods have evolved in addition to the usual transcription profiling technique; notably, these include comparative genomic hybridization (CGH [29]), chromatinimmunoprecipitation combined with arrays (ChIPchip [33]), and both toxico-and nutri-genomics (http://www.ebi.ac.uk/microarray/Projects/tox nutri/index.html) methods. In addition, data are collected for a variety of organisms using different expression platforms. The databases need to be sufficiently flexible to cope with these variables, so the database groups have come up with different approaches to overcome these challenges [40]; e.g. GEO was primarily designed to act simply as a repository for public data, whilst an aim of ArrayExpress was to query and download datasets, and compare data from different experiments. In addition, the GEO database is flexible enough to store a variety of other high-throughput experimental data including serial analysis of gene expression (SAGE) and proteomics data [32].
Following the model of the sequence databases, the three international microarray database groups intend to exchange and share their data [9,25,35,40]. This has turned out to be much more difficult to achieve than for sequence data, due to the increased complexity of the data. The method that the groups have developed, in common with the way many informatics groups share data, is to use a common mark-up language called MicroArray Gene Expression Markup Language (MAGE-ML [41]), which is derived from the more general XML format. This means that any group using a database that can export in MAGE-ML format should be able to transfer their data, with relative ease, into one of the central repositories. individual genes can be searched for and data can be visualized graphically; genes that vary most and least often in each species can also be searched for The use of controlled nomenclature or ontologies can also ease the data-sharing process. A working group has been set up to standardize common terms and phrases [42] (see http://mged.sourceforge.net/ ontologies/index.php); this will probably remain an ongoing process as the technology develops and finds new applications.

Data submission to public repositories
Currently, the major issue for public repositories is the submission process, because of the variation between datasets from different groups and studies. For example, experimental designs vary widely: some researchers do time courses, perhaps with a pooled reference sample, whereas others do studies comparing normal against mutant or diseased cells/tissues. There are a wide variety of microarray platforms, all of which need to be described in detail. Furthermore, some of these technologies record different types of data: an Affymetrix GeneChip records one sample per chip and has several perfect match and mismatch probes per gene, whilst a two-colour microarray captures competitive hybridization between two samples, often with multiple replicates for many genes. Inevitably, the complexity of all these features makes a system that attempts to facilitate the conversion of this information into MAGE-ML somewhat unwieldy.
A few organizations have developed their own pipelines for the creation of MAGE-ML, which allow them to submit data in a more automated fashion [6,9,35]. Hopefully, as MAGE-ML and the ontologies mature, software manufacturers will develop user interfaces that permit an easier and more uniform submission process. The ArrayExpress group have been developing a web application, MIAMExpress, which allows researchers to submit their data to ArrayExpress. The submission process has been designed by a team of software developers working closely with data curators [9,35]. This software allows for the importing of experiments of more than 100 arrays into ArrayExpress. Initially, it is possible to keep the data private, and the investigator can specify when to make the data publicly available. This means that data can be submitted as they are collected, so that the system can trace experiments electronically and independently of a lab book. However, in our experience, except for smaller datasets, it is beneficial to have a bit of experience of the MIAMExpress front-end when submitting data through it, e.g. it is useful to have an understanding of the whole submission process to keep track of where you are in the process.
Since the release of MAGE-ML, other academic groups have been working to establish fully working MAGE-ML-based pipelines for importing data automatically into ArrayExpress from their databases; these databases include: Stanford Microarray Database, Stanford University (SMD [20]); RNA Abundance Database, University of Pennsylvania (RAD [43]); the TM4 Microarray Software Suite from The Institute for Genomic Research (TIGR [36]); and the microarray database at the German Resource Center for Genome Research (RZPD), http://www.rzpd.de/submit/ At the Sanger Institute, Matloob Qureshi has been working on a user-friendly Java application that will allow members of both the Pathogen Microarray Group and our group to submit their data in MAGE-ML format to ArrayExpress. The advantage with this type of software is that with the increased flexibility of Java applications compared to web applications, it is easier to have an overview of the submission process. The main reason that the ArrayExpress team adopted a web application is that this method supports data submission from remote laboratories with little local computing support [35]. Hence, for those with good local IT support, tools that make the process easier and more automated will become of increasing value if they wish to submit large amounts of data to ArrayExpress or the other central repositories.

Accessing data from public repositories
The value of the central resources (ArrayExpress, GEO and CIBEX) will increase as more datasets are submitted to them. Prior to the establishment of these repositories, data were not always or continuously available, were hard to find on the Internet, and were stored in varying formats. Clearly, databases with professional support that provide standardized datasets and stable URL addresses are essential for the long-term benefit of obtaining and re-analysing data from published microarray studies. There are now a variety of datasets that are available at ArrayExpress, including data for humans, human cell lines, rodents, plants, yeast and bacteria; this number is set to increase rapidly, e.g. in February 2004 there were only four datasets for the yeast Saccharomyces cerevisiae, but 5 months later the number of datasets stands at 16.
An ever-changing aspect of microarray data is the sequence annotation on the arrays, which is being constantly updated and improved. Array-Express solves this problem by linking from the sequences on each array to external annotation databases, e.g. we include links to Schizosaccharomyces pombe GeneDB [24] (a public database for fission yeast genes: http://www.genedb.org/) when we submit a newly designed Sz. pombe array to the database in array description file (ADF) format.

Specialized public databases
Public datasets are stored in a number of other databases besides the public repositories. The less stringent requirements of those databases that do not use the MIAME checklist present a smaller submission burden for researchers, hence they often contain more datasets than the main repositories. The nature of these databases means they have differing priorities and structures (Table 2; and [30]), e.g. some databases contain large amounts of publicly available data, but are only open to submission by a defined group of researchers (Section 2 of Table 2); SMD is perhaps the most established of these and includes many valuable analysis features [20].
Other databases relate to specific projects or biological processes (Section 3 of Table 2), e.g. Ger-mOnline contains expression data relevant to the mitotic and meiotic cell cycle in both yeast and higher eukaryotes [46]. Additionally, there are a number of databases that store tissue distribution information for human and mouse cell lines (Section 4 of Table 2). One easily searchable example is the Gene Expression Atlas database, which contains Affymetrix profiles for 79 human and 61 mouse tissues collected for a variety of up-to-date standard and custom chips [44,45]; it is possible to search for genes with similar expression profiles and genes include extensive annotation alongside the expression data. Finally, some databases store data for related organisms (Section 5 of Table 2); an example is the yeast microarray database, yMGV [28]. This database provides useful tables and graphs of datasets from different yeast species.
Lists of such specialized databases are available from: http://www3.oup.co.uk/nar/database/cat/9 and http://ihome.cuhk.edu.hk/b400559/arraysoft public.html. However, it is apparent that some older sites have broken links or are outof-date, leading to the loss of access, further demonstrating the importance of maintaining central repositories.

Conclusions
Since the launch of the central microarray repositories, the submission procedures for these resources have been made easier and more accessible. The value of centralized warehouses that contain standardized, well-structured data is clear, and has been established for some time now in other fields.
Although not yet straightforward, the time is fast approaching when all researchers should submit their data to public repositories so that all expression data are kept together, as the current open letter from the MGED society suggests [6]. See also the recent article by Ball et al. [48].
It is also valuable to have specialized databases of microarray data, which are available through the web and designed with expert local biological knowledge, as these databases will allow researchers to focus on the data that are most relevant to them. Such resources will also benefit from being able to obtain all the data they require from the central public repositories.