Standardization Initiatives in the (eco)toxicogenomics Domain: A Review

The purpose of this document is to provide readers with a resource of different ongoing standardization efforts within the ‘omics’ (genomic, proteomics, metabolomics) and related communities, with particular focus on toxicological and environmental applications. The review includes initiatives within the research community as well as in the regulatory arena. It addresses data management issues (format and reporting structures for the exchange of information) and database interoperability, highlighting key objectives, target audience and participants. A considerable amount of work still needs to be done and, ideally, collaboration should be optimized and duplication and incompatibility should be avoided where possible. The consequence of failing to deliver data standards is an escalation in the burden and cost of data management tasks.


Introduction
Molecular-based approaches, such as transcriptomics, proteomics, metabolomics and metabonomics, are being used to study the impact of chemicals on human and wildlife populations. These high-throughput (eco)toxicogenomics investigations are information-intensive and, by producing massive amounts of data, have placed the informatics challenge under the spotlight. The need to provide easy access to integrated data in a structured standard format is clearly significant. Several efforts are already under way to promote standardization, tackle data management issues and develop databases to facilitate data exchange. We have seen the value of these collaborative efforts already. The Microarray Gene Expression Data (MGED; http://www.mged.org) Society has been successful in developing the MIAME standard and related ontology and object models for microarray data (reviewed in Quackenbush 2004). The Reporting Structure for Biological Investigations (RSBI; http://www.mged.org/Workgroups/rsbi) is a new working group formed under the MGED Society umbrella, planning to act as a 'single point of focus' for Toxicogenomics, Environmental Genomics and Nutrigenomics communities working towards an international and compatible informatics platform for data exchange. Disciplinespecific initiatives are regarded as important because they target 'real world' data capture requirements for the particular omics technologies being used. A consequence of this, however, is that, by remaining within each given discipline, the standardization effort fragments, resulting in duplication and the development of different terminology and data models, thereby limiting the potential for data exchange. One of the objectives of the RSBI working group is to ensure that these initiatives are coordinated, so that synergy and cross-discipline communication can be maximized, and duplicated effort can be minimized.

S. A. Sansone et al.
To capitalize on these efforts, representatives of the RSBI working group are also directly participating in certain initiatives and, by fostering interactions, are laying the ground for further collaborations. One forum for such interaction is the Standards and Ontologies for Functional Genomics (SOFG; http://www.sofg.org) Conference. We invite comments on the work of the RSBI at mged-rsbi@lists.sourceforge.net

Standardization initiatives
Data standardization is now considered beyond the research application of high-throughput technologies (reviewed in Quackenbush, 2004) and regulatory bodies, such as the US Food and Drug Administration (FDA) and Environmental Protection Agency (EPA), are developing their policy or guidance on genomics data submissions (http://www.fda.gov/cder/guidance/5900dft.doc; http://www.epa.gov/osa/genomics.htm). Several organizations and committees are tackling data standardization; however, there is a fundamental difference in both the design and objectives of the efforts around regulatory submission of data vs. the needs of the research community, who need databases and tools for discovery. The former aims to accelerate the review process, facilitate proprietary data submission and optimize data visualization in a way that does not impact the vocabulary used by the individual submitter. The research community needs to ease deposition in public databases and facilitate data mining by the use of common annotation standards and ontologies. There is some overlap between the needs of these communities and some level of interaction. Thus, there is value in assessing the commonality between regulatory, research community and database designers' objectives in the design of data standards. Specifically, a unified approach to describing and reporting the experimental biological metadata that is common to the different 'omics' technologies (transcriptomics, proteomics and metabonomics/metabolomics) or disciplines (e.g. pharmacogenomics, toxicogenomics, environmental genomics) is a goal of the RSBI. Undoubtedly specialized information is needed by certain applications, but a high-level unified model for description of metadata would be able to encompass these applications. Here, metadata, refers to biological information relating to samples and the information about experimental design. Data refers to measured values relating to samples (e.g. toxicological endpoints and gene expression) under given experimental conditions. This paper is not an exhaustive list of all activity but provides a summary of standardization efforts for toxicological and environmental applications, which address reporting standards (e.g. what should be reported), and management issues (e.g. how reported information should be stored and exchanged, and which ontologies should be used to annotate data and metadata). The various initiatives fall into six broad categories, summarized in Table 1 and explored in detail below.

'Omics' technology communities
These are academic grass roots communities that have joined forces with commercial vendors to address content standards and reporting needs for a single high-throughput technology.

MGED Society
The MGED Society has established standards for microarray data annotation (MIAME; Brazma et al., 2001;Ball et al., 2002) and exchange (MAGE-ML; Spellman et al., 2002) that have facilitated the creation of microarray databases and related supporting software (MAGE-OM; Spellman et al., 2002). The response from the scientific community to these community standards has been extremely positive (Editorial, 2002). Most of the major scientific journals and some funding agencies require publications describing microarray experiments to comply with MIAME, for the data to be submitted to public repositories, such as ArrayExpress (Brazma et al., 2003), GEO (Edgar et al., 2002) and CIBEX (Ikeo et al., 2003). Consequently, the MIAME model has been adopted by other communities (Quackenbush, 2004). MGED is now working with other initiatives, such as HUPO-PSI in the proteomics field and SMRS (see below). There have been several extensions to MIAME: MIAME-Tox, an arraybased toxicogenomics standard developed by the ILSI Health and Environmental Sciences Institute (HESI) (http://hesi.ilsi.org/index.cfm?pubenti-tyid=120); the National Institute of Environmental

Proteomics Standardization Initiative (PSI)
The HUPO (Human Proteome Organization; (http://www.hupo.org) PSI (http://psidev.source forge.net) includes the major protein databases, government and industry and is defining standards for data representation in proteomics to facilitate data comparison, exchange and verification. Current focus is on mass spectrometry and protein-protein interaction data. A set of open source standards are being developed along MIAME lines, including a content standard, the Minimum Information About Proteomics Experiments (MIAPE), an XML standard data exchange format (Hermjakob et al., 2004) and an ontology of clearly defined general proteomics terms.

Standard Metabolic Reporting Structure (SMRS)
SMRS (http://www.smrsgroup.org) comprises industry, software developers, governmental representatives and academia, who are investigating the reporting and design of metabonomics and metabolomics studies in plants, microbial systems, environment, in vivo and in vitro applications, as well as human studies. A set of draft recommendations has been produced as a discussion document. It considers the factors in a metabolic study that could be recorded and standardized, including the origin of a biological sample, the technologies and methods for analysis and the chemometric and statistical approaches. The recommendations also touch on the granularity of information required for different reporting needs, including journal submissions, public databases and regulatory submissions.

Measurement and methods validations
As high-throughput technologies are used in industry and are considered by regulatory agencies, the methodology itself comes under scrutiny. Agreement on data formats will do little good if experimental protocols are inconsistent. Currently, standardization of microarray experiment procedures is key to the broad acceptance and use of these data. The very variability of microarray data generation, analysis, future validation of the technology and production of standard materials is now the focus of many initiatives.

MfB (Measurements for Biotechnology) program
MfB (http://www.mfbprog.org.uk) is a UK programme that addresses bio-measurements of importance for industry. The 'Comparability of Gene Expression Measurements on Microarrays' is an industry-based consortium led by LGC (http://www. lgc.co.uk). The project is designed to determine the accuracy and comparability of gene expression measurements made on different array platforms and also evaluates data analysis methods. A second phase is now looking at the standardization of array-based toxicogenomics and will build up on the analysis framework to develop a panel of quality metrics for validating and standardizing arraybased toxicogenomics measurements.

External RNA Controls Consortium (ERCC)
ERCC (http://www.cstl.nist.gov/biotech/workshops/ERCC2004) originated at a US National Institute of Standards and Technology (NIST; http://www.nist.gov) meeting and is composed of representatives from the public, private and academic sectors, addressing experimental control and performance evaluation for gene expression analysis. ERCC is considering the utility of universal (platform-independent) spike-in controls, protocols, and informatics tools intended for use across one-and two-channel microarray and quantitative RT-PCR (QRT-PCR). Outcomes of this work will be published and resulting data submitted to a public database.

Regulatory-driven fora
To streamline regulatory electronic submissions a number of technical issues need to be addressed. These efforts intend to identify the kind of data that should be included in submissions to regulatory bodies and automate the largely paper-based clinical trials and non-clinical research processes.

Clinical Data Interchange Standards Consortium (CDISC)
CDISC (http://www.cdisc.org) is an open, multidisciplinary, non-profit organization committed to the development of worldwide pharmaceutical industry standards, vendor-neutral, platformindependent data models to support the electronic acquisition, exchange, and the submission and archiving of clinical trials data and metadata.

Standard for Exchange of Non-clinical Data (SEND)
SEND (http://www.cdisc.org/models/send/v1.5) is a consortium formed among the pharmaceutical industry, contract laboratories, software developers and the FDA. The goal of SEND is to develop a common format for the electronic submission of animal toxicity data and study description to a regulatory agency. Once the SEND standard is finalized, it will be merged with CDISC's model to form the Study Data Tabulation Model (SDTM).

Domain-driven fora
These toxicoinfomatics and ecoinformatics specific initiatives are an example of international coordination for the development and adoption of controlled vocabularies and format for exchanging chemical toxicity, and ecological and environmental data.

The Distributed Structure-Searchable Toxicity (DSSTox)
DSSTox (http://www.epa.gov/nheerl/dsstox) is a network project by the US EPA, providing a community forum for publishing standard format, structure-annotated chemical toxicity data files for open public access. Although a primary focus of this effort is aimed towards inclusion of chemical structures and standardized chemical fields, DSSTox will also promote the use of a controlled vocabulary, i.e. common data field names and entry formats for the same types of toxicity data across databases. It will link to such public toxicity data by incorporating DSSTox Standard Fields and Indices in the custom databases, making common queries possible using a standard DSSTox identifier. DSSTox is collaborating with, or using standards from, several other efforts, including the LeadScope In Silico Tox (LIST) Focus Group, the National Cancer Institute (NCI), NIEHS's National Center for Toxicogenomics and the National Toxicology Program, the National Library of Medicine (NLM) TOXNET, the International Union of Pure and Applied Chemistry (IUPAC), the National Institutes of Standards and Technology (NIST), the ILSI HESI SAR Toxicity Database Project and MGED's MIAME/Tox, as well as numerous vendors and consortia (http://www.epa.gov/nheerl/dsstox/Co-ordinatingPublicEfforts.html).

The Science Environment for Ecological Knowledge (SEEK)
SEEK (http://seek.ecoinformatics.org) is a multidisciplinary initiative designed to create cyberinfrastructure for ecological, environmental and biodiversity research and to educate the ecological community about eco-informatics. SEEK participants are building an integrated data grid (EcoGrid) for accessing a wide variety of ecological and biodiversity data and analytical tools (Kepler; http://kepler-project.org). Ecological Metadata Language (EML) is a metadata specification developed in association with SEEK and the Knowledge Network for Biocomplexity (KNB; http://knb.ecoinformatics.org) that can by used in a modular and extensible manner to document ecological data.

World-wide organizations
Global organizations have initiated a dialogue between technological experts, regulators and the principal validation bodies to draw road maps for development, validation and regulatory use of omics-based technologies in chemical assessment. Others are liaising with different life sciences disciplines, offering support, mediation and consultancy to speed up the standards development process.
Organization for Economic Co-operation and Development (OECD) and the International Program on Chemical Safety (IPCS) IPCS (http://www.who.int/ipcs/en/) is a joint program of three cooperating organizations -the International Labour Organization, the United Nations Environment Network and the World Health Organization -implementing activities related to chemical safety. In collaboration with the Organization for Economic Cooperation and Development (OECD, http://www.oecd.org), the IPCS has organized a series of workshops to identify the possible application of methods based on (eco)toxicogenomics in regulatory hazard assessment, to determine the current limitations to the use of (eco)toxicogenomics in regulatory assessment and develop a plan to overcome such limitations, to identify the need for future activities with regard to the use of these methods in test guidelines, new and existing chemicals, pesticides and biocides programs. At present, recommendations are being prepared and will be published. In view of these recommendations, the development of a coordinated international research program on (eco)toxicogenomics will be initiated, aiming to optimize the integration of genomic techniques into (eco)toxicology and their use in ecological and human health risk assessment.

The National Academy of Sciences (NAS)
The NAS Committee on Emerging Issues and Data on Environmental Contaminants (http://dels.nas. edu/emergingissues) is a public forum for communication among government, industry, environmental groups and the academic community about emerging evidence and issues in toxicogenomics, environmental toxicology, risk assessment and exposure assessment. The Committee will develop a framework for how the emerging field of genomics will be incorporated into risk assessment.

Standardization initiatives in (eco)toxicogenomics 639
in the life sciences disciplines and the IEEE Standards Association. BSC will provide a neutral forum for the global bioinformatics community to work towards common agreements on standards in new areas and integration between established standards.

Standard(s)-compliant infrastructure
This section provides a short review of public infrastructure currently available for toxicogenomics and environmental genomics data. These efforts are in different stages of development, serving specific needs of their user community and relying on diverse types of funding support. Nevertheless, these are examples of institutions working together, sharing expertise and moving towards an internationally compatible informatics platform for data exchange, interacting closely with standardization initiatives listed here.

ArrayExpress and Tox-MIAMExpress
ArrayExpress (http://www.ebi.ac.uk/arrayexpress) (Brazma et al., 2003) is a MGED standardscompliant, public infrastructure for microarraybased gene expression data at the EBI. The infrastructure has been extended to link biological endpoint values with gene expression data as result of a collaborative undertaking with the ILSI HESI Committee on the Application of Toxicogenomics Data to Mechanism-based Risk Assessment (http://www.ebi.ac.uk/microarray/Projects/tox-nutri). Their toxicogenomics datasets (Pennie et al., 2004) have been submitted to ArrayExpress using Tox-MIAMExpress, the online MIAME/Tox-compliant data input tool (Mattes et al., 2004) (http://www.ebi.ac.uk/tox-miamexpress). The ILSI HESI Committee research programme has provided the first large array-based toxicogenomics dataset in the public domain annotated according to the MGED standards.

Chemical Effects in Biological Systems (CEBS) Knowledgebase
CEBS (http://cebs.niehs.nih.gov) (Waters et al., 2003) is a public toxicogenomics knowledgebase in year two of its 10 year development at the NIEHS's NCT. CEBS aims to integrate omics datasets in the context of toxicology to advance knowledge discovery about toxicity (Waters et al., 2003;Waters and Fostel, 2004;Mattes et al., 2004). CEBS implements standards developed by the MGED Society and the HUPO PSI in the CEBS SysBio object model (Xirasagar et al., 2004). CEBS is designing an ontological representation of data and terms used by its collaborators, which includes descriptors for different study design types and metadata vocabularies.
maxd maxd (http://bioinf.man.ac.uk/microarray/maxd) is an open-source data warehouse and visualization environment for genomic expression data employed by the NERC EGTDC. The maxd software suite includes two major components. The first, maxdLoad2, is a database schema and data loading and curation application designed to enable biologists to store expression data, annotate it to MIAME and MIAME/Env standards, and export it in MAGE-ML format to ArrayExpress. The second, maxdView, is a modular analysis and visualization environment for interactive exploration of transcriptomics data and associated metadata.

Toxicoinformatics Integrated System (TIS)
ArrayTrack (http://www.fda.gov/nctr/science/centers/toxicoinformatics/ArrayTrack; Tong et al. 2003) is an integrated software system for managing, mining and visualizing microarray gene expression data at NCTR-FDA. The system has three integrated components: a MIAME-compliant database storing array-based toxicogenomics data; a set of tools providing data visualization and analysis capability; and a library containing functional information about genes, proteins, pathways and toxicants. ArrayTrack is the first module of TIS, a system to integrate genomic, proteomic and metabonomic data with data from the public repositories, as well as conventional in vitro and in vivo toxicology data. TIS will serve as a general toxicogenomics repository for diverse data sources, supporting broad data mining and meta-analysis activities, as well as the development of robust and validated predictive toxicology systems.

The Comparative Toxicogenomics Database (CTD)
The CTD (http://ctd.mdibl.org) promotes understanding about the effects of environmental chemicals on human health by facilitating cross-species comparative studies of toxicologically important genes and proteins. CTD is now publicly available as a prototype. It provides annotated associations between genes, proteins, sequences, references and chemicals in vertebrates and invertebrates; integrates molecular and toxicology data; implements ontologies; and will describe gene-chemical interactions in diverse organisms. These data provide insight into the genetic basis of variable sensitivity to chemicals and complex interactions between the environment and human health.

Conclusions
Data produced by (eco)toxicogenomics investigations are growing in volume and complexity at a staggering rate. It is not trivial to define precise data content, presentation and exchange formats. However, there is a growing realization within the (eco)toxicogenomics community that, if we are to realize the opportunities offered by omicsbased technologies, we will need to change our approach to data handling and work more collaboratively. The authors, also moderators of the RSBI working group, would like to emphasize the need for community participation in the integration of these standardization initiatives. It is hoped that highlighting these different initiatives will help to assess the commonality and optimize harmonization, thus minimizing duplication and incompatibility and achieving cost-effective results in a timely manner.