High-throughput “omics” technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput “omics” data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput “omics” data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied “omics” data from different laboratories to make useful connections that could lead to new biological knowledge.
Unlike traditional one-gene-at-a-time approach, which provides the detailed molecular functions of individual genes, the advances of high-throughput technologies in the study of molecular biology systems in the past decades marked the beginning of a new era of biological and biomedical research, in which researchers systematically study organisms on the levels of genomes (complete genetic sequences) [
Genomics analysis tells us the complete genetic sequences and the intragenomic interactions within the genomes. The sequences only tell us what a cell can potentially do. In order to know what a cell is doing, DNA microarray technologies [
The genome of an organism is relatively constant, while the proteome of an organism, a set of expressed proteins under varied conditions, can be quite different for different cell types and conditions. Because the expression profiling at the transcript level can only give a rough estimate of the concentration of expressed proteins, high-throughput profiling at the protein level using mass spectrometry technologies has been widely used to identify, characterize, and quantify all the proteins and their functions in cells under a variety of conditions [
The rapid growth of high-throughput genomics, proteomics, and other large-scale “omics” data presents both opportunities and challenges in terms of data integration and analysis. Many bioinformatics databases and repositories have been developed to organize and provide biological annotations for individual genes and proteins to facilitate the sequence, structural, functional, and evolutionary analyses of genes and proteins in the context of pathway, network, and systems biology. In addition, a rapidly growing number of quantitative methods and tools have been developed to enable efficient use and management of various types of “omics” data and analyses of large data sets for different biological problems, including biomarker discovery for diagnosis and early detection of disease. A few examples include (1) Bioconductor [
The richness of “omics” data allows researchers to ask complex biological questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard, and standardized nomenclature imposes a major challenge for all parties involved, from lab technicians, data analysts to biomedical researchers who are trying to interpret the final results of “omics” experiments. Therefore, advanced computational methods and bioinformatics infrastructures are needed for integration, mining, visualization, and comparative analysis of multiple high-throughput “omics” data to facilitate data-driven hypothesis generation and biological knowledge discovery.
In this paper, we present the challenges in high-throughput “omics” data integration and analysis in Section
The most commonly used molecular biology databases for functional analysis of gene and protein expression data are listed in Table
Commonly used molecular biology databases for functional analysis of gene and protein expression data.
Database name | Database content | Data access and analysis support | URL |
---|---|---|---|
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, UniProt Archive (UniParc) [ | UniProt protein sequences and functional information, comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world | Text search; Blast sequence similarity search; Sequence alignment; Batch retrieval; Database ID mapping; FTP download | |
NCBI Reference Sequence (RefSeq) [ | Non-redundant collection of richly annotated DNA, RNA, and protein sequences | Entrez query access; Searching Nucleotide or Protein; Searching Genome; BLAST; FTP download; Sequence Homology searches and retrieval | |
GenBank [ | Genetic sequence database, an annotated collection of all publicly available DNA sequences databases | Database query; Phylogenetics; Genome Analyses; FTP download | |
EMBL [ | |||
DDBJ [ | |||
UniGene [ | Non-redundant set of eukaryotic gene-oriented clusters of transcript sequences, together with information on protein similarities, gene expression, cDNA clone reagents, and genomic location | Entrez query; Library browse; Digital Differential Display; FTP download | |
FlyBase [ | Aberration Maps; Batch download; BLAST; Chromosome Maps; Coordinate Converter; CytoSearch; GBrowse; ID Converter; ImageBrowse; Interactions Browser; QueryBuilder; TermLink; FTP download | ||
Mouse Genome Database (MGD) [ | Gene characterization, nomenclature, mapping, gene homologies among mammals, sequence links, phenotypes, allelic variants and mutants, and strain data | Genes & Markers Query; Sequence Query; MouseBLAST; Graphical Map Tools; Mouse Genome Browser; Batch Query; MGI Web Service | |
Saccharomyces Genome Database (SGD) [ | Genetic and molecular biological information about | Search Gene function information and Protein information; Specialized Gene and Sequence Searches; Search Yeast Literature; BLAST; Batch download; Pattern Matching; Genome Restriction Analysis; PDB Homology Query; Yeast Protein Motif Query; Yeast Biochemical Pathways; Gene Expression Connection | |
WormBase [ | Data repository for | Gene, Phenotype, protein, and Genetics Search; Microarray Expression download and Pattern search; Ontology Search | |
The Arabidopsis Information Resource (TAIR) [ | The genetic and molecular biology information resource about | Synteny Viewer; MapViewer; Pattern Matching; Motif Analysis; Bulk Data Retrieval; Chromosome Map Tool; Restriction Analysis | |
NCBI Taxonomy [ | Names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence | Browse; Retrieve and FTP download | |
UniProt Taxonomy [ | UniProt taxonomy database, which integrates taxonomy data compiled in the NCBI database and data specific to the UniProt Knowledgebase | Query the database by keywords (species name) or NCBI taxonomic identifier | |
Gene Expression Omnibus (GEO) [ | Public repository for high-throughput microarray experimental data | Search by accession number; Search Entrez GEO DataSets or Entrez GEO Profiles with keywords; Visualize cluster heat map images; Retrieve other genes with similar expression patterns; Retrieve chromosomally closest 20 genes; FTP download | |
CleanEx [ | Expression reference database that facilitates joint analysis and cross-dataset comparisons | Search by ID, Gene symbol and target ID; List expression datasets; Text search in expression datasets description lines; Extract all features of common genes between datasets; Experiments pools comparison; Batch retrieval; FTP download | |
SOURCE [ | Functional genomics resource for human, mouse and rat to facilitate the analysis of large sets of data using genome-scale experimental approaches | Search by CloneID, Database Accession, Gene name/Symbol, UniGene ClusterID, Probe ID, and Entrez GeneID; Batch retrieval | |
ArrayExpress [ | Public repository for well-annotated data from array based platforms, including gene expression, comparative genomic hybridization (CGH) and chromatin-immunoprecipitation (ChIP) experiments, tiling arrays, and so forth | Web-based query interface; REST and Web-services access; FTP download; Web-based online microarray analysis tool—Expression Profiler | |
Global Proteome Machine Database (GPMDB) [ | Global Proteome Machine Database, which utilizes the information obtained by GPM servers to aid in peptide validation as well as protein coverage patterns | Search by protein description keywords, and data set keywords | |
PRoteomics IDEntifications Database (PRIDE) [ | PRIDE database provides public data repository for proteomics data | Search by PRIDE Experiment accession number and Protein accessions; Browse experiments by project name or categories such as species, tissue, cell type, GO terms and disease; Ontology Lookup Service (OLS); Protein Identifier Cross Reference (PICR) service; Database on Demand (DOD) | |
Peptidome [ | Public repository that archives and freely distributes tandem mass spectrometry peptide and protein identification data | Search by Accession, Author, Description, MeSH Terms, Organism, Peptide Count, Platform, Protein Count, Protein GI, Publication Date, Search Engine, Spectra Count, Submitter Institute, Title, Update Date | |
PeptideAtlas [ | Peptide database identified by Tandem Mass Proteomics experiments | Search by Protein/Gene Name, Protein/Gene ID, Protein/Gene Symbol, Accession, Refseq, Sequence and Peptide Accession; Browse Peptides; Browse Proteins; FTP download | |
Swiss-2DPAGE [ | Annotated 2D gel electrophoresis database contains data on proteins identified on various 2D PAGE and SDS-PAGE reference maps | Search by description, accession number, author, spot serial number, experimental pI/Mw range and experimental identification methods; Retrieve all the protein entries identified on a given reference map; Compute estimated location on reference maps for a user-entered sequence; FTP download | |
Kyoto Encyclopedia of Genes and Genomes (KEGG) [ | Integrated database resource consisting of 16 main databases, broadly categorized into systems information, genomic information, and chemical information | Access by KEGG object identifier; KEGG Web Services and KEGG FTP download; Pathway Mapping; Brite Mapping; KegHier for browsing and searching functional hierarchies in KEGG BRITE; KegArray for analysis of transcriptome data (gene expression profiles) and metabolome data (compound profiles) | |
BioCyc [ | Microbial pathway/genome databases | Visualize individual metabolic pathways; View the complete metabolic map of an organism; Genome browsing capabilities and comparative analysis tools | |
Online Mendelian Inheritance in Man (OMIM) [ | A catalog of human genetic and genomic phenotypes | Entrez search at basic, advanced, or complex Boolean levels; Browse entries; Build query; Combine search results; Store search results in Clipboard; FTP download | |
HapMap [ | Resource for human genetic variation | Browse data; Bulk data download; HapMart—a data mining tool for retrieving data from the HapMap database | |
Gene Ontology (GO) [ | Gene Ontology database provides controlled vocabulary of terms describing Biological process, Cellular component, and Molecular function of gene and gene product annotation data | Tools include Browsers, Microarray tools, Annotation tools, Mapping to other databases, FTP download in Flat file, MySQL or RDF XML format | |
IntAct [ | Protein-protein interaction data | Browse by UniProt Taxonomy, Gene Ontology, Interpro Domain, Reactome Pathway, Chromosomal Location, and mRNA expression, FTP download in PSI-MI and PSI-MI TAB format | |
Database of Interacting Proteins (DIP) [ | Database of experimentally determined interactions between proteins with curator or computational methods generated annotations | Search by protein entry, BLAST, Motif, Article and pathBLAST; Data analysis services include Expression Profile Reliability Index, Paralogous Verification, and Domain Pair Verification | |
RESID [ | Collection of annotations and structures for Protein Pre-, Co- and Post-translational modifications | Web-based search interface; FTP download database entries in XML format, and associated files containing XML DTD, graphic images, and molecular models | |
Phosphosite [ | Database of phosphorylation sites and other Post-translational modifications | Search by Protein, Sequence, or Reference; Browse MS data by Disease, Cell Line, and Tissue | |
Protein Data Bank (PDB) [ | Database of experimentally-determined structures of proteins, nucleic acids, and complex assemblies | Web-based search and browsing interface; File download via http and FTP services in PDB, mmCIF, and PDBML/XML format | |
Structural Classification of Proteins (SCOP) [ | Comprehensive ordering of all proteins of known structure according to their evolutionary and structural relationships | Keywords-based search | |
CATH [ | Protein domain structures database | Search by ID/Keywords and FASTA sequence; BLAST; Cathedral server, and SSAP server for query and analysis CATH data; FTP download | |
Molecular Modeling Database (MMDB) [ | Database of 3D structures | Search by UID/text term, protein sequence and 3D coordinates; FTP download | |
PDBsum [ | Summaries and analyses of PDB structures | Search by text or sequence; Browse by Highlights, List of PDB codes, Het Groups, Ligands, Enzymes, ProSite and Species; Download data file for protein names, protein sequences, protein annotations, Enzymes, Het Groups, and Ligands | |
Protein Structure Model Database (Modbase) [ | Annotated comparative protein structure models and related resources | Search by model or sequence similarity and properties | |
PIRSF [ | Family/superfamily classification of whole proteins | Batch retrieval using UniProtKB AC, PIRSF ID, Pfam ID, COG ID, EC Number, GO ID, KEGG Pathway ID, PDB ID; PIRSF scan by sequence or UniProtKB identifier; FTP download | |
UniProt Reference Clusters (UniRef) [ | UniProt non-redundant reference clusters | Searches on various attributes of the UniRef clusters, including UniRef cluster ID, protein names, organism names and database identifiers; Direct web access in HTML, XML and FASTA format; FTP download in XML format | |
Pfam [ | Protein families of domains each represented by multiple sequence alignments and hidden Markov models (HMMs) | Search by Sequence, Functional similarity, Keyword, Domain, DNA, and Taxonomy; Browse by Families, Clans, Proteomics; FTP download | |
InterPro [ | Integrated resource of protein families, domains, and functional sites | Text search; SRS text search; InterPro Scan; InterPro BoMart; Web services; FTP download | |
Protein ANalysis THrough Evolutionary Relationships (PANTHER) Classification System [ | Gene products organized by biological function | Search; Browse; Batch search; Gene expression data analysis; Evolutionary analysis of coding SNPs; HMM sequence scoring; FTP download | |
Simple Modular Architecture Research Tool (SMART) [ | Resource for protein domain identification and the analysis of protein domain architectures | Sequence analysis; Architecture analysis; Domain detection |
In many cases, one of the most difficult tasks is not mapping biological entities from different sources or managing and processing large set of experimental data, such as raw microarray data, 2D gel images, and mass spectra. The problem was recording the detailed provenance of those data, that is, what was done, why it was done, where it was done, which instrument was used, what settings were used, how it was done, and so forth. The provenance of experimental data is an important aspect of scientific best practice and is central to the scientific discovery [
The general biomedical scientists are more interested in finding and viewing the “knowledge” contained in an already analyzed data set. However, in high-throughput research many of the gene/protein data generated are insignificant in the conclusions of an analysis. Of the thousands of genes examined in a microarray experiment, only a relatively few show significant responses relevant to the treatment or condition under study. Unfortunately, this information seldom comes with the standard data files and formats and is usually not easily found in “omics” repositories unless a re-analysis is performed or the data is annotated by a curator. For example, tables of proteins present in a given proteomics experiment or genes found to be up- or down-regulated under defined conditions are routinely found as supplemental data in scientific publications but are not available in a searchable or easily computable form anywhere else. This is unfortunate as this supplemental information is the result of considerable analysis by the original authors of a study to minimize false positive and false negative results and often represents the “knowledge” that underlies additional analysis and conclusions reached in a publication.
Recently, “omics” data analysis has focused on information integration of multiple studies including cross-platform, cross-species, or cross-disease-type analyses [
As the volume and diversity of data and the desire to share those data increase, we inevitably encounter the problem of combining heterogeneous data generated from many different but related sources and providing the users with a unified view of this combined data set. This problem emerges in the life sciences research community, where research data from different bioinformatics data repositories and laboratories need to be combined and analyzed. The benefit of developing a data integration system is that it can facilitate information access and reuse by providing a common access point. It also provides users with more complete view of the available information.
Lenzerini [
In our experience, the users of microarray, proteomics and, other “omics” data can be broadly divided into two groups: (1) bioinformaticians or biostatisticians who develop tools to handle large and complex data set routinely; (2) general biomedical scientists who lack the expertise or tools to do “omics” data analysis but still want to analyze the data sets and find the biological knowledge related to the set of genes or proteins they are studying. Considering such target user groups, our approach for integration of diverse high-throughput “omics” data is to construct a relatively lightweight data warehouse to capture the key information or “knowledge” our users are likely to need.
In our approach, the original data may reside in other databases or repositories that are managed and optimized for a particular type of “omics” data such as microarray and mass spectrometry data. Our warehouse uses Web Services, database downloads and other means to make updates regularly with web links back to the original data sources. Our approach uses less computational resources and human involvement, it meanwhile provides the usability, flexibility, reliability and performance. As proteins occupy a middle ground molecularly between gene and transcript information and higher levels of molecular and cellular structure and organization, the key design principle of our data integration approach is to integrate diverse “omics” data and present them in a protein-centric fashion where information query is conducted via common proteins and their large set of attributes such as families, functions, and pathways.
The use of different data sources and identifiers in analysis pipelines is a common problem encountered when we try to combine the data across multiple laboratories or research centers. One of the most difficult problems in “omics” data integration and analysis is to maintain the correspondence of IDs for genes and proteins and their high-level functional attributes such as modifications, pathways, structures, and interactions. The ID or name mapping [
The Protein Information Resource (PIR) provides an ID mapping service (
Database identifiers supported by PIR ID mapping service.
From | To |
---|---|
FLY ID, GenBank AC, Genpept AC, GI Number, IPI ID, MGI | FLY ID, GenBank AC, Genpept AC, GI Number, IPI ID, MGI |
ID, NREF ID, PIR-PSD ID, PIR-PSD AC, Refseq AC, SGD ID, | ID, NREF ID, PIR-PSD ID, PIR-PSD AC, Refseq AC, SGD ID, |
TIGR ID, UniParc AC, UniProtKB AC, UniProtKB ID | TIGR ID, UniParc AC, UniProtKB AC, UniProtKB ID, |
UniRef50, UniRef90, UniRef100 | |
BLOCKS ID, COG ID, Pfam ID, PIRSF ID, PRINTS ID, PROSITE | BLOCKS ID, COG ID, Pfam ID, PIRSF ID, PRINTS ID, |
ID, UniRef50, UniRef90, UniRef100 | PROSITE ID |
BIND ID, EC Number, GO ID, KEGG Pathway ID, RESID ID | BIND ID, EC Number, GO ID, KEGG Pathway ID, RESID ID |
Taxon Group ID, Taxon ID | Taxon Group ID, Taxon ID |
Entrez Gene ID, OMIM ID, PDB ID, PubMed ID, Gene Name | Entrez Gene ID, OMIM ID, PDB ID, PubMed ID, Gene Name |
PIR ID mapping service maps a set of NCBI GI numbers to UniProt accession numbers.
PIR provides iProClass (
iProClass database contains full descriptions of all known proteins with up-to-date information from many sources (Figure
The overview of PIR iProClass data warehouse.
iProClass database is implemented in Oracle and updated every three weeks. The underlying database schema and update procedures have been modified to interoperate with UniProtKB. iProClass also provides comprehensive views for more than 35,000 PIRSF protein families [
iProClass provides a set of data search and retrieval interfaces and value-added views for UniProtKB protein entries and PIRSF family entries with extensive annotations and graphical display of reported information.
The iProClass website provides a very simple way to retrieve protein entries by a single protein ID or one of many other sequence database identifiers. It also allows retrieval of protein entries using a batch of database identifiers. The batch retrieval tool (
iProClass data warehouse batch retrieval tool web form and result page. (1)
Peptide sequences, such as those obtained by MS/MS proteomics experiments, can be used as queries to search proteins containing exact matches to the peptide sequence from the UniProtKB database. In this case, the search can be performed on the whole set of proteins or on only those from Taxonomy group or a specific organism, as in the example shown in Figure
iProClass data warehouse peptide match tool web form and result page. (1)
iProClass integrated database provides two types of summary report for the information presentation: Protein summary report and Family summary report. The Protein summary report contains information about protein ID and name, source organism taxonomy, sequence annotations, data cross-references, family classification, and graphical display of domains and motifs on the amino acid sequence. A sample Protein summary report can be viewed here (
In this section, we use the NIAID (National Institute of Allergy and Infectious Diseases) Biodefense Proteomics Resource (
The NIAID Biodefense program consists of seven Proteomics Research Centers (PRCs) conducting state-of-the-art high-throughput research on pathogens of concern in biodefense and emerging infectious diseases as well as a Biodefense Resource Center for public dissemination of the pathogen and host data, biological reagents, protocols, and other project deliverables (Table
NIAID biodefense proteomics resource catalog summary.
Organism | PRC | Data Type | SOPs | No. of protein in Master Protein Directory (MPD) | No. of reagents in Master Reagent Directory (MRD) | No. of proteins in Complete Predicated Proteome (CPP) |
---|---|---|---|---|---|---|
Caprion Proteomics Inc. | Mass spectrometry | 23 | 4963 | — | 6070 | |
Einstein Biodefense Proteomic Research Center | Mass spectrometry | 4 | 609 | Antibodies (68) | — | |
Myriad Genetics | Protein interaction | 4 | 62 | Clone (4379) | 4629 | |
Pacific Northwest National Laboratory | Mass spectrometry | 2 | 2958 | Antibodies (1) | 187 | |
Scripps Research Institute | Protein structure | 5 | 6 | Clone (7) | — | |
Einstein Biodefense Proteomic Research Center | Mass spectrometry | 5 | 6678 | Antibodies (101) | — | |
Myriad Genetics | Protein interaction | 2 | 33 | Clone (315) | 254 | |
Pacific Northwest National Laboratory | Mass spectrometry | 2 | 2973 | — | — | |
Harvard Institute of Proteomics | Clone | 4 | 3731 | Bacteria (627) Clone (7172) | 11208 | |
Myriad Genetics | Protein interaction | 5 | 75 | Clone (9900) | 5966 | |
Harvard Institute of Proteomics | Clone | 3 | 5342 | Clone (5344) | ||
University of Michigan | Mass spectrometry | — | 5851 | Bacteria (22) ArrayChip (1) | ||
16686 | ||||||
University of Michigan | Microarray | 2 | 6378 | — | ||
Myriad Genetics | Protein interaction | 5 | 84 | Clone (7884) | ||
Pacific Northwest National Laboratory | Mass spectrometry | — | 2061 | Bacteria (38) | — | |
Pacific Northwest National Laboratory | Protein interaction | — | 3 | — | ||
Pacific Northwest National Laboratory | Mass spectrometry | 12 | 3753 | — | 4532 | |
Pacific Northwest National Laboratory | Microarray | — | 653 | — |
Based on the functional requirements of the Resource Centers, we developed a protein-centric bioinformatics infrastructure for integration of diverse data sets. Multiple data types from PRCs are submitted to the center using a data submission protocol and standard exchange format, with the metadata using controlled vocabulary whenever possible. Underlying the protein-centric data integration is a data warehouse called the Master Protein Directory (MPD) [
The MPD focused on capturing significant results usually only available in supplementary tables for the primary authors. To enable searching on these results, it needs to be converted into a searchable and digested form and mapped to the gene or protein of interest. To achieve this goal we developed a simple set of defined fields called
We have developed methods and prototype software tools specifically designed to provide functional and pathway discovery of large-scale “omics” data in a systems biology context with rich functional descriptions for individual proteins and detecting functional relationships among them. A prototype expression analysis system, integrated Protein eXpression (iProXpress) (
After the ID mapping of proteins, rich annotation can be fully described in a protein information matrix based on sequence analysis and integration of information from the MPD. We precompute and regularly update sequence features of functional significance for UniProt proteins, and make the sequence analysis tools available for online analysis of proteins/sequence variations not in UniProt database. Sequence features precomputed include homologous proteins in KEGG [
Functional profiling analysis aims at discovering the functional significance of expressed proteins, the plausible functions and pathways, and the hidden relationships and interconnecting components of proteins, such as proteins sharing common functions, pathways, or cellular networks. The extensive annotation in the protein information matrix allows functional categorization and detailed analysis of expressed proteins in a given data set as well as cross-comparison of co-expressed or differentially-expressed proteins from multiple data sets. For functional categorization, proteins are grouped based on annotations such as GO [
In the NIAID proteomics resource center project, our support for data mining and analysis was designed to make sure that all project data and other deliverables are available via browsing and simple keyword search; the data and information are sufficient for re-analysis or mining by a skilled researcher; the data, procedures, publications, and general results and conclusions of an analysis are easily searchable for a biomedical scientist who is not familiar with the details of the particular technologies used to generate them. We focused on providing simple, yet powerful, queries of experimental summaries where a user can query if a gene/protein was presented in the results. Once a set of proteins of interest is identified, user can further view the specific experimental values, methods used to generate the particular data set, and all protein attributes such as protein names, accessions, or project data, and search pathways, protein families, Gene Ontology (GO) [
The MPD web interface with its ability to mine the data and download information to other analysis tools has been used to identify and rank potential targets for therapeutics and diagnostics [
Protein-centric query across multiple data types in the NIAID Biodefense Master Protein Directory. (a) Search for
A single experiment, Myriad_Bac_07, contains interactions between 497
Pathogen-Host Y2H protein interaction data, from Figures
Inspection of the protein interaction data showed that it contained a total of 84 bacterial proteins interacting with 412 Human proteins (Figure
We downloaded the UniRef_90 [
The availability of voluminous, complex, and context-dependent high-throughput “omics” data brings both challenges and opportunities for bioinformatics research. The integrative analysis across multiple data sets can reveal the potential functional significance and hidden relationships between different biological entities, which requires advanced computational methods and bioinformatics infrastructures to support integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery.
Our protein-centric integration approach based on Protein ID mapping service, iProClass data warehouse, and iProXpress discovery platform provides a simple but powerful bioinformatics infrastructure for scientific discovery and hypothesis generation. The case study using NIAID Biodefense Proteomics Resource as an example illustrates that our protein-centric data integration allows query and analysis across different data types and pathogen host systems that lead to new biological knowledge. It is also a relatively simple, yet powerful and practical, approach to integrate and navigate diverse sets of “omics” data in a manner useful for systems biology study.
As the future work, the prototype system iProXpress will be further developed into a pipelined analysis tool to allow direct integration of multiple high-throughput “omics” experimental results. Moreover, the network modeling method will also be incorporated for functional and pathway analysis in a broader range of biological systems. We will also explore the using of ontologies and Semantic Web technologies to facilitate the semantic integration of high-throughput “omics” experimental data.
This study is supported in part by Grants U01HG02712 and HHSN266200400061C. The authors would like to thank the anonymous reviewers for their constructive comments on the manuscript.