Enabling Proteomics: The Need for an Extendable ‘Workbench’ for User-Configurable Solutions

Proteomics has the capability to generate overwhelming quantities of data in relatively short timescales, and it is not uncommon to see experimenters investing substantially more time in data analysis than in data gathering. Although several sophisticated tools for data reduction and analysis are available, they lack the flexibility to cope with increasingly innovative experimental strategies and new database resources that encode both qualitative and quantitative data. I will outline a specification of a flexible proteomics tool that could address many current bottlenecks and deficiencies.


Introduction
The science of proteomics is often seen as a largescale exercise in systems biology. The sequential processes in proteomics are; simplification of the proteome (organelle purification, 2D gels, protein chromatography, 1D and 2D peptide chromatography); mass spectrometric analysis of the ensuing peptides (using one or more of the impressive variety of the current generation of mass spectrometers); and bioinformatic analysis of the resultant mass spectrometric data. One of the major challenges in proteomics is that of identification and quantification of every protein in a complex mixture. This goal has been a major driver in the evolution of high-throughput systems, in novel strategies for proteome simplification and in the construction of powerful bioinformatics platforms.
Yes, there are many studies in proteomics that have a rather different perspective. The focus can be constrained to a single protein or a small group of proteins. It is possible to entertain the idea of proteomics experiments without a single gene, cDNA, EST or protein sequence being present in any database. Some experimenters seek to extend proteomics by the use of clever chemical or stable isotope-based methods that greatly enhance simplification or subsequent analysis of the analyte. Whether such studies are all deserving of the title of 'proteomics' is irrelevant; they represent cuttingedge thinking in analytical and preparative protein chemistry. Imaginative as such studies are, I suggest that a major constraint on such developments is the absence of software tools to analyse the subsets and modifications to the analyte. I will present a case, based on the perspective of a protein chemist and an end-user of such software tools, for a configurable proteomics platform that allows individual experimenters to define the nature of their analyses.

Current needs and frustrations
My research group is motivated by a range of proteomics studies that cover such diverse areas of biology as chemical communication, copper toxicity, cross-species matching and proteome dynamics. Of these studies, relatively few are driven by the need for global protein identification or by comparative proteomics. In all of the studies, I have been a little surprised by the investment of time in Enabling proteomics: the need for an extendable 'workbench' 53 data analysis relative to the duration of the biological experiment. Whilst simple database searches (e.g. for peptide mass fingerprinting against known proteomes) are rapid and efficient, other analyses require a high degree of manual intervention and, indeed, time-consuming visual inspection of the data.
As an illustrative example, we are motivated to extend the description of a proteome to include an understanding of proteome dynamics, defined as the rates of synthesis and degradation of any protein within the proteome. Comparative proteomics is set firmly in the arena of changing proteomes, but an understanding of changes in the amount of any protein in the cell requires definition of both the rate of synthesis and the rate of breakdown. This acquires particular significance when attempts are made to correlate proteome and transcriptome data, as such studies implicitly assume that an increase in the level of protein reflects, through an increase in the mRNA level, a corresponding increase in the rate of synthesis of the proteins. However, a protein can also increase in concentration if the rate of degradation is reduced. Parameterization of proteome changes into rate of synthesis and degradation is of fundamental biological importance. Changes in rates of synthesis are very likely to correlate directly with transcriptome changes, whilst changes in degradation may reflect interactions at the substrate level, and may thus connect more immediately with the metabolome.
To analyse proteome dynamics, we use stable isotope-labelled amino acids, which are incorporated into proteins as they are synthesized de novo. Note that we design experiments to avoid complete labelling of the proteins, as this would eliminate any information about the relative rates of synthesis. The proteins are then separated and analysed most simply by MALDI-TOF mass spectrometry of a tryptic peptide mixture. The relative amounts of 'heavy' and 'light' peptides define the rate of replacement of that particular protein in the system. The 'heavy' and 'light' variants of the peptide are clearly identifiable, and the mass offset coincidentally informs about the number of the labelled amino acid in the peptide. However, to determine the rate of turnover we need to calculate the areas under the 'heavy' and 'light' peptide peaks. Further, we would ideally have a tool that would scan a MALDI-TOF mass spectrum, identify those peptides that exist as heavy-light pairs and calculate the rate of turnover automatically [1]. No such tool exists in any readily accessible form.
I suggest that the lack of such simple and targeted tools is an obstacle to the development of novel and imaginative approaches to the development of proteomics. It is unlikely that any one group would produce a full range of such tools, and I am persuaded by the opportunities for construction of an open, extensible platform onto which new tools and modules can be bolted as required, and offered as a service to the entire proteomics community. I refer to this concept as a 'workbench' to assist proteomics and, rather than coin another acronym (there are already too many of these in proteomics), will refer to 'workbench' throughout this article.
One of the better analogous systems that might serve as a model for the workbench is the AVS package (http://www.avs.com) designed for visualization of scientific data and presentation of image data. This comprises a core product onto which can be introduced task-specific tools, written by the scientific community. The tools are assembled in a modular format, with defined connectivities to the core and to other tools. Modules are assembled in a graphical environment where modules, represented as building blocks, are linked to define complete analyses. I am intrigued by the possibility that proteomics, or protein mass spectrometry, might also benefit from the availability of such an open environment into which new tools can be slotted without the added overhead of the need to write a complete application. The behaviour of each tool or module is either inflexible, performing a single invariant function, or is modified by a set of parameters that are adjusted by the user through control panels (using such visual devices as sliders, dials and check boxes) or through text commands. The workbench would have a scripting language underlying each module, and it might be possible to dispense with the visual metaphor and cast an analytical process as a script.

A proteomics workbench
The workbench would not be intended to replace or compete with other developments for management of complete proteomics experiments, including PEDRO (http://pedro.man.ac.uk [2]) and the Human Proteome Organisation (HUPO) proposals (http://psidev.sourceforge.net/ [3,4]). Rather, 54 R. J. Beynon I see it as a set of tools that at least in part can precede experimental design, encourage an analytical approach to the development of novel strategies and provide customizable modules for analysis of novel, and sometimes unique, proteomics data ( Table 1).
The two major sources of data for the workbench are sequence databases and mass spectrometric data. Each covers a range of specific data types. Sequence databases can be protein sequences (SWISSPROT, TREMBL), untranslated gene or cDNA sequences (EMBL-Bank, GenBank) or EST resources (dbEST). In all instances, these datasets have utility in proteomics studies. They differ in the degree of error that they manifest (e.g. singlepass vs. multiple-pass sequencing of DNA) but, with appropriate tools, can generate a search space against which proteomics data can be matched. Typical tools that might fall under the aegis of database manipulation include extraction of a subset of sequences from multiple data sources to create a local, private database or subproteome, or generation of a summary analysis of the members of a proteome or subproteome. Table 1. Some modules that might form part of a core proteomics workbench

Filters
• Selective recovery of entries from protein or DNA databases according to user-specified criteria to form a subproteome • Filtering of a proteome according to use-defined criteria, such as presence of specific pairs of amino acids, post-translational modification sites Processes • Six-frame translation and recovery of all putative ORFs according to pre-defined criteria • Scanning of mass spectra for stable isotope-labelled duplexes or multiplexes • Summary statistics pertaining to a local subproteome • Chemical modification strategies • Proteolytic digestion to generate a database of fragments • Detailed queries of proteome, subproteomes or fragments sets • Shotgun sequencing assembly of overlapping MS/MS data • Searching private databases using experimentally derived data (possibly externally computed using GRID-like capabilities)

Outputs
• Plot a distribution of a range of parameters, define a proteome, subproteome • Presentation of detailed mass spectrometric coverage diagrams • Tabulate and export database sets in text or XML format Mass spectrometric data would be more problematical, because several instrument manufacturers use proprietary data formats that are not as readily accessible. There is a need for some intermediate mass spectrometric data format to which all instrument suppliers adhere, at least as an exportable format. This is a topic of active debate and development and which can be built upon existing programmes in analytical science (http://psidev.sourceforge.net/ms/docs/030611 PSI ASMS.pdf).

Representative modules
The modules that could be built into the workbench are limited only by the imagination of the investigator and the availability of appropriate programming skills. However, careful description of the scope and behaviour of some primitive tools should permit a hierarchical construction of taskspecific tools that could be shared. Rather than devise a tool to define proteome-relevant analysis of a single database, a generic tool should be defined to operate on any global or local database. Perhaps the most appropriate way forward in defining the functionalities of the modules is by direct interaction with end-users, who will be most able to define their tasks in terms of natural language specifications.
Many workbench specifications could be initiated with three types of modules -filters, processes and outputs. The filters work on external data sources or on internally created local or private datasets, and offer rule-based simplification of the data sources. A filter might, for example, support SQL statements or allow more flexible user control via an intuitive control panel. Filters are equivalent to searches, and could be applied at an early stage or intermediate stage of any workbench application. In natural language, a filter might 'prepare a local, temporary proteome database of all proteins in TREMBL or SwissProt that are derived from chicken'. An investigator should be able to pose the task 'plot the distribution of masses of endopeptidase LysC peptides from rodent skeletal muscle, irrespective of species, and split according to whether the peptides contain no, one, two or more than two valine residues', or 'what percentage of human liver proteins have a tryptic N-terminal peptide that is between 400 Da and 4000 Da?'. These