The BioLink SIG Workshop at ISMB2004

The Special Interest Group (SIG) on Text Mining (or BioLINK — Biological Literature, Information and Knowledge; http://www.pdg.cnb.uam.es/BioLINK/) was created to address the need for communication and interchange of ideas in the field of text mining and information extraction applied to biology and biomedicine. Information extraction (IE) is an outgrowth of work in automated natural language processing, which began in the 1950s with work on transformational grammar by Zellig Harris [5,6] and later Noam Chomsky [3,4]. Information extraction technology made rapid progress starting in the late 1980s, thanks to a series of conferences focused on evaluation of IE: the Message Understanding Conferences [1]. There is also a long history of research on applications in medicine. Applications to the medical field focus on two distinct sub-problems: improved access to the medical literature and extraction of information from patient records. Despite these successes in other fields, natural language processing (NLP) techniques were not introduced in biology until the late 1990s. Even today, there are two distinct groups: on the one hand, researchers with a background in computer science, and on the other hand, their colleagues with a background in the life sciences, with only limited interaction between the two groups. To improve this situation, the BioLINK group holds regular open meetings to bring together researchers developing text data mining tools and related language processing methods to manage the information explosion in the biomedical field. They include invited and contributed papers, with a focus on developing shared infrastructure (tools, corpora, ontologies) and challenge evaluations, in the style of the KDD Challenge Cups [2]. This year, the BioLINK SIG meeting focused on resources and tools for text mining, with special emphasis on the evaluation of these tools. Speakers from the following areas were invited:

The Special Interest Group (SIG) on Text Mining (or BioLINK -Biological Literature, Information and Knowledge; http://www.pdg.cnb.uam.es/Bio-LINK/) was created to address the need for communication and interchange of ideas in the field of text mining and information extraction applied to biology and biomedicine. Information extraction (IE) is an outgrowth of work in automated natural language processing, which began in the 1950s with work on transformational grammar by Zellig Harris [5,6] and later Noam Chomsky [3,4]. Information extraction technology made rapid progress starting in the late 1980s, thanks to a series of conferences focused on evaluation of IE: the Message Understanding Conferences [1]. There is also a long history of research on applications in medicine. Applications to the medical field focus on two distinct sub-problems: improved access to the medical literature and extraction of information from patient records.
Despite these successes in other fields, natural language processing (NLP) techniques were not introduced in biology until the late 1990s. Even today, there are two distinct groups: on the one hand, researchers with a background in computer science, and on the other hand, their colleagues with a background in the life sciences, with only limited interaction between the two groups. To improve this situation, the BioLINK group holds regular open meetings to bring together researchers developing text data mining tools and related language processing methods to manage the information explosion in the biomedical field. They include invited and contributed papers, with a focus on developing shared infrastructure (tools, corpora, ontologies) and challenge evaluations, in the style of the KDD Challenge Cups [2]. This year, the BioLINK SIG meeting focused on resources and tools for text mining, with special emphasis on the evaluation of these tools. Speakers from the following areas were invited:

Overview: contributed papers
The contributed papers reflect the importance that is currently given to biological named entity detection in the literature. Four out of the five publications are related to this issue and to the associated issues of resources, infrastructure, and evaluation: The first BioCreAtIvE Workshop was held in Granada, Spain, 28-31 March 2004. The goal of the workshop was to provide a set of common challenge evaluation tasks to assess the state of the art for text mining applied to biological problems. The assessment focused on two tasks. The first dealt with extraction of gene or protein names from text, and their mapping into standardized gene identifiers for three model organism databases (fly, mouse, yeast). The second task addressed issues of functional annotation, requiring systems to provide Gene Ontology (GO) annotations for proteins, given full-text articles. Overall, 27 groups participated in the assessment, including 18 for gene/protein name extraction, and nine for the GO functional annotation task.

Enhancing access to the bibliome: the TREC genomics track -William R. Hersh
The Text Retrieval Conference (TREC) is an annual activity of the information retrieval (IR) research community sponsored by the National Institute for Standards and Technology (NIST). TREC aims to provide a forum for evaluation of IR systems and users. Activity is organized into 'tracks' of common interest, such as questionanswering, multi-lingual IR, web searching, interactive retrieval and, as started in 2003, IR in the genomics domain. The genomics track is sustained by a National Science Foundation Information Technology Research grant that provides funding through 2008. Background on the motivation and evolution of the track can be found on the track website (http://medir.ohsu.edu/∼genomics/). The website also contains an overview paper from the 2003 track as well as the protocol for the 2004 track.

BioMinT: a database curator's assistant for biomedical text processing -Anne-Lise Veuthey
The goal of the BioMinT project is to develop a generic text mining tool that assists manual database annotation by: (a) interpreting diverse types of query; (b) retrieving relevant documents from the biological literature; (c) extracting the required information; and (d) providing the result as a database slot filler or as a structured report.

L Hirschman, C. Blaschke and A. Valencia
The development of the BioMinT system has followed a strictly problem-oriented approach. All decisions relative to prototype design have been based on requirements from those who will use the final product in their daily work, i.e. the curators of Swiss-Prot (the knowledgebase component of the UniProt resource) and PRINTS (the protein family fingerprint database), as well as biological researchers.

CASP: critical assessment of techniques for protein structure prediction -Anna Tramontano
The CASP community-wide experiment critically assesses the state-of-the-art in the prediction of protein structure from sequence and it has been conducted on a 2 year cycle for the last decade, beginning in 1994. The primary goals are to establish the capabilities and limitations of current methods of modelling protein structure from sequence, to determine where progress is being made, to determine where the field is held back by specific bottlenecks, and to compare the results of automatic prediction servers with manually submitted predictions. Methods are assessed on the basis of the analysis of tens of thousands of blind predictions of protein structure submitted by a large number of prediction teams from around the world. CASP provides a forum in which there is a thorough examination of the outcome of the predictions -what went right, what went wrong and, where possible, to provide an understanding of why. For members of the structural biology community not directly involved in structure prediction, the results provide a reasonable guide to the current state of the art. For the prediction community, the results provide a new and sharper sense of direction. Finally, we can begin to measure progress in the field over time. EVA: automatic system for the evaluation of structure prediction servers -Burkhard Rost EVA (http://www.rostlab.org/eva/) is a web server for evaluation of the accuracy of automated protein structure prediction methods. The evaluation is updated automatically each week, to cope with the large number of existing prediction servers and the constant changes in the prediction methods. EVA currently assesses servers for secondary structure prediction, contact prediction, comparative protein structure modelling, and threading/fold recognition. Every day, sequences of newly available protein structures in the Protein Data Bank are sent to the servers and their predictions are collected. The predictions are then compared to the experimental structures once a week; the results are published on the EVA web pages. Over time, EVA has accumulated prediction results for a large number of proteins, ranging from hundreds to thousands, depending on the prediction method. This large sample assures that methods are compared reliably. As a result, EVA provides useful information to developers as well as users of prediction methods.