A Learning-Based Approach for Biomedical Word Sense Disambiguation

In the biomedical domain, word sense ambiguity is a widely spread problem with bioinformatics research effort devoted to it being not commensurate and allowing for more development. This paper presents and evaluates a learning-based approach for sense disambiguation within the biomedical domain. The main limitation with supervised methods is the need for a corpus of manually disambiguated instances of the ambiguous words. However, the advances in automatic text annotation and tagging techniques with the help of the plethora of knowledge sources like ontologies and text literature in the biomedical domain will help lessen this limitation. The proposed method utilizes the interaction model (mutual information) between the context words and the senses of the target word to induce reliable learning models for sense disambiguation. The method has been evaluated with the benchmark dataset NLM-WSD with various settings and in biomedical entity species disambiguation. The evaluation results showed that the approach is very competitive and outperforms recently reported results of other published techniques.


Introduction
Word sense disambiguation is the task of determining the correct sense of a given word in a given context. In the general language domain, and within natural language processing (NLP), the word sense disambiguation (WSD) problem has been studied and investigated extensively over the past few decades [1,2]. In the biomedical domain, on the other hand, WSD is more widely spread in the biological and medical texts and sometimes with more severe consequences. The amount of WSD research in the biomedical domain is not proportional to the extent of the problem. As an example, in the biomedical texts, the term "blood pressure" has three possible senses according to the Unified Medical Language System (UMLS) [3] as follows: organism function, diagnostic procedure, and laboratory or test result. Thus, if this term blood pressure is found in a medical text, the reader has to manually judge and determines which one of these three senses is intended in that text. Word sense disambiguation contributes in many important applications including the text mining, information extraction, and information retrieval systems [1,2,4]. It is also considered a key component in most intelligent knowledge discovery and text mining applications.
The main classes of approaches of word sense disambiguation include supervised methods and unsupervised methods. The supervised methods rely on training and learning phases that require a dataset or corpus containing manually disambiguated instances to be used to train the system [5,6]. The unsupervised methods, on the other hand, are based on knowledge sources like ontology, for example, from UMLS, or text corpora [2,4,7,8]. Our approach in this paper is a supervised approach. In this paper, we present and evaluate a supervised method for biomedical word sense disambiguation. The method is based on machine learning and uses some feature selection techniques in constructing feature vectors for the words to be disambiguated. We conducted the evaluation using the NLM-WSD benchmark corpus and species disambiguation dataset. The evaluation results proved the competitiveness of the proposed approach as it outperforms some recently published techniques including supervised techniques.

Related Work
In the biomedical domain, the applications of text mining and machine learning techniques were quite successful 2 The Scientific World Journal and encouraging [6]. Most of the methods for biomedical entity name recognition, classification, or disambiguation can be roughly divided into three categories: (i) supervised and machine-learning-based techniques, (ii) statistical and corpus-based techniques, and (iii) syntactic and rule-based techniques [9][10][11]. Moreover, the bioinformatics literature shows that biomedical WSD has been a quite active area of research with a number of approaches proposed and applied to biomedical data [1,2,4,8,12,13].
Agirre et al. proposed a graph-based WSD technique which is considered unsupervised but relies on UMLS [2]. The concepts of UMLS are represented as a graph, and WSD is done using personalized page rank algorithm [2].
In another related research, Jimeno-Yepes and Aronson [4] presented a review and evaluation of four WSD approaches that rely on UMLS as the source for knowledge for disambiguation. In [1], Stevenson et al. use supervised learners with linguistic features extracted from the context of the word in combination with MeSH terms for disambiguation.
The UMLS has been used, by Humphrey et al., as a knowledge source for assigning the correct sense for a given word [13]. They used journal descriptor indexing of the abstract containing the term to assign a semantic type from UMLS metathesaurus [3,13].
In bioinformatics and computational biology, there are quite a few tasks similar to WSD like biomedical term disambiguation, gene protein name disambiguation, and disambiguating species for biomedical named entities [9][10][11]. The task of biomedical named entity disambiguation or classification is an augmentation of the well-known task of biomedical named entity recognition (NER). In NER, biomedical entity names, for example, gene names, are recognized and extracted from the text. In the biomedical named entity disambiguation, the extracted entity names (e.g., gene product names) will be applied onto a process such that each occurrence should be disambiguated as either gene name or protein name as the same name can refer to a gene or protein. For example, the biomedical entity name SBP2 can be a gene name or a protein name depending on the context [10,11]. Furthermore, in species disambiguation, the term c-myc is a gene, but it can be either in a human gene (homo sapiens) or mouse gene (mus musculus) depending on the context [9][10][11][14][15][16].
In [9], Wang et al. devised a rule based system to disambiguate biomedical entity names, like gene products, based on species. In that approach [9], some parsing techniques are used and syntactic parse tree with paths between words to determine if there exists a path between species word and the entity name. They employed and examined several parsers in the task including C&C, Enju, Minipar, and Stanford-Genia [9,15,16]

A Method for WSD
A word sense disambiguation method is an algorithm that assigns the most accurate sense to a given word in a given context. Our method is a supervised method requiring a training corpus that contains manually disambiguated instances of the ambiguous words. The method is based on a word classification and disambiguation technique that we have proposed in a preliminary work [17]. In the previous work, [17], we introduced a method for term disambiguation and evaluated it with biomedical terms to disambiguate gene and protein names in medical texts.
The method relies on representing the instances of the word to be disambiguated, w x , as a feature vector, and the components of this vector are neighborhood context words in the training instances. In the context of the target word, w x , we select the words with the high discriminating capabilities as the components of the vectors. As a supervised technique, this method consists of two stages learning (or training) stage and a testing (or application) stage. The trained models (classifiers) produced from the learning phase will then be used to disambiguate unseen and unlabeled examples in the testing phase. That is, during the learning phase, the constructed feature vectors of the training instances will be used as labeled examples to train classifiers. The classifier will be then used to disambiguate unseen and unlabeled examples in the application phase. One of the main strength of this method is that the features are selected for learning and classification.
Feature Selection. The features selected from the training examples have great impact on the effectiveness of the machine learning technique. Extensive research efforts have been devoted to feature selection in machine learning research [18][19][20][21]. The labeled training instances will be used to extract the word features for the feature vectors.
Suppose the word w x has two senses s 1 , s 2 , let the set C 1 be the set of w x instances labeled with s 1 , and suppose C 2 contains instances of w x labeled with sense s 2 . So, each instance of w x labeled with sense s 1 or s 2 (i.e., in the set C 1 or in the set C 2 ) can be viewed as where the words p 1 , p 2 , . . . , p n and f 1 , f 2 , . . . , f n are the context words surrounding this instance, and n is the window size. Next, we collect all the context words p i and f i of all instances in C 1 and C 2 in one set W (s.t. Each context word w i ∈ W may occur in the contexts of instances labeled with s 1 or with s 2 or combination and in any distribution. We want to determine that, if we see a context word w q in an ambiguous instance/example, to what extent this occurrence of w q suggests that this example belongs to C 1 or to C 2 . Thus, we use as features those context words w i that can highly discriminate between C 1 and C 2 . For that, we use feature selection techniques such as mutual information (MI) [19,20] as follows. For each context word w i ∈ W in the labeled training examples, we compute four values a, b, c, and d as follows: The Scientific World Journal 3 Therefore, the mutual information (MI) can be defined as , ( 2 ) and N is the total number of training examples. MI is a well-known concept in information theory and statistical learning. MI is a measure of interaction and common information between two variables [22]. In this work, we adapted MI to represent the interaction between the context words w i and the class label based on the values a through d as defined above. We utilized the training corpus of the labeled instances of the word to be disambiguated to compile the list of all context words (W = {w 1 , w 2 , . . . , w m }) as explained above; all instances of one sense are under one class label. We notice that if the context word, w i , is mostly occurring in class C 1 (or mostly in C 2 ), then the MI indicates this as shown in (2). Thus, MI can be used as a means to estimate the amount of information interaction between a context work and a class label. So, MI is used to select the context words with the highest discriminating capability between C 1 and C 2 . For simplicity, and without loss of generality, we assume that we have two senses (two class labels). Moreover, following the same intuitive reasoning of mutual information, MI, we define another method, M2, for selecting the words as features to be included in the feature vectors as follows: In the following example, assume that the target word w x has 10 instances already labeled with one of two senses as shown in Table 1. Class C 1 are the instances of w x with the first sense, while C 2 are the instances of w x instances in the second sense. Each instance is shown with its context words within certain window size. The target word w x is shown in bold face. In this example, N = 10 is the total number of training examples. The values of a, b, c, d for w p are (4,1,1,4), respectively. That is, w p has 4 occurrences in C 1 and one instance in C 2 , and so on. The values of a, b, c, d for w q are (3, 2, 2, 3), respectively. As we can see, w p is more highly related with the class C 1 than w q , and so it has more discriminating power than w q , and this is quantified by their MI values. MI values for w p and w q are 1.8 and 1.2, respectively. Then, MI (or M2) value is computed for all context words w i ∈ W. Then, the context words w i are ordered based on their MI values, and the top k words w i with highest MI values are selected as features. In this research, we experimented with k values of 100, 200, and 300. With k = 100, for example, each training example will be represented by a vector of 100 entries such that the first entry represent the context word w i with the highest MI value, and the second entry represents the context word with the second highest MI value and so on.
Then, for a given training example, the feature vector entry is set to +MI (or −MI) if the corresponding feature (context word) occurs (does not occur) in that training example and set to −MI otherwise. Table 2 shows the top 10 context words with the ten highest MI values for Table 1: An example of a training corpus of 10 instances of an ambiguous word w x where 5 instances are in the first sense listed under class label C 1 and 5 instances of the second sense listed under class C 2 . The context word w p has 4 and 1 occurrences in Class C 1 and C 2 , respectively, while w q has 3 and 2 occurrences in C 1 and C 2 , respectively.
The Learning Phase. From the labeled training examples of the word, we build the feature vectors using the top context words selected by MI or M2 as features. After that, we use the support vector machine (SVM) [23] as the learner to train the classifier using the training vectors. SVM has been shown as one of the most successful and efficient machine learning algorithms and is well founded theoretically and experimentally [7,17,18,23]. The applications of SVM are abound; in particular, in NLP domain like text categorization, relation extraction, named entity recognition, SVM proved to be the best performer. We use SVMlight (http://svmlight.joachims.org/) implementation with the default parameters and with the Radial Basis Function (RBF) kernel.

4
The Scientific World Journal

The Disambiguation
Step. In the testing step, we want to disambiguate an instance w q of the word w. We construct a feature vector V q for the instance w q the same way as in the learning step. The induced learning model (classifier) from the learning step will be employed to classify it (assign w q ) to one of the two senses.

Biomedical WSD (NLM-WSD)
Dataset. We used the benchmark dataset NLM-WSD for biomedical word sense disambiguation [24]. This dataset was created as a unified and benchmark set of ambiguous medical terms that have been reviewed and disambiguated by reviewers from the field. Most of the previous work on biomedical WSD uses this dataset [1,2,4]. The NLM-WSD corpus contains 50 ambiguous terms with 100 instances for each term for a total of 5000 examples. Each example is basically a Medline abstract containing one or more occurrences of the ambiguous word. The instances of these ambiguous terms were disambiguated by 11 annotators who assigned a sense for each instance [24]. The assigned senses are semantic types from UMLS. When the annotators did not assign any sense for an instance, then that instance is tagged with "none". Only one term "association" with all of its 100 instances were annotated none and so dropped from the testing.
Text Preprocessing. On this benchmark corpus, we have carried out some text preprocessing steps.
(i) Converting all words to lowercase.
Moreover, unlike other previous work, words with less than 3 or more than 50 characters are not ignored currently (unless dropped by the stopword removal step). Also words with parentheses or square brackets are not ignored and part of speech is not used.
After the text preprocessing is completed, for each word we convert the instances into numeric feature vectors. Then, we use SVM for training and testing with 5-fold cross validation 5FCV such that 80% of the instances are used for training and the remaining 20% are used for testing, and this is repeated five times by changing the training-testing portions of the data. The accuracy is taken as the mean accuracy of the five folds and the accuracy is computed as Accuracy = no. of instances with correct assigned senses total no. of tested instances .
We also use the baseline method which is the most frequent sense (mfs) for each word.
Experiments. Initially, we evaluated our WSD method with all the 49 words (excluding association as mentioned previously) such that, a word is included in the evaluation only if it has at least two or more senses with each sense having at least two instances annotated with it. This lead, to a total of 31 words tested in this evaluation, and 18 words were dropped because they do not have at least two instances annotated for each one of two senses. For example, the word "depression" has two senses: mental or behavioral dysfunction and functional concept. Out of the 100 instances of depression, 85 instances are tagged with the first sense, and remaining 15 instances are tagged with "None" (i.e., no instances tagged with a second sense), and so it was excluded in this evaluation. Likewise, the word "discharge" was not tested as it has only one instance tagged with the first sense, 74 instances tagged with the second sense, and 25 instances tagged with None. We used k = 200, and the window size is 5.
The accuracy results of this first evaluation (EV1) are shown in Table 4. The detailed results of this evaluation are included in Table 5.
In the second evaluation (EV2) and third evaluation (EV3), we changed the parameter and the word/features selection formula. In EV2, we set k = 300, and window size is still 5. In EV3, we kept k = 300, window = 5, and changed the word/feature selection formula to M2 defined in (3). Table 5 contains the results of EV2 and EV3. To judge on performance of our method and compare our results with similar techniques, we included several reported results from three recent publications from 2008 to 2010 [1,2,4] with our results in Table 6 under the same experimental settings.

Species Disambiguation.
In biomedical text, named entities, like gene name, are used the same way irrespective of the species of the entity. As a result, it will be difficult to extract relevant medical information automatically from texts using information extraction system. In biomedical named entity species disambiguation, for a given entity name, for example, c-myc, we want to disambiguate this entity name, c-myc, based on the species (e.g., human versus mouse) [9]. In one instance, c-myc might refer to a human gene, while in another instance it refers to a mouse gene.
For example, in Table 3, the biomedical entity name BCL-2 (a protein name) in the first text (no. 1) is human while in the second one is a mouse protein. We examined our system on this task of species disambiguation. We obtained the data from the project of Wang et al. [9]. From their data, we tested the biomedical entity names that occur in at least two species with at least 3 occurrences in each species. This enables us to use two instances for training and one for testing and repeat it three times. If the entity has 5 or more occurrences in one species, we repeat five times using 5FCV as in Section 4.1. We extracted and tested our system on a total 465 instances of entity names with an average of 8 instances per species for each entity name. In the original dataset (gold standard), 90% of the terms have all their instances occurring in only one species [9] and so cannot be tested in our system. Our system requires that each term should have instances in two or more species with at least 3 occurrences in each species. The results of Wang et al. are shown in Table 7, whereas the 2) The BCL-2 family has various pairs of antagonist and agonist proteins that regulate apoptosis. Whether their function is interdependent is uncertain. Using a genetic approach to address this question, we utilized gain-and loss-of-function models of Bcl-2 and Bax and found that apoptosis and thymic hypoplasia characteristic of Bcl-2-deficient mice are largely absent in mice also deficient in Bax results of our proposed system are shown in Table 8 in terms of precision, recall, and F1.

Discussion and Conclusion
The main weakness of the supervised and machine-learningbased methods for WSD is their dependency on the annotated training text which includes manually disambiguated instances of the ambiguous word [2,17]. However, over the time, the increasing volumes of text and literature in very high rates and the new algorithms and techniques for text annotation and concept mapping will alleviate this problem. Moreover, the advances in ontology development and integration in the biomedical domain will facilitate even more the process of automatic text annotation.
In this paper, we reported a machine learning approach for biomedical WSD. The approach was evaluated with a benchmark dataset, NLM-WSD, to facilitate the comparison with the results of previous work. The average accuracy results of our method, compared to some recent reported results (Table 6), are promising and proving that our method outperforms those recently reported methods. Table 6 contains the results for 11 methods: baseline method (mfs), our method (last column), and 9 other methods from recent work published in 2008 to 2010 (from [1,2,4]). The average accuracy of our method is the highest (90.3%), and the closest one is NB (86.0%).
Our method also outperforms all 10 other methods in 12 out of 31 words followed by NB which outperforms the rest in 7 words.
Stevenson et al. in their paper [1] report extensive accuracy results of their method (we call it Stevenson-2008) along with four other methods including Joshi-2005 and McInnes-2007, with various combinations of words from  [1]. In Table 6, the results of the three methods 6 The Scientific World Journal Our evaluation is done on 31 words (as explained in Section 3). We obtained the results of the other methods on these 31 words from the references shown in Table 6 to allow for direct comparison. The best result reported in their paper is 87.8% using all words with VSM model and for McInnes 85.3% also with the whole set [1]. The best result of Stevensons-2008 for subsets was 85.1% using a subset of 22 words defined by Stevenson et al. [1].
The results of the three methods (single, subset, full) in Table 6 are taken directly from Agirre et al. [2]. As shown in Table 6, the average accuracy of these three methods (68.8%, 59.7%, and 63.5%) on the 31 words is significantly lower than our method (90.3%) and also the average accuracy of their method on the whole set (65.9%, 63.0%, and 65.9%); we note that their method is unsupervised and does not  require tagged instances [2]. In another work, Jimeno-Yepes and Aronson evaluate four unsupervised methods on the whole NLM-WSD set [4] as well as NB and combination of the four methods. The accuracy of the four methods ranges from 58.3% to 88.3% (NB) on the whole set, and NB was found to be the best performer followed by CombSW (76.3%) [4]. The average accuracy results of NB and two combinations (NB, CombSW, and CombV) on our 31 wordsubset are 86%, 73.1%, and 72.1% respectively which are lower than our results, see Table 6.
When we applied our system onto the species disambiguation task, the results are also encouraging as shown in Table 8. The evaluation results of our method compare very well with those reported in [9] as shown in Table 7. From their results (Table 7), we notice that the best overall performance was obtained with the ML method (machine learning) with precision, recall, and F1 values being equal at 82.69. Our results as shown in Table 8 are not directly comparable with those in Table 7 due to the difference in the size of test set. However, we can see that our method's performance is reasonably well standing in terms of precision, recall, and F1. The main strength of this method is in using MI values as weights encoded in the feature vectors. These weights enable the learner to induce quite reliable models for sense disambiguation. As the components of the vectors, +MI and −MI, are the common information between context word and class labels, the induced learners are finely calibrated towards the disambiguation task.
All the results showed that the technique is fairly successful and effective in the disambiguation task. Thus, more research work should be exerted to carry out further improvements on the performance of this technique. In future work of this research, we plan to investigate the possibility of disambiguating entity names when all instances of that entity are occurring in one species. Currently, our method is supervised and required annotated instances in both classes to be able to test new samples.