We propose a new approach for determining the adequate sense of Arabic words. For that, we propose an algorithm based on information retrieval measures to identify the context of use that is the closest to the sentence containing the word to be disambiguated. The contexts of use represent a set of sentences that indicates a particular sense of the ambiguous word. These contexts are generated using the words that define the senses of the ambiguous words, the exact string-matching algorithm, and the corpus. We use the measures employed in the domain of information retrieval, Harman, Croft, and Okapi combined to the Lesk algorithm, to assign the correct sense of those proposed.
Human language is ambiguous; many words can have more than one sense: this sense is dependent on the context of use. The word sense disambiguation (WSD) allows us to find the most appropriate sense of an ambiguous word. This work is a contribution in a general frame-work which aims at understanding the Arabic
We propose some steps [
This paper is structured as follows. We describe in
Most of the works related to the word sense disambiguation were applied to the English. They achieve a disambiguation rate of around 90%. There are many approaches which are classified using the source of knowledge adapted for the differentiation of the senses.
They were introduced in 1970, based on the dictionary, thesaurus, and lexicon. Using these resources they extract the information necessary to disambiguate words. Some of them [
Since the evolution of the statistic methods based on large text corpus, two principal orientations appear. Unsupervised methods: these methods are based on training sets and use a non-annotated corpus. They are divided into type-based discrimination [ Supervised and semi-supervised methods: they use an annotated training corpus inducing the appropriate classification models [
As we have mentioned before, the majority of the works related to the WSD were applied to the English. However, there are some works applied to Arabic. We can state the unsupervised approach of Bootstrapping Arabic Sense Tagging [
Principle of the proposed method.
Subsequently we eliminate stopwords from the original sentence, using the list of stop words defined in our database (see Section
The second step of the proposed method is to measure the similarity between the different contexts of use generated from the glosses and the current context. The context that obtains the highest score of similarity with the current context will represent the most probable sense of the ambiguous word. The Algorithm
CU: Context of use generated;
AW: ambiguous word;
(1) For each (2) Assign weight (3) For each (4) Lemmatizing ( (5) For each (6) Approximate String-Matching (char } (7) For each (8) Load all the sentences that contains these occurrences to generate the context of use CU; Context-Matching ( (9) For each
(10) For each CU containing AW { (14) If the result given by each measure are different then Else
To maximize the probability of finding the context for each gloss, we proposed as a solution to generate the occurrences of the most significant words. To extract the most significant words, we eliminate the non-informative words (stop words in English) using a predefined list (this list contains 20000 words). Given that the Arabic word has flexional morphology, we used an algorithm to extract the word roots and then an algorithm for matching words to find occurrences of this root. Obtaining instances of a root consists of adding a suffix to the beginning of a word or a prefix in the end.
To extract the stems of the Arabic words we use the Al-Shalabi-Kanaan
The weights affiliated to letters were determined through experiments on Arabic texts, for example, we assign the most highest weight “5” to the letters “ة ,ا” “a, t” because the words in the Arabic language begin and finish by these letters. The rank of letters in a word depends on the length of the word, and if the word contains an even or odd number of letters. The three letters with the lowest weights are selected.
In Table
Execution of the stemming algorithm to extract the root of the word “الحساب” “alhissab”.
word | الحساب | |||||
---|---|---|---|---|---|---|
Letters | ب | ا | س | ح | ل | ا |
Wheights | 0 | 5 | 1 | 0 | 1 | 5 |
Rank | 1.5 | 2.5 | 3.5 | 4 | 5 | 6 |
Multiplication | 0 | 15 | 3.5 | 0 | 5 | 30 |
Root | حسب |
Unlike English, Arabic has a rich derivational system and it is one of the characteristics that makes it ambiguous. The Arabic words are based on roots, generally trilateral. We use the algorithm of approximate string matching [
It is based on two steps [
Begin (i) (i.a) Construct the matrix Filling the matrix (i.b) For For (ii) For End End
After that we use the step of back-tracking (see Algorithm
(iii) (iii.a) Select telle que (iii.b) whiel ( If else if else end if end if end do; (iv) end
We use the corpus described in the experimental results, to extract the sentences containing the words of the glosses and their occurrences. These texts represent the contexts of use. This algorithm takes so much time during its execution; to facilitate that, we generated a table in our knowledge base in which are recorded occurrences of each root are recorded. Until now this table has a list of 7,349 roots with an average of seven occurrences for each root.
We propose some measure that determines the degree of similarity between a sentence (containing an ambiguous word) and a document (that represents the contexts of use for a given sense of the ambiguous word). Let
To determine the appropriate sense of
Consider
Consider
Consider
The Lesk algorithm, introduced in 1986, was derived and used in several studies of Pedersen and Bruce [
We adapted simplified Lesk algorithm [
Begin Score Sens For all I Sup For all sup if sup > score then End.
The choice of the description and context varies for each word tested by this algorithm.
The function context (
The application of this algorithm allowed us to obtain a rate of disambiguation up to 76%.
To check the validity of the algorithm presented in the previous section, tests were conducted using some free tools. The English works were evaluated using Senseval-1 or Senseval-2. However in our work we have to make our experimental data using a totally different set of resources. To measure the rate of disambiguation, we use the most common evaluation techniques, which select a small sample of words and compare the results of the system with a human judge. We use the metric of the precision
We use the dictionary of “Al-Mu’jam Al-Wasit” that contains the Arabic lexicography. Therefore, we construct a database that contains the words of an electronic version of this dictionary and their glosses. Table
Description of the used dictionary.
Number of letters | Number of pages | Average number of glosses per word |
---|---|---|
29 | 1407 | 12 glosses/words |
We give in what follows a sample of glosses for the word “عين” “ayn” given by the dictionary Al-Wasit.
First gloss
عضو الإبصار للإنسان وغيره من الحيوان
Transcription
Organ vision of man and other animals.
Second gloss
يَنْبُوعُ الماء ينبُعُ من الأرض ويجري
Transcription
Fountain water flows from the land being.
In this work we choose to work on fine-grained senses. This choice makes our work more difficult and complex because it increases the number of the considered senses.
We chose to work on texts dealing with multiple domains (sport, politics, religion, science, etc.). These texts are extracted from newspaper articles, which were recorded in the corpus of Al-Sulaiti and Atwell [
Characteristics of the collected corpus.
Measure | Value |
---|---|
Total size of the corpus |
|
Number of ambiguous words | 50 words |
Average number of synonyms of each ambiguous word | 4 |
Average number of the possible senses | 12 |
Average size of each context of use | 970 words, 130 sentences |
Average size of the text | 500 words |
These documents have the advantage of possessing an explicit structure that facilitates their presentation and their exploitation in different contexts to find relevant words more efficiently.
We have compiled a list of stop words which have no influence on the meaning of the sentence. This list contains 20000 empty words or stop words. To build this list, we collected from the net pronouns, noun, names, letters, noun-verb, and some words considered insignificant by humans.
Fifty words have been chosen. For each one of these ambiguous words, we evaluate 20 examples per sense. This number may be judged as not enough due to the problems encountered during the experimentation cited in what follows. The important number of glosses given by a dictionary for the ambiguous word. The problem of the sentence segmentation due to the ambiguity of the Arabic language [ Finding the samples for the tests that can be judged as well as not so different for the process of disambiguation.
We measure the performance of our system using the metrics presented above, with and without the respective use of the stemming algorithm and the string-matching algorithm (see Table
Results obtained by different measures after and before pretreatment.
Method | Without rooting | Without string-matching | Final rate | MFS |
---|---|---|---|---|
|
0.52 | 0.61 | 0.78 | 0.86 |
|
0.39 | 0.52 | 0.65 | 0.74 |
|
0.44 | 0.56 | 0.71 | 0.84 |
Table
To determine the size of the collected context of use for each sense, we evaluate the results given by our system varying the size of that context of use (50 words, 100 words, and 150 words).
Figure
The
In the work of Yarowsky [
Results obtained for different window sizes.
The best similarity measure is obtained using a window size of three words. The Croft measure was the best one between those proposed.
Figure
Comparison of the similarity measures.
This paper has presented an unsupervised method to perform word sense disambiguation in Arabic. This algorithm is based on segmentation elimination of stop words, stemming and applying the approximate string matching algorithm for the words of the glosses. We measure the similarity between the contexts of use corresponding to the glosses of the word to be disambiguated and the original sentence. This algorithm will affiliate a score for the most relevant sense of the ambiguous word. For a sample of fifty ambiguous Arabic words that are chosen by their number of senses out of context (the most ambiguous words), the proposed algorithm achieved a precision of 78% and recall of 65%.
We propose that in future works, we ameliorate the correspondence between words and their glosses to build a system based on rules to disambiguate Arabic words.