Word Sense Disambiguation for Chinese Based on Semantics Calculation

In order to use semantics more effectively in natural language processing, a word sense disambiguation method for Chinese based on semantics calculation was proposed. The word sense disambiguation for a Chinese clause could be achieved by solving the semantic model of the natural language; each step of the word sense disambiguation process was discussed in detail; and the computational complexity of the word sense disambiguation process was analyzed. Finally, some experiments were finished to verify the effectiveness of the method.


Introduction
Currently, semantics is becoming more and more important in natural language processing.Scholars had made great progress in WSD research by analyzing the semantic relations.
Based on the semantic relevancy calculated according to HowNet, a WSD method was discussed [1].A WSD algorithm which could disambiguate the word sense of the polysemy by the semantic relatedness in WordNet was proposed [2].A two-stage WSD method was researched according to the semantic information on the Wiki [3].Using the distance between words in the graph based model, a graph based WSD method was studied [4].The Chinese sentence could be disambiguated based on HowNet in a question answering system [5].A WSD algorithm based on the semantic relevancy in HowNet was researched [6].A -pruning algorithm for semantic relevancy calculating model of natural language was studied [7].WSD can be achieved by solving a model based on WordNet [8].According to the semantic tree in WordNet, word sense could be disambiguated [9,10].
Although the research had made considerable achievements, the WSD results were still not accurate enough in practice.In order to solve the problem more effectively and accurately, a WSD algorithm for the Chinese sentences was proposed.By the method, a Chinese clause could be disambiguated by analyzing the semantic relevancy.In the end, we verified the effectiveness of the method through some experiments.

The Basic Theory
2.1.The Semantic Relevancy Calculation Model.Suppose that each word   (except for the predicate words) in a sentence (  ) semantically describes another word   ; the semantic relevancy between   and   could be represented by the correlation function Rel(  ,   ).
Suppose there are  kinds of parsing process for the sentence   ; in the th parsing process   , V are the predicate words, S are the subject words, and  are the object words.The semantic relevancy of the sentence for   can be expressed by formula (1), as shown in Figure 1: In formula (1),  is the number of words in   (not including S, V, and O),   is the weight coefficient, generally,   should be proportional to the length of the sentence, and   > 1.
The Basic Principle of Model Solution.The most reasonable parsing process would be the parsing process which had the max semantic relevancy in all the  kinds of parsing process.

Semantically modify relation
The ith kind of parsing process The mth kind of parsing process

Notional word
The 1st kind of parsing process . . .

Figure 1:
The  kinds of parsing process for a sentence.
In the calculation process, the grammatically partial words should be neglected.

The Basic Method to Solve the Model.
According to the semantic structure, all the sentences in Chinese could be divided into two kinds: (i) the simple sentences: the sentences without subordinate sentences, (ii) the complex sentences: the sentences with subordinate sentences.
In the process to solve the model, a simple sentence might be selected, and resolute it to a word, and repeat the resolution process until the sentence becomes a simple sentence.And, in the resolution process, WSD could be finished.

The Word Sense Disambiguation Process
Most words in a Chinese sentence are polysemies; the WSD process could be solved by the following steps.

Get All the "𝑉-Sequences" for a
Sentence.If a word () in a sentence is polysemy and one of the senses may be a verb or an adjective, the word W could be classified as "-Word." "-Word" is a word that may be the predicate word () in the sentence.
Select all the "V-Words"; the other words remain unchanged; we can arrange all the possible "V-sequences" for the sentence.When a "V-Word" is arranged, no matter how All the "V-sequences" Figure 2: All the "V-sequences" for a sentence with 3 "V-words." many senses of the "V-Word" were, the word would be treated as two kinds: {V, a common vocabulary}.In mathematical theory, a sentence with  "V-Words" could be arranged in 2  − 1 kinds of "V-sequence." As an example, Figure 2 shows all the "V-sequences" for a sentence with 3 "V-Words."

Get All the Simple
Sentences for a "-Sequence".Generally a simple clause contains only one "V-Word, " so it is easy to get all the simple sentences by the exhaustive method.As an example, Figure 3 shows all the simple sentences for a "V-Word (  )" in a V-sequence.
In Figure 3, there might be  *  kinds of the simple sentences for   at most.
The range of a simple sentence n words between V i and V i+1 The word might be S or O Create all the "SVO-group"

Get All the "SVO-Group" for a Simple
Sentence.Get all the words which might be the subject words (S) or the object words (O) by calculating the semantic relevancy; if the value of Rel(, ) is greater than the threshold, the word might be  or O.It is easy to get all the "SVO-group" for a simple sentence, as shown in Figure 4.

Dividing a Simple Sentence into
Segments.Generally, a sentence (  ) could be divided into several segments as in Figure 5.
In Figure 5, L is the segment between S, V, and O,   is prepositive attributive,   is postpositive attributive, and   is adverbial.

Turning the Segments into Some Simple Semantic Units.
A segment  between S, V, and O could be turned into several simple semantic units in semantic logic as in Figure 6.
For any simple semantic units, there would be the following semantic features: (i) for any word   ,   is in the same simple semantic unit (  semantically describes another word   ); (ii) in the semantic analyzing process, a simple semantic unit could be treated as a whole, and its internal grammatical structure had no effect on the other analyzing process.
3.6.WSD for Simple Semantic Units.Most words are polysemies in natural language, so a semantic description relation graph (SDRG) for a simple semantic unit could be created for WSD.In the SDRG, all the senses of a polysemy are created as a "Generalized Vertex" and each sense is created as a vertex in the "Generalized Vertex" set.In Figure 7, there were the following key features.

Simple semantic unit
(i) Except for the final "Generalized Vertex, " there is only one goal for any "Generalized Vertex" to describe, and the outdegree of any "Generalized Vertex" is 1.
(ii) In each "Generalized Vertex, " all the edges of a spanning tree must connect to the same vertex.
(iii) A SDRG, for each parsing method, must be a spanning tree of the complete graph of all the "Generalized Vertices." So, the best SDRG of the best parsing method for the simple semantic units must be the maximum spanning tree (MST) of the complete semantic description relation graph of all the "Generalized Vertices." The specific details had been discussed in references [11].

Get the Best Simple Clause Resolution
Sequence.According to formula (1), we could calculate the semantic relevancies of each simple clause and sum up all the values.There would be many different resolution sequences, so we should search and calculate each resolution sequence by exhaustive method during the calculation process.The resolution sequences with the best semantic relevancies are the best parsing method in semantics.
After Step 1 to Step 7, the semantic model could be solved and each polysemous word could be disambiguated.

Generalized vertex
A spanning tree

The Computational Complexity
The key difficult problem is the computational complexity because the exhaustive methods are used in each step.Is the method too complex for calculating?Suppose a sentence contained  words and n "V-Words, " each word is  senses averagely; the time complexity for each step is analyzed as follows (Figure 9).
Step 5. Consider (3); a segment  might be turned into 3 simple semantic units.
Step 7. Consider (!); there would be  simple clauses, so the maximum kind of simple clause resolution sequence is !.Averagely,  is less than 5,  is less than 5, and only the value of  would be great, so the time complexity would not high enough for calculation in practice.

Experimental Results and Analysis
In the experiments, 200 Chinese sentences were selected and the HowNet was used as the lexical semantics library when calculating the semantic relevancy between two words (Windows XP; CPU: Xeon E5-2403, 2 GHz; memory: 8 G).
From the experimental results (Table 1), we can see the following.
(i) The correct rates decrease with the length of the clause.
(ii) The computational complexity increases with the length of the clause.
(iii) The time of solving the semantic model is in the same order of  5 *  2 ; this means that the computational complexity is ( 5 *  2 ) in practice.
Using the same 200 Chinese sentences and the method in [1], we made some comparative experiments; the results are shown in Table 2.
In theory and practice, the correct rates would decrease with the length of the clause (Figure 8), but the author did not treat the different length of the clause in [1].

Summaries
In this paper, a word sense disambiguation method for Chinese based on semantics calculation was researched and WSD could be achieved by solving the semantic relevancy calculation model, and the relations between accuracy and the time complexity were explored by experiments.However, the experimental data was not enough and the accuracy was not high enough.These problems will be explored in the future research.

Figure 3 :
Figure 3: All the simple sentences for   .

Figure 4 :
Figure 4: All the "SVO-group" for a simple sentence.

Figure 7 :
Figure 7: The SDRG of a semantic unit.

Table 2 :
The correct rates of the comparative experiments.