Combined with the problem of single direction of the solution of the existing sentence similarity algorithms, an algorithm for sentence semantic similarity based on syntactic structure was proposed. Firstly, analyze the sentence constituent, then through analysis convert sentence similarity into words similarity on the basis of syntactic structure, then convert words similarity into concept similarity through words disambiguation, and, finally, realize the semantic similarity comparison. It also gives the comparison rules in more detail for the modifier words in the sentence which also have certain contributions to the sentence. Under the same test condition, the experiments show that the proposed algorithm is more intuitive understanding of people and has higher accuracy.
Information retrieval has become an effective way for people to access resources, and the effectiveness of retrieval has been an important index that people are most concerned about. Previous retrieval results are confined to literal meaning of the request sentence that users input. With the development of the semantic web technology and natural language processing technology, people began to pay more attention to the real intention behind the sentence that users input, that is, seeing the essence through the phenomenon and returning the most satisfactory search results to the users. The key to this process is the calculation of similarity. Sentence similarity computation has become an important research content in the field of Chinese information processing and has a wide range of applications in information retrieval, text classification, question answering system and machine translation, and so forth.
Sentence similarity computation generally consists of three levels: syntactic similarity, semantic similarity, and pragmatic similarity, in which the pragmatic similarity is the highest goal and it is quite difficult to realize at present. Adding semantic similarity computation on the premise of syntactic similarity can greatly improve the retrieval effect and meet the needs of people.
In this paper, with the help of the function of syntactic analysis and semantic role labeling of LTP platform (Language Technology Platform) of Harbin Institute of Technology [
Sentence similarity computation can be finally attributed to words similarity computation. This paper calculates the words similarity with the help of “HowNet” platform [
In “HowNet,” all words are described by one or several “concepts,” and each concept is described by a group of “sememes.” The sememe is used to describe the smallest meaningful unit of a “concept” and each sememe represents a different role. Each concept in HowNet is described with a record as shown in Box
NO.=037649 W_C=Da G_C=verb S_C= E_C= W_E=work out G_E=verb S_E= E_E= DEF={compile RMK=
Listed in Box
For the two Chinese words
The calculation of concept similarity can be finally attributed to the calculation of sememe similarity. The calculation of sememe similarity mainly has three kinds of methods: one is proposed by Zhang and Li [
In the above formula, both
The words included in “HowNet” are divided into two categories: notional words and function words. In actual text, the similarity of notional words and function words is always zero; the similarity computation of notional concept is more complex because it is described in a semantic representation and readers can refer to the article of “HowNet lexical semantic similarity computation” by Liu and Li [
Thus, HowNet attributes the similarity problem of two words to the similarity problem of two concepts. But knowing from the relationship between words and concepts, if the relationship between the two is one-to-one, the concept of HowNet can be directly taken as the meaning of the word (i.e., that word is equal to the concept); if the relationship between the two is one-to-many or many-to-many, then the similarity of various concepts of the two words needs to be calculated one by one, and then take the maximum.
Considering the problem of more complex algorithm and large amount of calculation in the second relationship above, this paper has improved this problem.
HowNet only considers two isolated words in the calculation of word similarity. If the word is placed in the sentence, the corresponding concept of the word is actually determined, and the task here is matching each word obtained by segmentation with the specific concept of the word in HowNet through word disambiguation. Thus, there is no second relationship in the problem of calculating word similarity in HowNet, and the algorithm has become relatively simple. This paper does the multidimensional word sense disambiguation based on HowNet with the segmentation results of LTP platform of Harbin Institute of Technology, then corresponds the word to its corresponding concept, and finally calculates the similarity of concepts. The algorithm of word disambiguation is as follows.
Process the target sentence in segmentation and POS (part of speech) tagging with the help of LTP and take all the notional words for calculation.
Take out one word for POS disambiguation. Determine the corresponding concept of the word in HowNet according to the POS tagging from word segmentation of LTP. If the corresponding sense to the POS is only one, then the concept of the word can be determined according to the POS tagging. If the corresponding sense to the POS is more than one, then go to Step
Example matching disambiguation: take out all the senses of the word with the same POS, put the collocation (including their part of speech) before and after the word to be disambiguated into calculation, and find the corresponding concept in HowNet. The specific steps are as follows: if the collocation before and after the word to be disambiguated just coincides with the example match of one sense in HowNet, then its concept can be directly determined; if there is no agreement with the example matching, then do matching calculation according to the POS and the sense of the collocation before and after the word to be disambiguated, and identify the sense with the highest match degree as the concept of the word. For example, the collections to be disambiguated is “da cu (i.e., buy vinegar)”; take “cu (i.e., vinegar)” and all the collections of the verb senses of “da” into matching calculation; then it can confirm that the “da” in “da cu” and the “da” in “da jiang you (i.e., buy soy)” have the same concept.
Repeat Step
Although HowNet provides eight kinds of relationship between sememes in description, but HowNet only uses hyponymy between sememes in the calculation of word similarity. So the range of word similarity value was specified in check converse set and antonym set in HowNet, if the two sememes of their concepts have the relationship of antonymous and converse (the word in italics), for example, tall: attribute → measurement → stature → short: attribute → measurement → stature → check converse set and antonym set in HowNet, if the two hypernym sememes of their concepts have the relationship of antonymous and converse (the word in italics), for example, agree: events → static → state → statemental → attitude → disagree: events → static→ state → statemental → attitude → the two sememes of their concepts appear in the sentiment words set of HowNet, such as
Thus the similarity of the two words is
Sentence similarity refers to the matching extent in semantics of two sentences which is a real number between the value of
If the two sentences are the same in semantics, its value is 1; if the two sentences in semantics are completely different, its value is 0. Considering the addition of the word relationship of antonymy and oppositeness, this paper expands the value of the sentence similarity to the real interval
At present, the algorithms of the calculation of sentence similarity mainly have three kinds: the method based on keyword information, the method based on semantic information, and the method based on syntactic structure information. Each kind of method with its own features solves the problem of sentence similarity from a different point of view. The specific algorithms at present are the following: string matching method based on keywords [
The proposing and the practice of these methods have played a very good role in promoting the research on Chinese sentence similarity, but each has its own disadvantages: the method based on keyword information only considers the surface information of the words, without deep understanding; semantic dictionary, despite making up for the defects in the keyword method, is by the limit to the corpus itself and the unknown words; dependency analysis takes into account the sentence dependency structure information, but the algorithm is complex and the cost is too much.
In this paper, with the integration of syntactic structure information, keyword semantic information, and other factors, a calculation method of sentence semantic similarity was put forward based on syntactic structure, in order to improve the accuracy of sentence similarity computation.
This paper thinks that the overall similarity builds based on the partial similarity. Breaking a complex integration into parts and obtaining the overall similarity through the similarity of the parts helps to solve the problem of sentence similarity computation, in which the key is the syntactic analysis.
A sentence is a language unit of complete meaning that consists of words or phrases according to a certain grammatical structure. The components of a sentence are called the sentence elements; thus words and phrases constitute the sentence elements. Since sentence elements are the units of constructing a sentence, according to the level and the structure characteristics of the sentence, the sentence elements are divided into six kinds: subject, predicate, object, attributive, adverbial, and complement. Syntactic analysis is to analyze the elements of a sentence and its structure relationship. The method of sentence elements analysis is a common method of syntax analysis, also known as the “central component analysis” [
The method of sentence elements analysis rules the corresponding relationship of sentence elements and the words, pays attention to find the corresponding relationship between the sentence elements and the part of speech, and can directly reflect the logical relationship of the meaning the sentence elements express.
Using the method of sentence elements analysis can quickly analyze the trunk and branches of the sentence with more complex structure and contribute to understanding and mastering the sentence. That the structure of the sentence sticks out a mile after sentence elements analysis helps us to compute the sentence similarity based on partial matching.
In this paper, using LTP platform of Harbin Institute of Technology as the sentence analysis tool helps to determine the center words and the sentence elements. LTP is a Chinese natural language processing service platform based on cloud computing technology. LTP has developed an XML-based natural language processing results expression and on this basis provides a rich set of bottom-up, efficient, high-precision Chinese natural language processing modules including lexical, syntactic, semantic analysis, and other five Chinese processing core technologies [
Syntactic analysis example of LTP.
In Figure
According to the introduction above, the sentence is divided into three parts by sentence elements analysis: main component, secondary component, and additional component. In order to do the partial component matching calculation better, on the basis of the results of LTP analysis, this paper divided the sentence into three parts: subject component, predicate component, and object component. According to the “HED” LTP tagged, we can determine the predicate component, the subject component on its left, and the object component on its right. From the linguistic knowledge, any sentence is composed by key component (subject, predicate, object, etc.) and modifier component (attributive, adverbial and complement, etc.), while the effect of key component on the sentence is significantly greater than modifier component [ Sentence 1: I like her red and pink face. Sentence 2: I like his naughty and lovely face.
If only the key component of the sentence is calculated, the component division (including part of speech tagging) can be obtained through the results by LTP. Component division of sentence 1 is I/r like/v face/n./wp. Component division of sentence 2 is I/r like/v face/n./wp.
The analysis process is as shown in Figure
Analysis process of sentence 1 and sentence 2.
In the above sentences, “r” means pronoun, “v” means verb, “n” means noun, “a” means adjective, and “u” means auxiliary.
Thus, the similarity of the two sentences is “1”; that is, sentence 1 and sentence 2 are exactly the same, but the two sentences are different in fact. Therefore, this paper suggests to properly consider the modifier component, in order to improve the accuracy of sentence similarity computation, but some modifiers cannot calculate, such as the adverb in the following. Sentence 3: He/r jumps/v really/d high/a !/wp. Component division is He/r jumps high/v !/wp.
After the analysis of annotation results of partially mature corpus (2003 Edition) of “People’s Daily” (January, 2000) based on LTP, we found that the subject and the object part of a sentence is mainly a noun or a pronoun, the predicate part is mainly a verb or an adjective, and the part of speech of the words in attributive, adverbial, and complement parts includes more, but the contribution to semantic understanding is also the noun (specifically time nouns and space nouns in adverbial), pronouns, verbs, and adjectives (as Tables
Part of speech and tagging and abbreviation.
Part of speech | Tagging and abbreviation | Part of speech | Tagging and abbreviation |
---|---|---|---|
Noun | n | Pronoun | r |
Time noun | t | Adjective | a |
Space noun | s | Number | n |
Verb | v | Adverb | d |
Main part of speech of key words in the sentence components.
Key component | The part of speech of key words | Modified component | The part of speech of key words |
---|---|---|---|
Subject | n, r | Attributive | n, a, v, r |
Predicate | v, a | Adverbial | a, v, s, t, d |
Object | n, r | Complement | v, a |
Considering various aspects, this paper puts forward the following calculation formula of sentence semantic similarity:
The value of The parameter The parameter The parameter The parameter Since a sentence is divided into three parts after syntactic analysis, so the value of
In addition, when the corresponding components of the sentence are compared, there may appear the cases of modifiers with different structure and words with different number (see Table have the same relationship and more than one word, the center word of the component and other words get the weight in accordance with the distribution of 0.6 and 0.4, such as ADV (adverbial structure) in C5 and C6; have the same relationship and more than one noncentral word, the POS of noncentral words gets the weight according to the proportion of noun, verb, adjective, and pronoun. If the POS of more than one word are the same, calculate their similarity, respectively, taking the maximum; have part of the same relationship, only compare the same relationship, such as ATT (attributive structure) in one component, ATT (attributive structure) and VOB (verb-object structure) in another component, finally coordinate with granularity coefficient which is have different relationship, only calculate the similarity for the parts with the same part of speech and finally coordinate by granularity coefficient which is
Part of experimental data of sentence similarity computation.
Test sentences | Method |
Method |
Method |
|
|||
C1: The capital of Beijing is beautiful. |
0.5714 | 0.5000 | 0.1276 |
|
|||
C3: He knitted a long scarf. |
0.5000 | 0.6625 | 0.4274 |
|
|||
C5: He went to work today. |
0.5714 | 0.5500 | −0.7720 |
|
|||
C7: I enjoy comfortable life. |
0.0000 | 0.0000 | 0.0770 |
|
|||
C9: I love to eat apples. |
0.4444 | 0.3600 | 0.1928 |
|
|||
C11: How is she recently? |
1.0000 | 0.7750 | 0.1000 |
|
|||
C13: He pushed his younger brother over. |
0.7500 | 0.5400 | 1.0000 |
Besides, if the two sentences, respectively, contain the word “Ba” and the word “Bei” and the similarity between the subject (object) of a sentence and the object (subject) of another sentence is greater than a certain threshold (0.5), then the similarity of two sentences is 1.
Here is the first set of examples in the experiments, which shows the calculation process of sentence similarity: syntactic analysis (as shown in Figure component analysis (as shown in Figure word disambiguation: Beijing/n DEF = {place: PlaceSect = {capital}, belong = “China”, modifier = {ProperName beautiful/a DEF = {beautiful} capital/n DEF = {place: PlaceSect = {capital}, belong = {place: PlaceSect = {country}, domain = {politics is/v DEF = {be} Beijing/n DEF = {place: PlaceSect = {capital}, belong = “China”, modifier = {ProperName sentence similarity computation:
Syntactic analysis result of LTP.
Component analysis result of LTP.
This paper selects three kinds of methods of sentence similarity to make comparisons.
Methods based on the keyword, such as the sentence similarity computing formula proposed by Peking University Institute of Computational Linguistics [
The sentence similarity in the methods based on the combination of the word form and word order depends on the surface similarity and word order similarity, please see the literature of teacher He and Wang [
The similarity calculation method based on syntactic structure in this paper, please refer to formula (
Some experimental data of sentence similarity computation are shown in Table
It is seen from the results of the experiment in Table
Based on the analysis of the existing sentence similarity algorithm, the sentence semantic similarity algorithm based on syntactic structure was proposed, and adding semantic similarity computation in the premise of syntactic similarity can greatly improve the effectiveness of retrieval. First of all, do an analysis on sentence components with the help of the LTP platform of Harbin Institute of Technology; then make semantic comparisons based on syntactic structure of the sentence. Since the meaning of a sentence depends ultimately on the meaning of the words, in this paper, the word similarity was calculated after the word was translated into the concept through word disambiguation with the help of HowNet, hereby realizing semantic comprehension. Meanwhile considering modifier words in the sentence components with certain contribution, it gives more detailed rules to further distinguish the relations between sentences. The experimental results show that, under the same test condition, the proposed algorithm results are more in accordance with the actual situation and obviously improve the calculation accuracy. By the limit of LTP platform of Harbin Institute of Technology and HowNet, such as the imperfect definition of the word concepts in HowNet and the deviation in POS tagging of the two platforms, to a certain extent affected the accuracy of sentence similarity computation. In addition, we will also go to deep research on complex sentence structures to solve the problem of special sentence patterns such as “be” sentences.
The authors declare that there is no conflict of interests regarding the publication of this paper.
Thanks are due to National Natural Science Foundation of China (60973051) and Science and Technology Department of Henan Province Key Scientific and Technological Project (112102210375).