A Fuzzy Computing Model for Identifying Polarity of Chinese Sentiment Words

With the spurt of online user-generated contents on web, sentiment analysis has become a very active research issue in data mining and natural language processing. As the most important indicator of sentiment, sentiment words which convey positive and negative polarity are quite instrumental for sentiment analysis. However, most of the existing methods for identifying polarity of sentiment words only consider the positive and negative polarity by the Cantor set, and no attention is paid to the fuzziness of the polarity intensity of sentiment words. In order to improve the performance, we propose a fuzzy computing model to identify the polarity of Chinese sentiment words in this paper. There are three major contributions in this paper. Firstly, we propose a method to compute polarity intensity of sentiment morphemes and sentiment words. Secondly, we construct a fuzzy sentiment classifier and propose two different methods to compute the parameter of the fuzzy classifier. Thirdly, we conduct extensive experiments on four sentiment words datasets and three review datasets, and the experimental results indicate that our model performs better than the state-of-the-art methods.


Introduction
With the advent of Web2.0, user-generated contents such as product reviews, status on social networking services, and microblogs are exploding in the Internet. The growing availability of subjective contents makes extracting useful sentiment information from subjective contents a hot topic in natural language processing, web mining, and data mining [1]. Since early 2000, as a special case of text classification, sentiment classification has attracted attention from increasing number of researchers and become a very active research area [2]. Sentiment words play a more important role in mining subjective contents than in mining objective contents [3]. As the prerequisite of sentiment classification, identifying the polarity of sentiment words is a key issue in sentiment classification.
At present, there are mainly two types of methods in identifying the polarity of English sentiment words. One is based on corpus; the other is based on thesaurus. These two methods are also widely used for identifying the polarities of Chinese sentiment words. There are mainly three steps in the two types of methods. The first step is to compute similarity between sentiment words and positive reference words. The second step is to compute similarity between sentiment words and negative reference words. The third step is to compare the two similarities based on the Cantor set and acquiring the polarity of sentiment words [4].
The existing two types of methods simply divide sentiment words into two classes, that is, positive or negative, without regarding polarity intensity of sentiment words and fuzziness of polarity intensity of sentiment words [5]. Actually, different sentiment words belonging to the same polarity have different polarity intensity. For example, "laugh" has a larger intensity than "smile. " In order to distinguish polarity intensity of different sentiment words, researchers have proposed some methods to identify polarity of Chinese sentiment words based on Chinese morphemes [6][7][8][9][10]. This type of methods hypothesizes that words are function of its component morphemes and can improve performance [8]. In a certain extent, this type of methods overcomes the shortcomings that the polarity intensity of sentiment words is not considered in identifying polarity of Chinese sentiment 2 Computational Intelligence and Neuroscience words, but the fuzziness of the polarity intensity of sentiment words is still not considered in this type of methods. Due to the fuzziness of natural language and sentiment category, we should adopt a fuzzy set to describe polarity of sentiment words instead of the Cantor set [11].
In order to overcome the above shortcomings and to improve the accuracy as best as we can, we propose a fuzzy computing model to identify the polarity of Chinese sentiment words. Our model mainly includes two parts: one is calculating polarity intensity of sentiment morphemes and sentiment words; the other is constructing a fuzzy classifier and computing parameter of the fuzzy classifier. The contribution of this paper is mainly embodied in three aspects.
Firstly, based on the three existing Chinese sentiment lexicons, we constructed an unambiguous key sentiment lexicon and a key sentiment morpheme set. Then, we proposed a method to compute the sentiment intensity of sentiment morphemes and sentiment words using the constructed sentiment lexicon and sentiment morpheme set.
Secondly, considering the fuzziness of sentiment intensity, we constructed a fuzzy sentiment classifier and a corresponding classification function of the fuzzy classifier by virtue of fuzzy sets theory and the principle of maximum membership degree. In order to improve the performance, we further proposed two different methods to learn parameters of the fuzzy sentiment classifier.
Thirdly, we constructed four sentiment words datasets to demonstrate the performance of our model. At the same time, we proved that our model performs better than several state-of-the-art methods by applying our model to sentiment classification on three review datasets. This paper is organized as follows. We introduce related work in Section 2. Section 3 introduces the fuzzy computing model and two key parts of the model. In Section 4, we firstly build a key sentiment lexicon, a key sentiment morpheme set, and four sentiment word datasets and then conduct some experiments to verify performance of the fuzzy computing model. Finally, we summarize this paper, draw corresponding conclusions, and figure out future research direction in Section 5.

Related Work
Sentiment classification is a hot topic in natural language processing and web mining. There are a large number of research papers about sentiment classification since 2002 [1,2]. Existing methods are mainly divided into two categories: machine learning methods and semantic orientation aggregation [12]. The machine learning methods include many traditional text classification methods [13], such as naive Bayes [14], support vector machine [15], and neural networks [16]. The second strategy uses sentiment words to classify features into positive and negative categories and then aggregates the overall orientation of a document [17,18].
As a basic requirement of sentiment classification, identifying the polarity of sentiment words is a research focus which has been focused on for many years. There are mainly three types of methods in identifying polarity of Chinese sentiment words. The first is thesaurus-based method which computes similarity between reference words and the given sentiment words by distance in thesaurus. The second is corpus-based method which computes similarity between reference words and the given sentiment words by statistic method in corpus. The third is morpheme-based method which computes polarity of sentiment words by combining polarity of Chinese morpheme.
Thesaurus-based method acquires sentiment words mainly by synonyms, antonyms, and hierarchies in thesaurus such as WordNet and HowNet [19][20][21][22]. These methods use some seed sentiment words to bootstrap via synonym and antonym relation in a thesaurus. Kamps et al. computed polarity of sentiment words according to the distance between sentiment words and reference seed words in Word-Net [23,24]. Esuli and Sebastiani used glosses of words to generate a feature vector and computed polarity of sentiment words with a supervised learning classifier in thesaurus [25,26]. Dragut et al. proposed a bootstrapping method according to a set of inference rules to compute sentiment polarity of words [27].
The kernel of corpus-based method is to calculate similarity between reference words and sentiment words in corpus. This approach has an implied hypothesis that sentiment words have the same polarity with the reference words of the greatest cooccurrence rate and the opposite polarity with the reference words of the least cooccurence rate. The polarity of sentiment words is assigned by computing cooccurrence in corpus [28][29][30][31][32][33][34][35].
The most classic method is point mutual information method proposed by Turney [30,31]. This method computes the polarity of a given sentiment word by subtracting the mutual information of its association with a set of negative sentiment words from the mutual information of its association with a set of positive sentiment words. The result of mutual information depends on statistic result in a given corpus.
Different from Turney, Yu and Hatzivassiloglou used more seed words and log-likelihood ratio to compute the similarity [32]. Kanayama and Nasukawa used a set of linguistic rules in intrasentence and intersentence to identify polarity of sentiment words from the corpus [34]. Huang et al. proposed an automatic construction of domain-specific sentiment lexicon [28]. Ghazi et al. used sentential context to identify the contextual-sensitive sentiment words [29].
Some researchers calculated polarity of sentiment words by combining corpus with thesaurus [36,37]. Xu et al. presented a method to capture polarity of sentiment words by using a graph-based algorithm and multiple resources [36]. Peng and Park computed polarity of sentiment words by constrained symmetric nonnegative matrix factorization [37]. This method finds out a set of candidate sentiment words by bootstrapping in dictionary and then uses a large corpus to assign sentiment polarity scores to each word.
Taking into consideration the characteristics of Chinese character, some researchers proposed morpheme-based methods [6][7][8][9][10]. Based on Turney's work, Yuen et al. proposed a method by calculating similarity between reference morphemes and sentiment words in corpus to get the polarity of sentiment words [6]. Experimental results demonstrated better performance than Turney's method in identifying polarity of Chinese sentiment words. Ku et al. proposed a bagof-characters method, which computed polarity intensity of sentiment words based on morpheme by statistic and then compared polarity intensity of sentiment words with a single threshold "0" to identify polarity of sentiment words [10]. Ku et al. considered eight types of morphemes and built a classifier based on machine learning for Chinese word-level sentiment classification [8]. They showed that using word structural features can improve performance in word-level sentiment classification.
The existing three types of methods are based on a common hypothesis that the polarity of sentiment words is certainty. But some researches have proven that polarity of sentiment words had some fuzziness to some extent [5]. So, it is not suitable to identify polarity of sentiment words by either-or methods. To this end, we propose a fuzzy computing model to identify polarity of Chinese sentiment words.
Some researches on fuzzy set have been applied to sentiment classification. These researches mainly focus on document-level and sentence-level sentiment classification [9,12]. For example, Wang et al. proposed an ensemble learning method to predict consumer sentiment by online sequential extreme learning machine and intuitionist fuzzy set, which is a supervised method [12]. Fu and Wang together invented an unsupervised method using fuzzy sets for sentiment classification of Chinese sentences [9]. Different from the above methods, we focus on word-level sentiment classification and propose a fuzzy computing model, which is an unsupervised framework to identify the polarity of Chinese sentiment words.

General Framework.
In existing methods of identifying the polarity of sentiment words, sentiment words are divided into two classes-positive or negative by Cantor set. The fuzziness of polarity intensity of sentiment words is not considered. In order to overcome the shortcomings and improve the accuracy, we proposed a fuzzy computing model (FCM) for identifying polarity of Chinese sentiment words. The general framework of FCM is described in Figure 1.
The general framework of FCM consists of three sections: sentiment words datasets, a key sentiment lexicon (KSL) and a key sentiment morpheme set (KMS), and the central FCM. Sentiment words datasets are test datasets for verifying the performance of FCM when identifying the polarity of Chinese sentiment words. KSL and KMS are the basic thesauruses of FCM. KSL consists of a positive key sentiment words list (P KSL) and a negative key sentiment words list (N KSL). The central FCM is composed of two key parts.
The first part includes computing polarity intensity pi( ) of sentiment morpheme in KMS, computing polarity intensity pi( ) of sentiment word in KSL, and computing polarity intensity pi( ) of sentiment word in sentiment words datasets. We compute pi( ) based on the frequency of sentiment morpheme appearing in P KSL and the frequency of sentiment morpheme appearing in N KSL. After getting pi( ), we divide each sentiment word into morphemes and compute pi( ) and pi( ) based on pi( ).
The second part is to construct a classification function (pi( )) of fuzzy classifier and computing parameter in (pi( )). Firstly, we define fuzzy set and membership function of fuzzy set for positive or negative categories. Secondly, based on the principle of maximum membership degree, we construct (pi( )). Thirdly, we propose two different methods based on average polarity intensity of sentiment words (APIOSW) in different sentiment word datasets and APIOSW in KSL to determine . Then, we describe the two key components of FCM in detail.

Computing Polarity Intensity of Sentiment Morphemes and Sentiment
Words. Based on KSL and KMS, we calculate pi( ) in KMS. With pi( ) available, we compute pi( ) in KSL and pi( ) in sentiment words datasets. There are mainly three steps in the whole computational process.  (2) Secondly, for each sentiment morpheme in KMS, we use percentage of in ( | P KSL) and ( | N KSL) to compute positive polarity intensity, negative polarity intensity, and polarity intensity by Here ( ) is positive polarity intensity, ( ) is negative polarity intensity, and pi( ) is polarity intensity.
( | P KSL) is the frequency of appearing in P KSL and ( | N KSL) is the frequency of appearing in N KSL.
(3) Thirdly, for each sentiment word in KSL and in sentiment words datasets, we calculate pi( ) and pi( ) based on polarity intensity of by Here number( , ) is the number of morphemes included in sentiment words ; number( , ) is the number of morphemes included in sentiment words .

Constructing Fuzzy Classifier and Computing Parameter of Fuzzy Classifier.
After getting pi( ), pi( ), and pi( ), we firstly define two membership functions of the fuzzy classifier for positive and negative categories. Secondly, we build a classification function of the fuzzy classifier by principle of maximum membership degree. Thirdly, we determine an optimum fixed parameter of the fuzzy classifier by experimenting on KSL. Fourthly, we compare APIOSW in different sentiment word datasets with APIOSW in KSL to get different optimum parameters of fuzzy classifier for each sentiment word dataset.

Defining Membership Function of Fuzzy Classifier.
In order to identify polarity of sentiment words based on FCM, we choose semitrapezoid distribution to define membership function of positive category and negative category in fuzzy classifier by Here is sentiment word, pi( ) is polarity intensity of , and , are adjustable parameters which decide the region and the shape of member function of the fuzzy classifier.
We need to set value for parameters , in membership function of the fuzzy classifier to identify polarity of sentiment words. Actually, we do not set value for the two parameters , in FCM but simplify the two parameters to one parameter by defining = ( + )/2.

Constructing Classification Function of Fuzzy Classifier.
After computing polarity intensity of sentiment words, based on membership function in (4), we confirm polarity of sentiment words according to the principle of maximum membership degree. At last, we get the following equation as the classification function of the fuzzy classifier to identify polarity of sentiment words: Here pi( ) is polarity intensity of sentiment word . We can determine polarity of sentiment words only by setting value for parameter .

Setting Parameter in Classification Function of the Fuzzy Classifier.
In order to determine the parameter in classification function of the fuzzy classifier, we propose two different methods. One is fixed parameter method where the parameter is selected by experimenting on KSL; the other is variable parameter method where the parameter is selected by comparing APIOSW of different sentiment word datasets with APIOSW of KSL. The two methods are described as follows.  parameter 0 as fixed parameter in FCM. The experimental results and specific parameter 0 are shown in Figure 2.
(2) Variable Parameter Method. Referring to experiment results in KSL to estimate the value of parameter , we can only get a local optimal parameter 0 in KSL. In order to get global optimal parameter in different sentiment word datasets, we propose a variable parameter method. The specific method is described as follows.
For each sentiment word datasets (SWD ) which consist of positive sentiment words list P SWD and negative sentiment words list N SWD , we define APIOSW in Here is the number of sentiment words in SWD , is the number of sentiment words in SWD , and pi( ) is polarity intensity of sentiment word in SWD . Similar to SWD , we define APIOSW of KSL in pi ( ) . (7) Here is the number of sentiment words in P KSL, is the number of sentiment words in KSL, and pi( ) is polarity intensity of sentiment word in KSL.
For each SWD , we calculate the difference of APIOSW between SWD and KSL. Based on the difference and fixed parameter 0 in KSL, we adjust the parameter of SWD to get different optimum parameter in each SWD . The special method is described in Here parameter 0 is a fixed parameter which is got through the fixed parameter method above.

Performance Evaluation
To verify the performance of FCM by experiment, we firstly construct KSL and KMS. Secondly, we construct four sentiment word datasets as test datasets and choose classification indicator: precision, recall, 1 measure, and accuracy as metric to evaluate the performance of baseline methods and FCM. Thirdly, we do experiments on KSL and compare APIOSW in different sentiment word datasets with APIOSW in KSL to find the optimum parameter. Fourthly, we compare 6 Computational Intelligence and Neuroscience the performance of different methods and prove the efficiency of FCM. Fifthly, we discuss the influence of parameter on accuracy of FCM. Sixthly, in order to demonstrate the effect of our methods in a real task, we apply our methods and baseline methods to sentiment classification of review. The experimental results prove the validity of our method.

Constructing KSL and KMS.
Based on Chinese sentiment lexicons-Tsinghua University sentiment lexicon (TUSL), National Taiwan University sentiment lexicon (NTUSL), and HowNet, we construct KSL. When constructing KSL, we have an implicit assumption that there are some sentiment words whose polarity is ambiguous among different sentiment lexicons. We define TUSL as SL 1 , NTUSL as SL 2 , and HowNet as SL 3 . Given above SL which consists of P SL and N SL , we get some sentiment words whose polarity is ambiguous. Table 1 presents the number of sentiment words whose polarity is fuzzy between P SL and N SL . From Table 1, we can see that there are some sentiment words whose polarity is ambiguous between P SL and N SL , which proves the assumption that polarity of sentiment words is not always consistent within different sentiment lexicons. So, we delete these sentiment words whose polarity is ambiguous from P SL and N SL . Finally, we construct KSL by choosing the sentiment words which is at least contained in two sentiment lexicons and unambiguous in polarity. The method of constructing KSL is shown in Here = 1, 2, 3, = 1, 2, 3, P SL is positive sentiment words list of SL , and N SL is negative sentiment words list of SL . We compute KSL in (9). The number of sentiment words in SL and KSL is shown in Table 2.
For KSL, we delete words whose length is greater than two and then split the remaining words into morphemes. Finally, we put morphemes together to construct a KMS.

Experimental Setting.
In order to verify the performance of our model, we build four sentiment word datasets based on TUSL, NTUSL, and HowNet. To ensure that the polarity of sentiment words is unambiguous in the four sentiment word datasets, we delete the sentiment words whose polarity is ambiguous among the three sentiment lexicons. In order to ensure that sentiment words in the four sentiment word datasets is independent of sentiment words in KSL, we delete the sentiment words in KSL from the three sentiment lexicons. The specific method is described as follows.
For each SL which consists of P SL and N SL , we build sentiment words dataset1, dataset2 and dataset3 according to With the same method, we construct a much larger datasets4 in Each sentiment word dataset SWD consists of positive sentiment words list P SWD and negative sentiment words list N SWD . At last, we get four sentiment word datasets (http://203.91.121.76/Datasets/) which are summarized in Table 3.
Since our task is identifying polarity of sentiment words, therefore, we choose classification indicator: precision ( ), Computational Intelligence and Neuroscience 7 Here 1 , 2 , 1 , and 2 are defined in Table 4. To evaluate the overall performance of our model in identifying the polarity of Chinese sentiment words, we compare our model with thesaurus-based method and morphemebased method in four different sentiment word datasets. Our model and baseline methods are depicted as follows: MBOT: the method based on thesaurus [27]; MBOM: the method based on morpheme [10]; FCMWFP: the fuzzy computing model with fixed parameter, which is described in Section 3.3.3; FCMWVP: the fuzzy computing model with variable parameter, which is shown in Section 3.3.3.
In order to further demonstrate the effect of our methods and highlight the contribution of our work, we design the following experiments. Firstly, for each sentiment word dataset, we construct four different sentiment lexicons where the polarities of sentiment words are different in each method. Secondly, we choose three Chinese review datasets, which are provided by Songbo Tan (http://203.91.121.76/Datasets/). Each review dataset (RDS ) consists of both positive reviews and negative reviews. The basic statistics of these three review datasets are summarized in Table 5. Thirdly, for each sentiment word dataset, we compare sentiment classification results of three review datasets based on four different sentiment lexicons, which correspond to four different methods. These sentiment lexicons are described as follows: SLMBOT : the sentiment lexicon corresponding to MBOT and sentiment word dataset ; SLMBOM : the sentiment lexicon corresponding to MBOM and sentiment word dataset ; SLFCMWFP : the sentiment lexicon corresponding to FCMWFP and sentiment word dataset ; SLFCMWVP : the sentiment lexicon corresponding to FCMWVP and sentiment word dataset .
We conduct extensive experiments in four sentiment word datasets and three review datasets to solve four problems.
(1) Discuss how to set parameter in classification function of fuzzy classifier.
(2) Study performance of our model in identifying polarity of Chinese sentiment words.
(3) Analyse effect of different parameter on accuracy of our model.
(4) Validate the effect of sentiment lexicons created by our methods in sentiment classification of documents.

Setting Parameter in Classification Function of Fuzzy
Classifier. In FCM, parameter in classification function of fuzzy classifier needs to be set. We conducted experiment on KSL to find the optimal value of parameter . Figure 2 shows performance of MBOM and FCMWFP for different parameter .
From Figure 2, we can see that when parameter is selected near 0.05, performance of FCMWFP is the best. So we choose = 0.005 in FCMWFP. After choosing the fixed value 0 of parameter , according to Section 3.3.3, we calculate APIOSW of different datasets by (6) and (7). Finally, we compute value of parameter by (8). Table 6 summarizes the APIOSW and of KSL and four sentiment word datasets.

Performance of Different Methods in Identifying Polarity
of Chinese Sentiment Words. In order to verify performance of FCM, we choose MBOT and MBOM as baselines. FCM consists of FCMWFP and FCMWVP. We compared FCM with MBOT and MBOM in four sentiment word datasets. Experimental results are shown in Table 7 and Figure 3.
From Table 7 and Figure 3, we can see that accuracy of MBOT is slightly higher than accuracy of MBOM. 8 Computational Intelligence and Neuroscience  Average Sentiment word datasets At the same time, we can see that FCMWVP has better performance than FCMWFP in our model, which validates our assumption that we can only acquire local optimum by FCMWFP, but we can get approximate global optimum by FCMWVP.

Accuracy of FCM with Different Parameter .
In order to explore the effect of the different parameter on accuracy of FCM in identifying polarity of Chinese sentiment words, we do some experiments using different parameter in four sentiment word datasets. The results are shown in Figure 4.
From Figure 4, we can see that different values of parameter in FCM have different effect on the accuracy of FCM. When we choose suitable parameter , FCM always achieves higher accuracy than MBOT and MBOM in identifying polarity of Chinese sentiment words.

Performance of Different Sentiment Lexicons in Sentiment
Classification of Chinese Reviews. In order to further verify the feasibility of our methods, we applied four different sentiment lexicons to sentiment classification of Chinese reviews. For each sentiment word dataset, we compared sentiment classification results of three review datasets based on four different sentiment lexicons. Experimental results are shown in Tables 8, 9, 10, and 11.   From Tables 8, 9, 10, and 11, we can see that accuracies of SLFCMWVP and SLFCMWFP are higher than accuracies of SLMBOM and SLMBOT in sentiment classification of Chinese review. The results prove that the methods which consider fuzzy sentiment are more effective than those methods that consider only either-or sentiment.
At the same time, we can see that SLFCMWVP had better performance than SLFCMWFP in sentiment classification of Chinese reviews, which proves that our method based on variable parameters was more efficient than our method based on fixed parameter.

Conclusion
In this paper, we propose a fuzzy computing model for identifying polarity of Chinese sentiment words by combining polarity intensity of Chinese morpheme with fuzzy set theory. Based on the assumption that Chinese sentiment words are a function of Chinese morpheme, we compute polarity intensity of sentiment words with known polarity intensity of morphemes. After studying the three existing sentiment lexicon, we find that there is fuzziness among some of the sentiment words; that is to say, some sentiment words have different sentiment polarities in different lexicons. We define polarity of sentiment words as fuzzy set and identify polarity of sentiment words by the principle of maximum membership degree. In order to verify performance of our model, we build four sentiment word datasets. We compare our model with baseline methods in four sentiment word datasets. Experimental results prove that our model had better performance than the state-of-the-art methods.  Our methods suggest several possible research directions. Due to fuzziness of sentiment polarity in natural language, we can deal with sentiment analysis problem based on fuzzy set theory. Our model demonstrates the effectiveness of fuzzy computing in sentiment words classification. Next, we plan to apply fuzzy set theory to sentence-level sentiment classification and document-level sentiment classification. Polarity intensity of sentiment word in sentiment words datasets : Parameter in classification function of fuzzy classifier (pi( )): Classification function of fuzzy classifier.