On the Feature Selection and Classification Based on Information Gain for Document Sentiment Analysis

Sentiment analysis in a movie review is the needs of today lifestyle. Unfortunately, enormous features make the sentiment of analysis slow and less sensitive. Finding the optimum feature selection and classification is still a challenge. In order to handle an enormous number of features and provide better sentiment classification, an information-based feature selection and classification are proposed.The proposedmethod reducesmore than 90%unnecessary features while the proposed classification scheme achieves 96% accuracy of sentiment classification. From the experimental results, it can be concluded that the combination of proposed feature selection and classification achieves the best performance so far.


Introduction
One of the interesting challenges in text categorization is sentiment analysis, a study that analyzes the subjective information of specific object [1].Sentiment analysis can be applied on various level: document level, sentence level, and feature level.
Sentiment-based categorization in the movie review is a document-level sentiment analysis.It treats the review as a set of independent words by ignoring the sequence of words on a text.Every single unique word and phrase can be used as the document features.As a result, it constructs massive numbers of features.In addition, it also slows down the process and makes the classification task bias [2].
Actually, not all features are necessary.Most of the features are irrelevant to the class label.On the other hand, a good feature for classification is the one that has maximum relevance with the output class.
As feature selection in sentiment analysis is a crucial part, in this paper, we proposed an information gain based feature selection.In addition, we also proposed classification schemes based on the dictionary that is constructed by selected features.

Previous Work
There are two common approaches to sentiment analysis: machine learning methods and knowledge-based methods.Cambria [3] suggested the combination of both methods: using machine learning to provide the limitations of the sentiment knowledge.On the other hand, it cannot be applied in movie review.The sentiment knowledge such as SenticNet is highly dependent on domain and context.For example, "funny" means positive for comedy but negative for horror movie [4].
Machine learning-based sentiment analysis on movie review initialized by Pang et al. [5].Their work performed 70%-80% accuracy while the human baselines sentiment analysis only reaches 70% accuracy.In 2014, Dos Santos and Gatti [6] used deep learning method for sentence-level sentiment analysis that reached 70%-85% accuracy.Words and characters are used as sentiment features.Unfortunately, the massive constructed features resulted in a long-time computation.
In order to provide robust machine learning classification, a feature selection technique is required [7].Some researchers focus on reducing the number of features [8].Manurung [9] proposed a feature selection scheme named Nicholls and Song [8] research and OKeefe and Koprinska [10] research proposed similar idea to select features based on the difference between document frequency (DF) in class positive and DF in class negative.It was named Document Frequency Difference (DFD).DFD selects the feature that has the highest proportion between the positive DFnegative DF difference and the total number of documents.Their research may select feature which has high difference but less relevant to the output class.
Information theory-based feature selection such as information gain or mutual information was also proposed in sentiment analysis [11,12].In advance, Abbasi et al. proposed a heuristic search procedure to search optimum subfeature based on its information gain (IG) value named Entropy Weighted Genetic Algorithm (EWGA) [13].EWGA search optimal subfeatures using Genetic Algorithm (GA) which its initial population is selected by information gain (IG) thresholding schemes.Compared to the other, EWGA is the most powerful feature selection so far.It selected features that achieved 88% accuracy of classification.However, it took high-cost computation.
This study uses polarity v.2.0 from Cornell review datasets, a benchmark dataset for document-level sentiment analysis, that consists of 1000 positive and 1000 negative processed reviews [14].This dataset split into tenfold crossvalidation.

Information Gain on Movie Review
Information gain measures how mixed up the features are [15].In sentiment analysis domain, information gain is used to measure the relevance of attribute  in class .The higher the value of mutual information between classes  and attribute , the higher the relevance between classes  and attribute .
where Since Cornell movie review dataset has balanced class, the probability of class  for both positive and negative is equal to 0.5.As a result, the entropy of classes () is equal to 1.
Then the information gain can be formulated as The minimum value of (, ) occurs if only if ( | ) = 1 which means attribute  and classes  are not related at all.On the contrary, we tend to choose attribute  that mostly appears in one class  either positive or negative.On the other words, the best features are the set of attributes that only appear in one class.

Sentiment Analysis Framework
This study uses polarity v.2.0 from Cornell review datasets, a benchmark dataset for document-level sentiment analysis, that consists of 1000 positive and 1000 negative processed reviews [14].This dataset split into tenfold cross-validation.
Similar to the dictionary construction phase, classification phase also consists of preprocessing and feature construction.On the contrary, it uses the constructed dictionary instead of selecting feature and constructs another dictionary.The result of this phase is sentiment labeled movie review.[16] selects feature that has high relevance with the output class.Those features commonly appear in positive class or negative class only.Unfortunately, it may appear only a few times since the sentiment can be expressed in a various way.As a result, overfitting occurs since those features do not appear.

IG-DF Feature Selection. Previous work on information gain
On the other hand, DF thresholding [8,12] selects feature that appears most in the training set.It may select feature that always appears in both classes.Those features are unnecessary since it cannot differentiate the class to which it belongs.
In this study, we propose a combination of information gain and DF thresholding feature selection, named IGDFFS.IGDFFS selects a feature that has IG score equal to 0.5.It means those features highly related to one class only.These schemes succeed in reducing about 90% of unnecessary features (Algorithm 1).

Classification.
As it is known that entropy and information gain are commonly used in decision tree.The selected feature with the highest information gain determines the class of the review.Based on this intuition, we categorize our vocabulary into the positive feature and negative feature.A review will be classified into positive review if most of the features are positive and vice versa (Algorithm 2).

Results and Analysis
Figure 2 shows the performance previous feature selection (FFSA) [16] and proposed feature selection (IGDFFS).The results show that IGDFFS selects better features.
Proposed method selects feature that has high relevance to the output class and also has the highest occurrence.As a result, generated feature matrix has less zero value.On the contrary, the previous method may succeed in selecting high relevant features but probably takes rare features.The rare feature does not appear in another movie review document in training set and may not appear in the testing set.As a result, the generated feature matrix consists of a lot of zero value.A lot of documents which have not any features are hard to be classified.
One of the feature selection objectives is to avoid overfitting.Actually, in this case, common machine learning techniques may result in overfitting.The reason is the feature matrix in testing set consists of a lot of zero values more than the feature matrix in training set.Since the features affect machine learning model, then it is hard for machine learning to fit the model to the feature matrix in the testing set.
Figure 3 summarizes the performance of SVM, ANN, and IG classifier.Unfortunately, SVM and ANN suffer from overfitting problems.Their testing accuracy fails in achieving 70% accuracy.Different to ANN and SVM, IGC is quite stable in any condition.IGC succeed in avoiding overfitting problems.It can be concluded that IGC as proposed classifier performs better than the current classifier.
Information gain value tells how mixed a feature to the class is.IG value reaches the highest value (0.5 in this case) when the feature belongs to one class only.It means when the feature appears we make sure that the label must be positive or negative.In this case, the IG value of selected feature achieves the maximum value on average (0.5) so, it can be used for automatic classification.The specialty of proposed classification scheme is the independence from mathematical model.Since proposed classification method (1) procedure IG-based-Classifier(input: {Sentiment Feature Vector: Vocabulary × Number of Document}, output: {Sentiment Label: positive or negative}) (2) for each document in featurevector do (3) for each vocabinVocabulary do (4) if V is positivefeatures then (5) V ← V + 1 (6) else (7) V ← V + 1 (8) end if (9) end for (10) if V > V then (11)    ←    +  positive  (12) else (13)    ←    +  negative  (14) end if (15) end for (16)   succeeds in avoiding overfitting, we can say that our method is better than the previous work.

Conclusion and Future Work
In order to provide better sentiment analysis system, an improvement of information gain based feature selection and classification was proposed.The proposed feature selection selects feature that has high information gain and high occurrence.As a result, it succeeded in providing feature that most probably appears in testing also.Proposed classifier used the positive and negative features obtained from the IG calculation before.Then, it takes less time than the previous classifier (SVM, ANN, etc.).
The combination of information gain and document frequency in this study proposed feature selection; IGDFFS selects subfeatures that satisfy these criteria: (1) high relevance to the output class and (2) high occurrence in  dataset.As a result, it constructs subfeatures that reach better performance in the classification.
Compared to the current classifier, Information Gain Classifier (IGC) overcomes the recent high accuracy which belongs to EWGA (only 88.05%).It succeeded in avoiding overfitting problems in any condition.The performance of IGC is quite stable in both training and testing.
We are considering to groups the words based on their relevance to positive and negative reviews.Note that there are 171,476 words that are currently used and 47,156 obsolete words in English domain (based on Oxford English Dictionary).At least a finite number of groups would be less than the total number of words.