A Novel Feature Selection Technique for Text Classification Using Naïve Bayes

With the proliferation of unstructured data, text classification or text categorization has found many applications in topic classification, sentiment analysis, authorship identification, spam detection, and so on. There are many classification algorithms available. Naïve Bayes remains one of the oldest and most popular classifiers. On one hand, implementation of naïve Bayes is simple and, on the other hand, this also requires fewer amounts of training data. From the literature review, it is found that naïve Bayes performs poorly compared to other classifiers in text classification. As a result, this makes the naïve Bayes classifier unusable in spite of the simplicity and intuitiveness of the model. In this paper, we propose a two-step feature selection method based on firstly a univariate feature selection and then feature clustering, where we use the univariate feature selection method to reduce the search space and then apply clustering to select relatively independent feature sets. We demonstrate the effectiveness of our method by a thorough evaluation and comparison over 13 datasets. The performance improvement thus achieved makes naïve Bayes comparable or superior to other classifiers. The proposed algorithm is shown to outperform other traditional methods like greedy search based wrapper or CFS.


Introduction
The proportion of unstructured data to structured data has been rising consistently in the last few decades [1,2]. To extract meaningful information from these large corpora of text data, both statistical/machine learning techniques and linguistic techniques are applied. Text classification has various applications in the form of email classification, sentiment analysis, language identification, authorship identification, influential blogger detection, and topic classification to name a few. Classification is called supervised learning as it requires training data. The classification algorithm builds the necessary knowledge base from training data and then a new instance is classified in predefined categories based on this knowledge. As an example of a classification task, we may have data available on various characteristics of breast tumors where the tumors are classified as either benign or malignant. Now, given an unlabeled tumor, the classifier will map it as either benign or malignant. The classifier can be thought as a function which maps an instance or an observation based on the attribute values to one of the predefined categories. Text classification is a part of classification, where the input is texts in terms of documents, emails, tweets, blogs, and so forth. One of the problems with text classification is much higher input size. More formally, given a set of document vectors { 1 ; 2 ; . . . ; } and their associated class labels ( ) ∈ { 1 ; 2 ; . . . ; }, text classification is the problem of assigning the class label to an unlabeled document .
Most classification algorithms require sufficient training data, which adds to the space complexity as well as increased training time. So the capability of a classifier to give good performance on relatively less training data is very critical. Naïve Bayes classifier is one such classifier which scores over the other classifiers in this respect. Naïve Bayes model is the simplest of all the classifiers in the way that it assumes that all the attributes are independent of each other in the context of the class [3][4][5][6][7]. It is mentioned in [8] that naïve Bayes is widely used because of its simplicity, though it is one of the classifiers noted to have poor accuracy in tasks like text categorization. The unique contributions of the paper are as follows.
2 International Scholarly Research Notices (i) We offer a simple and novel feature selection technique for improving naïve Bayes classifier for text classification, which makes it competitive with other standard classifiers.
(ii) Contrary to conventional feature selection methods, we employ feature clustering, which has a much lesser computational complexity, and equally if not more effective outcome, a detailed comparison has been done.
(iii) Our approach employs the below steps: (a) Step 1: chi-squared metric is used to select important words; (b) Step 2: the selected words are represented by their occurrence in various documents (simply by taking a transpose of the term document matrix); (c) Step 3: a simple clustering algorithm likemeans is applied to prune the feature space further, in contrast to conventional methods like search and one word/feature corresponding to each cluster that is selected.
(iv) The superiority of our performance improvement has been shown to be statistically significant.
The organization of the paper is as follows. In Section 2, the theoretical foundation of naïve Bayes classifier is discussed. In Section 3, a brief overview of feature selection is provided. In Section 4, we present our algorithm with necessary illustration. In Section 5, we discuss experimental setup and in Section 6 the results of various studies and their analysis are presented. Section 7 contains the conclusion and future scope of work.

Na\ve Bayes Classifier and Text Classification
Naïve Bayes is based on conditional probability, and following from Bayes theorem, for a document and a class , it is given as The most likely class (maximum a posteriori) is given by arg max arg max where the document is represented by different features like 1 , 2 , . . . , , respectively. (Typically the features correspond to words.) The naïve Bayes assumptions depict all features is independent of each other. This assumption transforms (4) as follows: In a previous work of the authors, naïve Bayes has been compared with few other popular classifiers like support vector machine (SVM), decision tree, and nearest neighbor (kNN) on various text classification datasets [9]. Table 1 summarizes the findings. Naïve Bayes's performance was the worst among the classifiers. We argue that the reason for this lesser accurate performance is the assumption that all features are independent. The authors carry out extensive empirical analysis of feature selection for text classification and observe SVM to be the superior classifier [10], which indirectly supports our claim of naïve Bayes's poor performance.
One of the popular methods to represent a document is by using a bag of words (BoW) or vector space model using term document matrix, where each document is represented by the words present in the document after some preliminary transformations, rather than raw counts (order of the word is ignored). One such weighting scheme uses both the term frequency and the inverse document frequency given by (tfidf) [13], which balances the number of occurrences of a word in a particular document and novelty of that word: So each word represents the features of documents and the weights described by (6) are the values of the feature, respectively, for that particular document. Using our proposed method, we want to modify (5) as follows: where ( ) ≪ ( ), selecting in a manner such that they are less dependent on each other. A survey on improving Bayesian classifiers [14] lists down (a) feature selection, (b) structure extension, (c) local learning, and (d) data expansion as the four principal methods for improving naïve Bayes. We focus on feature selection in our proposition. The attribute independence assumption can be overcome if we use Bayesian network; however, learning of an optimal Bayesian network is an NP hard problem [15].
In [16], the authors have proposed an improvement for naïve Bayes classification using a method name as auxiliary feature method. The idea is to find an auxiliary feature to each International Scholarly Research Notices 3 independent feature such that the auxiliary feature increases separability of the class probabilities than the current feature. As we need to determine the auxiliary feature for all features, this method has high computational complexity. In [8], the authors propose a novel method of improving the naïve Bayes by multiplying each conditional probability with a factor, which can be represented by chi-squared or mutual information. Reference [17] proposes a word distribution based clustering based on mutual information, which weighs the conditional probabilities based on the mutual information content of the particular word, based on the class. Our proposed method will have an advantage as in the first step we reduce the feature sets using a simple univariate filter before applying clustering.

Feature Selection
Feature selection is one of the most important data preprocessing steps in data mining and knowledge engineering. Let us say we are interested in a task ( ), which is finding employees prone to attritions. Each employee is represented by various attributes/features ( ) like their age, designation, marital status, average working hours, average number of leaves taken, take-home salary, last ratings, last increments, number of awards received, number of hours spent in training time from the last promotion, and so forth. In feature selection, the idea is to select best few features ( ) from the above, so as to we perform equivalently in performing the task ( ), in terms of some evaluation measure ( ). (Generally ≪ .) So for a classification task, a standard evaluation measure like classification accuracy and -Score, and so forth, and for clustering it can be internal measures like silhouette width or an external measure like purity.
Feature selection offers the following three advantages: (i) better model understandability and visualization: it might not be possible to reduce to a two-dimensional or a three-dimensional feature set, but even if we want to visualize with a combination of two or three features, the combinations will be much lesser in the reduced feature space; (ii) generalization of the model and reduction over fitting: as a result better learning accuracy is achieved; (iii) efficiency in terms of time and space complexity: for both training and execution time.
Feature selection approaches can be broadly classified as filter, wrapper, and embedded.
Filter Approach. This is the most generic of all the approaches and works irrespective of the data mining algorithm that is being used. It typically employs measures like correlation, entropy, mutual information, and so forth which analyzes general characteristic of the data to select an optimal feature set. This is much simpler and faster to build compared to embedded and wrapper approaches; as a result, this method is more popular to both academicians and industry practitioner. However, it is to be noted that wrapper and embedded methods often outperform filter in real data scenarios.
Embedded Approach. Feature selection is a part of the objective function of the algorithm itself. Examples of the same are decision tree, LASSO, LARS, 1-norm support vector, and so forth.
Wrapper Approach. In this method, the wrapper is built considering the data mining algorithm as a black box. All combinations of the feature sets are used and tested exhaustively for the target data mining algorithm and it typically uses a measure like classification accuracy to select the best feature set. Because of the "brute force" approach, these methods tend to be computationally extensive.
In terms of outputs, it can be set of ranked features or optimal subset of features. We can classify the approaches as either univariate or multivariate. In the univariate class, all features are treated individually and ranked (some of the popular metrics are information gain, chi-square, and Pearson correlation coefficient).
Correlation feature selection (CFS) is a very popular example of such multivariate techniques [18]: where indicates worth of features subset. is the average of correlation between the features and the target variable. is the average intercorrelation between the components. The one with the highest is selected. Both filter and wrapper methods can employ various search strategies. As exhaustive search is computationally complex, various other variants like greedy (both sequential backward and forward), genetic search, hill climbing, and so forth are used for better computational efficiency. Our proposed algorithm based on feature clustering provides better computation complexity because of the following reasons.
(1) It does not follow the wrapper method, so that many numbers of combinations do not need to be enumerated.
(3) We effectively consider both the univariate and multivariate nature of the data.
(4) There is no additional computation required as the term document matrix is invariably required for most of the text classification tasks.
Reference [20] proposes a clustering based feature selection, and we would like to highlight the following differences with the method we have proposed. Firstly, we employ a partition based clustering instead of a hierarchical one; secondly, even in case of clustering, we limit ourselves to a much pruned dataset as in the first step we are retaining only the most relevant ones. The authors use maximal information compression index (MICI) as defined in [19] to measure the similarity of the features which is also an additional computational step. For finding the prototype feature, average distance from all the features in the cluster is taken, where other simpler versions could have been applied.
In [19], the authors define a measure of linear dependency, maximal information compression index ( 2) as the smallest eigenvalue of Σ, and the value of 2 is zero when the features are linearly dependent and increases as the amount of dependency decreases: where Λ is a ( * ) matrix, where is number of features and each of diagonal entries holds the corresponding eigen values: ] .
We have also added an empirical comparison between FS-CHICLUST and wrapper with greedy search and multivariate filter search using CFS in Table 9, in Section 6.

Our Proposition
Naïve Bayes is one of the simplest and hence one of the most widely used classifiers. However, this often does not produce results comparable with other classifiers because of the "naïve" assumption; that is, attributes are independent of each other. Performance of naïve Bayes further deteriorates in the text classification domain, because of the higher number of features. Our proposed method works on the term document matrix [13]. We firstly select the important words based on the chi-squared value, that is, selecting only those words which have a value higher than a threshold. We have taken this as "0" in our experimental study. The chi-squared statistics is detailed below.

Chi-Squared ( 2 Statistic).
Chi-squared is generally used to measure the lack of independence between and (where is for term and is for class or category) and compared to the 2 distribution with one degree of freedom. The expression for 2 static is defined as where is the total number of documents. is the number of documents of class containing term . is the number of documents containing occurring without . is the number of documents class occurring without . is the number of documents of other classes without .
Next, we take the selected words and represent them by their occurrence in the term document matrix. So, if there are three documents 1, 2, and 3 and there are four words 1, 2, 3, and 4, respectively, then the term document matrix is represented as shown in Table 2. Then we argue that the individual features, that is, the words, can be represented as their occurrence in the documents so 1 can be represented as a vector {1.1, 2.3, 1.1} and if by this representation two words have a smaller distance between them, then that means they are similar to each other. The weighing scheme is tf-idf as explained in Section 2.
Finally, we have applied -means clustering, which is one of the simplest and most popular clustering algorithms. One of the inputs -means expects is the value of , that is, the number of clusters.
The optimal number of clusters is one of the open questions in clustering [21]. For our present setup, we start with the square root of (of the reduced set of step 1, using chi-squared) as per [22] and proceed up to /2. As indicated in [20], a feature clustering method may need a few iterations to come to an optimal or near optimal number of features but this is much lesser than a search based approach using a filter or wrapper method.
The algorithm is described below; the algorithm accepts three parameters: Thresh is taken as "0" in the current case, and this can also be used by determining the 10th percentile or so on. It produces the reduced feature set as the output. Step 1. We apply the feature selection technique based on chisquared on the entire term document matrix to compute chisquared (CH) value corresponding to each word.
Step 2. We select only those words that have a CH value greater than thresh.
Step 3. We form a new term document matrix ( ) which consists of only those important words as selected in Step 2.
International Scholarly Research Notices 5 The datasets can be mostly found at [11,12]. * Data for three classes have been used for Reuters.
Step 4. We transpose this new term document matrix ( ) and each row represents a word. The transposed matrix is denoted by " . " Step 5. We create "nc" clusters on " . " Step 6. We select the most representative words from each cluster, which is the closest to the clustering centre and add them one by one to such ( ) = nc.
Step 7. The Euclidian norm is calculated for each point in a cluster, between the point and the center. The one nearest to the center is selected.

Experimental Setup
This section describes details about the setup of the experiment. It covers details about the datasets that are used and different preprocessing techniques that were applied. The software tool and packaged that are used, Hardware and software details of the machine, on which the experiment was carried out.

Dataset Information.
The detailed information of the datasets used in our experimental setup has been summarized in Table 3.

5.2.
Methodology. The basic steps followed for the experiment are described below for reproducibility of the results.  (V) The term document matrix is split into two subsets, 70% of the term document matrix is used for training, and the rest 30% is used for testing classification accuracy [22].
(VI) The so-produced term document matrix is used for our experimental study.
(VII) We compare the results with other standard classifiers like decision tree (DT) SVM and kNN.
(VIII) We also compare execution time taken by FSCHI-CLUST with other approaches like wrapper with greedy search and multivariate filter based search technique based on CFS.

Results and Analysis
6.1. Results. We have used classification accuracy, which is a measure of how well a document is classified into its appropriate class. It is simply the % of # Correctly Classified Documents/# Total Documents. All the classification accuracies have been computed on testing dataset. We present the following evaluation and comparison, respectively.
(i) Classification accuracy on the test dataset using (a) naïve Bayes, (b) chi-squared with naïve Bayes, and (c) FS-CHICLUT with naïve Bayes is computed. The result is summarized in Table 4 and Figure 1. (ii) Using FS-CHICLUST, we can significantly reduce the feature space. The total number of features and reduced number of features using (a) chi-squared and (b) FSCHICLUST are displayed in Table 5.
In Table 6, we summarize % reduction of feature set and the % improvement of classification accuracy over all the datasets between simple naïve Bayes and FS-CHICLUST with naïve Bayes.
(iii) We compare the results of FSCHICLUT and naïve Bayes with other classifiers like kNN and SVM and decision tree (DT), which makes naïve Bayes (NB) comparable with other classifiers, the results are summarized in Table 7, and the classifier accuracy is also displayed in line chart in Figure 2.
Comparing mean ranks, we see that our method has a better mean rank than the other four methods, and the mean ranks for all the methods are summarized in Table 8.
International Scholarly Research Notices 7 The value is very less, so the null hypothesis that the difference in ranks is not significant is rejected and we can conclude that FSCHICLUST has significantly better performance than other classifiers.
(iv) We compare the execution time of FSCHICLUST with other approaches like (a) wrapper with greedy search (forward), (b) multivariate filter using CFS (using the best first search).
The results are shown in Tables 9(a) and 9(b), respectively.
So the difference is indeed significant.

Considerable Reduction in Feature Space.
On one hand, we have significant improvement in terms of classification accuracy; on the other hand, we could reduce the number of features from univariate chi-square. What we observe is that at significance level of 0.05 there is a significant reduction in our proposed method, compared to reduction achieved through chi-square alone (  (Table 8), and the nonparametric Friedman rank sum test corroborates the statistical significance.

Comparison with Other Feature Selection Methods.
We have compared the execution time and classification accuracy with greedy forward search based wrapper method (Table 9(a)) and CFS based multivariate filter method which employs the best first search (Table 9(b)). Our proposed method has got much better result both on execution time and on classification accuracy. In Table 10, there are values corresponding to the comparison with greedy based wrapper search and CFS. This shows that there is a significant difference between the two results.

Conclusion
Our previous study and works of other authors show naïve Bayes to be an inferior classifier especially for text classification. We have proposed a novel two-step feature selection algorithm which can be used in conjunction with naïve Bayes to improve the performance. We have evaluated our algorithm FS-CHICLUST over thirteen datasets and did extensive comparisons with other classifiers and also with other feature selection methods like greedy based wrapper, CFS, and so forth. Below is the summary of our findings.
(i) FS-CHICLUST is successful in improving the performance of naïve Bayes. The improvement in performance is statistically significant.
(ii) FS-CHICLUST not only improves performance but also achieves the same with further reduced feature set. The reduction compared to univariate chi-square is statistically significant.
(iii) Naïve Bayes combined with FS-CHICLUST gives superior performance than other standard classifiers like SVM, decision tree, and kNN.
(iv) Naïve Bayes combined with FS-CHICLUST gives better classification accuracy and takes lesser execution time than other standard methods like greedy search based wrapper and CFS based filter approach.
So FS-CHICLUST will improve naïve Bayes's performance for text classification and make this simple to implement intuitive classifier suitable for the task of text classification.
We have used -means clustering which is the simplest among the clustering algorithms have been applied here for feature clustering; we can extend this work by employing other advanced clustering techniques. The work can be extended to further limit choice of no. of clusters and use other text presentation schemes like topic clustering.