Due to the era of Big Data and the rapid growth in textual data, text classification becomes one of the key techniques for handling and organizing the text data. Feature selection is the most important step in automatic text categorization. In order to choose a subset of available features by eliminating unnecessary features to the classification task, a novel text categorization algorithm called chaos genetic feature selection optimization is proposed. The proposed algorithm selects the optimal subsets in both empirical and theoretical work in machine learning and presents a general framework for text categorization. Experimental results show that the proposed algorithm simplifies the feature selection process effectively and can obtain higher classification accuracy with a smaller feature set.
Due to the era of Big Data and the rapid growth in textual data, feature selection (FS) is very important for organizing the data. Feature selection is also called attribute selection. Feature selection is a key step in automatic text categorization and machine learning systems, which automatically assigns the documents to a set of predefined classes based on their textual content. It is well known that feature selection is often used to deal with a high dimensional space of features whose main objective is to simplify a dataset by reducing its dimensionalities and identifying relevant underlying features. In the practical application of machine learning, the number of features which exist irrelevant and interdependent features usually very large. It easily leads to the following consequences. Firstly, there is more time consumption in features analysis and model training when the number of features is increasing. Secondly, it easily leads to “curse of dimensionality” when the number of features is increasing and results in the model becoming more complicated. Feature selection has been widely applied to various fields including text categorization [
Given a feature set
In order to choose a subset of available features by eliminating unnecessary features to the categorization task, this paper makes use of FS method, together with machine learning knowledge, and proposes a novel heuristic algorithm for feature selection called chaos genetic feature selection optimization (CGFSO). Chaos is universal phenomenon in many nonlinear systems that exhibits sensitive dependence on initial conditions and includes infinite unstable periodic motions [
The rest of this paper is organized as follows. Section
In this section we focus our discussion on the prior research on feature selection. Many scholars at home and abroad have made great contributions to the feature selection in both empirical and theoretical work, which are necessary and sufficient for solving the text categorization problem.
In order to achieve minimum classification error, Kanan and Faez [
Genetic algorithm (GA) is a parallel heuristic intelligent method, which is a popular technology for nonlinear optimization problem. Due to the advantages of GA, GA has been widely used an effective tool for FS in text categorization. Zhu et al. [
In this section, we focus our discussion on algorithms that explicitly attempt to select an optimal feature subset. It is usually difficult to obtain an optimal feature subset and has been proven to be NP-hard. Therefore, lots of heuristic algorithms have been used to perform feature selection of training including genetic algorithms, neural networks, and simulated annealing. In order to avoid the combinatorial search problem to find an optimal subset of m features, the most popular feature selection methods is the application of genetic algorithm, which always provide a suboptimal solution.
Although GA has a powerful quality of global search, it is liable to raise the problem of prematurely convergence in the practical application and has low search efficiency in the late evolving period [
COA is a novel approach of global optimization that has attracted widespread attention in recent years. In the COA, the well-known logistic map is normally described as follows:
The basic process of chaos optimization algorithm generally includes two major steps. Firstly, define a chaotic sequences generator based on the logistic map. Generate a sequence of chaotic points and map it to a sequence of design points in the original design space. COA has a very sensitive dependence upon its initial condition and parameter. Chaotic sequences have been adopted instead of random sequences and somewhat good results have been shown in many applications. Then, calculate the objective function based on the generated design points, and choose the point with the minimum objective function as the current optimum. Secondly, the current optimum is assumed to be close to the global optimum after certain iterations, and it is viewed as the consult point with a little chaotic perturbation and explores the descent direction along axis directions in order. Repeat the above two steps until some specified convergence criterion is satisfied, then the global optimum is obtained. However, further numerical simulation showed that the method is effective only in small design space.
Generally, a text categorization system consists of several essential parts including feature extraction and feature selection [
In CGFSO algorithm, each individual in the population represents a candidate solution to the feature selection problem [
The solution quality in terms of classification accuracy is evaluated by classifying the training data sets using the selected features. Classification accuracy and feature cost are the two key factors used to design a fitness function. The test accuracy measures the number of examples that are correctly classified. Thus, the individual who has high classification accuracy and low total feature cost produces a high fitness value. The individual with high fitness value has high probability to be selected to the next generation. A solution obtaining higher accuracy and with fewer features will get a greater quality function value. Therefore, the fitness function can be defined as follows:
The main steps of the CGFSO algorithm can be summarized as follows.
In this section, a series of simulation experiments were conducted to show the effectiveness and superiority of the CGFSO algorithm for text categorization problems. In order to provide an overview on the base accuracy of the classifiers, the Reuters collection was taken in our experiments. We uses Reuters-21567 that are 5213 documents in training set and 2016 documents in test set and adopt the top ten classes. Experimental platform use Dell computer with CPU Xeon 3.06 GHz (24P8122) and 2 GB of RAM. We implement the proposed CGFSO algorithm and other two FS methods such as GA and SVM; that is, the parameters of CGFSO and GA are set as follows: the size of the population is 100, the maximum number of generations is 500, crossover probability is 0.7, and mutation probability is 0.2. Since the experimental result depend on the population randomly generated by the CGFSO and GA algorithms, so we have performed 20 simulations on each data set.
In most text categorization, the performance of feature selection techniques is particularly important. Several norms such as precision and recall are often used to evaluate the performance of feature selection algorithm. Precision is defined as the ratio of correct topic cases to the total predicted topic cases. Recall is defined as the proportion of the correct topic cases to the total cases. Precision and recall are defined as follows.
Assume that
Assume that
To analyse the performance of the feature selection algorithms, we will show the results obtained using the proposed approach. Figures
The precision of the three feature selection algorithms.
The recall of the three feature selection algorithms.
The fitness value of the three feature selection algorithms.
The precision of algorithms with different number of features.
The recall of algorithms with different number of features.
From the experimental result in Figure
From the experimental results in Figure
It can be seen from the experimental results that CGFSO learning process effectively and efficiently reduces the complexity of the system in the feature selection stage.
Due to the era of Big Data and the rapid growth in textual data, text classification has become a way to process and organize the text data. In order to achieve the goal of this paper, we designed a new text classification algorithm based on genetic algorithm and chaotic optimization algorithm. The experimental results show that the CQFSO yields the best result of these three methods. The experiment also demonstrated that the CQFSO yields better accuracy even with a large data set since it achieved better performance with the lower number of features. In the future, we will design a new heuristic feature selection algorithm, apply it to text classification field, and will involve experiments with other kinds of datasets.
This work is partially supported by the National Science Foundation of China under Grant nos. 61370226 and 61272546.