Feature Selection with Neighborhood Entropy-Based Cooperative Game Theory

Feature selection plays an important role in machine learning and data mining. In recent years, various feature measurements have been proposed to select significant features from high-dimensional datasets. However, most traditional feature selection methods will ignore some features which have strong classification ability as a group but are weak as individuals. To deal with this problem, we redefine the redundancy, interdependence, and independence of features by using neighborhood entropy. Then the neighborhood entropy-based feature contribution is proposed under the framework of cooperative game. The evaluative criteria of features can be formalized as the product of contribution and other classical feature measures. Finally, the proposed method is tested on several UCI datasets. The results show that neighborhood entropy-based cooperative game theory model (NECGT) yield better performance than classical ones.


Introduction
With the development of information acquirement, more and more high-dimensional data need to be processed for some real world applications [1,2]. Nevertheless, some of the features in huge datasets are irrelevant or redundant, which lead classification algorithms to low efficiency and overfitting. How to identify the most characterizing features [3][4][5] is critical to reduce the classification error and increase classifier's computation speed. Thus, feature selection as a common technique used in data preprocessing for the classification algorithms has attracted much attention in recent years [2].
Up to present, some different information theoreticalbased selectors are employed in feature selection, such as mutual information (MI) [6], rough set (RS) [1,7], and mRMR [2]. The main idea of those methods is to find the significant features for classification by calculating the significance of individual feature. In [8,9], the authors pointed out that the classical methods ignore features which as a group have strong discriminatory power but are weak as individuals. As a result, those traditional methods are unable to deal with some practical problems, such as feature cooccurrences [10].
Aimed at this problem in feature selection, Guyon and Elisseeff [11] constructed an example to illuminate that two variables which are useless by themselves can be useful together. Consequently, how to evaluate the correlation of features is another important aspect besides estimating the classification ability of the individual feature. Cohen et al. proposed the concept of feature contribution in the feature subset to describe the correlation of features via cooperative game theory [12]. One drawback of the method is their less generalization of the selected features on other classifiers, because they are tightly coupled with specified learning algorithms. Sun et al. [8,9] used Shannon's entropy to define the independent, redundant, and interdependent features. Then the interdependent features are used for calculating the feature contribution under the framework of cooperative game theory. A universal feature selection method was proposed in [8,9] which can be used in conjunction with many traditional feature selection methods. It is a pity that Shannon's entropy is only suitable for dealing with nominal data, such as male or female and good or bad. If the attributes are numerical or set-valued, researchers generally adopt discretization technique to transform the nonnominal to the nominal, which would bring loss of information inevitably [13].

Computational Intelligence and Neuroscience
It is obviously unreasonable to measure similarity or dissimilarity with Euclidean distance as to categorical attributes in numerical methods.
Thus, in summary, the method of feature selection still needs improvement because of the following problems.
(1) Most of traditional methods, such as MI, RS, and mRMR, ignore the interdependent features which seemed without direct effects on decision.
(2) Although CoFS [9] can be used for mining the interdependent features, this kind of Shannon's entropybased method would bring loss of information inevitably during the process of discretization. It can lead to computational deviation because of the distortion of the datasets.
Hu et al. investigated neighborhood entropy to avoid data discretization [6]. It makes a breakthrough in this problem. Based on the theory of cooperative game and the concept of neighborhood entropy, the contribution of this paper includes the following: (1) we redefine the redundancy, interdependence, and independence of features by using neighborhood entropy to avoid the discretization; (2) moreover, the neighborhood entropy-based feature contribution is presented to handle the feature selection problem under the framework of cooperative game; and (3) the proposed method is tested on the UCI datasets. The results show that neighborhood entropy-based cooperative game theory model (NECGT) yield better performance than classical ones.
The paper is organized as follows: in Section 2, some basic concepts about feature selection, neighborhood, and game theory are briefly reviewed. In Section 3, the NECGT model is investigated in detail. Section 4 shows the application of NECGT for feature evaluating and feature selection. Numeric experiments are reported in Section 5. Finally, Section 6 concludes the paper.

Preliminaries
In this section, the formalism of feature selection is presented. The common concepts about neighborhood entropy and cooperative game theory are introduced.

Feature Selection
Definition 1. Knowledge representation is realized via the information system (IS) which is a tabular form, similar to databases. An information system is IS = ( , , ), where = { 1 , 2 , . . . , } is a nonempty finite set of objects, is a nonempty finite set of conditional attributes, and is the decision attribute which represents the target classes. The goal of feature selection is to find the minimum subset from set . The subset is optimized for the performance of machine learning algorithm.

Neighborhood Entropy-Based Measurements.
Evaluating relevance between features (attributes, variables) is an important task in pattern recognition and machine learning.
Shannon's entropy and mutual information provide intuitive tools to measure the uncertainty of random variables and the information shared by two different features in discrete spaces. However, there is a limitation in computing relevance between numerical features with mutual information due to problems of loss of information in the process of discretization. In [6], the authors integrate the concept of neighborhood into Shannon's information theory and propose a new information measure, called neighborhood entropy.

Definition 2. For all
∈ , ≥ 0, we say ( ) is a neighborhood of whose centre is and radius is , where Definition 3. Let IS = ( , , ) be an information system, and Δ is a given distance function. We say ( , ) is a neighborhood approximation space when the following conditions are met where and num ( , ) = | − |.
Entropy is a key measure for information. Since it is capable of quantifying the uncertainty of random variables and scaling the amount of information shared by them effectively, it has been widely used in many fields [6,14].
Definition 5. Let IS = ( , , ) be an information system, where = { 1 , 2 , . . . , } is described by the features and . ⊆ is a subset of attributes. The neighborhood of sample is denoted by ( ). Then the neighborhood uncertainty of the sample is defined as where | ( )| is cardinality of set ( ).
Computational Intelligence and Neuroscience 3 The average uncertainty of the set of samples is computed as Definition 6. Let IS = ( , , ) be an information system. and are two subsets of attributes. The neighborhood of sample in feature subspace ∪ is denoted by ∪ ( ); then the joint neighborhood entropy NE( , ) is computed as Definition 7. Let IS = ( , , ) be an information system. and are two subsets of attributes. The conditional neighborhood entropy of to is defined as Conditional entropy refers to the uncertainty of when is known. From this definition, if completely depends on , then NE ( | ) is zero. This means that no more information is required to describe when is known. Otherwise, NE ( | ) = NE ( ) denotes that knowing will do nothing to observe .
To quantify how much information is shared by two features and , a concept termed neighborhood mutual information NMI( ; ) is described as follows. Remark 9. From this definition, the neighborhood mutual information becomes a measurement of relevance of features. The value of NMI( ; ) will be very high, if and are closely related with each other; otherwise, NMI( ; ) = 0 denotes that these two features are totally unrelated. As it is wellknown, mutual information is widely applied in evaluating the significance of features when ⊆ and = . It reflects how much information is shared by subsets of conditional features and decision feature .
Definition 10. Let IS = ( , , ) be an information system. , , and are three subsets of attributes. The conditional mutual information of and is defined as The conditional mutual information represents the quantity of information shared by and when is known. That is to say, NMI ( ; | ) implies that brings information about which is not already contained in .

Cooperative Game Theory.
Cooperative game theory introduces the concept of coalitional games in which a set of players are associated with a real function that denotes the payoff achieved by different subcoalitions in a game.

Definition 11.
A cooperative game is defined by a pair ( , ), where = {1, . . . , } is the set of all players and ( ), for every ⊆ , is a real number associating a worth with the coalition .
Game theory further pursues the question of representing the contribution of each player to the game by constructing a worth function ( ), which assigns a real value to each player. The values correspond to the contribution of the players in achieving a high payoff. Banzhaf value [15] was proposed by Banzhaf, which yields a unique outcome in cooperative games, to measure the contribution of players in the game [8,9]. It is based on counting, for each player, the number of coalitions to which the player is crucial to winning [9]. In our study, we use this measurement to evaluate the contribution of features.

Neighborhood Entropy-Based Cooperative Game Theory Model
Conventional feature selection algorithms tend to select features which has high relevance with the target class and low redundancy among the selected features. The major disadvantage of these algorithms is that they ignored the dependencies between the candidate feature and unselected features. For example, mRMR [2] introduced the criterion, namely, "Min-Redundancy, " to eliminate the redundant features. However, the authors in [8,9] pointed out that it is likely to disregard the intrinsic interdependent groups which as a group have strong discriminatory power but are weak as individuals. The main reason is that features which have been labeled "redundancy" are in reality interdependent to the selected feature subset [9]. In this work, neighborhood entropy-based measurements are adapted to distinguish the relationship of redundancy, interdependence, and independence between features. Then, we use Banzhaf value to computing the contributions of each feature. A universal framework to evaluate the significance of features is investigated.

Neighborhood Entropy-Based Redundancy,
Interdependence, and Independence Analysis for Features

Redundancy
Definition 12. A conditional feature is said to be redundant with if the relevance between and target class will be reduced under the condition of . The formulation is defined as follows: Redundancy means that there is redundant information shared between and target class when is known.

Interdependence
Definition 13. Suppose and are interdependent on each other, then the relevance between and target class will 4 Computational Intelligence and Neuroscience be increased conditioned by . Thus, two features and are interdependent on each other if the following form is satisfied: According to the explanation about redundancy, interdependence means that the amount of information shared between and target class will be increased when is known. In another word, the impact of each feature on the classification performance cannot be ignored and replaced.

Independence
Definition 14. If two features and are completely independent, then the relevance between target class and any one of them will not be changed by the other emerging as a condition. That is, Based on the definitions above, it can be concluded that some interdependent features, which seem to unimportant to the decision, should be considered in the selection process. We will discuss this problem in detail in the next section.

Feature Evaluation Framework Based on Cooperative
Game Theory. In [15], Banzhaf proposed that a winning coalition is one for which ( ) = 1 and a losing coalition is one for which ( ) = 0. Each coalition ∪ { } that wins when loses is called a swing for player [8,9]. That is, Δ ( ) = ( ∪ ) − ( ) = 1. It means that the membership of player in the coalition is crucial to the coalition winning. In other words, the greater the number of swings for player , the more important the player .
Then Banzhaf value is defined as where Δ ( ) = ( ∪ ) − ( ). The Banzhaf value can be interpreted as the average contribution of player alone to all coalitions [16].
The Banzhaf value measures the distribution of power among the players in the voting game, which can be transformed into the arena of feature selection [8,9]. In the feature selection game, every feature can be regarded as a player. Thus, the Banzhaf value can be used to estimate the contribution of each feature.
From the definition of interdependence, it is easy to see that the optimal feature subset is the one in which all the features are relevant to the target class and interdependent on each other [8]. Given a candidate subset coalition , the feature ( ∉ ) is to be estimated. Let ID be the number of features which fall into interdependent relationship with the feature . The contribution of the feature on coalition can be redefined as the following description: German 1000 20 2 which means that the feature is crucial to win the coalition only if it both increases the relevance of the unitary subset on the target class and is interdependent with at least half of the members. Furthermore, we can get the average contribution of player to all coalitions according to the Banzhaf value. The definition about Δ ( ) is similar as the formula (11) in [8]. Nevertheless, we use neighborhood conditional mutual information NMI ( ; | ) rather than Shannon's conditional mutual information in [8]. The neighborhood entropy-based method can avoid discretization of the samples.
The traditional feature selection methods were proposed based on some feature measures [2,6,13], such as mutual information (MI), rough set (RS), and mRMR. These measures were usually used in evaluating the significance of features. Actually, this type of significance can only be called the significance for decision (SIGFD). In the framework of neighborhood entropy-based cooperative game theory (NECGT), the contribution of one feature in the coalitions is another important aspect. Here, we give the neighborhood entropy-based formulaic feature measure according to [9]: where SIGFD( ) can be any of traditional feature measures and ( ) is the Banzhaf value.

Feature Selection Algorithm with NECGT
Before giving the algorithm of feature selection, details of the feature contribution evaluation method based on the Banzhaf value are presented in Algorithm 1.
An information system is IS = ( , , ), where is a nonempty finite set of objects, is a nonempty finite set of conditional attributes, and is the decision attribute which represents the target classes. The output of this evaluation framework is a vector Ω of which each element Ω( ) represents the Banzhaf value ( ) of feature .
Computational Intelligence and Neuroscience 5 Calculate the value of Δ ( ) (6) End for (7) Calculate the Banzhaf value ( ) In fact, it is impractical to get the optimal subset of features from 2 − 1 candidates through exhaustive search, where is the number of features. The greedy search guided by some heuristics is usually more efficient than the plain brute-force exhaustive search. A forward search algorithm for feature selection with NECGT is written as shown in Algorithm 2.
In the forward greedy search, one starts with an empty set of attributes and keeps adding features to the subset of selected attributes one by one. Each selected attribute maximizes the significance of the current subset. This selection procedure will be terminated if the number of selected features is larger than the user-specified threshold . Without loss of generality, to handle the feature selection problem, a general significance to decision SIGFD( ) is presented by employing any classical criteria, such as MI and mRMR.
It is worth noting that the calculation of the Banzhaf value requires summing over all possible subsets of features, which can extremely increase the computational complexity of Algorithm 1. In fact, it is impossible to consider all coalitions for features, especially large coalitions. In [9], the author proposed that the number of features correlated with a certain feature is much smaller than the total number of features in the real datasets. Thus, we use a limit value being a bound on the coalition size. The Banzhaf value can be redefined as where Π is the set of subsets of feature set \ limited by . The usage of bounded sets coupled with the method for the Banzhaf value estimation yields an efficient and robust way to estimate the contribution of a feature to the task of feature selection. In our study, is set to √ according to [17], where is the number of features.

Experiments
In this section, we will evaluate the proposed model NECGT by a series of experiments. In this study, method of feature selection has improved from two aspects as follows.
(1) The concept of neighborhood entropy-based feature contribution is proposed to avoid the process of discretization which would bring loss of information.
(2) The neighborhood entropy-based feature contribution is used for feature selection. It can enhance the feature selection ability.
Hence, we design two experiments to verify the two points above. To compare the effectiveness of NECGT, we employ two popular feature selectors: RS [13] and mRMR [2], for evaluating the significance to decision (SIGFD). This experiment can be called NECGT, SIGFD versus SIGFD. On the other hand, we choose CoFS [9] as the benchmark. This Shannon's entropy-based method can also be used to compute feature contribution. However, it must adopt discretization technique to preprocess data. This experiment is called NECGT versus CoFS where the SIGFD is MI [6].
The  were run in a 10-fold cross validation mode. The parameters of the linear SVM and RBF-SVM are taken as the default values (the use of the MATLAB toolkit osu svm3.00). Literature [1] has explained that the result is optimal if threshold is set between 0.1 and 0.2. In the experiment, threshold is set to 0.15 in our method.

Experiment 1: NECGT-SIGFD versus SIGFD.
First of all, we give an example to show the difference between NECGT-SIGFD and SIGFD in detail. mRMR is employed as the metric of significance to decision (SIGFD). All samples in Glass are used in this test where learning algorithm RBF-SVM (RSVM) is chosen to evaluate the selected feature subsets. The order of the features, which are kept being added to the feature space, is shown in Table 2. There are altogether 9 features in glass. The main differences are the feature   subsets from 3rd to 7th where the order selected by mRMR is 2, 3, 8, 1, 6 and the order selected by NECGT-mRMR is 3, 1, 6, 2, 8. mRMR regards the features 1 and 6 as unimportant individuals to the decision. However, the two features get high contribution scores which are showed in Figure 1. Meanwhile, we see that the contributions of features 2 and 8 are relatively less. It can be inferred that features 1 and 6 are more competitive than features 2 and 8 in a feature group. Then, we compare the classification accuracy of the feature subset to verify the inference. In Figure 2, the number on -axis of figures refers to the first features with selected order (as is shown in Table 2) by different selectors. The -axis represents the performance of classifiers of the first features. The classification accuracy of raw data is 74.4%. No matter what mRMR groups the features, it still cannot surpass 74.4%. Whereas in the view of NECGT-mRMR, the feature subsets 4, 7, 3, 1, and 6 can reach up to 77.19% because of the high contribution of features 1 and 6. The theory of cooperative game emphasizes the coactions of the features. Obviously, the features 1 and 6 greatly enhance the discriminatory power of the attribute groups 4, 7, and 3 although they are weak as individuals. In contrary, the competent individual features, such as 2 and 8, are not necessarily well performed in the view of cooperative game theory.
For further comparison, the effectiveness of NECGT-SIGFD is measured by the classification performance on different datasets besides glass. We build classification models with the selected features and test their classification performance. mRMR and RS are chosen as SIGFD.
Sun et al. [9] proposed the concept about "acceptable" numbers of selected features to verify the effectiveness of the features selection algorithm. The "acceptable" number means that about a third of original features remained for a dataset. This method is also used in our study. The selected features with different algorithms are presented in Table 3. The orders of the features in the tables are the orders that the features are kept being added to the feature space. We compare the raw data, NECGT-mRMR, mRMR, NECGT-RS, and RS in Tables 4, 5, and 6, where learning algorithms are CART, linear SVM, and RBF SVM, respectively. The last column of Tables 4-6 records the average efficiency value of these different feature selection models. The feature subsets selected by NECGT-SIGFD and SIGFD are different. Just as the explanation on glass datasets, NECGT-SIGFD prefers to the feature which is likely to bring better overall performance for the feature coalition instead of individual effect. The comparison of win/tie/loss between NECGT-SIGFD and SIGFD is 20/5/5. Table 7 summarizes NECGT-SIGFD model which yields better performance than SIGFD model in most cases. It means that the features subsets selected by NECGT-SIGFD have strong discriminatory power as a group. Meanwhile, we also find that SIGFD yields better performance than NECGT-SIGFD in a small number of cases. As is known to all, the data sample is unpredictable in a real-world environment. As a result, the interdependent features may not even exist in some datasets. It is difficult to guarantee that our method is always efficient. Nonetheless, the method suggested an effective way to retain useful interdependent features and groups as many as possible.

Experiment 2: NECGT versus CoFS.
Both of NECGT and CoFS [9] can be used to estimate the contribution of features. In our study, the feature contribution measurement is defined based on neighborhood entropy. The CoFS method is based on Shannon's entropy. This is the contribution of our 8 Computational Intelligence and Neuroscience  research compared to CoFS in [9]. It is important to note that we calculate the feature contribution under the framework of cooperative game theory same as [9]. The datasets are discretized for CoFS, whereas NECGT deals with original samples directly. Discretization can lead to computational deviation of contribution evaluation. Conversely, neighborhood entropy-based method NECGT will make a fairly good treatment on the issue of information losing by avoid discretization. Then the experiment on Lymphography and Wpbc is given to prove the validity of the proposed method by comparing NECGT with CoFS. MI [6] is employed as the metric of significance to decision (SIGFD) in this experiment. Figure 3 shows the feature contribution which is evaluated by NECGT and CoFS. Table 8 shows the features' order that the attributes are kept being added to the feature space. We can see the performance of first features in Figure 4.
Some features on Lymphography get slightly different contribution scores because of the discretization. It makes the diversity of feature order. Figure 4(a) shows that NECGT performs much better than CoFS except the third time iteration. NECGT-MI achieves the highest classification accuracy 82.14% where the selected feature subset is 13, 14, 2, 15, 3, and 16. The key reason is some information in Lymphography is lost by CoFS. Hu et al. [1] pointed out that there are at least two categories of structures lost in discretization: neighborhood structure and order structure in real spaces. For example, we know the distances between samples and we can get how the samples are close to each other in real spaces. In other words, it is unreasonable to measure similarity or dissimilarity with Euclidean distance as to categorical attributes in numerical methods. Therefore, it can be concluded that the discretization can lead to computational deviation of contribution evaluation, even if there is very little loss of information.
Then, we reflect on Wpbc. The contribution of most features is rated zero except feature 2 by CoFS. It can be concluded that the discretization makes serious distortion on Wpbc. As a consequence of this, CoFS-MI has to select the feature sequence in ascending order according to the feature number besides feature 2. Obviously, CoFS is inapplicable when the dataset is sensitive to discretization. In contrast, we can see that NECGT appears completely normal in the experiment. It shows that the nondiscretization method NECGT is suitable for more datasets.
For further comparison between NECGT and CoFS, we test the classification performance and running time on different datasets besides Lymphography and Wpbc.
As Experiment 1, the selected feature subset and the classification accuracy are displayed in Tables 9 and 10. NECGT-MI yields better performance than CoFS-MI in most cases. It can be concluded that NECGT is more efficient by Computational Intelligence and Neuroscience  Compared with CoFS, NECGT is less time consuming. The main reason is that the process of discretization can take anywhere a fraction of a second to complete. On the other hand, we can see that the computational time of feature contribution has accounted for a considerable proportion in NECGT-MI (or CoFS-MI) model.

Discussion about Some Open-Ended Questions.
The two experiments show the validity of the method in our study. NECGT indeed enhances the ability of classification of the attributes subsets. Nonetheless, there will be some openended questions if NECGT is applied to a real environment.
(1) It is noticeable that the proposed method performs different performances for different classifiers on the same dataset. Consequently, for different application fields, a suitable classifier is also necessary. And this issue is one of the most important challenges in the application of artificial intelligence.
(2) Although NECGT takes quite some time consequentially, the traditional methods are improved by NECGT. It necessary to consider which one is more important between high classification accuracy and fast computing in the practical application. (3) The validity of our model has been verified preliminarily on the UCI datasets. However, the large-scale dataset exists in the real environment. Consequently, the version of NECGT-SIGFD under distributed framework [18] is definitely worth exploring in the future work.

Conclusion and Future Work
Feature selection is an important preprocessing step in pattern recognition and machine learning. Traditional information-theoretic based selectors tend to ignore some features which as a group have strong discriminatory power but are weak as individuals. To overcome this disadvantage, we introduce a neighborhood entropy-based cooperative game theory framework to evaluate the contribution of each feature. The contribution of features is considered as another important factor for calculating the significance of features. Experimental results on UCI datasets show that the proposed method works well and outperforms traditional feature selectors at most cases. On the other hand, although CoFS also can be used for estimating the contribution of features by using Shannon's entropy, the major defect of CoFS is Shannon's entropy is only suitable for dealing with nominal data. Consequently, we redefine the redundancy, interdependence, and independence of features by using neighborhood entropy to avoid the loss of information caused by Shannon's entropy. Experimental results show that NECGT performs better than CoFS in most cases. The future work could move along three directions. First, many other entropy models also can be used to calculate the relevancy of features, such as kernel entropy and fuzzy entropy. How to evaluate the interaction of these entropy modes is an important issue. Second, we will continue to construct the game theoretic-based feature selection model by adopting approximate Shapley value estimate technique [12]. Thirdly, the application of our model in the real environment is necessary. The version of NECGT-SIGFD under the distributed framework requires further attention.