A Mixed Feature Selection Method Considering Interaction

Feature interaction has gained considerable attention recently. However, many feature selection methods considering interaction are only designed for categorical features. This paper proposes a mixed feature selection algorithm based on neighborhood rough sets that can be used to search for interacting features. In this paper, feature relevance, feature redundancy, and feature interaction are defined in the framework of neighborhood rough sets, the neighborhood interactionweight factor reflecting whether a feature is redundant or interactive is proposed, and a neighborhood interaction weight based feature selection algorithm (NIWFS) is brought forward. To evaluate the performance of the proposed algorithm, we compare NIWFSwith other three feature selection algorithms, including INTERACT, NRS, and NMI, in terms of the classification accuracies and the number of selected features with C4.5 and IB1. The results from ten real world datasets indicate that NIWFS not only deals with mixed datasets directly, but also reduces the dimensionality of feature space with the highest average accuracies.


Introduction
Feature selection plays an important role in pattern recognition and machine learning.It has drawn attention of many researchers from various fields.This task aims to select the essential features that allow us to discern between patterns belonging to different classes.The effects of feature selection have been widely recognized in, for example, improving predictive accuracy, facilitating data visualization, reducing storage requirements, and reducing training time [1].
Many feature selection methods have been proposed to remove as many irrelevant and redundant features as possible [2][3][4][5][6][7][8], such as Relief and its variation Relief-F [9,10], correlation-based feature selection (CFS) [11], mutual information based feature selection (MIFS) [12], fast correlationbased filter (FCBF) [13], and minimum-redundancy maximum-relevance (MRMR) [14].However, apart from the identification of irrelevant and redundant features, an important but usually neglected issue is feature interaction [15].Interacting features are those that appear to be irrelevant or weakly relevant to the class individually, but when it is combined with other features, it may highly correlate to the class.A typical example is the XOR problem.There are two features and a class label which is zero if both features have the same value and one otherwise.Obviously, each feature does not carry any information about the class individually; however, the two features determine the class completely when combined.In many classification problems, a feature that is completely useless by itself sometimes can provide a significant performance improvement when taken with others.If we only consider relevance and redundancy but ignore interaction, some salient features may be missing.
Some wrapper methods are able to deal with feature interaction to some extent, but these methods require a model testing each feature subset and the process is usually timeconsuming, especially for some computational expensive models.Furthermore, wrapper methods are very sensitive to the specific classification algorithm, and the performance of the model does not necessarily reflect the actual predictive ability of the selected feature subset.Therefore, it is a challenge to filter out the irrelevant and redundant features and reserve only a small number of interactive features.Feature interaction increasingly arouses the attention of researchers.Zhao and Liu [16] propose to search for interacted features using consistency contribution to measure feature relevance.Recently, Wang et al. [17] bring forward a propositional FOIL rule based algorithm FRFS.The algorithm involves two steps: (i) redundant feature exclusion and interactive feature reservation and (ii) the irrelevant feature identification.The experimental results demonstrate the effectiveness of FRFS algorithm.
Although the work mentioned above has pointed out the existence and effectiveness of feature interaction, the state-ofthe-art feature selection techniques of searching interacting features are merely designed for categorical datasets.In real world, however, data comes with a mixed format in the majority of cases [18].Discretizing numerical features usually bring information loss because the degrees of membership of values to discretized values are not considered [19].
Rough set theory [20][21][22][23], introduced by Pawlak, is a wellknown mathematical approach addressing vague and uncertain data with no additional information.It has attracted the attention of many researchers who have studied its theories and applications during the last decades.Rough set theory can be used to find a subset of informative features which preserves the discernible ability from the original features.Therefore, it has been playing an important role in feature selection [24][25][26][27][28][29].However, classical rough set theory can only deal with nominal feature values.Since numerical feature values are more common in real world, the crisp rough set theory encounters a challenge.Therefore, some new models such as fuzzy rough sets and neighborhood rough sets are usually considered for extension of the classical rough set theory.These extended models can be used to deal with mixed numerical and categorical data within a uniform framework [30][31][32].For example, Jensen and Shen [33,34] generalized the dependency function defined in classical rough sets based on positive region into the fuzzy case and presented a rough-fuzzy feature selection algorithm.Hu et al. [31,35] substituted classical equivalence relation with neighborhood relations and introduced a neighborhood rough sets model to address the data with mixed features.A neighborhood rough set based heterogeneous feature subset selection (NRS) [31] which utilizes the neighborhood dependency to evaluate the significance of a subset of heterogeneous features is proposed.As the robustness to noise and transformation of mutual information, Hu et al. also generalized Shannon's information entropy to neighborhood information entropy and proposed a neighborhood mutual information based feature selection method (NMI) [35].
Inspired by the fact that neighborhood granules can characterize numerical features, in this paper, we attempt to analyze relevance, redundancy, and interaction in the framework of neighborhood rough sets.Since redundant features produce negative influence and interaction features produce positive influence in predicting, a neighborhood interaction weight factor is introduced to measure the redundancy and interaction of candidate features.We can adjust the relevance measure between a feature and the class by the neighborhood interaction weight factor and rank the candidate features with the adjusted relevance measure.Finally, we propose a neighborhood interaction weight based feature selection algorithm (NIWFS).To verify its performance, the proposed method is compared with three state-of-the-art feature selection methods (INTERACT, NRS, and NMI) on a series of benchmark datasets.Experiment results show that our proposed method can be applied to dataset with mixed categorical and numerical features directly and outperforms the other selectors.
The remainder of this paper is structured as follows: Section 2 reviews some basic concepts related to neighborhood rough sets and neighborhood entropy-based information measures; Section 3 provides our definitions of relevant feature, redundant feature, and interactive feature based on neighborhood interaction gain; Section 4 puts forward a neighborhood interaction weight based feature selection algorithm; Section 5 presents the experimental results and analysis to evaluate the effectiveness of the proposed method; and Section 6 lays out our conclusions.

Preliminaries
In this section, we briefly introduce some basic concepts and notations of neighborhood rough set model and some neighborhood entropy-based information measures.

Neighborhood Rough Set Model.
The notion of an information system provides a convenient basis for the representation of objects in terms of their attributes (also called features).An information system is a quadruple (, , , ), where  is a nonempty finite set of objects called the universe,  is a nonempty finite set of attributes,  = ⋃ ∈   where   is the value domain of attribute , and  :  ×  →  is an information function which associates a unique value of each attribute with every object belonging to  such that, for any  ∈  and  ∈ , (, ) ∈   .A decision table IS = (,  ∪ {}, , ) is a special case of information system, where attributes in  are called condition attributes and  is a designated attribute called the decision attribute.
Definition 1 (see [36]).A neighborhood information system is a quintuple NIS = (, , , , ), where  is a nonempty finite set of objects called the universe;  is a nonempty finite set of attributes;  is the union of attribute domains such that  = ⋃ ∈   ; for any  ∈ , there exists a mapping  →   , where   is the set of values of ;  is a neighborhood parameter.More specially, NIS = (,  ∪ {}, , , ) is called a neighborhood decision system, where  is a set of condition attributes and  is a decision attribute.
In classical rough sets, the objects with the same feature value are pooled into a set, called equivalence class.These objects are expected to belong to the same class; otherwise, they are inconsistent.However, it is unfeasible to compute equivalence classes with numerical features because the probability of objects with the same numerical value is very small [35].Therefore, the equivalence class will be substituted by the neighborhood class.
Theorem 3 (monotonicity [31]).Let  = (, , , , ) be a neighborhood information system.For ,  ⊆  and   ∈ , one has the following: The monotonicity is very important for constructing a greedy forward or backward search algorithm [2].It guarantees that addition of any new attribute to the existing subset does not lead to a decrease of the relevance between the new subset and the decision attribute.

Some Neighborhood Entropy-Based Information Mea-
sures.Shannon's information theory, first introduced in 1948 [37], provides a way to measure the information of random variables.The entropy is a measure of uncertainty of random variables [38].In this section, some neighborhood entropybased information measures are defined in a neighborhood system which is the generalization of Shannon's entropy.
Definition 4 (see [35]).Let NIS = (, , , , ) be a neighborhood information system,  ⊆ , and   ∈ .Then the neighborhood entropy is defined as Obviously, when attribute subset  can distinguish any two objects, the neighborhood entropy is the largest; when attribute subset  cannot distinguish any two objects, the neighborhood entropy is zero.
Theorem 5 indicates that neighborhood entropy equals Shannon's entropy if attributes are discrete.That is, neighborhood entropy is a natural generalization of Shannon's entropy.
Definition 8 (see [35]).Let NIS = (, , , , ) be a neighborhood information system, ,  ⊆ , and   ∈ .Then the neighborhood mutual information of  and  is defined as The neighborhood mutual information NMI  (; ) describes the common information found in  and .It is usually used to measure the relevance between numerical or nominal variables.

Relevance, Redundancy, and Interaction
The concepts such as relevance, redundancy, and interaction of features have been used frequently in the study of feature selection.However, a quantitative formalism for mixed data has not been available to date.In this section, we will redefine the relevant feature, redundant feature, and interactive feature by using neighborhood information measures.
In the discrete case, mutual information has been frequently used to evaluate the strength of the relevance between a feature   and the class .In this situation, the features are evaluated individually.However, some features influence the class variable by grouping rather than individualizing.A wellknown illustration of this phenomenon is the XOR problem as shown in Table 1.
One can see that  1 and  2 have a null relevance individually; that is, MI( 1 ; ) = 0, MI( 2 ; ) = 0.That is to say, feature  1 (or  2 ) can be considered irrelevant in terms of mutual information.However, when we combine  1 and  2 , the maximal relevance is obtained; that is, MI( 1 ,  2 ; ) = () = 1.This indicates that feature  1 (or  2 ) is strongly relevant to the class .Moreover, the mutual information is difficult to compute when the features are continuous.Therefore, it is necessary to redefine the relevant feature.We give the new definition of the relevant feature as follows.Previous work mostly focuses on the definitions of relevant features and redundant features [39].Interactive features are often ignored.To judge whether there exists interaction or redundancy between features, we introduce the neighborhood interaction gain by using neighborhood information measures.
Definition 13 (neighborhood interaction gain).Let F be a full set of features,   ,   ∈ F and the class ; then the neighborhood interaction gain is defined as By Theorem 11, we also have Neighborhood interaction gain can be interpreted as the change in a dependence between feature   (or   ) and the class  by introducing context   (or   ).It is quite easy to see that when the neighborhood interaction gain is negative, context decreases the amount of dependence.When the neighborhood interaction gain is positive, context increases the amount of dependence.When the interaction gain is zero, context does not affect the dependence between feature   (or   ) and the class .
If the neighborhood interaction gain is positive, we benefit from a synergy between the features   and   [40,41].In other words, the addition of feature   will produce positive influence in predicting  for   .A well-known example of such synergy is the XOR problem.If the neighborhood interaction gain is negative, we suffer diminishing returns by several features providing overlapping, redundant information.In fact, the neighborhood interaction gain is the amount of information gained (or lost) in transmission by controlling one feature when the other feature is already known [42].Based on this, we give the new definitions of redundant and interactive feature in the following.Definition 14 (redundant feature).Letting F be a full set of features,   ,   ∈ F, feature   is said to be redundant with feature   if and only if According to Definition 14, NIG  (  ,   ) ≤ 0 suggests a redundancy between   and   ; in other words, they both provide in part the same information about the class .Therefore, the inequality implies that   is a redundant feature when given feature   .
Definition 15 (interactive feature).Letting F be a full set of features,   ,   ∈ F, feature   is said to be interactive with feature   if and only if NIG  (  ,   ) ≥ 0. ( According to Definition 15, NIG  (  ,   ) ≥ 0 indicates a synergy between feature   and   ; that is, they yield more information together than what could be expected from the sum of NMI  (  ; ) and NMI  (  ; ).In other words, the absence of either feature will decrease the ability of predicting the class .

Proposed Feature Selection Algorithm
In this section, we define the neighborhood interaction weight factor based on the neighborhood interaction gain and then move on to present our proposed feature subset selection algorithm.

Neighborhood Interaction Weight
Factor.One can see that the introduction of feature   affects the dependence between the feature   and the class .The positive neighborhood interaction gain means that we cannot depict their relationship without considering both of them at once and the addition of another feature will increase the amount of dependence.That is to say, the introduction of feature   has a positive influence on predicting the class variable .Correspondingly, we should increase the weight of feature   .The negative neighborhood interaction gain means that the introduction of the new feature will inhibit the amount of dependence.That is to say, the introduction of feature   has a negative influence in predicting the class variable .Correspondingly, we should decrease the weight of feature   .Therefore, we can define the neighborhood interaction weight factor based on the neighborhood interaction gain.It is possible to analyze relationships between features and guide feature selection and construction.
Definition 16 (neighborhood interaction weight factor).The neighborhood interaction weight factor of feature   with respect to feature   is defined as This results in loss of some valuable features in the process of feature selection.To solve this problem, we first compute the neighborhood mutual information between a feature and the class and then adjust it through the manipulation of interaction weight factor which can reflect the information of whether a feature is redundant or interactive.The candidate features will be ranked with the adjusted relevance measure.The corresponding descriptive pseudocode is shown in Algorithm 1.
Features can be selected by different search strategies.For the sake of efficiency, we use the sequential forward search technique in this paper.A predefined threshold  is used to terminate the procedure, and  is a neighborhood parameter.For a dataset  with original set F = { 1 ,  2 , . . .,   } and the class , we rank the features in the descending order according to the adjusted relevance measure and then select the first  features, where  has been specified in advance.
NIWFS is a feature ranking algorithm.Firstly, we initialize parameters which consist of the selected feature subset and the weight for each feature and employ the neighborhood mutual information NMI  (  ; ) as a measure of relevance.Secondly, candidate features will be weighted through the neighborhood interaction weight factor NIW  (  ,   ).And the original relevance NMI  (  ; ) will be redressed by multiplying the weight (  ).Feature   with the largest   (  ; ) will be selected and removed from the feature set F to subset .This process terminates until  features have been selected.According to Theorem 18, the weight of a redundant feature is in the range of [0, 1], and the value of   (  ; ) will decrease by multiplying the weight (  ).According to Theorem 19, the weight of an interactive feature is in the range of [1,2], and the value of   (  ; ) will increase by multiplying the weight (  ).Therefore, the adjusted relevance measure can reflect the information of whether a feature is redundant or interactive.
To determine the threshold , we may use a specific classifier to select the subset of features producing the highest accuracy.Alternatively, we may terminate the procedure until dataset .First of all, we need to calculate the neighborhood mutual information between  features and the class, and the time complexity is ().We assume that  features have been selected and compute the adjusted relevance measure between the  −  remaining features and the class.The computational complexity is ∑  =1 ( −  + 1).Besides, the time complexity of calculating updated weight is ∑  =2 ( −  + 1).Therefore, the total complexity of NIWFS is (), and it is the same as that of NMI.In the worst case, the total complexity is ( 2 ) when all features are selected.However, in most cases,  ≪ .

Experiments
In this section, we empirically evaluate the performance of our proposed algorithm present the experimental results in comparison with the other three different types of feature subset selection algorithms applied to ten real world datasets, respectively.

Experiment Setup.
To verify the effectiveness of our method, ten datasets are downloaded from UCI machine learning repository [43].The description of datasets is presented in Table 2.Among 10 datasets, three are completely discrete, three are completely continuous, and the other four are heterogeneous.The sizes of datasets vary from 32 to 8124, the numbers of candidate features vary from 13 to 279, and the classes vary from 2 to 19.
All the continuous features are transformed to interval [0, 1] in preprocessing, while the discrete features are coded with a sequence of integers.The 2-norm is used to compute distance (Euclidean distance).The neighborhood parameter  is set as 0 for the datasets with categorical features.According to observations made by Hu et al. [31], the threshold  should take value in [0.1, 0.2] for numerical features.In the following, we set the neighborhood parameter to 0.15 as suggested by Hu et al. [35].As INTERACT cannot deal with numerical features directly, we employ the MDL discretization method to transform the numerical features into discrete one [44].For datasets with missing values, we replace all missing values for nominal and numerical features with the modes and means from the training data [45].
Three representative feature selection algorithms are selected to be compared with NIWFS.To evaluate the performance of NIWFS in terms of handling feature interaction, an algorithm INTERACT [16], which is specifically proposed to address the feature interaction, is selected as one benchmark algorithm.Moreover, we also compare NIWFS with NRS and NMI which can handle mixed datasets directly.NRS evaluates the features with a function called dependency, which is the ratio of consistent samples over the whole learning samples; NMI employs the neighborhood mutual information to select relevant features based on the criterion of maximal dependency.
We use specific classifier to select the top  features producing the highest accuracy.Two representative classification algorithms based on different hypotheses are employed to test the performance of selected features.They are tree-based C4.5 [46] and instance-based IB1 [47], respectively.The whole classification process is conducted in WEKA release 3.6.9with default parameter settings.in Tables 3 and 4 show that the classification accuracies and the number of the selected features obtained by the original features (Unselect), INTERACT, NRS, NMI, and NIWFS with different classifiers.The bold value means that it is the largest one among these four feature selection algorithms.The row Avg.shows the average of accuracies and the number of selected features with different learning algorithms.In addition, a paired two-tailed -test between accuracies of NIWFS and other selectors has been performed.Moreover, the number of the datasets which have higher (or equal or lower) accuracy with respect to NIWFS is represented by the WTL (win/tie/loss).The symbols "v" and " * , " respectively, identify statistically significant (at 0.05 level) wins or losses over our proposed method.

Experimental Results and Analysis
As we can see in Tables 3 and 4, all of the feature selection algorithms can remove a large number of candidate features effectively.Our proposed NIWFS algorithm obtains the best average accuracies for all the classification algorithms.For instance, with respect to IB1 learning algorithm, the average accuracy is 84.7% for NIWFS, while accuracies of INTERACT, NRS, and NMI are 78.7%,81.4%, and 81.0%, respectively.The average classification accuracy reduced by 7.6%, 4.1%, and 4.6%, respectively.By comparing the accuracies of the four feature selection algorithms, we can find that NIWFS exhibits the highest classification accuracy.
The results obtained by NIWFS method are better than or at least equal to those obtained by the INTERACT, NRS, and NMI methods according to the view of win/tie/loss.For example, the numbers of cases for which NIWFS achieves significantly higher classification accuracy over INTERACT, NRS, and NMI are seven, seven, and eight out of ten cases in the IB1 classifier, respectively.
The average numbers of selected features achieved by NIWFS are 10.6 and 12.6, respectively.Notice that the average number of selected feature obtained by NIWFS is 10.6 in the C4.5 classifier which is the least among these methods.In general, our proposed algorithm achieves better results as compared with the other three feature selection algorithms.
From the experimental results, we also find that INTER-ACT has the lowest average classification accuracy among the four feature selection algorithms.There are two main reasons for this.Firstly, INTERACT can only deal with nominal features.Therefore, some valuable information may be lost in the process of discretization.Secondly, INTERACT does not use wrapper.The reason why NIWFS wins over NMI and NRS is that NIWFS considers not only the relevance between a single feature and the class, but also the redundancy and interaction with other features which are expressed by the interaction weight factor.Therefore, NIWFS performs better when there is feature interaction in the dataset.To further compare the effectiveness of NIWFS with NMI and NRS which can deal with mixed data directly, we add features for learning one by one in the order that the features are selected.In the experiments, three representative datasets are chosen: arrhythmia, movement libras, and synthetic control.To reduce the bias of a feature assessment based on a specific classification, we calculate the average classification accuracies of classifiers for NIWFS, NMI, and NRS.The comparison results are shown as in Figures 1, 2, and 3.The number  in -axis refers to the first  features with the selected order by different methods.The -axis represents the average classification accuracies of the first  features.
The results in Figures 1-3 show that the best average accuracy of classifier with NIWFS is higher than NMI and NRS.On the arrhythmia dataset, the plots of NIWFS are much higher than NRS and higher than NMI in the range of 10-30 features.NIWFS achieves 67.15% classification accuracy with 24 features, which is higher than NMI by 3.95% and higher than NRS by 4.68%.With the movement libras dataset, the plots of NIWFS are higher than NMI and NRS in the range of 13-30 features.NIWFS produces its best accuracy (76.81% with 17 features), which is about 3% higher than NMI and about 1% higher than NRS.For the synthetic control dataset, the plots of NIWFS are much higher than NMI and NRS in the range of 9-20 features.The highest accuracy of synthetic control achieved by NIWFS is 94.33% with 16 selected features while the highest accuracy is 91.92% with 7 selected features for NMI and 91.17% for NRS with 8 selected features.This demonstrates that having too few features is not necessarily a good feature selection result.Some interactive features may be lost in the process of removing redundancy.
We also find that plots with the first few features are lower than NMI and NRS in some cases.The main reason is that NIWFS does not select the first few features having the maximal relevance with the class due to the weight reducing by redundancy analysis.

Conclusions and Future Work
The main goal of feature selection is to find a feature subset that is small in size but high in prediction accuracy.Feature interaction exists in many applications.It is a challenging task to find interactive feature.In this paper, we present an interactive feature searching algorithm which is based on some neighborhood information measures.First, the new definitions of redundant and interactive feature have been defined in the framework of neighborhood rough sets.Then we propose the neighborhood interaction weight factor which can reflect the information of whether a feature is redundant or interactive.Based on the neighborhood interaction weight factor, we present our feature selection method.This method is compared with three other feature selection methods in terms of the number of selected features and accuracies of two classifiers such as C4.5 and IB1 on ten public real world datasets.The experimental results show that NIWFS can not only deal with mixed datasets directly, but also reduce a large number of features with the best average classification accuracies.However, it is time-consuming for our method.The main reason is that the computation of the neighborhood mutual information involves the calculation of distance.For the future work, we plan to improve the efficiency of NIWFS.
. The classification accuracy is obtained by 10-fold cross-validation.The results

Figure 2 :
Figure 2: Average classification accuracy versus different number of selected features on movement libras dataset.

Figure 3 :
Figure 3: Average classification accuracy versus different number of selected features on synthetic control dataset.

Table 4 :
Number and accuracy (%) of features selected with different algorithms (IB1).