Feature Selection in Decision Systems : A Mean-Variance Approach

Uncertainty measure is an important implement for characterizing the degree of uncertainty. It has been extensively applied in pattern recognition and data clustering. Because of instability of traditional uncertainty measures, mean-variance measure (MVM) is utilized to perform feature selection, which could depress disturbances and noises effectively.Thereby, a novel evaluation function based on MVM is designed. The forward greedy search algorithm (FGSA) with the proposed evaluation function is exploited to perform feature selection. Experiment analysis shows the validity and effectiveness of MVM.

Uncertainty measure is an important implement for characterizing the degree of uncertainty in rough set theory.It has been extensively applied in pattern recognition and data clustering.However, this paper reveals the issue that classical uncertainty measures are sensitive to disturbances or noises.Therefore, a novel uncertainty measure, called mean-variance measure (MVM), was proposed to characterize the degree of uncertainty of rough sets in paper [11].Since it takes fully information in the boundary region into account, MVM is more robust and effective than classical uncertainty measures in depressing disturbances and noises.
As an important application of rough sets in artificial intelligence and machine learning, feature selection or attribute reduction in information systems has been drawing wide attention.due to the fact that excessive features or attributes usually confuse learning algorithms, cause significant slowdowns in learning processes, and increase risks of learned classifiers to over-fit the training data [4,12].
Unfortunately, it has been proved that finding all reducts or finding an optimal reduct (a reduct with the least number of attributes) is an NP-complete problem [13].Many researchers devote themselves to finding an efficient reduct by optimization techniques.The forward greedy search algorithm (FGSA), also called hill-climbing algorithm or greedy algorithm, is such an optimal technique for finding one reduct quickly and has been extensively investigated [14][15][16].A key ingredient of FGSA lies in establishing an evaluation function to examine importance of each feature or attribute in databases.The evaluation function induced by the classical uncertainty measure, that is, the Pawlak's roughness or dependency, has been successfully applied in rough sets based feature selection [17,18].Along with the development of rough sets, attribute reduction has been studied extensively in the past decade, such as fuzzy rough sets based attribute reduction [19][20][21][22][23], neighborhood rough sets based attribute reduction [24], cross-entropy based attribute reduction [25], tolerance rough sets based attribute reduction [26], cost based attribute reduction [27,28], and dynamic attribute reduction [29,30], and extended rough set based attribute reduction [31], cover rough sets based attribute reduction [14,32], covering generalized rough sets based attribute reduction [33], variable precision rough sets based attribute reduction [34].Nevertheless, the classical uncertainty measure is not robust and maybe fluctuates largely only with minor disturbances.Even a little change in information systems may produce an unpredictable fluctuation of this uncertainty measure.
The mean value and variance in probability theory, able to be used to analyze preciously data, have been widely discussed in portfolio optimization and portfolio selection [35][36][37][38].They are considered as an arbitrator which is used to determine whether a group of data is robust and stable.For example, two shooters obtain the same score (mean value).If one has to be chosen to take part in a tournament, which one should be chosen reasonably?Apparently, the one with a less variance score would like to be chosen.In this paper, the notions of mean value and variance are introduced into information systems as an arbitrator to evaluate the uncertainty degree.A novel uncertainty measure, called mean-variance measure (MVM), is proposed.MVM firstly calculates the mean of every object, and then all objects' variances are taken into account.The effect caused by disturbances of data in decision systems on MVM will decrease, since a tiny alteration of values will not result in a large change of variance.
Based on the new notion of MVM, an evaluation function called D-MVM in decision systems is further designed.The designed evaluation function takes full information in positive region and boundary region into account.
This paper is organized as follows.Some elementary concepts on rough sets and MVM are reviewed in Section 2. Section 3 investigates the issue on feature selection in decision systems by MVM.Experimental results and analysis are given in Section 4, and Section 5 concludes this paper.

Preliminaries
2.1.Rough Sets.This section briefly outlines some basic notions on rough sets.Definition 1.An information system is a pair  = (, AT) satisfying (1)  is a nonempty finite set of objects; (2) AT is a nonempty finite set of attributes; (3) for every  ∈ AT, there is a mapping  :  →   , where   is the set of values.
Definition 2. Given an information system  = (, AT) and  ⊆ AT, an indiscernibility relation on  is defined by Obviously,   is an equivalent relation induced by the attribute set . []   = { ∈  | (, ) ∈   } is referred to as the equivalence class of  with respect to   .A partition of  induced by the equivalent relation   can be denoted by where   is some equivalence class of   in ,  = 1, 2, . . ., . /  and []   are, respectively, denoted by / and []  , for short, when no ambiguity arises in this paper.
Definition 3. Given an information system  = (, AT),  ⊆ AT, and  ⊆ , the lower approximation and the upper approximation of  with respect to  are defined, respectively, by ( Definition 4. Given an information system  = (, AT), a partial ordering relation ⪯ in the family {/ |  ⊆ AT} is defined as  is said to be coarser than , or  is finer than , if / ⪯ /. is said to be strictly finer than , denoted by / ≺ /, if / ⪯ / but / ̸ = /.
From Proposition 5, the more attributes an information system contains, the finer the corresponding partition is.Therefore, /AT is the finest one among partitions induced by all subsets of AT.
The classical uncertainty measure is defined as follows.
Definition 6.Given an information system  = (, AT) or an incomplete information system  = (, AT),  ⊆ AT, and  ⊆ , the roughness of  is defined as The quantity   () characterizes the uncertainty degree of  with respect to .When   () = 0,  is said to be definable; otherwise, it is said to be rough.
When AT is divided into two nonempty sets  and  such that ∩ = 0, then  = (, AT), denoted by  = (, ∪), is called a decision system,  is called the conditional attribute set, and  is called the decision attribute set.Definition 7. Given a decision system  = (,  ∪ ), the dependency degree of  on  is defined by where POS  () = ∪ ∈/ () is the positive region of  with respect to  and | * | denotes the cardinality of * .
Attribute reduction in decision systems is defined as follows.

A Novel Uncertainty
Measure of Rough Sets.Given an information system  = (, AT) and  ⊆ , the characteristic function of  on  can be denoted by where  ∈ .
In rough set theory, objects in the same equivalent class cannot be distinguished for each other, since they have the same characteristic.However, in the boundary region of a rough set, objects in the same class have different characteristics.In this case, their mean value of objects in a class is generally used to characterize each object.
As mentioned above, when an object  is not in , its mean value is non-zero if and only if its equivalent class has non-empty intersection with ; when  is in , its mean value is 1 if and only if its its equivalent class is contained in .From Definition 10, it is easy to verify that the mean value   () is an inclusion degree (/[]  ) of []  being included in .
Proposition 11.Given an information system S = (, AT),  ⊆ ,  ⊆ AT, and  ∈ , the following conclusions hold: Note that   () = () when  is in the positive region and the negative region.It is obvious that   () ̸ = () only when  is in the boundary region.
Definition 12.Given an information system  = (, AT),  ⊆ AT, and  ⊆ , the mean-variance uncertainty measure (MVM) of  with respective to , denoted by   (), is defined as

It is clear that
Assume   () = 0 when  = 0 or  = 0. From Definition 12 one can see that only objects in the boundary region of  contribute to the value of   ().In this sense,   () takes fully information in the boundary region into account.Therefore, it is a proper measure to evaluate the uncertainty of .
Definition 13.Given an information system  = (, AT), ,  ⊆ AT, and ,  ⊆ , (1)  is said to be -definable if   () = 0; (2)  is said to be -rough if   () ̸ = 0; (3)  is said to be coarser with respect to  than  with respect to  if   () <   (), in which case,  is called finer with respect to  than  with respect to .
Next, we investigate properties of   () and show its efficiencies in evaluating uncertainty of a set in information systems.

Feature Selection in Decision Systems
In this section, the proposed uncertainty measure is further investigated to perform feature selection in decision systems.
Definition 15.Given a decision system  = (,  ∪ ) and  ⊆ , MVM of the decision attribute set  with respect to the conditional attribute subset , called D-MVM, or an evaluation function, is defined by where  is the number of the decision classes induced by the decision attribute set ,   (  ),  = 1, 2, . . ., , reflect the uncertainty measure of each decision class, and (, ) describes the integrated uncertainty degree of blocks  1 ,  2 , . . .,   .
In the following, some properties of (, ) are studied.
Definition 17.Given a decision system  = (,  ∪ ) and  ⊆ ,  is independent if By D-MVM, a relative reduct can be defined as follows.
A relative reduct is a minimal subset which has the same discriminating power as the raw decision systems.Definition 19 (significance based on D-MVM).Given a decision system  = (,  ∪ ),  ⊆ , and a feature  ∈  − , the significance of  is defined as Notice that if  is an empty set, (, ) = 0, and Sig  (, , ) is a nonnegative real number; otherwise, Sig  (, , ) ≤ 0.
With the proposed evaluation function, a forward greedy search algorithm for feature selection can be designed as follows.
In the first iteration, we start with an empty set specified with (, ) = 0.The quantity Sig  (, , ) is negative in every iteration except the first one.The rest features in each iteration are all evaluated, and the one with the minimal significance will be chosen.The algorithm does not stop until adding any of the rest features to selected feature set will not bring a change larger than threshold  in Algorithm 1, where  controls the precision of the algorithm.
There is no doubt that FGSA-MVM is for the sake of searching a subset of conditional attributes with minimal positive real D-MVM.We obverse step 7 of FGSA-MVM.In the first iteration, we choose the minimal significance because Sig  (, 0, ) is a positive number.In the rest iterations, we also select the minimal significance with the biggest step length since Sig  (, , ) is nonpositive for any  ̸ = 0.

Experiments and Analysis
In order to test the validity of the proposed method for feature selection, comparative experiments have been implemented in efficiency and convergence of proposed algorithm with two of the most important methods, feature selection based on dependence [39] and mutual information [40].
As shown in Table 1, four standard data sets, cited from the machine learning data repository, University of California, Irvine, CA, USA [41], are employed in our experiments.
CART and RBF-support vector machine (SVM) learning algorithms are introduced to test the classification performances of feature selection for raw sets and for selected feature sets.As a widely used technique to evaluate classification performances in machine learning, 10-fold cross-validation [42] is carried out in our experiments by dividing the samples into 10 subsets.Nine of them are used as training set, and the rest one is used as the test set.After 10 rounds, the average value and variation are computed as the final classification performance.
Classification performances are evaluated by CART in From the experiments one can see that the proposed measure outperforms not only in the smallest average number of selected features in reducts but also in the highest classification performance in feature selection.
In the remainder of this section, we pay attention to the convergence of the proposed method.Figure 2 shows the fluctuations of evaluation functions with respect to the number of selected features.The significance of selected features is calculated based on dependency, on MI, and on MVM, respectively.The four data sets are used to show the convergence of different techniques.The selected orders of the four data sets based on different evaluation functions are shown in Table 4, in which the sequences of selected features are different, even the number of selected features in the optimal reducts may be the same.As a whole, significance degrees based on dependency and MI increase, while significance based on MVM decreases.With MVM, all four evaluation functions decrease fast at the beginning of the selection process.The evaluation function of credit data slowly decreases, and this result constitutes a different pattern of behavior compared with the three other data sets.Feature selection algorithms may stop very early if we specify a threshold to stop the search in this case.The convergence and good classification performances are observed in the results.

Conclusion
This contribution studied feature selection based on MVM in decision information systems, which is one of the most important applications of rough set theory.A novel approach to feature selection was proposed by introducing an evaluation function based on MVM.Theoretical analysis and experimental results concluded that the performances of proposed method are outperformed by dependency and by MI not only in the number of selected features but also in the classification precision.

Figure 2 :
Figure 2: Comparison of convergence among three methods.

Table 2
and by RBF-SVM in Table3."Hold" marks the highest classification performances among these obtained Forward Greedy Search Algorithm of Feature Selection based on Mean-Variable in Decision Systems (FGSA-MVM): Input: (,  ∪ , , ),

Table 2 :
Comparison of classification performance of reducts based on different uncertainty measures with CART.

Table 3 :
Comparison of classification performance of reducts based on different uncertainty measures with SVM.

Table 4 :
Comparison of selected features by different uncertainty measures.