A Novel Approach of Rough Conditional Entropy-Based Attribute Selection for Incomplete Decision System

. Pawlak’s classical rough set theory has been applied in analyzing ordinary information systems and decision systems. However, few studies have been carried out on the attribute selection problem in incomplete decision systems because of its complexity. It is therefore necessary to investigate effective algorithms to deal with this issue. In this paper, a new rough conditional entropy-based uncertainty measure is introduced to evaluate the significance of subsets of attributes in incomplete decision systems. Furthermore, some important properties of rough conditional entropy are derived and three attribute selection approaches are constructed, including an exhaustive search strategy approach, a heuristic search strategy approach, and a probabilistic search strategy approach for incomplete decision systems. Moreover, several experiments on real-life incomplete data sets are conducted to assess the efficiency of the proposed approaches. The final experimental results indicate that two of these approaches can give satisfying performances in the process of attribute selection in incomplete decision systems.


Introduction
Rough set theory, proposed by Pawlak [1][2][3], is an extension of set theory for the study of the intelligent systems characterized by uncertain, imprecise, incomplete, and inconsistent data.It has been proven to be an innovative and efficient mathematic tool, compared with other traditional data processing strategies like PCA, neural networks, SVM and so forth [4][5][6][7].Unlike those methods, rough set theory allows knowledge discovering process to be conducted automatically by the data themselves without any dependence on the prior knowledge.By using the concepts of upper and lower approximations in rough set model to deal with the data, knowledge hidden in information systems could be discovered and expressed in the form of decision rules.Rough set methodology presents a novel paradigm to deal with uncertainty and then it has been successfully applied into feature selection [8], rule extraction [9,10], uncertainty reasoning [11][12][13], decision evaluation [14], granular computing [15,16], and so on.
It has been known that Pawlak's classical rough set model can only be used to tackle the problems of complete information systems [17].Nevertheless, because of the error of data measuring, the impreciseness from the limitation of data acquisition manners and other probable factors, in the real database it is inevitable to meet the empty values, which stand for the inaccessible information in the database for the moment.In other words, the incomplete information systems with missing values often exist in practical knowledge acquisition.For now, two main approaches have been proposed to cope with incomplete information systems.One is indirect approach, which transforms an incomplete information system into a complete information system by eliminating objects with missing values or filling up missing values with processed data.The other is direct approach, which extends some basic notions in the classical rough set models [18,19].In the past decade, with respect to different requirements, various extensions of the rough set models have been proposed, such as variable precision rough set models [20,21], rough set models based on tolerance relation [22,23], fuzzy rough set models [24], and weighted attributes values rough set models [25].
In many fields such as data mining, machine learning, and pattern recognition, data sets containing huge numbers 2 Mathematical Problems in Engineering of attributes can often be encountered.In such cases, feature selection, or attribute reduction as we know, is necessary.It is well known that irrelevant and redundant attributes in input attributes not only complicate the problem but also degrade solution accuracy [26,27].As a significant step among data preprocessing procedures, the main objective of attribute selection is to determine a minimal attribute subset, which is also called reduction, from a problem domain, which can retain a relatively high accuracy in replacing the original attributes.
In comparison with the study of attribute selection from complete information systems, scant effort has been made to develop the tolerance relation-based methods of attribute selection for incomplete information systems [28][29][30][31].Furthermore, there is a lack of the study of attribute selection of incomplete decision systems, compared with the above.As an effective method for attribute selection, rough set can preserve the meaning of the attributes.The essence of rough set approach for attribute selection is to find a subset of attributes, which can predict the decision concepts as well as the original attribute set.So far, there have been some new widespread approaches which are usually considered to develop the classical rough set theory as follows: tolerancebased rough set model [32], covering rough set model [33,34], and dominance-based rough set model [35,36].However, they still have some inherent disadvantages and are not suitable for attribute selection of incomplete decision systems.
As we know, finding the minimal reduct in an incomplete decision system is an NP-hard problem [37].The general method of solving this kind of problems is adopting a heuristic search, which always depends on the measurements associated with the attributes [38].However, little investigation has addressed the issue of measuring the uncertainty of knowledge of the tolerance relation-based rough set models in incomplete decision system until now.Hence, a further study on uncertainty measures applicable for evaluating the roughness and accuracy of a set in an incomplete decision system is of both theoretical and practical importance.
The main aim of this paper is to construct an effective uncertainty measure evaluating the roughness and accuracy of knowledge to find a heuristic attribute selection algorithm of incomplete decision systems.The rest of this paper is organized as follows.In Section 2, we briefly review some fundamental concepts concerning the main subject of this paper.Section 3 introduces the concept of entropy-based uncertainty measures and demonstrates several heuristic attribute selection algorithms of incomplete decision systems.Experimental comparisons and results are illustrated and analyzed in Section 4. Finally, some conclusions are presented in Section 5.

Preliminary
Classical rough set theory is originated by Qian et al. to deal with imprecise or vague concepts [39].In the last decade, many generalized rough set models have been proposed and developed.In this section, we will only introduce some notions being used in this paper.
Pawlak's classical rough set is to be defined at the first place.An information system in rough set theory is a pair (, ), where  = { 1 , . . .,   } denotes a nonempty finite set of objects, called universe of discourse, and  = { 1 , . . .,   } denotes a finite condition attribute set.With every attribute  ∈  we associate a set   , of its values, called the domain of .Then, a data pattern ( 1 (), . . .,   ()) can be defined by the object  and attributes from .A decision table, abbreviated as DT, is a special system with the form (,  ∪ {}), where  ∉  denotes decision attribute.Let   denote the domain of decision attribute mapping ().For the application of pattern classification, the attribute set is just the feature set and the universe of the discourse may represent a training pattern set or a sign set of training pattern sets.
Let  be an equivalence relation on which are called the lower and upper approximations of , respectively.Then, the concepts of positive region, negative region, boundary region, and approximation measure, are introduced, where  ̸ = 0.The lower approximation, which is equivalent to the positive region, is the complete set of the objects in the universe that can be unambiguously classified as belonging to the target set .In contrast, the upper approximation is the complete set of the objects that are possibly members of the target set .In other words, these objects cannot be positively classified as belonging to the complement of the target set , that is, .Furthermore, the negative region contains the set of objects that can be definitely ruled out as the members of the target set . Finally, the approximation measure   () is intended to capture the degree of completeness of our knowledge about the target set .
In most cases, some precise values of particular attributes in an information system are not known, which means missing or known partially.Then such a system is called an incomplete information system and is still denoted without any confusion by a pair (, ).As for an incomplete information system, a missing value () may be represented by the set of all the possible values for the corresponding attribute; that is, () =   .Moreover, if () is known partially, for instance, which implies that () cannot be ,  ∈   , then the value of () is specified as In what follows, we only take the consideration of the incomplete information systems with missing values.In such a case, the special symbol " * " is to be used to indicate that the specific value of an attribute is missing.Let  = (, ) be an incomplete information system.Each subset of attributes  ⊆  determines a similarity relation The incomplete information system can be described by a setvalued information system with () =   when () = * .In such a case, the similarity relation SIM() can be equivalently defined as With the similarity relation SIM(), two objects are considered to be possibly indiscernible in terms of the values of attributes .Furthermore, a similarity relation satisfies reflexive and symmetric, but may not be transitive, so it is one of the tolerance relations.

Rough Entropy-Based Uncertainty Measures
In this section, the concept of rough entropy is introduced to measure the uncertainty or roughness of knowledge in incomplete information systems.And then, some rough entropy-based uncertainty measures are presented in incomplete information systems and incomplete decision systems.Some important properties concerning the uncertainty measures are derived, respectively, and the relationship among them is discussed as well.
For a given information system, we need to assess its uncertainty or roughness for a target object or a target decision.A form of uncertainty measure, which is called rough entropy, has been mentioned in rough sets, rough relational databases, and information systems to calculate the roughness of knowledge.The following definition gives the description of the rough entropy in incomplete information systems.
Let IIS = (, ) be an incomplete information system, with ,  ⊆ .If there exists a one-to-one onto function ℎ : . Therefore, the rough entropy of knowledge is invariant with respect to the different set of tolerance classes that are size isomorphic.
Let IDS = (,  ∪ {}) denote an incomplete decision system with * ∈   .For any subset of condition attributes  ⊆ , a tolerance relation , which is in generalized form of the former similarity relation, can be defined as Then, the tolerance class of an object  with respect to an attribute set  is defined as Obviously, relation  is reflexive and symmetric, but may not be transitive.

Conditional Entropy Measure for Incomplete Decision
System.The rough entropy of knowledge in IIS and IDS has been discussed above.In this subsection, we will introduce a new form of conditional entropy and the mutual information based on the tolerance relationship to measure the uncertainty of knowledge in incomplete decision systems.And then, some important properties will be deduced. where Then, we get EN( |   ) ̸ = EN( | ).From parts (1) and (2), we finally prove that the theorem holds.

Attribute Selection Approaches Based on Rough Conditional Entropy for IDS
Two important steps are contained in the procedure of attribute selection: evaluation of a candidate attribute subset and search strategy through the attribute space, which are all for finding the most significant attributes, that is, relative reduct.Therefore, we use the conditional entropy discussed previously in Section 3 to evaluate the attribute subset and a measurement is defined as follows.
Definition 7. Given a consistent decision system IDS = (,  ∪ {}), let  ⊆  and  ∈  − .Then, the significance of attribute relative to  is defined as This definition describes the increment of discernibility power relative to the decision caused by involving attribute .

It implies that the larger the difference between EN(𝑑 | 𝐵)
and EN( |  ∪ {}) is, the more significant the specific condition attribute  is for condition attribute subset .Thus, it can be used as a new measurement for attribute selection in incomplete decision system.According to this new measurement, three attribute selection approaches based on different search strategies are proposed, respectively, in the following subsections.
4.1.Breadth-First: Exhaustive Search.Breadth-first is one of earliest feature selection or attribute selection algorithms in machine learning area.It begins with an empty attribute set and carries out search process with breadth-first strategy until it finds a minimal subset that satisfies stop criterion.Since the Breadth-first algorithm adopts an exhaustive search strategy, it can guarantee an optimal solution [41].Then, we present the Breadth-first approach for attribute selection of incomplete decision system as shown in Algorithm 1.
Example 8. Given the incomplete decision system shown in Table 2, we have According to the previous definition of rough conditional entropy, we can simply get Consequently, we can deduce other conditional entropy values with respect to the different condition attributes in the same way as follows: 3 ) Since EN( | {, , }) = EN( | ) = 0, we obtain the desired attribute set of selected {, , }, which is a relative reduct of the original condition attribute set .It also means that the search procedure ends at this step and it is not necessary to calculate the entropy value of the last combination of three attributes {, , } anymore.The detailed search procedure is illustrated in Figure 1.

Depth-First: Heuristic
Search.We can also use heuristic search or greed search to find attribute reduction.At the very beginning, the candidate attribute subset is empty.Then, a new attribute which can maximize the significance measure is added to the selected attribute subset each time, until the stop criterion is satisfied.Depth-first algorithm is fast, close to optimal, and deterministic [42].Here, we present the Depthfirst approach for incomplete decision system, as shown in Algorithm 2.
Example 9. Given the incomplete decision system shown in Since EN( | {}) has the minimum conditional entropy value, we choose the attribute  as one of selected attributes.Therefore, we can get Since EN( | {}) ̸ = EN( | ), we still need to add more attributes to SelectAttr.On the basis of the rule of heuristic search, we only need to test and compare the conditional entropy values of the subsets in which the condition attribute  is included: 2 )] = 0.6667.

LVF: Probabilistic
Search.Las Vegas algorithm is new for attribute subset selection and can make probabilistic choices of subsets in search of an optimal set.Las Vegas Filter, which is abbreviated as LVF, is a probabilistic algorithm where probabilities of generating any subset are equal [43].
In this paper, we use the investigated conditional entropy as LVF's evaluation measurement.It generates attribute subsets randomly with equal probability and records the minimal size of attributes subset satisfying the stop criterion of maximum tries times.LVF is fast and efficient in reducing the number of candidate features in the early stages and can produce optimal solutions if the computing resources permit.The LVF approach for attribute selection in incomplete decision system is given, as shown in Algorithm 3.

Experiments
In this section, the performances of our attribute selection algorithms given in previous section are demonstrated and compared.Several real-life incomplete data sets from UCI Repository of Machine Learning Database at the University of California are used in our experiments.These experiments are performed on a personal computer with Windows 7, Intel (R) Core (TM) i3 CPU 2.13 GHz, and 4 GB RAM.The objective of these experiments is to evaluate the effectiveness and efficiency of the previous algorithms.The summary and statistic of the experimental data sets are shown in Table 3 and Figure 3, respectively.Since some incomplete data sets contain continuous condition attribute values, we conduct a discretization preprocess to turn these continuous values into discrete ones before carrying out attribute selection.The aim of this step is to compress the data and reduce the time consumption of subsequent attribute selection.The running time of each algorithm is average CPU time, expressed in seconds.
In Breadth-first algorithm, we terminate the program when it runs beyond 5000 seconds.Moreover, we set the parameter MaxTries in LVF algorithm to variant values for different incomplete data sets, according to their sizes.The running time and the size of attribute selection results are highly concerned with the choice of the parameter MaxTries.It means that when MaxTries grows, the running time of LVF approach increases linearly and the size of selected attribute subset decreases.Both of them for each attribute selection approach are shown in Table 4.The running time of each approach is the average CPU time, expressed in seconds.We can easily find that Breadth-first approach takes much more time, even more than 5000 seconds, to obtain an attribute reduction, compared with the other two approaches.Three data sets out of six are too large in scale to calculate in limited time.Furthermore, it is also easy to be observed in Table 4 that Depth-first approach tends to select fewer attributes than LVF approach.In other words, the size of attribute subset selected by Depth-first approach intends to be smaller than that of attribute subset selected by LVF approach.And Depth-first approach consumes less time than the other two approaches in most instances.The time consumptions of Breadth-first approach, Depth-first approach, and LVF approach for attribute selection are (2    and || denote the MaxTries parameter in LVF approach, the total numbers of the condition attributes, and instances in incomplete data sets, respectively.The relationships between incomplete data sets and the number of selected attributes, the running time of attribute selection, are illustrated in Figures 4 and 5.
The final part of our experiments is to compare and evaluate the efficiency of the proposed algorithms in practical classification tasks.For each of the six data sets in Table 5, we employ the SVM-RBF classifier, which is one of the most frequently used classifiers.We also apply the 10-fold crossvalidation method to estimate the classification accuracy with respect to the reducts generated by the proposed algorithms.In each fold, the redundant attributes from the current training set are removed at the beginning, according to the proposed algorithms.Then, the test set is classified by using the rules generated from the training set.The final results of classification accuracies are shown in Table 5.It can be seen that, by using attribute selection algorithms, the classification accuracies for incomplete data sets are all raised in different degrees, compared with the classification accuracy for the original full attributes.It also can be noticed that the Depthfirst algorithm exhibits the highest classification accuracy on each incomplete data set.Therefore, the experimental results demonstrate that the proposed algorithms are effective for attribute selection tasks in application domains.

Conclusion
In this paper, a rough conditional entropy-based attribute selection approach is proposed to evaluate the significance of condition attributes and find the minimal reduct in incomplete decision systems.By this measure, three types of attribute selection approaches, including the exhaustive search strategy approach Breadth-first, the heuristic search strategy approach Depth-first, and the probabilistic search approach LVF, are constructed.To evaluate the effectiveness of the introduced approaches, experiments on several real-life incomplete data sets are conducted.The experimental results suggest that Depth-first and LVF approaches are practical for attribute selection for classification of high-dimensional data with thousands of condition attributes, and they can efficiently enhance classification accuracy with predominant attributes.However, the process of examining exhaustively all combinations of condition attributes for finding the optimal one is an NP-hard problem.So far, it still cannot be easily calculated by our approaches if there are hundreds of thousands of condition attributes in a complex incomplete decision system.Therefore, for large data sets, to reduce the time consumption of the process of attribute selection, more applicable approaches such as parallel heuristic algorithms are desirable for incomplete decision systems with large scale.This issue needs to be investigated in the future.

) Definition 2 .
Let IDS = (,  ∪ {}), * ∈   be an incomplete decision system.Let   =  →   be the generalized decision function.If |  ()| = 1 for any  ∈ , then the incomplete decision is consistent, which implies that it is deterministic and definite, where | ⋅ | denotes the number of the elements of the set.

Figure 1 :
Figure 1: The flow chart of the Breadth-first algorithm.

Figure 2 :
Figure 2: The flow chart of the Depth-first algorithm.

Figure 3 :
Figure 3: The statistical result of experimental data sets.

Figure 4 :
Figure 4: Number of selected attributes versus data set.

Figure 5 :
Figure 5: Running time of attribute selection versus data set.
, which means relation  satisfies reflexivity, symmetry, and transitivity.Relation  generates a partition / = IND() = {[]  |  ∈ } on , where IND() denotes the equivalence classes, as well as indiscernible class, generated by the equivalence relation .These are also called elementary sets of  in rough set theory.Let 0 denote the empty sets.For any  ⊆ , we can describe  by elementary sets of  and the two sets

Table 3 :
The summary of experimental data sets.

Table 4 :
Size of selected attribute subsets and running time.

Table 5 :
Performance of attribute selection algorithms with SVM-RBF classifier.