Positive Macroscopic Approximation for Fast Attribute Reduction

Attribute reduction is one of the challenging problems facing the effective application of computational intelligence technology for artificial intelligence. Its task is to eliminate dispensable attributes and search for a feature subset that possesses the same classification capacity as that of the original attribute set. To accomplish efficient attribute reduction, many heuristic search algorithms have been developed. Most of them are based on the model that the approximation of all the target concepts associated with a decision system is dividable into that of a single target concept represented by a pair of definable concepts known as lower and upper approximations. This paper proposes a novel model called macroscopic approximation, considering all the target concepts as an indivisible whole to be approximated by rough set boundary region derived from inconsistent tolerance blocks, as well as an efficient approximation framework called positive macroscopic approximation (PMA), addressing macroscopic approximations with respect to a series of attribute subsets. Based on PMA, a fast heuristic search algorithm for attribute reduction in incomplete decision systems is designed and achieves obviously better computational efficiency than other available algorithms, which is also demonstrated by the experimental results.


Introduction
Rough set theory (RST) [1] is a powerful mathematical tool for dealing with imprecision, uncertainty, and vagueness.As an extension of traditional set theory supporting approximation in decision making, RST provides a wellestablished model that the approximation of an indefinable target concept is represented by a pair of definable concepts known as lower and upper approximations.In recent years, more and more attention has been paid to RST, and much success of its applications has already covered a variety of fields such as artificial intelligence, machine learning, and knowledge discovery [2][3][4][5][6].
Attribute reduction is one of the key topics in RST, viewed as the strongest and most important result to distinguish RST from other theories [7].Its task is just to eliminate reducible or dispensable attributes and search for a feature subset with the same classification capacity as that of the original attribute set.Much use has been made of attribute reduction as a preprocessing stage prior to classification of decision systems, making analysis algorithms more efficient and learned classifiers more compact.
In attribute reduction, we encounter four general search strategies.The most intuitive one is the exhaustive search, which checks all the possible candidate subsets and retrieves those that satisfy the given criteria.The exhaustive search results in high time complexity and has been proved to be an NP-hard problem [8].Another alternative way characterized by the incomplete search applies mapping and pruning techniques to minimization.It is achieved by mapping pertinent elements to a structured model and pruning useless branches in the search space [9,10].Similar to the exhaustive search, the incomplete search finds the minimal reduct at the expense of great computational effort.The third strategy for attribute reduction conducts a random search using the techniques such as genetic algorithm [11], ant colony optimization [12], and particle swarm optimization [13].The random search provides a robust solution but is also computationally very expensive.The fourth and most practical strategy discovers feature subsets by the heuristic search, where attributes with high quality are preferred as heuristics according to an evaluation function [14][15][16][17][18].The heuristic search has the ability to seek out an optimal or suboptimal reduct as well as acceptable computational complexity, playing an important role in the attribute reduction community.
The positive region-based reduction takes the change of rough set positive region caused by the addition of an attribute as the significance of the attribute.Attributes with the highest significance are selected as heuristics to guide the search process.One of the classic instances is the quick reduct algorithm proposed by Chouchoulas and Shen [19], which can pick the best path to a reduct from the whole search space and receive many improved versions [20][21][22].By using decomposition and sorting techniques to calculate positive region, Meng and Shi [23] put forward a fast positive region-based algorithm for feature selection from incomplete decision systems.Qian et al. [24,25] constructed an efficient accelerator for heuristic search using a series of positive regions to approximate a given target decision on the gradually reduced universe.It can be incorporated into heuristic attribute reduction algorithms and make the modified versions capable of greatly reducing computational time.
Unlike the positive region-based reduction, the combination-based reduction considers positive region as well as other available information such as rule support and boundary region.An attribute is evaluated by combined measure generated by positive region and additional information.With consideration of the overall quality of the potential set of rules, Zhang and Yao [27] introduced rule support into the evaluation function and proposed a support-oriented algorithm called parameterized average support heuristic (PASH), which selects features causing high average support of rule over all decision classes.Parthaláin and Shen [28] used distance metric to qualify the objects in the boundary region with regard to their proximity to the lower approximation and presented the distance metric-assisted tolerance rough set attribute reduction algorithm, which employs a new evaluation measure created by combining the distance metric and the dependency degree.
Different from the above two categories, the entropybased reduction gives the evaluation functions from the information view, such as combination entropy and rough entropy.Qian and Liang [32] presented the concept of combination entropy for describing the uncertainty of information systems and used its condition entropy to select a feature subset.Sun et al. [33] utilized rough entropy-based uncertainty measures to evaluate the roughness and accuracy of knowledge and then constructed a heuristic search algorithm with low computational complexity for attribute reduction in incomplete decision systems.
These investigations have offered interesting insights into attribute reduction.When dealing with large incomplete data, however, they still suffer from computational inefficiency.A more efficient and feasible attribute reduction approach is really desirable.This paper just intends to provide such a solution.
One can observe that most of heuristic attribute reduction algorithms are based on the model that the approximation of all the target concepts from a decision system is dividable into that of a single target concept represented by lower and upper approximations.Little work has hitherto taken the approximation problem into account at the macroscopic level.In this paper, we propose a novel model called macroscopic approximation, considering all the target concepts of a decision system as an indivisible whole to be approximated by rough set boundary region derived from inconsistent blocks, as well as an efficient approximation framework called positive macroscopic approximation (PMA), addressing macroscopic approximations with respect to a series of attribute subsets.Based on PMA, a fast heuristic attribute reduction algorithm for incomplete decision systems is designed and achieves obviously better computational efficiency than other available algorithms, which is also demonstrated by the experimental results.
The remainder of this paper is organized as follows.In Section 2, we review some basic concepts of RST and outline the quick reduct algorithm.Section 3 investigates macroscopic approximation and positive macroscopic approximation.In Section 4, a fast heuristic attribute reduction algorithm based on positive macroscopic approximation is devised and illustrated by a worked example.In Section 5, some experiments are practiced to validate the time efficiency of the proposed algorithm.Finally, we give a concise conclusion in Section 6.

Preliminaries
In this section, we briefly recall some basic concepts, such as incomplete decision system, tolerance relation, tolerance block, tolerance class, positive region, and boundary region, together with the quick reduct algorithm, needed in the following sections.

Basic Concepts.
A decision system is an information system with distinction between decision attributes and condition attributes.It is generally formulated by a data table where columns are referred to as attributes and rows as objects of interest.If there exist objects that contain missing data, the decision system is incomplete; otherwise, it is complete.
An incomplete decision system is a tuple  = (, ), where , called the universe of discourse, is a nonempty finite object set and  is an attribute set that consists of a condition attribute subset  and a decision attribute subset .For any  ∈ , there is a mapping ,  :  →   , where   is the value domain of .  = {  |  ∈ } ∪ { * } (where " * " stands for missing values) and   = {  |  ∈ } are the value domains of  and , respectively.For convenience, the tuple  = (, ) is usually denoted as  = (,  ∪ ).
Let  = (,  ∪ ) be an incomplete decision system.For any subset  ⊆ ,  determines a binary relation, denoted by SIM(), which is defined as describing the maximal set of objects which are probably indistinguishable to  with respect to .There is a relationship between the tolerance class and the maximal tolerance block, shown as follows [34]: where   () is the family of maximal -tolerance blocks containing .Then, it is easy to prove that where   () is the family of -tolerance blocks containing .Consider a partition   = {  |  = 1, 2, . . ., } of  determined by .  is the family of all the decision classes derived from the decision system.Each decision class can be viewed as a target concept approximated by a pair of precise concepts which are known as lower and upper approximations.The dual approximations of a target concept   are defined, respectively, as The low approximation of   is regarded as the maximal definable set contained in   , whereas the upper approximation of   is considered as the minimal -definable set containing   .If (  ) = (  ), then   is a -exact set; otherwise, it is a -rough set.By the dual approximations, the universe of the decision system is partitioned into two mutually exclusive crisp regions: positive region and boundary region, defined, respectively, as It can be perceived that the positive region is the collection of objects which are classified without any ambiguity into the target concepts using -tolerance relation, while the boundary region is, in a sense, the undeterminable area of the universe, where none of the objects are classified with certainty into the target concepts as far as

Rough Set Attribute Reduction.
In RST environment, dependency degree of decision attribute set  to condition attribute set  ( ⊆ ) is definable in terms of positive region: where |POS  (  )| and || are the cardinalities of POS  (  ) and , respectively.If   () = 1, we say that  totally depends on .If   () < 1, we say that  partially depends on .If   () = 0, we say that  is completely independent of .RST describes a variation of dependency degree caused by the addition of attribute  to  as the significance of  such that The bigger the significance value is, the more informative the attribute is.Accordingly, the quick reduct algorithm [19], regarded as a classical rough set attribute reduction algorithm, is constructed by iteratively adding the attribute with the highest significance to an attribute pool which begins with an empty set until the dependency value of the pool equals that of the set of the whole condition attributes.This process can be outlined by Algorithm 1.

Positive Macroscopic Approximation
It is well known that approximation is one of the core ideas of RST.A target concept is approximated by the low and upper approximations.Likewise, a decision system can be viewed as a super concept to be approximated by the low and upper super approximations, where the family of all the target concepts from the decision system is considered as the super concept, the positive region or the complementary set of the boundary region of the decision system acts as the low super approximation, and the universe of the decision system serves as the upper super approximation.If the low super approximation equals the upper super approximation, the decision system is consistent or exact; otherwise, it is inconsistent or rough.From the observation, it is easily understood that the approximation of a decision system can be represented by the positive region or the boundary region.
RST offers a feasible way to obtain the approximation of a decision system by means of that of a single target concept represented by lower and upper approximations as stated by ( 5) and (6).For convenience, this model is called microcosmic approximation.As opposed to microcosmic approximation, this section introduces a novel model called macroscopic approximation, where the approximation of a decision system is achieved by regarding all the target concepts as an inseparable entity to be approximated by the boundary region.Furthermore, we explore positive macroscopic approximation (PMA), which considers macroscopic approximations with respect to a series of attribute sets.

Macroscopic Approximation. Macrocosmic approxima-
tion is an alternative way to arrive at the approximation of a decision system by the boundary region.Due to an integral consideration of all the target concepts associated with a decision system, the low and upper approximations are unavailable.Affirmably, an attempt is deserved to pioneer a new avenue to macroscopic approximation.An inconsistent tolerance block, introduced in the following context, is capable of offering such a feasible solution.Proof.This proof is done by contradiction.For any  ∈ BND  (  ), suppose that there does not exist any IT-block containing .That is to say, any  ∈   () is a CT-block, which implies that there must exist some decision value  0 ( 0 ∈   ) such that () =  0 . 0 corresponds to a certain decision class  0 ( 0 ∈   ) containing all the objects whose decision values are equal to  0 , and then  ⊆  0 and ⋃ ∈  ()  ⊆  0 .So,   () = ⋃ ∈  ()  ⊆  0 .This means that  ∈ POS  (  ), contradicted with  ∈ BND  (  ).Hence, for any  ∈ BND  (  ), there must exist at least one IT-block containing .The lemma holds.Theorem 4 reveals the relationship that the boundary region is the union of IT-blocks, which is in essence the materialization of macroscopic approximation by integrally approximating all the target concepts using the boundary region derived from IT-blocks.Unlike the microscopic approximation stated by (6), the macroscopic approximation is directly constructed on the elementary members of the approximation space rather than the low and upper approximations, making the calculation of the approximation of the decision system more efficient.

PMA.
For a given decision system, we can build a series of attribute subsets according to the following rule.The first set has only one attribute selected from the set of the whole attributes, the union of the first set and another attribute chosen from the remaining attributes is regarded as the second set, and so on.The newly generated attribute subsets form an ascending order of the sequence, called a positive sequence.If a sequence is organized by a descending order, it is called a converse sequence.Assume that attribute subsets  and  are two arbitrary elements of a positive sequence.If  contains , we say that the tolerance relation determined by  is finer than that determined by .Conversely, if  is contained in , we say that the tolerance relation determined by  is coarser than that determined by .Accordingly, a positive sequence determines a train of tolerance relations stretching from coarse to fine.
In this subsection, we explore positive macroscopic approximation (PMA), which addresses macroscopic approximations with respect to a positive sequence.To construct an efficient PMA, the following relevant definitions and lemmas are needed.This lemma shows that any subblock derived from a CT-block is also a CT-block.In other words, inconsistent tolerance subblocks are derivable only from IT-blocks.Proof.Assume that  = { 1 ,  2 , . . .,   }.By ( 9) and ( 10), we have PMA is in fact a sequence of boundary regions, each of which denotes the approximation of the decision system with respect to some attribute subset.Since the tolerance relations determined by the positive sequence are from coarse to fine, it is easily proved that corresponding boundary regions stretch from broad to narrow.In other words, PMA is the sequence of gradually reduced boundary regions.When  = , PMA portrays in detail the evolution of boundary region becoming narrower and narrower until reaching the boundary region with respect to the set of the whole condition attributes.
PMA provides an efficient approximation framework where a decision system is consecutively approximated by the boundary region derived from IT-blocks according to a positive sequence.This mechanism can be visualized in Figure 1.
PMA considers the universe  as a root IT-block and evaluates the boundary regions by repeatedly dividing the IT-blocks into the smaller ones with the positive sequence.For the attribute set  1 , the attribute  1 works on the root IT-block and then outputs the consistent family  CT  1 and the inconsistent family  IT  1 .The former is pruned, while the latter is used to produce the boundary region BND  1 (  )   −1 along with BND   (  ) induced by  IT   .The detailed steps of this process are shown in Algorithm 2. PMA starts with an empty sequence; boundary regions with respect to a positive sequence are added incrementally.This process continues until all the attributes are traversed.For each loop, the boundary region is derivable from the corresponding ITblocks obtained by operating on their predecessors with a single attribute.
There are several highlights decorating PMA.First, the inconsistent family with respect to some attribute subsets For  2 , the attribute  2 is used to operate on  IT  1 for  IT  2 .The corresponding results are generated: Similarly, the results with respect to  3 and  4 are produced as follows: Therefore, PMA with respect to  4 is achieved: It is clear that , which confirms the fact that the boundary region of PMA is gradually reduced in accordance with the positive sequence and finally catches up to the minimal boundary region (BND  (  ) = { 4 ,  5 ,  6 }).

PMA-Based Attribute Reduction
As mentioned previously, PMA offers a sequence of the boundary regions in descending order.If each selected attribute is so informative that it makes the boundary region narrowed remarkably, the boundary region with respect to some attribute subsets can keep up with the boundary region with respect to the set of all the attributes.In other words, a reduct is the attribute subset with the minimal number that creates the same approximation of the decision system as the original attribute set.Following the observation, we design a heuristic attribute reduction algorithm based on PMA, called PMA-AR.Before elaborating PMA-AR, we give a redefinition of the attribute dependency.In (7), the dependency degree is definable in terms of positive region.Since positive region and boundary region are complementary within the universe of a decision system, the dependency degree can also be defined in terms of boundary region.Definition 12. Let  = (, ∪),  ⊆ , and  ∈ −, and the dependency degree of decision attribute set  to condition attribute set  is defined as Then, the significance of the attribute  is also equivalently redefined as Inputs: : the universe :  = { 1 ,  2 , . . .,   }, a condition attribute set :  = {}, a decision attribute set Output: The significance expresses how the boundary region will be affected if the attribute  is added to the set .In general, an attribute with a maximal significance value is preferentially selected to guide the search process.
PMA-AR is in essence an extension of the quick reduct algorithm indicated previously.It marries PMA and the boundary region-based significance.The former provides an efficient way to compute the boundary region, and the latter acts as a router to determine the optimal search path.This effective combination allows PMA-AR to have the ability to locate a reduct efficiently.Algorithm 3 gives the detailed description of PMA-AR.
PMA-AR is constructed on PMA and employs the boundary region-based significance as the evaluation function to determine the positive sequence.A new attribute subset is achieved by this way.Each of the unselected attributes is used to work on the IT-blocks generated by the current attribute subset, and then corresponding boundary region is derived from the resulted IT-blocks.By evaluating the boundary region-based significance, the attribute with the biggest significance value is selected and added to the current attribute subset, which creates the expected attribute subset.This process continues until the boundary region with respect to the newly generated attribute subset equals the boundary region with respect to the set of all the attributes; equivalently, the dependency degree of the decision attribute set to the newly generated attribute subset is equal to that of the decision attribute set to the original attribute set.PMA-AR produces the shortest positive sequence together with the fastest evolution of the boundary regions from maximal to minimal.A hidden reduct is uncovered which makes use of the minimal attributes to describe the approximation of the decision system with respect to the set of all the attributes. Note It is evident that the attribute  3 with the maximal significance value is available and can be used to create the attribute subset  1 = { 3 }.Then,  IT  1 = {{ 1 ,  2 ,  4 ,  5 ,  6 }} and BND  1 (  ) = { 1 ,  2 ,  4 ,  5 ,  6 }.Now,  IT  1 is regarded as father IT-blocks to breed son IT-blocks by adding one of the remaining attributes to  1 .From the newly generated IT-blocks, the boundary region is derivable.By evaluating the boundary region-based significance, the next expected attribute can be selected from the set of remaining attributes { 1 ,  2 ,  4 }.The corresponding results are exhibited as follows: Thus, the attribute  4 with the maximal significance value is preferably selected and used to generate another attribute subset  2 = { 3 ,  4 }.Then,  IT  2 = {{ 4 ,  5 }, { 5 ,  6 }} and As a result, the reduct of the decision system is { 3 ,  4 }.

Experimental Evaluation
In the following, we carry out several experiments on a personal computer with Windows XP, 2.53 GHZ CPU, and 2.0 G memory so as to evaluate PMA-AR in terms of the number of selected features and running time.
There are many heuristic search algorithms for attribute reduction in incomplete decision systems [17, 20, 23-25, 28, 33], of which three state-of-the-art algorithms are appropriate for comparison with PMA-AR.They are positive region-based algorithm (PRA) [23], distance metric-assisted algorithm (DMA) [28], and rough entropy-based algorithm (REA) [33], qualified as the representatives of positive region based reduction, combination-based reduction, and entropybased reduction, respectively.
Our experiments employ eight publicly accessible datasets from UCI repository of machine learning databases [35].Each of them is a discrete dataset with only one decision attribute.Since PMA-AR is designed to deal with incomplete data, five complete datasets, such as balance scale weight and distance, tic-tac-toe end game, car evaluation, chess end game, and nursery, are all turned into incomplete ones by randomly replacing some known attribute values with missing ones.In addition, an identifier attribute of standardized audiology is removed.The characteristics of these datasets are described in Table 2.
The experiments are performed by applying the four algorithms (PRA, DMA, REA, and PMA-AR) to the eight datasets shown in Table 2.The resulted number of selected features and running time expressed in seconds is exhibited in Tables 3 and 4, respectively.
From Table 3, it can be observed that the number of features selected by PMA-AR is the same as that by PRA but not less than those by DMA and REA.On the whole, the numbers by the four algorithms are relatively approximate.This indicates that the performances of the four algorithms are very close, though DMA and REA perform a little better than PMA-AR and PRA.
On the other hand, Table 4 shows that for each dataset, DMA needs the most time, PRA and REA get the second and third place, respectively, and PMA-AR need the lest.Moreover, the running time of DMA, PRA, and REA increases much more rapidly than that of PMA-AR.The differences can be illustrated by plotting the ratios of DMA, PRA, and REA to PMA-AR, respectively, as shown in Figure 2.
From Figure 2, we can find that the curve corresponding to DMA/PMA-AR increases most rapidly, and the curve corresponding to PRA/PMA-AR increases slightly rapidly than that corresponding to REA/PMA-AR.The experimental result coincides with the theoretical analysis that the time complexity of DMA is not less than (|| 2 || 2 ), which is the highest among all four algorithms; the second one is PRA, of which the time complexity is (|| 2 || log ||).Following PRA, REA has the time complexity not more than (|| 2 || log ||), and the most efficient one is PMA-AR with the time complexity much less than (|| 2 ||).It is verified that PMA-AR achieves the best performance in terms of time efficiency.
One can also observe that although each curve tends to increase with size of datasets, it is not strictly monotonic, namely, the curves fluctuate significantly.This can be seen from the case that the ratio of DMA to PMA-AR on Dataset 2 is higher than that on Dataset 3. The main reason is that the attribute number of datasets is different, and more importantly, the number of selected features is also different.For example, Dataset 2 has 69 attributes, of which 21 features are selected, while Dataset 3 has 16 attributes, and 8 features are selected.Furthermore, the curves also indicate that when the number of attributes is far less than that of objects, the  running time mainly relies on the latter.This supports the conclusion that PMA-AR is more suitable for feature selection from large data than the other three algorithms because it is proportional to ||.

Conclusions
Attribute reduction in incomplete decision systems has been a hotspot of rough set-based data analysis.To efficiently obtain a feature subset, many heuristic attribute reduction algorithms have been studied.Unfortunately, these algorithms are still computationally costly for large data.This paper has developed PMA-AR, which has the ability to find a feature subset as well as obviously better computational efficiency.
Unlike other algorithms featured by microcosmic approximation, PMA-AR adopts a novel model called macrocosmic approximation, which considers all the target concepts of a decision system as an indivisible whole to be approximated by rough set boundary region derived from ITblocks.Constructed on PMA which serves as an accelerator for calculation of boundary region, PMA-AR is capable of efficiently identifying a reduct by using the boundary region-based significance as the evaluation function.Both theoretical analysis and experimental results demonstrate that PMA-AR indeed outperforms other available algorithms with regard to time efficiency.
[34]erance block and   is the family of all the -tolerance blocks on , called a -approximation space.A -tolerance block depicts the collection of objects which are possibly indiscernible from each other with respect to .If there does not exist another -tolerance block  such that  ⊂ , then  is called a maximal -tolerance block[34].For any object  ∈ , -tolerance relation determines the tolerance class of , denoted by   (), that is, An IT-block describes a set of -definable objects with diverse class labels, implying that a group of indistinguishable objects have the divergence of decision making, whereas a CT-block depicts a collection of definable objects with the same class label, indicating that a group of -indiscernible objects share the same decision making.Accordingly,   are classified into two mutually exclusive crisp subfamilies.One is the consistent family, denoted by  CT  , collecting all the CT-blocks from   such that  CT  = { ∈   | |()| = 1}.The other is the inconsistent family, denoted by  IT  , gathering all the ITblocks from   such that  IT  = { ∈   | |()| > 1}.Obviously,  CT  ∪  IT  =   and  CT  ∩  IT  = {0}.It is worth noting the distinction between the boundary region and the IT-block.The former consists of objects of which tolerance classes cannot be entirely contained in target concepts, while the latter is such an entity that overlaps two or more target concepts.The following lemmas are used to investigate the relationship between them.Let  = (,  ∪ ) and  ⊆ .For any  ∈ BND  (  ), there must exist at least one IT-block containing .

Table 3 :
Number of features selected by four algorithms.

Table 4 :
Running time of four algorithms.