The divide and conquer method is a typical granular computing method using multiple levels of abstraction and granulations. So far, although some achievements based on divided and conquer method in the rough set theory have been acquired, the systematic methods for knowledge reduction based on divide and conquer method are still absent. In this paper, the knowledge reduction approaches based on divide and conquer method, under equivalence relation and under tolerance relation, are presented, respectively. After that, a systematic approach, named as the abstract process for knowledge reduction based on divide and conquer method in rough set theory, is proposed. Based on the presented approach, two algorithms for knowledge reduction, including an algorithm for attribute reduction and an algorithm for attribute value reduction, are presented. Some experimental evaluations are done to test the methods on uci data sets and KDDCUP99 data sets. The experimental results illustrate that the proposed approaches are efficient to process large data sets with good recognition rate, compared with KNN, SVM, C4.5, Naive Bayes, and CART.

In the search for new paradigms of computing, there is a recent surge of interest, under the name of granular computing [

Rough set (RS) [

Divide and conquer method is a simple granular computing method. When the algorithms are designed by divide and conquer method, the decision table can be divided into many subdecision tables recursively in attribute space. That^{'}s to say, an original big data set can be divided into many small ones. If the small ones can be processed one by one, instead the original big one is processed on a whole, it will save a lot time. Thus, it may be an effective way to process large data set. The divide and conquer method consists of three vital stages.

Divide the big original problem into many independent subproblems with the same structure.

Conquer the sub-problems recursively.

Merge the solutions of sub-problems into the one of original problem.

So far, some good results for knowledge reduction based on divide and conquer method have been achieved, such as the computation of the attribute core and the computation of attribute reduction under given attribute order [

However, the systematic method for knowledge reduction based on divide and conquer method is still absent, especially “how to keep invariance between the solution of original problem and the ones of sub-problems.” It results in the difficulty to design the high efficient algorithms for knowledge reduction based on divide and conquer method. Therefore, it is urgent to discuss the knowledge reduction method based on divide and conquer methods systematically and overall.

Contributions of this work are (1) some principles for “keeping invariance between the solution of original problem and the ones of sub-problems” are concluded. Then, the abstract process for knowledge reduction based on divide and conquer method in the rough set theory is presented, which is helpful to design high efficient algorithm based on divide and conquer method. (2) Fast approaches for knowledge reduction based on divide and conquer method, including an algorithm for attribute reduction and an algorithm for attribute value reduction, are proposed. Experimental evaluations show that the presented methods are efficient.

The remainder of this paper is organized as follows. The basic theory and methods dealing with the application of rough set theory in data mining are presented in Section

Rough set theory was introduced by Pawlak as a tool for concept approximation relative to uncertainty. Basically, the idea is to approximate a concept by three description sets, namely, the lower approximation, upper approximation, and boundary region. The approximation process begins by partitioning a given set of objects into equivalence classes called blocks, where the objects in each block are indiscernible from each other relative to their attribute values. The approximation and boundary region sets are derived from the blocks of a partition of the available objects. The boundary region is constituted by the difference between the lower approximation and upper approximation and provides a basis for measuring the “roughness” of an approximation. Central to the philosophy of the rough set approach to concept approximation is the minimization of the boundary region [

For the convenience of description, some basic notions of decision tables are introduced here at first.

A decision table is defined as

Given a decision table

Given a decision table

Given a decision table

Given a decision table

Given a decision table

Given a decision table

Given a decision table

In the research of rough set theory, the divide and conquer method is an effective method to design high effective algorithm. It can be used to compute equivalence classes, positive region, and attribute core of decision table (see Propositions

Given a decision table

Proposition

Given a decision table

Given a decision table

Propositions

Obviously, Propositions

Discernibility matrix by Skowron is a useful method to design some algorithms in rough set theory. However, due to its high complexity of algorithms based on explicit computing of the discernibility matrices, the efficiency of algorithms based on discernibility matrix needs to be improved. In the literature some useful methods have been proposed (see [

“How to keep invariance between the solution of original problem and the ones of sub-problems” is a key problem. We conclude some principles for computing positive region, attribute core, attribute reduction, and value reduction (see Propositions

Although the decision tree-based methods and our approaches both belong to divide and conquer method, our approaches cost more on “conquer” and “merge” while they cost less on “divide,” compared with the decision tree-based methods. Furthermore, our approaches need not to construct a tree, which maybe save space.

The existing heuristic ones in [

Given a decision table

for all

Given a decision table

(Note: If

First, prove

For all

If

If

Thus,

Similarly, we can proof

Therefore, Proposition

Given a decision table

According to Proposition

Propositions

In the course of attribute value reduction, tolerance relation is often adopted due to some attribute values on condition attribute being deleted. Thus, tolerance relation in attribute value reduction may be needed. A method is introduced by Kryszkiewicz and Rybinski [

Given an incomplete decision table

The tolerance class

Given a decision table

Given a decision table

Given a decision table

Given a decision table

in

in

Given a decision table

Propositions

According to the divide and conquer method under equivalence relation and tolerance relation, the abstract process for knowledge reduction in rough set theory based on the divide and conquer method APFKRDAC(

Input: The problem

Output: The solution

Determine a similarity relation between different objects, such as equivalence relation or tolerance relation. Generally, reflexivity and symmetry of different objects may be necessary.

Determine the order

Design a judgment criteria

Design a decompose function

Design a boolean function

Design a computation function

Design a computation function

(4.1) IF

goto Step 5.

(4.2) (Divide)

According to the decomposing order, divide

(4.3) (Conquer sub-problems recursively)

FOR

END FOR.

Where,

(4.4) (Merge the solutions of sub-problems)

If necessary, optimize the solution

RETURN

Now, let us give an example for computing the positive region of decision table to explain Algorithm

Input: The problem

Output: The positive region

equivalence relation.

(3.1) Design a judgment criteria

On attribute

IF

ELSE

END IF

(3.2) Design a decompose function

According to

(3.3) Design a boolean function

Let

IF

END IF

(3.4) Design a computation function

For arbitrary sub-decision table

IF

ELSE

END IF

(3.5) Design a computation function

(4.1) IF

(4.2) (Divide)

According to the order

decision tables

(4.3) (Conquer sub-problems recursively)

FOR

END FOR

Where,

(4.4) (Merge the solutions of sub-problems)

RETURN

Given a decision table

Let us denote by

Divide

Conquer

Divide

Conquer

Conquer

Merge the solutions of

Similarly, we can conquer

Conquer

Conquer

Merge the solutions of

The course of computing positive region of

Knowledge reduction is the key problem in rough set theory. When the divide and conquer method is used to design the algorithm for knowledge reduction, some good results may be obtained. However, implementing the knowledge reduction based on the divide and conquer method is very complex, though it is only a simple granular computing method. Here, we will discuss the quick algorithm for knowledge reduction based on divide and conquer method.

In the course of attribute reduction, the divide and conquer method is used to compute the equivalence class, the positive region, and the non-empty label attribute set and delete the elements of discernibility matrix. Due to the complexity of attribute reduction, the following algorithm is not presented as Algorithm

According to Step 2 of Algorithm

In 2001, an algorithm for attribute reduction based on the given attribute order is proposed by Jue Wang and Ju Wang [

Given a decision table

For

According to Definition

Given a decision table

(Necessity) according to Lemma

(Sufficiency):

for all

If

If

That is,

if

Therefore, Proposition

According to the algorithm in [

NonEmptyLabelAttr

//

According to Propositions

IF

return;

END IF

Let NonEmptyLabel

Denote non-empty label attribute: NonEmptyLabel[

END IF

NonEmptyLabelAttr

END FOR.

END Function 1.

Using the above recursive function, an algorithm for computing the non-empty label attribute set of a decision table is developed.

Computation of The Non-empty Label Attribute Set

Input: A decision table

Output: The non-empty label attribute set

FOR

END FOR

NonEmptyLabelAttr

FOR

IF

END FOR

RETURN

Suppose

Obviously, Algorithm

Computation of Attribute Reduction Based on Divide and Conquer Method

Input: A decision table

Output: Attribute reduction

Compute the positive region

Compute the non-empty label attribute set

//Suppose

IF

ELSE

Generate a new attribute order:

Compute new non-empty label attribute set

GOTO Step 4.

END IF

Suppose

In Algorithm

Given a decision table

Given a decision table

Given a decision table

According to Algorithm

DRAVDAC

//Denote by array CoreValueAttribute

//The values of array CoreValueAttribute

IF there is contradiction on

return;

END IF

Divide

Denote by array

Where,

FOR

END FOR

FOR

FOR

IF

END FOR

IF

END FOR

END Function 2.

Using Function 2, we present an algorithm for value reduction based on divide and conquer method (see Algorithm

An Algorithm for Value Reduction Based on Divide and Conquer Method

Input:

Output: The certain rule set

According to Algorithm

Assume the order for dividing decision table be

Compute the non-empty label attribute set

Let

FOR i =

FOR

CoreValueAttribute

END FOR.

Invoke Function 2: DRAVDAC

Update

END FOR.

FOR

IF

Construct a rule

ENE IF

END FOR

RETURN

Suppose

In Step 4, let the number of non-empty label attribute set be

Suppose the data obey the uniform distribution. The time complexity of Algorithm

The space complexity of Algorithm

In order to test the efficiency of knowledge reduction based on divide and conquer method, some experiments have been performed on a personal computer. The experiments are shown as follows.

In this experiment, some experimental evaluations are done to present the efficiency and recognition results of Algorithms

The test course is as follows. First, 11 uci data sets (Zoo, Iris, Wine, Machine, Glass, Voting, Wdbc, Balance-scale, Breast, Crx, and Tic-tac-toe) are used. Second, our methods: the algorithm for discretization [

From Table

Specifications of 11 uci data sets.

Data sets | Number of records | Number of attributes | Number of decision values |
---|---|---|---|

Zoo | 101 | 17 | 7 |

Iris | 150 | 4 | 3 |

Wine | 178 | 13 | 3 |

Machine | 209 | 7 | 8 |

Glass | 214 | 9 | 7 |

Voting | 435 | 16 | 2 |

Wdbc | 569 | 28 | 2 |

Balance scale | 625 | 4 | 5 |

Breast | 684 | 9 | 2 |

Crx | 690 | 15 | 2 |

Tic-tac-toe | 958 | 9 | 2 |

Recognition result on uci data sets.

In order to test the efficiency of our methods for processing large data sets, some experiments are done on KDDCUP99 data sets with 4898432 records, 41 condition attributes, and 23 decision classifications (

The recognition of 6 algorithms on uci data sets.

Data sets | KNN | SVM | C4.5 | Naive bayes | CART | Our methods |
---|---|---|---|---|---|---|

Zoo | 0.9208 | 0.9603 | 0.8713 | 0.9702 | 0.9109 | 0.9208 |

Iris | 0.9600 | 0.9467 | 0.9600 | 0.9533 | 0.9533 | 0.9467 |

Wine | 0.9719 | 0.4887 | 0.8483 | 0.9775 | 0.8652 | 0.9607 |

Machine | 0.8461 | 0.6490 | 0.8173 | 0.8173 | 0.8173 | 0.8894 |

Glass | 0.7009 | 0.6963 | 0.5935 | 0.4953 | 0.7009 | 0.7149 |

Voting | 0.9149 | 0.9333 | 0.9517 | 0.9425 | 0.9563 | 0.9333 |

Wdbc | 0.9490 | 0.6274 | 0.9367 | 0.9349 | 0.9332 | 0.9139 |

Balance scale | 0.9024 | 0.952 | 0.7488 | 0.9088 | 0.7920 | 0.7712 |

Breast | 0.9570 | 0.3462 | 0.9399 | 0.9613 | 0.9399 | 0.9341 |

Crx | 0.8710 | 0.5565 | 0.8232 | 0.7522 | 0.8406 | 0.7754 |

Tic-tac-toe | 0.6952 | 1.0000 | 0.7328 | 0.7150 | 0.8779 | 0.9624 |

| ||||||

Average | 0.8808 | 0.7415 | 0.8385 | 0.8571 | 0.8716 | 0.8839 |

First, the experiments are done on 10 data sets (≤10^{4} records) from the original KDDCUP99 data sets. The experimental evaluations are showed in Tables

The recognition of 6 algorithms on KDDCUP99 data sets (≤10^{4} records).

Number of records | KNN | SVM | C4.5 | Naive bayes | CART | Our methods |
---|---|---|---|---|---|---|

1000 | 0.9950 | 0.9550 | 0.9800 | 0.9800 | 0.9870 | 0.9950 |

2000 | 0.9980 | 0.9700 | 0.9935 | 0.9925 | 0.9920 | 0.9935 |

3000 | 0.9977 | 0.9850 | 0.9947 | 0.9753 | 0.9950 | 0.9947 |

4000 | 0.9967 | 0.9750 | 0.9940 | 0.9483 | 0.9935 | 0.9940 |

5000 | 0.9984 | 0.9770 | 0.9956 | 0.9760 | 0.9958 | 0.9954 |

6000 | 0.9986 | 0.9791 | 0.9968 | 0.9553 | 0.9965 | 0.9975 |

7000 | 0.9980 | 0.9850 | 0.9958 | 0.9905 | 0.9951 | 0.9956 |

8000 | 0.9985 | 0.9831 | 0.9958 | 0.9555 | 0.9956 | 0.9965 |

9000 | 0.9983 | 0.9877 | 0.9970 | 0.9690 | 0.9955 | 0.9964 |

10000 | 0.9986 | 0.9880 | 0.9971 | 0.9449 | 0.9968 | 0.9971 |

The training time of 6 algorithms on KDDCUP99 data sets (≤10^{4} records).

Number of records | KNN | SVM | C4.5 | Naive bayes | CART | Our methods |
---|---|---|---|---|---|---|

1000 | 3 | 87 | 50 | 9 | 120 | 143 |

2000 | 3 | 259 | 47 | 12 | 173 | 78 |

3000 | 2 | 631 | 18 | 16 | 280 | 130 |

4000 | 0 | 1071 | 125 | 22 | 384 | 192 |

5000 | 0 | 1731 | 173 | 31 | 543 | 226 |

6000 | 0 | 2467 | 223 | 41 | 672 | 311 |

7000 | 0 | 3617 | 281 | 129 | 861 | 541 |

8000 | 2 | 4591 | 329 | 70 | 1155 | 822 |

9000 | 2 | 6171 | 364 | 178 | 1262 | 1261 |

10000 | 3 | 7496 | 378 | 96.7 | 1629 | 1415 |

The test time of 6 algorithms on KDDCUP99 data sets (≤10^{4} records).

Number of records | KNN | SVM | C4.5 | Naive bayes | CART | Our methods |
---|---|---|---|---|---|---|

1000 | 41 | 6 | 0 | 25 | 2 | 0 |

2000 | 122 | 9 | 0 | 34 | 0 | 0 |

3000 | 279 | 18 | 0 | 50 | 0 | 0 |

4000 | 495 | 40 | 0 | 67 | 0 | 0 |

5000 | 782 | 59 | 2 | 81 | 0 | 0 |

6000 | 1197 | 89 | 0 | 94 | 0 | 2 |

7000 | 1805 | 131 | 2 | 125 | 0 | 0 |

8000 | 2289 | 154 | 2 | 249 | 2 | 0 |

9000 | 2880 | 207 | 0 | 178 | 0 | 0 |

10000 | 3588 | 271 | 2 | 162 | 0 | 0 |

From Tables

Second, the experiments are done on 10 data sets (≤10^{5} records) from the original KDDCUP99 data sets. The experimental evaluations are showed in Tables

The recognition of 5 algorithms on KDDCUP99 data sets (≤10^{5} records).

Number of records | KNN | C4.5 | Naive bayes | CART | Our methods |
---|---|---|---|---|---|

10000 | 0.9985 | 0.9969 | 0.9804 | 0.9970 | 0.9973 |

20000 | 0.9987 | 0.9979 | 0.9490 | 0.9980 | 0.9980 |

30000 | 0.9990 | 0.9985 | 0.9560 | 0.9987 | 0.9987 |

40000 | 0.9989 | 0.9988 | 0.9365 | 0.9987 | 0.9989 |

50000 | 0.9991 | 0.9989 | 0.9613 | 0.9989 | 0.9989 |

60000 | 0.9992 | 0.9989 | 0.9627 | 0.9990 | 0.9989 |

70000 | 0.9992 | 0.9992 | 0.9438 | 0.9990 | 0.9990 |

80000 | 0.9992 | 0.9992 | 0.9249 | 0.9992 | 0.9992 |

90000 | 0.9992 | 0.9993 | 0.9097 | 0.9991 | 0.9992 |

100000 | 0.9992 | 0.9992 | 0.9118 | 0.9991 | 0.9992 |

The running time of 5 algorithms on KDDCUP99 data sets (≤10^{5} records).

Number of records | KNN | C4.5 | Naive bayes | CART | Our methods | |||||
---|---|---|---|---|---|---|---|---|---|---|

Tr | Te | Tr | Te | Tr | Te | Tr | Te | Tr | Te | |

10000 | 2 | 3622 | 412 | 0 | 100 | 164 | 1632 | 0 | 1496 | 3 |

20000 | 3 | 15045 | 1225 | 0 | 281 | 321 | 4984 | 0 | 4367 | 0 |

30000 | 6 | 34592 | 2225 | 2 | 474 | 557 | 8964 | 0 | 8165 | 0 |

40000 | 9 | 60845 | 3582 | 5 | 704 | 774 | 15833 | 2 | 12815 | 5 |

50000 | 13 | 95023 | 5527 | 2 | 930 | 1045 | 19140 | 2 | 20003 | 6 |

60000 | 17 | 139299 | 7920 | 6 | 1167 | 1148 | 28805 | 2 | 30450 | 9 |

70000 | 22 | 196827 | 10084 | 9 | 1422 | 1438 | 31774 | 3 | 34909 | 2 |

80000 | 28 | 248624 | 12069 | 5 | 1688 | 2073 | 40056 | 3 | 39407 | 9 |

90000 | 27 | 310826 | 14461 | 10 | 1959 | 1716 | 44023 | 3 | 47688 | 6 |

100000 | 33 | 386018 | 16673 | 13 | 2185 | 2044 | 54288 | 3 | 49394 | 8 |

From Tables

Third, the experiments are done on 10 data sets (≤10^{6} records) from the original KDDCUP99 data sets. The experimental evaluations are showed in Table

The experimental results of 3 algorithms on KDDCUP99 data sets (≤10^{6} records).

Number of records | C4.5 | CART | Our methods | ||||||
---|---|---|---|---|---|---|---|---|---|

RRate | Tr | Te | RRate | Tr | Te | RRate | Tr | Te | |

100000 | 0.9992 | 18730 | 31 | 0.9991 | 56208 | 8 | 0.9992 | 51692 | 24 |

200000 | 0.9997 | 52971 | 32 | 0.9994 | 113297 | 8 | 0.9995 | 113496 | 16 |

300000 | 0.9997 | 78118 | 47 | 0.9997 | 182997 | 0 | 0.9997 | 178769 | 31 |

400000 | 0.9998 | 111026 | 86 | 0.9997 | 327640 | 23 | 0.9997 | 313085 | 40 |

500000 | 0.9997 | 163746 | 94 | 0.9998 | 391179 | 16 | 0.9997 | 410342 | 55 |

600000 | 0.9998 | 218152 | 110 | 0.9996 | 446004 | 16 | 0.9997 | 538169 | 55 |

700000 | 0.9998 | 226879 | 125 | 0.9997 | 610749 | 24 | 0.9998 | 719630 | 78 |

800000 | 0.9999 | 387911 | 148 | 0.9999 | 1015165 | 32 | 0.9999 | 1143149 | 109 |

900000 | 0.9998 | 304466 | 195 | 0.9999 | 1595899 | 382 | 0.9998 | 1512619 | 133 |

1000000 | 0.9999 | 303403 | 203 | 0.9997 | 1583910 | 367 | 0.9998 | 1590494 | 140 |

Fourth, the experiments are done on 10 data sets (<

The experimental results of 3 algorithms on KDDCUP99 data sets (<5 × 10^{6} records).

Number of records | C4.5 | CART | Our methods | ||||||
---|---|---|---|---|---|---|---|---|---|

RRate | Tr | Te | RRate | Tr | Te | RRate | Tr | Te | |

489843 | 0.9998 | 158 | 0.124 | 0.9997 | 386 | 0.031 | 0.9998 | 499 | 0.063 |

979686 | 0.9999 | 330 | 0.187 | 0.9998 | 1517 | 0.312 | 0.9998 | 1472 | 0.141 |

1469529 | 0.9999 | 706 | 0.312 | 0.9999 | 5851 | 5.554 | 0.9999 | 4026 | 0.156 |

1959372 | 0.9999 | 883 | 0.499 | — | — | — | 0.9999 | 6650 | 0.421 |

2449216 | 0.9999 | 1143 | 0.624 | — | — | — | 0.9999 | 11446 | 0.476 |

2939059 | 0.9999 | 1176 | 0.655 | — | — | — | 0.9999 | 14023 | 0.578 |

3428902 | 0.9999 | 1849 | 0.827 | — | — | — | 0.9999 | 35361 | 0.606 |

3918745 | 0.9999 | 2043 | 0.923 | — | — | — | — | — | — |

4408588 | 0.9999 | 2567 | 1.106 | — | — | — | — | — | — |

4898432 | 0.9999 | 2916 | 1.217 | — | — | — | — | — | — |

From Tables

Now, we will give some conclusions for our approaches compared with KNN, SVM, C4.5, Naive Bayes, and CART, according to the LOOCV experimental results on the uci data sets and 10 cross-validation experimental results on KDDCUP99 data sets.

Compared with KNN, SVM, and Naive Bayes, the LOOCV recognition results by our methods on uci data sets are better than KNN, SVM, and Bayes. Furthermore, our methods on KDDCUP99 data sets have higher efficiency than KNN, SVM, and Naive Bayes, while they also have good recognitions results.

Compared with CART, the LOOCV recognition results by our methods on uci data sets are closed to CART. But our methods can process larger data sets than CART on KDDCUP99 data sets, while they both have good recognition results.

Compared with C4.5, the LOOCV recognition results by our methods on uci data sets are better than C4.5. Furthermore, the test time by our methods on KDDCUP99 data sets is less than C4.5, while C4.5 can process larger data sets than our methods. After these two methods are analyzed, we find that our methods are more complex than C4.5, due to the complex discretization (C4.5 can process the decision table with continuous values directly, while the discretization should be necessary for our methods). As a coin has two sides, enough learning contributes to the better rule sets, thus less test time is needed by our methods than C4.5.

Therefore, the knowledge reduction approaches based on divide and conquer method are efficient to process large data set, although they need to be improved further in the future.

In this paper, the abstract process of knowledge reduction based on divide and conquer method is concluded, which is original from the approaches under the equivalence relation and the one under the tolerance relation. Furthermore, an example for computing positive region of the decision table is introduced. After that, two algorithms for knowledge reduction based on divide and conquer method, including an algorithm for attribute reduction and an algorithm for attribute value reduction, are presented, respectively. The proposed algorithms are efficient to process the knowledge reduction on uci data sets and KDDCUP99 data set, according to the experimental evaluations. Therefore, the divide and conquer method is an efficient and, therefore, suitable method to be used to knowledge reduction algorithms in rough set theory. With this efficiency, widespread industrial application of rough set theory may become possible.

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants no. 61073146, no. 61272060, no. 61203308, and no. 41201378, Scientific and Technological Cooperation Projects between China and Poland Government, under Grant no. 34-5, Natural Science Foundation Project of CQ CSTC under Grant no. cstc2012jjA1649, and Doctor Foundation of Chongqing University of Posts and Telecommunications under Grant no. A2012-08.