^{1}

^{2}

^{3}

^{3}

^{4}

^{1}

^{2}

^{1}

^{2}

^{3}

^{4}

Parallel attribute reduction is one of the most important topics in current research on rough set theory. Although some parallel algorithms were well documented, most of them are still faced with some challenges for effectively dealing with the complex heterogeneous data including categorical and numerical attributes. Aiming at this problem, a novel attribute reduction algorithm based on neighborhood multigranulation rough sets was developed to process the massive heterogeneous data in the parallel way. The MapReduce-based parallelization method for attribute reduction was proposed in the framework of neighborhood multigranulation rough sets. To improve the reduction efficiency, the hashing Map/Reduce functions were designed to speed up the positive region calculation. Thereafter, a quick parallel attribute reduction algorithm using MapReduce was developed. The effectiveness and superiority of this parallel algorithm were demonstrated by theoretical analysis and comparison experiments.

With the rapid development of information technology, especially in the aspects of sensing, communication, network, and calculation, the amount of accumulated data in many fields is increasing at striking speeds. The inestimable value in big data has become a common understanding in academia and industry [

In order to solve this problem, some scholars have proposed parallel algorithms for high-dimensional or large-scale data. Based on the divide and conquer strategy, Xiao et al. [

To break the limit of the equivalence relation, Lin [

To the best of our knowledge, it is still a challenging task to perform parallel attribute reduction on complex and massive data. In particular, the existing algorithms could not effectively deal with the complex heterogeneous data, which include categorical and numerical attributes, in the parallel way from the multiple granular computing perspectives. Due to the rampant existence of heterogeneous datasets in real-life applications, it is therefore necessary to investigate effective parallel approaches to deal with this issue. For the purpose of parallelizing the traditional attribute reduction algorithm for complex heterogeneous data, the neighborhood multigranulation rough set model was considered in this paper, and the parallelization points of the hashing, positive region calculating, and boundary objects pruning are analyzed based on MapReduce mechanism. Thereafter, a fast parallel attribute reduction algorithm is developed. The effectiveness and superiority of this parallel algorithm were demonstrated by theoretical analysis and comparison experiments.

Different from available algorithms, the contribution of this paper is twofold. (1) Motivated by the aforementioned MapReduce technology, hash algorithm, and neighborhood multigranulation rough set model, the parallelization methods of multiple granular spaces and hashing Map/Reduce functions for heterogeneous data are brought to light; 2) a neighborhood multigranulation rough set model-based parallel attribute reduction algorithm using MapReduce, which has never been done before, is proposed.

The paper is organized as follows. Section

In this section, 1-type and 2-type neighborhood multigranulation rough sets and MapReduce programming model will be briefly described.

Neighborhood rough set model uses neighborhood relation to replace equivalence relation, which can directly process numerical data and heterogeneous data. For further processing heterogeneous data from the perspective of multiple granular spaces and multiple levels of granularity, neighborhood rough set theory has been extended from single attribute subset to multiple attribute subsets. Two types of neighborhood multigranulation rough set models have been developed [

Let

Let

Given a decision system,

Comparing with that just the single neighborhood relation was used in the 1-type neighborhood multigranulation rough sets, multiple neighborhood relations were fully considered in the 2-type neighborhood multigranulation rough sets, which were denoted by 2-type NMGRS by Qian et al. [

Given a decision system

Given a decision system

Given a decision system

Given the attribute subset

Given a decision system

MapReduce is a parallel processing framework that breaks down large tasks into many small tasks. With the small tasks independent of each other, big tasks and small tasks are just different in size. The MapReduce parallel programming model also breaks down the computational process into two main stages: the Map stage and the Reduce stage.

In the MapReduce model, the whole dataset is split into many splits in natural sequence and then is passed to the Map stage. Data in the MapReduce programming model can be represented as <

Map:

Reduce:

Here,

Aiming at the numerical or the heterogeneous data, many attribute reduction algorithms based on neighborhood multigranulation rough sets have been developed. However, it is still a challenging task to parallelize these attribute reduction algorithms for massive heterogeneous data. Motivated by the works of Qian et al. [

To parallelize the attribute reduction algorithm based on the neighborhood multigranulation rough set model, the MapReduce model was adopted. Thus, it is the key point that how to design the Map and Reduce functions for quickly getting neighborhood classes and positive regions. The work of Yong et al. [

Thus, the Map and Reduce functions for hash buckets calculation are designed as follows.

_{i}

_{HM}_{HM}_{HM}_{HM}

<_{HM}_{HM}

for each

let _{i}

// _{0} is a special sample in universe

let

<_{HM}_{HM}_{HM}_{HM}

end for

_{HM}_{HM}

_{HR}_{HR}_{HR}_{HR}

<_{HR}_{HR}

for <_{HM}_{HM}

if _{HR}_{HR}

<

else

if _{k}

<_{HR}_{HR}_{HR}_{HR}

_{k}_{k}

end if

end if

<_{HR}_{HR}_{HR}_{HR}

end for //output with multi-file; a file named after a hash value is a hash bucket

We take the decision table shown in Table

Decision table.

0.10 | 0.20 | 0.61 | 0.20 | Yes | |

0.13 | 0.22 | 0.56 | 0.10 | Yes | |

0.14 | 0.23 | 0.40 | 0.31 | No | |

0.16 | 0.41 | 0.30 | 0.16 | No |

According to Definition

The Map process:

The <_{HM}_{HM}

The <_{HM}_{HM}

The Reduce process:

The <_{HR}_{HR}

The <_{HR}_{HR}

The <_{HR}_{HR}

After Algorithms

Next, we calculated the positive regions under the current subset. According to Definition

As to the neighborhood calculation of a single condition attribute subset, according to the work in literature [

Map and Reduce functions for neighborhood calculation by a single condition attribute subset are designed as follows.

_{i}

_{M}_{M}

//

<_{M}_{M}

for each

let _{i}

let

for each

if

let _{i}

else

break

end if

end for

<_{M}_{M}_{M}_{M}_{i}_{i}

end for

_{M}_{M}

_{R}_{R}

// let _{R}_{R}

<_{R}_{R}

for <_{M}_{M}

if _{R}_{R}

<

else

if _{k}

<_{R}_{R}_{R}_{R}

_{k}_{k}

end if

end if

<_{R}_{R}_{R}_{R}

end for

The positive region of the whole universe

We continue using the decision table in Table

The Map process:

The <_{M}_{M}

The <_{M}_{M}

The Reduce process:

The <_{R}_{R}

The <_{R}_{R}

The above operation results show that the positive and boundary region of the current universe can be acquired by Algorithms

According to the monotonic proof of the work of Ma et al. [

The Map and Reduce functions for updating positive regions are designed as follows.

_{i}

_{UM}_{UM}

// ^{^}^{^}

<_{UM}_{UM}

for each

if ^{^}_{i}^{^}_{i}_{i}

else ^{^}_{i}^{^}_{i}_{i}

end if

<_{UM}_{UM}_{UM}_{UM}^{^}_{i}^{^}_{i}

end for

_{UM}_{UM}

_{UR}_{UR}

// let _{UR}^{^}_{UR}^{^}

_{UR}_{UR}

if ^{^}_{k}

_{UR}_{UR}

_{UR}_{UR}^{^}_{k}

end if

Here, we are taking Table

It can be seen from Example

The Map process:

The <_{UM}_{UM}

The <_{UM}_{UM}

The Reduce process:

The <_{UR}_{UR}

After the above operation, the boundary region updating was finished, and the actual storage situation is shown as follows:

2 0.13 0.22 0.56 0.10 Yes

The above results show that the subsequent attribute reduction can be processed directly on the basis of the whole dataset without any extra splitting.

On the basis of parallel algorithms given in Section

Step 1: initialize

Step 2: if (

Step 2.1: for each condition attribute _{k}

Step 2.2: compare the positive region _{k}_{k}_{k}

Step 2.3: update boundary sample set

Step 3: output reduction

It is assumed that the neighborhood decision information system has

In this section, we conducted some numerical experiments to assess the efficiency of our proposed algorithm. The experiments were implemented on a PC cluster of nine nodes, where one was set as a master node and the rest were configured as slave nodes. Each node is equipped with Inter Core i5-2400M CPU (four cores in all, each 3.1 GHz), 4 GB RAM memory, and the software of Ubuntu 14.0, Hadoop 2.6.0, and Java 1.6.20. All algorithms were coded in Java.

To illustrate the efficiency of our proposed PARA_NMG algorithm, the representative parallel algorithm for reduction algorithm based on positive region, which was denoted as PAAR_PR, proposed in literature [

To test the efficiencies of above two algorithms on different types of data, the experiments were carried out with the real datasets

Datasets.

Dataset | Record number | Attribute number | Class number | |
---|---|---|---|---|

1 | DS1 | 30,700,000 | 35 | 19 |

2 | DS2 | 2,458,285 | 68 | 3 |

3 | DS3 | 5,000,000 | 18 | 2 |

4 | DS4 | 3,850,505 | 52 | 2 |

5 | DS5 | 1,025,010 | 11 | 10 |

6 | DS6 | 4,898,421 | 41 | 23 |

For neighborhood rough set model, it is important to select a proper neighborhood radius when calculating neighborhood classes. According to the work of Hu et al. [

Reduction results of the two algorithms.

Dataset | PAAR_PR | PARA_NMG |
---|---|---|

DS1 | 18, 26, 29, 1, 22, 3, 6, 10, 7, 9, 7, 17 | 18, 26, 29, 1, 22, 3, 6, 10, 7, 9, 4, 16 |

DS2 | 53, 25, 50, 14, 66 | 32, 67, 53, 35, 14 |

DS3 | / | 1, 17, 6 |

DS4 | / | 7, 19, 57, 2, 61, 5 |

DS5 | / | 4, 8, 2, 10 |

DS6 | 12, 6, 5, 3, 1, 23, 33, 36, 2, 24, 40 | 12, 6, 5, 3, 1, 23, 33, 36, 8, 11, 29, 34, 39, 27 |

It can be seen from Table

In addition, for datasets DS1, DS2, and DS6, although the attribute reduction results were all obtained, there was a little difference between the selected attribute subsets by both algorithms. To further analyze the two algorithms’ effects on reduction results from the perspective of classification accuracy, seven well-known typical classifiers, namely, sequential minimal optimization (SMO), naive Bayes, naive Bayesian model (NBM), logistic regression model (LRM), locally weighted learning (LWL), J48, and MultiClassClassifier , were selected to further test the classification accuracy associated with different attribute reduction subsets. The test results are shown in Table

Classification accuracies with two algorithms’ reduction results.

Classifier | Dataset | PAAR_PR | PARA_NMG |
---|---|---|---|

SMO | DS1 | 82.1037 | 86.4802 |

DS2 | 84.5962 | 87.6572 | |

DS6 | 91.5151 | 97.9569 | |

Naive Bayes | DS1 | 75.4451 | 84.7014 |

DS2 | 87.4395 | 93.1425 | |

DS6 | 67.0616 | 75.6389 | |

NBM | DS1 | 57.4696 | 77.7349 |

DS2 | 87.6570 | 89.6472 | |

DS6 | 97.1868 | 97.9569 | |

Logistic | DS1 | 83.6540 | 88.0672 |

DS2 | 87.7754 | 90.7764 | |

DS6 | 99.9873 | 99.9898 | |

LWL | DS1 | 91.7377 | 96.5833 |

DS2 | 70.9932 | 80.4851 | |

DS6 | 67.0616 | 67.0616 | |

J48 | DS1 | 90.5104 | 97.9103 |

DS2 | 82 | 82 | |

DS6 | 97.1868 | 98.1967 | |

MultiClassClassifier | DS1 | 81.6033 | 82.8716 |

DS2 | 95.1595 | 98.5344 | |

DS6 | 96.1977 | 99.9781 |

We can see from Table

To illustrate the influence of the number of computer nodes on the two algorithms’ computational time, the experiments were implemented on a cluster with different number of nodes. The average running times of the two algorithms were recorded, which are shown in Table

Average computational time of two algorithms (

Dataset | Algorithm | Number of nodes | |||||||
---|---|---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||

DS1 | PARA_NMG | 14,958 | 8499 | 6420 | 4953 | 4132 | 3596 | 3317 | 2968 |

PAAR_PR | 1724 | 894 | 607 | 492 | 418 | 364 | 312 | 272 | |

DS2 | PARA_NMG | 55,423 | 28,717 | 19,515 | 17,820 | 14,031 | 11,996 | 10,226 | 8854 |

PAAR_PR | 8732 | 4961 | 3731 | 2960 | 2591 | 2140 | 2003 | 1815 | |

DS3 | PARA_NMG | 64,035 | 34,992 | 24,820 | 19,583 | 16,504 | 14,927 | 13,481 | 12,338 |

DS4 | PARA_NMG | 16,006 | 8559 | 6428 | 4694 | 3942 | 3572 | 3163 | 2663 |

DS5 | PARA_NMG | 13,226 | 6961 | 4863 | 4032 | 3157 | 2738 | 2458 | 2249 |

DS6 | PARA_NMG | 21,371 | 11,264 | 7714 | 6333 | 5304 | 4847 | 3961 | 3533 |

PAAR_PR | 2091 | 1384 | 1124 | 987 | 816 | 723 | 651 | 624 |

As can be seen from Table

In fact, except for the computational time, the speedup is really an important performance index for evaluating the efficiency of a parallel algorithm, which is defined as follows:

The speedup of two algorithms was tested with different number of nodes. To be more intuitive, the average speedup of two algorithms on each dataset with different computer nodes is presented in Figure

Average speedup on different number of nodes.

As shown in Figure

Attribute reduction is one of the important research issues in rough set theory. In current big data era, traditional attribute reduction algorithms are now faced with big challenges for dealing with massive data. Most existing parallel algorithms have seldom taken granular computing into consideration, especially for dealing with complex heterogeneous data including categorical attributes and numerical attributes. To address these issues, aiming at heterogeneous data, a quick parallel attribute reduction algorithm using MapReduce in the framework of neighborhood multigranulation rough sets was developed in this paper. The hash function was introduced into the Map and Reduce stages to speed up the positive region calculation. The effectiveness and superiority of the developed algorithm were verified by comparison analysis.

However, just the static data was considered in this paper; in fact, datasets in real-world applications often vary dynamically over time. How to parallelize the incremental attribute reduction algorithm in the framework of neighborhood multigranulation rough sets is a focus for future research.

The data used to support the findings of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.

This work was supported by the National Natural Science Foundation of China (61833011, 61403184, and 61533010), the major program of the Natural Science Foundation of Jiangsu Province Education Commission, China (17KJA120001), the National Key Research and Development Program of China (2017YFD0401001), and the Six Talent Peaks Project in Jiangsu Province, China (XNY-038).

^{2}|U/C|))