^{1}

^{2}

^{1}

^{2}

Parallel processing as a method to improve computer performance has become a development trend. Based on rough set theory and divide-and-conquer idea of knowledge reduction, this paper proposes a classification method that supports parallel attribute reduction processing, the method makes the relative positive domain which needs to be calculated repeatedly independent, and the independent relative positive domain calculation could be processed in parallel; thus, attribute reduction could be handled in parallel based on this classification method. Finally, the proposed algorithm and the traditional algorithm are analyzed and compared by experiments, and the results show that the proposed method in this paper has more advantages in time efficiency, which proves that the method could improve the processing efficiency of attribute reduction and makes it more suitable for massive data sets.

With the rapid development of network technology and storage technology, especially the development of cloud computing and big data, many complex problems have emerged, which not only involve a large amount of computing but also deal with a large scale of data, namely, the so-called massive data processing. The explosive growth of all kinds of data has pushed human society into the era of big data. Big data has a series of characteristics such as high dimension, strong dynamic, and randomness, which reflects the uncertainty of big data [

Rough set theory was proposed by Polish mathematician Pawlak [

Attribute reduction, also known as dimension reduction or feature selection, comes from machine learning and is one of the core research contents of rough sets. Its purpose is to delete the attributes that are not relevant to classification in the data set and improve the performance of knowledge discovery of data. At present, attribute reduction has been widely used in the fields of pattern recognition and data mining. People have done a lot of research on attribute reduction in decision tables. Since the attributes in the decision table are not equally important, there are a number of redundant attributes in the attribute set. These redundant attributes have no effect on the decision results but reduce the efficiency of decision making. Therefore, it is necessary to remove these redundant attributes. Therefore, the main idea of attribute reduction is to eliminate those redundant attributes while keeping the classification ability of existing knowledge unchanged, which can reduce the size of the data set and improve the efficiency of knowledge discovery. Therefore, reducing the dimension of data through attribute reduction algorithm can reduce the training time and improve the quality of decision-making. Rough set attribute reduction algorithm can not only deal with small-scale data but also deal with large data effectively.

Parallel computing, in which multiple computing resources are used to solve computing problems at the same time, is an effective means to improve the computing speed and processing capacity of a computer system, which is also suitable for solving large-scale problems with large-scale and complex processes [

Based on the idea of divide-and-conquer [

For the convenience of writing and description, some basic concepts and definitions of rough set theory are listed below.

Decision table. Let

Unclear relation. Let

Upper and lower approximation sets. For a decision table, for each subset of attributes

Positive region. Let

Relative reduction. Let

Core attribute. Let

Necessary attributes. Let

Next, the related concepts of parallel computing are given below.

Parallel computer is a computer with multiple processors capable of parallel processing.

Parallel processing is an efficient form of information processing that emphasizes concurrent operations on data elements that belong to one or more processes that solve a single problem.

Parallel algorithm is a collection of simultaneous processes that interact and coordinate to achieve a solution to a given problem.

Data parallelism means that the data is divided into several blocks and mapped to different processors, respectively. Each processor runs the same processing program to process the assigned data. If the overhead associated with parallelism is not added, the processing speed of the feature increases by

Acceleration coefficient.

The design idea of divide-and-conquer is to divide a big problem that is difficult to solve directly into some smaller identical subproblems, divide and conquer. The subproblems generated by the divide-and-conquer method are often smaller models of the original problem, and then the subproblem is reduced to a point where it is easy to find its solution directly. Literature [

The divide-and-conquer method to find the core attribute is based on the algorithm of finding the positive domain. In the division of the domain, the decision table

The divide-and-conquer method of seeking core attributes is based on the idea of divide-and-conquer and recursion. While it is difficult to convert recursion to no recursion in program designing, there is no concurrency in the process of divide-and-conquer, and it is difficult to achieve parallel attribute reduction. In view of the disadvantages of the divide-and-conquer method which cannot be paralleled, we improved the method and designed a classification method that supports parallelism.

Finding the positive domain of decision table is actually a process of classification; all the instances in the decision table are classified according to given attribute set, and then the classification results are processed as follows: all the cases in each category are compared by their decision attributes, and if all are the same, then the instances in the category are added to positive domain. How to classify quickly determines the performance of the algorithm and how to realize fast classification determines the speed of algorithm performance.

Let

Input: Decision table

Output:

Step 1: let

Step 2: Initialize the attribute index array

Step 2.1 for each

Step 2.2 for each

Step 2.3 for each

Step 2.4

Step 2.5

Step 3: for each

Step 4: if

Step 5:

Step 6: return

In the process of finding the positive domain by the classification method, considering the space and time complexity, a preallocated structure array is used, its size is

According to the definition of core attribute, let

Therefore, in the process of finding the core attributes. We first need to find the positive domain of all attributes; the purpose of this is to prevent inconsistencies in the decision table and then find out the positive domain which is named as

Step 1: find the positive domain of all attributes

Step 2:

Step 3:

In the process of finding positive domain based on classification, for each attribute, we use the index method to traverse all categories

The core of concurrent computing is to seek concurrency. We improve the traditional serial decision table attribute reduction algorithm and divide the attribute reduction process into three parts: the core attribute of decision table calculation stage, attribute expansion stage, and attribute compression stage. These three parts all need to repeatedly calculate the relative positive domain of the attribute set, and the process of calculating the relative positive domain is relatively independent, so they all have good concurrency and can be implemented by parallel algorithms.

Although the divide-and-conquer method divides the original problem into small independent problems for solving, the recursive idea and dynamic partition cannot be well realized by parallel algorithm, while the classification algorithm calculates the relative positive domain of each attribute, and they are independent of each other, so it can be realized by parallel computing. In the process of attribute expansion and attribute compression of decision table, the relative positive domain of attribute set should be calculated repeatedly regardless of divide-and-conquer and classification, so this part can be realized by parallel computation.

According to the concurrency in the attribute reduction process, the following parallel attribute reduction algorithm can be obtained and the specific implementation is shown in Algorithm

Input: decision table

Output: A relative attribute reduction

Stage 1:Find core attribute in parallel

Step 1: Each process assigns attributes as

Step 2: Each process computes the core attribute

Step 3: Each process exchanges the core attribute with each other, and obtains the final core attribute.

Stage 2: Prejudge whether the core attribute is the result of reduction

Stage 3: Attribute expansion stage

Step 1: Calculate the attributes to be added

Step 2: Each process assigns attributes as according to

Step 3: Each process computes the best attributes to be added

Step 4: Each process sends the results of the calculations in Step3 to the main process.

Step 5: The main process receives calculation results of each process and calculates the attribute that is best to be added.

Then main process distributes the results,

Step 6: Each process accepts the calculation results of the main process and updates the reduction results.

Step 7: If

else goto Stage 3. Step 2;

Stage 4:Attribute compression stage

(xxi)Step 1: Calculate the properties needed to be checked,

(xxii)Step 2: Each process checks whether

(xxiii)else Send -1 to main process.

(xxiv)Step 3: The main process receives the calculation result of the sub process, select one attribute to compress and distribute the results.

(xxv)Step 4: Each process updates the compression results

(xxvi)Step 5:

(xxvii)else goto Stage 4.Step 2.

Let

The time complexity of parallel attribute reduction algorithm based on MPI is related to the number of parallel processes. In this paper, we assume that the number of parallel processes is

In order to verify the effectiveness of the parallel attribute reduction algorithm proposed in this paper, we conducted multiple sets of comparative experiments for testing. The data set is KDDCUP99 intrusion detection data [

The test is divided into two stages. The first stage tests the nonparallel performance of the original divide-and-conquer algorithm and the classification-based attribute reduction algorithm in this paper, as well as the difference between them. We first randomly select 10%, 20%,…, 100% from the 1 million records to generate the new data sets. Then we test the performance of the two algorithms in positive domain, attribute core, and attribute reduction on different data sets and draw conclusions through comparative analysis. The second stage of the experiment is to test the performance of the two algorithms in the parallel environment, and the data set obtained in the first stage is still used for testing, and then the performance of the two algorithms in the positive domain, attribute core, and attribute reduction on different data sets is tested, respectively, and the conclusion is drawn through comparative analysis.

The experiment is carried out on five computers by using MPICH.NT.1.2.5 and VS 2008 development tools in Windows environment. The system is configured with Windows XP, CPU P4, main frequency of 2.93 GHz, and 1G memory.

Nonparallel testing is mainly to test the performance of two algorithms and analyze the reasons for the difference in performance between them. According to the previous description, the 1 million pieces of data are selected at a rate of 10%. In order to reduce unnecessary errors, the two algorithms were, respectively, run 5 times on different data sets, and then the average running time was taken as the running time of this data set. Table

Attribute reduction results of the two algorithms in nonparallel testing.

Number of data sets | Running time of divide-and-conquer (s) | Running time of classification (s) | Reduction result |
---|---|---|---|

97968 | 35.424784 | 40.115938 | 9/41 |

195936 | 64.141747 | 74.01378 | 15/41 |

293904 | 72.875936 | 90.239001 | 19/41 |

391872 | 89.184722 | 113.409269 | 19/41 |

489840 | 105.541708 | 136.741713 | 19/41 |

587808 | 121.805281 | 159.759949 | 19/41 |

685776 | 139.719959 | 183.720615 | 19/41 |

783744 | 159.584528 | 212.715617 | 20/41 |

881712 | 176.411368 | 263.181288 | 20/41 |

979680 | 260.774013 | 329.194148 | 21/41 |

Running time of the two algorithms in nonparallel testing on different set.

It can be observed from Table

In parallel testing, in order to maintain consistency with nonparallel testing, the data of the nonparallel test is still used, and the MPI communication protocol is used to run on the PC with the same performance, and the average running time is calculated by 5 times in the same test. Table

Attribute reduction results of the two algorithms in parallel testing.

Number of data sets | Running time of divide-and-conquer (s) | Running time of classification (s) | Reduction result |
---|---|---|---|

97968 | 35.166193 | 26.079651 | 9/41 |

195936 | 65.103365 | 45.734677 | 15/41 |

293904 | 80.939468 | 62.347393 | 19/41 |

391872 | 99.040613 | 77.326405 | 19/41 |

489840 | 117.440574 | 87.380716 | 19/41 |

587808 | 135.404997 | 101.736253 | 19/41 |

685776 | 153.235049 | 117.03368 | 19/41 |

783744 | 173.126772 | 133.629863 | 20/41 |

881712 | 198.306259 | 148.357779 | 20/41 |

979680 | 256.82359 | 207.596716 | 21/41 |

Running time of the two algorithms in parallel testing on different set.

Comparison of the two algorithms in nonparallel and parallel testing.

It can be clearly observed from Table

In this paper, starting from improving the efficiency of the rough set knowledge acquisition algorithm, combining the ideas of divide-and-conquer, classification, and parallelism, a parallel attribute reduction algorithm based on rough set is proposed. By improving the divide-and-conquer method without parallelism, a classification method that supports parallel processing in a parallel environment can greatly improve the processing efficiency of attribute reduction. Experimental results show that the algorithm proposed in this paper is more suitable for attribute reduction processing of massive data sets.

The data used to support the findings of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.

This work was supported in part by the Science and Technology Key Project of Henan Province under Grant no. 202102210370.