^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{2}

^{3}

^{1}

^{2}

^{3}

At present, the explosive growth of data and the mass storage state have brought many problems such as computational complexity and insufficient computational power to clustering research. The distributed computing platform through load balancing dynamically configures a large number of virtual computing resources, effectively breaking through the bottleneck of time and energy consumption, and embodies its unique advantages in massive data mining. This paper studies the parallel

Data consumption and services have become the mainstream of today’s information age [

However, data analysis and knowledge discovery face greater challenges. Data mining usually uses algorithms to find the deep meaning hidden under the explicit features from massive data [

Whether it is traditional data mining or data analysis in a big data environment, clustering, as a basic process of automatically categorizing unknown data, can be used in the data preprocessing stage as well as in data-mining processing. However, in the big data environment, cluster analysis faces many challenges. Some of these challenges are inherent to the clustering algorithm, while others are caused by the complex data environment. These challenges have brought new difficulties to cluster analysis in the big data environment, including the ability to handle diversified data types, ultra-high-dimensional data, and uneven data; the iterative execution efficiency of clustering algorithms; the algorithm extensible ability and clustering effect evaluation model; and many other issues. The

In summary, in a big data environment, data has characteristics such as massiveness, sparseness, and high dimensionality. Moreover, the big data processing platform based on the distributed system provides abundant computing and storage resources for the processing of massive information. How to effectively use the computing power of the distributed parallel computing framework to effectively improve the mining effect of traditional data mining algorithms and provide more timely data analysis services in the complex big data environment has become an urgent problem to be solved. In this paper, in view of many problems in

With the development of the Internet in recent years, the information that people can access has also increased exponentially. How to obtain knowledge from massive amounts of information is one of the current research focuses on computer information theory. As an important branch of data mining, clustering has gradually attracted widespread attention in recent years. Compared with other data mining methods, clustering has the advantage of not requiring prior knowledge, and knowledge can be obtained based on the natural distribution of data [

Cluster analysis can divide the data set into several clusters [

In recent years, based on the big data platform, there has been a lot of research work to implement the traditional data mining algorithm in parallel on the distributed platform and optimize the algorithm according to actual needs. Therefore, in response to the problem of selecting the initial center, Kumar et al. [

However,

In summary,

Map reduce is one of the parallel programming models. Map reduce programs are often used to parallelize massive amounts of data. The design idea is to convert the problem of divide and conquer, which is usually the processing of flood data sources, into the processing of multiple small data sources at the same time [

Operation flow chart of MapReduce.

There are four entities at the top of the whole model, and the client is mainly responsible for submitting jobs to the MapReduce framework. JobTracker is solely responsible for scheduling the operation of the job. TaskTracker is responsible for running the input slice data and executing specific tasks. The distributed file system (HDFS) provides actual storage services and is used to share the resources required for operations with all nodes.

As a typical algorithm for calculating clusters,

To facilitate the description, this paper introduces a symbol

The center point can be defined using the following equation:

where _{j} refers to the number of same class.

The convergence flag can use the following formula to compute

Due to the influence of initial value and outliers, the results are not stable every time

easy to converge to the local optimal solution

The number of clusters needs to be preset

Clustering center U does not necessarily belong to the data set

In order to solve these problems, we have improved

Traditional

At present, there are two kinds of random sampling methods. One is traversal sampling, and the other is byte offset sampling. The characteristic of traversal is that the initial data is still selected in the sampling process without any operation, which is time-consuming, especially if the data set is large. It is a random sample, but the operation is still huge. Therefore, this method cannot be used for the data in this article. Although byte migration can process large quantities of data, the algorithm is not efficient.

In order to obtain more efficiency, we propose a parallel random sampling method on the basis of the above methods. Because the method is to operate on the parallel unit, it is more efficient and less time-consuming. The sampling procedure is as follows:

First, assign values to all data. At the same time, the unified processing is carried out in the format of keyword.

Sort the above data in order from the largest to the smallest.

Select the smallest data after sorting as the center point of the initialization class cluster. The calculation formula is as follows:

Traditional

The map function in MapReduce is used to map the data. First, the independence of data is used to map the data to different Reducer units in the form of keywords. Then, parallel clustering computation is carried out in these different units.

In this way, the independence between data can be effectively utilized. At the same time, the efficiency of clustering can be accelerated using multiunit simultaneous processing.

Generally speaking, data can be mapped to the Reducer of their respective class clusters by the Mapper process according to the length of the distance. In order to adaptively obtain the corresponding Reducer of each class cluster, we set the value of parallelism as

Reducer parallelization structure.

First, each cluster has its own Reducer. Therefore, we need to execute the parallelization strategy for all data. Then, we take all chosen data to process and set the focus dot of the class cluster. Then, Euclidean distance between many data and the current focus dot is meant. Finally, the minimum dot of the sum of squares is selected as the new center dot.

We adopt characteristics of MapReduce to optimize the calculation of minimum Euclidean distance. Because the comparison function and the key in the comparison function can calculate the value between the keys for sorting processing. The sort mechanism not only simplifies the problems faced by the algorithm but also facilitates the computing power of distributed clusters and speeds up the execution efficiency of selecting the minimum Euclidean distance sum (Figure

Sort mechanism of MapReduce.

Structure diagram of our method.

To imitate the real environment, a total of 6 PCs are used.

Hardware configuration: AMD Athlon (TM) X4 with 3.10 GHz CPU, 4 GB memory, and 500 GB disk space.

Software environment: Linux operating system is CentOS, and JAVA, ZooKeeper, Hadoop, and dBase are also used.

To test the performance, a data set is used, and the modified data set is divided into 5 groups. Each group has 50 samples, which have 4 attributes. This experiment generated a total of 5 parts. The detail of the random data set can be seen in Table

The data set we used.

Data sets | Size (MB) | Items | Dimension | Cluster center points |
---|---|---|---|---|

Data set 1 | 0.32 | 9,600 | 4 | 5 |

Data set 2 | 112 | 9,600,000 | 4 | 5 |

Data set 3 | 401 | 28,800,000 | 4 | 5 |

Data set 4 | 1,421 | 67,100,000 | 4 | 5 |

Data set 5 | 3,267 | 173,560,000 | 4 | 5 |

The effectiveness of our method is verified by comparing the convergence between the traditional

Convergence speed comparison.

To verify the accuracy of the primitive

Accuracy rate comparison experiment.

Recall rate comparison experiment.

Accuracy and recall are higher than the comparison as the size of the data continues to expand on different data sets. Primitive

In order to verify the advantages of MapReduce distributed clusters over Spark in iterative computing, this experiment was performed on MapReduce and Spark clusters. The primitive

In Figure

Time cost comparison experiment.

Our algorithm uses a parallel structure. We use the speedup ratio to verify the real-time performance. By testing the acceleration of our algorithm, the real-time performance of our algorithm is verified. Its calculation formula is as follows:

Speedup test.

The acceleration ratio can increase with many data. This condition indicates that our method can improve accuracy. At the same time, it can be used in big data sets.

The random selection of the focus dot by

Although our method can deal with large-scale data, it still has problems when dealing with high-dimensional data sets. Therefore, our next research plan is to further improve our algorithm so that it can adapt to high-dimensional data sets.

The data used to support the findings of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.