Design of Distributed Human Resource Management System of Spark Framework Based on Fuzzy Clustering

The construction of human resource management system is a key part of enterprise management and control. A perfect human resource management system is conducive to the long-term development of enterprises. Aiming at improving the current situation of enterprise human resource management, a distributed human resource management system is proposed in this paper based on Spark framework. Aiming at the disadvantages of traditional k-means algorithm in processing massive data, such as low computational efficiency and high time complexity, an improved k-means algorithm based on Spark computing framework is proposed. Through the spatial location relationship with the cluster center, redundant calculation is reduced and the ability of processing massive data is improved for the system. Combined with the actual situation of the enterprise, the human resource management system architecture is designed by using Java EE human-computer interaction. The proposed system can achieve user management, employee information, attendance, evaluation, performance, salary, personnel change, and other business management. The experimental results demonstrate that the system can effectively reduce the time complexity of calculation and improve the system efficiency.


Introduction
With the continuous improvement of the level of social and economic development, many enterprises also began to expand their own team to achieve good development. At present, the rapid development of cloud computing and big data Internet technology also makes people gradually enter the intelligent information era. Many enterprises develop distributed systems to be applied to all kinds of business, such as collaborative office [1], document management [2], financial management [3], and other systems [4]. For the development of enterprises, talent is an important cornerstone for the development of enterprises. It is of great significance to develop and design a set of human resource management system to improve the working efficiency of talents in enterprises through reasonable human resource management [5]. In order to get rid of the obstacles caused by traditional human resource management to the development and expansion of enterprises and provide efficient enterprise human resource development platform, it is nec-essary to design a perfect human resource management system [6]. In short, the construction of human resource management system is a key component for the development of enterprise management and control. It needs to be fully combined with other construction systems, and the starting point is based on the business form and long-term development strategy of the company [7]. At the same time, when designing the human resource management system, it is necessary to consider the full use of the existing human resources of enterprises. By analyzing the performance requirements of human resource management system design, the system needs to consider business process, data process, data dictionary, use case constraints, etc.
k-means algorithm is an unsupervised learning algorithm and has become one of the most widely used clustering algorithms. With the rapid development of opensource distributed computing framework, clustering algorithm based on distributed computing platform can effectively solve the problem of memory overflow in singlemachine mode [8,9]. This direction has become a research hotspot. At present, for the problem of parallelization of algorithms under large data sets, many scholars have optimized and realized the algorithm under the distributed framework of MapReduce [10,11]. Moreover, the optimization method research under Spark framework is relatively few.
Aiming at the problem of high time complexity of k -means algorithm [12], literature [13] improved by introducing Canopy algorithm and maximum and minimum distance method on Spark platform. The convergence speed of the algorithm has been improved, but the problem of large amount of redundant computation has not been solved fundamentally. Literature [14] proposed a clustering algorithm based on distance triangle inequality in cloud computing framework. But the optimization strategy using triangle inequality principle needs to save the upper and lower bound information of each data. It is difficult to fully implement this in the Spark framework. Literature [15] compares the operating efficiency of k-means algorithm based on MapReduce and Spark. Its experimental results show that the Spark framework has a more efficient running speed for the algorithm that needs repeated iteration.
The innovations and contributions of this paper are listed below.
(1) In order to solve the problem of large amount of redundant calculation of k-means algorithm, this paper introduces the spatial location relationship between grid cells and clustering centers (2) Because the triangle inequality optimization strategy is considered in this paper, the redundant distance calculation is greatly reduced The improved algorithm is implemented in parallel under the Spark framework. It improves the processing power of large data sets. Finally, the effectiveness of the proposed algorithm is verified by experimental analysis The structure of this paper is listed as follows. Distributed k-means optimization algorithm based on spark framework is described in the next section. The proposed system is expressed in Section 3. Section 4 focuses on the experiment and analysis. Section 5 is the conclusion.

Distributed k-Means Optimization
Algorithms Based on Spark Framework 2.1. Spark Distributed Framework. Spark is a commonly used distributed computing platform that can effectively process massive data analysis. It is a distributed computing framework based on Elastic Distributed Data Set (RDD) implementation initiated by AMPLab of UC Berkeley [16].
In the overall structure of Spark, each Spark application uses a driver program to initiate parallel operations on the cluster. The driver program can manage multiple actuator nodes simultaneously. In a distributed cluster environment, multiple working nodes can read data from the HDFS file system and convert it to RDD. An RDD is an immutable, distributed collection of objects. Each RDD is divided into multiple partitions. These partitions run on different actuator nodes.
Compared with the disk-based MapReduce calculation mode, Spark does not need to save the intermediate results of iteration to disks. Thereupon, it has more efficient computing efficiency. The Spark-based algorithm has good scalability and can better adapt to large-scale data sets. In the clustering algorithm which needs many iterations, its advantage is more obvious.

k-Means
Algorithm. The process of the traditional k -means algorithm [17] is as follows.
Input: number of clusters z, data set d.
Output: z class clusters. The algorithm steps are as follows.
(1) Select z points from data set d as the initial clustering centers where z is the number of class clusters and c x is the cluster center of the x-th class cluster D x .
The time complexity of k-means algorithm is ϕðtzwÞ, where t is the number of data points, z is the number of class clusters, and w is the number of iterations. The time complexity of the algorithm is relatively large, and affected by the number of class clusters, it increases with the increase of z value.

Use k-Means
Optimized by Spatial Information. In every iterative calculation process of k-means algorithm, each data point needs to calculate the distance between it and k cluster centers. The redundancy of its calculation is great, especially when the value of k is large, which has a great influence on the time efficiency of the algorithm. To solve the problem of large redundancy of k-means algorithm, the more effective improvement strategy is triangle inequality method. Without changing the clustering results of k-means algorithm, it can greatly reduce the computational complexity. The redundancy of calculation can be greatly reduced by using the principle of triangle inequality. However, in a single iteration, for each data point, several distance calculations are still needed to find the nearest class cluster. In fact, with the increase of k value, the number of calculations for a single data point will gradually increase, and the redundancy of calculation is positively correlated with k value. Aiming at the shortcomings of the triangle inequality optimization strategy, the spatial position information of data points is further considered to reduce the redundant computation.

Spatial Position Relationship.
In order to further reduce the time complexity based on applying the triangle inequality strategy, the algorithm in this paper introduces the spatial position relationship between data points and clustering centers in k-means clustering. The basic idea can be described as follows. For any data point, if the spatial position relationship between it and k cluster centers can be known, then the closest cluster center to it can be accurately determined. Instead of doing k calculations, you just need to assign data points to the corresponding classes. Therefore, it is necessary to design a method that can efficiently save the spatial position relationship between all data points and k cluster centers. In view of the high efficiency of grid partitioning process, it is appropriate to use grid cells to store spatial location information of data points.

Establish Spatial Location Information of Grid and
Class Cluster. First, the data set is meshed with a certain partition width in each dimension. Then, the position relationship between each grid containing data points and k cluster centers is judged.
Taking two-dimensional data as an example, for any grid C, the method to determine the spatial position relationship between grid C and k clustering centers is as follows.
(1) First, find the class cluster closest to the center of grid C from z clustering centers and set it as A, whose distance is d 1 . Let the concentric circle radius of the grid be r. Therefore, the maximum value of distance A for any point in grid C is d 1 + r (2) Then, calculate the distance of other k − 1 clustering centers in turn. Taking B as an example, let the distance between B and C be d 2 . The minimum value of the distance B of any point in grid C is d 2 − r, if the following equation is satisfied That is, the closest distance between any point in grid C and B is still greater than the furthest distance between any point in grid C and A, so any point in grid C cannot belong to class cluster B. If the above equation is not satisfied, the score points inside grid C may belong to B. Grid C needs to record all possible belonging class clusters. In fact, when the number of meshes is large enough, the vast majority of meshes will have only one belonging class cluster. Only a small number of grids will belong to more than two class clusters. The average number of belonging class clusters per grid is slightly more than 1. By establishing the spatial location relationship between each grid and k cluster centers, the location relationship between all data points and k cluster centers is obtained. 3 Journal of Sensors minimum. In grid division, the number of segments in one dimension is jNum. According to the coordinates of data points, the grid position ði ′ , j ′ Þ of w can be quickly obtained by following the equation.
where δ is a positive number less than 1.
The relationship between w and class cluster can be determined according to the relationship between the grid where data point w resides and k cluster centers. Namely, what cluster centers w may belong to. In this way, distance calculation between w and all cluster centers can be avoided.

k-Means Optimization Algorithm in Spark Framework.
The algorithm flow of this paper is shown in Figure 1. The algorithm is implemented in parallel under the Spark framework. First, initialize Spark Context to determine the data set, number of cluster centers k, and maximum number of iterations. The data set is then read from the HDFS (Hadoop Distributed File System) file system and converted into an RDD collection. The Spark cluster partitions RDD sets based on the Spark Context information. It makes each partition run on each actuator node for grid partitioning and establishes the spatial information of grid and k cluster centers. Select the initial cluster center and start iterative calculation.
In each iteration calculation process, mapPartitions are processed for each RDD partition first. Cluster allocation is made to each data point. The corresponding grid is obtained by data point coordinates. Then, the relationship between grid and class cluster is utilized to determine the tðt ⩽ kÞ class clusters that data points may belong to. The triangle inequality method is utilized to find the nearest class from t cluster centers (instead of k) to reduce the redundant distance calculation. After all data points were allocated, the class cluster centers of different actuator nodes were summarized by reduceByKey operations. New k clustering centers were obtained, and the spatial relationship between grid and class cluster was updated. Finally, the sum of squares of errors is calculated, and the next iteration is judged.
The key point of the optimization strategy is to use the grid structure to preserve the spatial relationship between all data points and k cluster centers and obtain the possible belonging class cluster of any data points according to this relationship. The effect of the proposed algorithm is similar to that of the triangle inequality, which does not change the final cluster center of k-means algorithm after clustering. This strategy can further reduce the amount of computation based on the application of triangle inequality.

Time Complexity
Analysis of the Algorithm. The time complexity of the improved algorithm is effectively reduced. Reasonable selection of grid partition width can ensure that most grids have only one possible cluster. Only a small number of grids may belong to two or more class clusters. In this way, for most data points, only one distance calculation is required, instead of k times. The time complexity of the improved algorithm is ϕðtwÞ, where t is the number of data points and w is the number of iterations.
Compared with the improved strategy of triangle inequality, the combination method using spatial position information has lower time complexity. Especially with the increase of k value, its advantage becomes more obvious.
Therefore, it can avoid a lot of redundant distance calculation process. For any data point, the advantage of using grid-based spatial location information is that most of the distance can be too far. It filters out the cluster center which obviously does not have the belonging relation, avoiding a lot of redundant distance calculation process.

The Design of Human Resource
Management System

Feasibility Analysis
(1) Analyze the system from the perspective of technical feasibility. In order to be able to fully enhance the system application management decision-making level, many large and medium-sized enterprises are vigorously developing human resource management system. However, with the development and expansion of the enterprise team, the existing human resource management system cannot meet the needs of the enterprise. Enterprises began to increase the research and development of human resource management system, the formation of more and more mature human resource management technology. Therefore, Java EE-based human resource management system has technical feasibility (2) Analyze the system from the perspective of operational feasibility. Design based on Java EE technology human resource management system, which  Journal of Sensors can facilitate the use of every common employee system. It realizes good man-machine interaction and ensures the feasibility of system development and operation (3) Analyze the system from the perspective of economic feasibility. The fundamental pursuit of enterprises is social and economic benefits, so how to maximize the benefits of enterprises is very important. And the enterprise's ability to bear the new technology also determines whether the enterprise can ensure the maximum benefit. The design of human resource management system based on Java EE technology can effectively simplify the workflow of human resource management. It can make scientific and reasonable decision in real time and ensure the economic feasibility of the system 3.1.2. Functional Requirement Analysis. In the design of human resource management system based on Java EE, it is necessary to ensure that the system operation is efficient, simple, direct, powerful, and real-time. The general goal of system development is to complete the systematic, standardized, and automatic processing of all kinds of information. Based on the general task of system development, complete the function of human resource management system. It mainly includes organization management, recruitment management, employee information, training, attendance, performance, salary and welfare, enterprise culture, and other management modules. The system example for system administrators is shown in Figure 2, and the system example for common employees is shown in Figure 3.

Analysis of Nonfunctional Requirements.
In the design of the human resource management system, nonfunctional requirement design includes the following two points.
(1) Performance requirements of system operation speed, response efficiency, result accuracy, and other aspects (2) Reliability of users in terms of software failure frequency, easy recovery, severity, and predictability and security requirements to ensure that users use system identity, authorization, and privacy. Ensure safe and reliable operating environment of software system. Ensure that the operating interface of the system is aesthetically available. Ensure that the user's software is scalable, configurable, portable, and maintainable 3.2. Overall Architecture of System Design. Based on MVC three-tier architecture development platform, the system in this paper is divided into three levels, namely, interaction layer, application layer, and data layer. In the three-tier architecture design, the client can only provide better device application services. It has better system development architecture security than other development methods. Users can also access the data layer through the application layer, which effectively improves the overall data security. The architecture diagram of the system is shown in Figure 4.

Interaction Layer.
When designing the interactive layer of the human resource management system in this paper, the C# language program is used to design the interactive interface, and the HTML5 technology is used to make forms. It can ensure the human resource management system to provide adaptive functions, combined with differentiated screen size, screen width, and height adjustment. It can also adjust the operation position of the system interface according to the user's requirements.   As a large-scale software framework, human resource management system realizes the integration of multiple systems and improves the overall technical compatibility of the system. View the application layer as a development factory pattern, compatible with all subsystem functions. At the same time, the Web server can be utilized to parse the request of the business system, and the corresponding business program can be provided and operated.

The Data
Layer. Introduce SQL advanced database technology into the data layer. The establishment of SQL database can realize the effective connection of various func-tional components and ensure the processing performance of the database and data communication effectiveness. At the same time, it can be processed offline. The data layer wants to encapsulate data by transforming the data business into managed storage statements. Add information processing, expansion, separation, independence, and other functions to ensure that the operation portability of the system is fully improved. This makes it easier for more users to successfully connect to the system.

Key Service
Functions of the System. Considering the actual human resource management needs, the Java EE framework is adopted in the process of human resource management system. It includes user management, employee information, organization, attendance, salary and welfare, performance, recruitment, training, and other different management modules, respectively. The user operation rights of each module are different.

Main Functional Modules
(1) In the login management function module, the user enters the corresponding user name and password after successfully entering the login interface. If the match is successful, you can enter the system. If the match fails, the system displays a message indicating that the user name or password is incorrect and refreshes the login page again (2) In the attendance management function module, you can log in the attendance-related information of all employees of the enterprise. In this module to achieve the search, add, modify, delete, and other functions    The system is designed to establish SQL database, which can be optimized by SQL statement format. Get uniform specification code according to the object name of the database. It follows the debugging code specification, ensures the good design of database, and comprehensively improves the overall programming calculation efficiency. It can reduce unnecessary data redundancy to some extent and improve the database running efficiency of the system. In order to realize the data function dynamically, it is necessary to establish the effective connection between the foreground and background of the system, as well as the effective connection between the database and the system code. Establish Java database connection by using Java EE technology. It can provide standard database interface, and the system database connection steps are as follows.
(1) Load the Java EE driver

Experiment and Analysis
The data set used in the test is the human resource data set of an enterprise. The writing language used in this experiment is Scala. To verify the effectiveness of the proposed algorithm, traditional k-means algorithm and reference [18] based on Spark framework are selected as the comparison algorithm. The algorithm in reference [18] adopts MLlib L.6.2. The initial cluster center was selected by random selection. All experiments were run for 20 times and averaged.

Experimental Environment.
Spark distributed cluster is adopted in the experiment. Hadoop and Spark are installed on five VMS. Among them, one is responsible for the operation and management of driver programs, and the other four serve as actuator nodes.
Hardware configuration: 16 G memory, 1024 G hard disk. Table 1 lists the comparison of the running time of the algorithm in this paper with traditional k-means and reference [18]. Table 2 lists the sum of squares of errors comparisons of the algorithms. Among them, the speed increase ∂ 1 and ∂ 2 are calculated as follows.

Comparison of Algorithm Performance.
n 0 is the running time of k-means algorithm. n 1 is the running time of the algorithm in reference [18]. n 2 is the running time of the algorithm in this paper.
As can be seen from the experimental results, the operating efficiency of the algorithm in this paper is significantly improved compared with the traditional k-means and literature [18] algorithm. When k value is small, the speed improvement is relatively insignificant because the improvement strategy in literature [18] has been able to avoid most redundant calculations. However, with the increase of k value, the improvement effect of the algorithm in this paper becomes more obvious.
By comparing the sum of squares of errors, the algorithm in this paper and the algorithm in literature [18] have no negative impact on the clustering quality of the original algorithm.

Scalability
Comparison. 4 × 10 7 data samples were utilized to test the scalability of the algorithm. Figure 6 shows the comparison of parallelization time between traditional k-means, literature [18], and the algorithm in this paper. The algorithm in this paper has a more efficient clustering speed. The running time of the proposed algorithm decreases with the increase of actuator nodes. Meanwhile, due to the time cost of the Spark cluster, the running time of the algorithm does not decrease linearly with the increase of nodes. Figure 7 shows the acceleration ratio comparison of the algorithms. The proposed algorithm has good scalability. With the expansion of cluster size, the acceleration ratio of the algorithm is basically consistent with that of literature [18].

Conclusion
In order to improve the current situation of human resource management, this paper puts forward a design method of human resource management system. Aiming at solving the problem of high computational complexity of traditional k-means algorithm, considering the spatial location relationship between data points and clustering centres and the advantages of grid division, this paper designs a clustering optimization algorithm to save the spatial location information of data points. The comparison results of parallel experiments based on Spark platform show that the computational efficiency of the proposed algorithm is significantly improved, and it has better scalability. On the premise of ensuring the system performance, how to further improve the scalability of the system is the next research direction.

Data Availability
The labeled data set used to support the findings of this study is available from the corresponding author upon request.