Research on Clustering Method of Improved Glowworm Algorithm Based on Good-Point Set

As an important data analysis method in data mining, clustering analysis has been researched extensively and in depth. Aiming at the limitation ofK-means clustering algorithm that it is sensitive to the distribution of initial clustering center, Glowworm Swarm Optimization (GSO) Algorithm is introduced to solve clustering problems. Firstly, this paper introduces the basic ideas of GSO algorithm,K-means algorithm, and good-point set and analyzes the feasibility of combining them for clustering optimization. Next, it designs a clustering method of improved GSO algorithm based on good-point set which combines GSO algorithm and classical K-means algorithm together, searches data object space, and provides initial clustering centers forK-means algorithm by means of improved GSO algorithm and thus obtains better clustering results. Major improvement of GSO algorithm is to optimize the initial distribution of glowworm swarm by introducing the theory and method of good-point set. Finally, the new clustering algorithm is applied toUCI data sets of different categories and numbers for clustering test.The advantages of the improved clustering algorithm in terms of sum of squared errors (SSE), clustering accuracy, and robustness are explained through comparison and analysis.


Introduction
As an unsupervised data analysis method, clustering analysis is widely applied in such fields as data mining, pattern recognition, machine learning, and artificial intelligence [1].Different from classification, clustering algorithm realizes categorization by gathering data objects through certain similarity metric and clustering criterion without any prior knowledge.As a branch of statistics, clustering analysis has been studied extensively.Clustering method can be mainly classified into division method, hierarchy method, and density-based method.The -means algorithm proposed by James Macqueen is a typical clustering algorithm based on division [2].However, the clustering result of -means algorithm is greatly affected by initial clustering center point and is very sensitive to outliers.Literature [3] optimizes the -means algorithm by integrating the coding, crossing, and aberrance thoughts of genetic algorithm (GA) with the local optimizing ability of -means clustering algorithm and proposes the -means clustering algorithm based on GA.Hierarchy-based clustering methods mainly include CURE algorithm [4] and Chameleon algorithm [5], of which one cluster is represented by multiple points in CURE algorithm, making the processing of nonspherical data sets better.Representative algorithms of density-based clustering methods include DBSCAN algorithm [6], which is able to effectively identify class cluster of any shape, but is very sensitive to the setting of artificial parameters (e.g., radius).Rodriguez and Laio put forward a new density-based density peaks clustering (DPC) algorithm [7] in 2014.In this algorithm, density peaks (i.e., clustering centers) are selected manually through "decision diagram" first, and then, residual data points are allocated to each clustering center on this basis to obtain corresponding clustering result.It is noteworthy that, in recent years, some scholars have started introducing the heuristic swarm optimization algorithm into clustering analysis of different fields and improving clustering effect by virtue of the global searching ability of swarm optimization algorithm.A clustering analysis method combining PSO and -means is proposed in literature [8] through the global searching ability of particle swarm algorithm.In addition, Cuckoo algorithm, artificial bee colony algorithm, artificial fish swarm algorithm, and so forth [9][10][11][12] are also started to be introduced in the research of clustering algorithm.
The GSO algorithm [13] proposed by Krishnanand and Ghose is a new swarm intelligence optimization algorithm, which is more efficient in solving multimodal problems compared with traditional swarm intelligence optimization algorithms [14].Aljarah and Ludwig put forward a new clustering based GSO algorithm in 2013.In this algorithm, the GSO algorithm is adjusted to solve the data clustering problem to locate multiple optimal centroids [15].An new approach for cluster analysis based on GSO algorithm and means has been proposed by Onan and Korukoglu [16].Due to the multimodal nature of multimedia data, Pushpalatha and Ananthanarayana proposed the GSO algorithm based Multimedia Document Clustering (GSOMDC) algorithm to group the multimedia documents into topics in 2015 [17].A fuzzy clustering algorithm based on GSO algorithm (GSO-KFCM) is proposed by Cheng and Bao in 2017.In this algorithm, the GSO algorithm obtains the optimal solution as the initial clustering center of the kernelized fuzzy mean clustering algorithm [18].
This paper introduces GSO algorithm into clustering analysis, regards each glowworm as a feasible solution in clustering center of data object space, searches data object space through the optimization process of glowworm, and solves clustering center by obtaining multiple extreme points.In this way, it combines GSO algorithm and -means algorithm together, provides initial clustering centers for -means algorithm by means of GSO algorithm, solves the problem that -means algorithm is sensitive to initial clustering centers, and thus obtains better clustering effects.Meanwhile, considering the effect of the initial distribution of glowworm swarm on clustering center search, this paper optimizes the initial glowworm swarm distribution in GSO algorithm by introducing the theory and method of good-point set [19,20], which improves the global searching performance of GSO algorithm.The research in this paper mainly includes 3 parts.Section 2 gives explanations on relevant algorithms and theories, which puts forward the optimization idea for clustering analysis-oriented GSO algorithm.Section 3 introduces improved GSO algorithm based on good-point set, combines improved GSO algorithm with -means algorithm together, and designs the algorithm framework and implementation steps for new clustering method (GSOK GP algorithm).Section 4 selects UCI data sets of different categories and numbers for clustering experiment and analysis for the GSOK GP algorithm designed in this paper.

Description of Relevant Algorithms
where   is the cluster center of   .SSE is taken as an important indicator for evaluating clustering result in general.
, that is, the new clustering center is different from the original one, turn to Step 2 for iteration again, until the convergence of clustering center points or reaching maximum iterations.
It can be learnt from the steps above that initial clustering centers have significant effect on the clustering result and operating efficiency of -means clustering algorithm and may lead to premature local optimum of -means clustering algorithm, which causes clustering results with large difference in turn.

Main Ideas and Steps of GSO Algorithm.
In GSO algorithm, each glowworm is deemed as a feasible solution of target problem in space.Glowworms gather towards high brightness glowworm through mutual attraction and location movement, and multiple extreme points are found out in the Mathematical Problems in Engineering 3 solution space of a target problem.In this way, the problem is solved.Its main ideas can be described as follows.
Step 1. Initialize glowworm swarm  = { 1 ,  2 , . . .,   }.Glowworm number  in swarm, step , fluorescein initial value  0 , fluorescein volatilization rate , domain change rate , decision domain initial value  0 , domain threshold  max , and other parameters related need to be initialized and assigned in the initialization.
Step 2. Calculate glowworm fitness based on objective function.Calculate the fitness (  ) of each glowworm   at its location based on specific objective function  = max(()).
Step 3. Calculate the moving direction and step of glowworm.Each glowworm   searches for glowworms   with higher fluorescein value   within its own decision radius   , and determine the next moving direction and step based on fluorescein value and distance.
Step 4. Update glowworm locations.Update the location of each glowworm   based on determined moving direction and step.
Step 5. Update the decision domain radius of glowworm.
Step 6.Judge whether the algorithm has converged or reached the maximum iterations (itmax) and determine whether to enter the next round of iteration.
It can be learnt from the steps above that algorithm execution efficiency can be improved and premature local optimum of algorithm can be avoided by optimizing the initial distribution of glowworm swarm.
() is known as a good-point set and  a good point.It has been proved by applicable theorems that, with respect to approximate integration, the order of deviation () is only relevant to  and irrelevant to the space dimensions of the sample.Therefore, good-point set can provide better support for the calculation in high-dimensional spaces [20].Meanwhile, as for a point set object whose distribution is unknown, the deviation () of  points   = { 1 ,  2 , . . .,   } obtained by virtue of good-point set is significantly superior to  points obtained by random method.Therefore, a better initial distribution scheme can be provided for the swarm distribution in swarm intelligence algorithm based on this feature of good-point set.

Design of GSOK_GP Algorithm
This paper proposes an improved GSO algorithm based on good-point set to solve clustering problems on the basis of analysis of relevant algorithms above and characteristics of clustering problems.Its main ideas can be described as firstly, optimize the initial distribution of glowworm swarm through good-point set, so as to optimize GSO algorithm.Secondly, optimize the initial clustering centers in clustering data objects, and obtain characteristics of multiple extreme points and a clustering center point set with optimized GSO algorithm.Thirdly, select  extreme points as the initial clustering center of -means algorithm in the clustering center point set as per maximum distance principle.Fourthly, execute the -means algorithm with initial clustering center to figure out the clustering result.The algorithm framework is shown as Figure 1.Where  ≤ itmax means the iterations are no greater than maximum iterations, flag >  means the number of extreme points is greater than the number k of initial clustering centers required.

Initial Swarm Optimization Based on Good-Point Set.
Optimization of initial distribution of glowworm swarm is to represent the characteristics of solution space more scientifically utilizing glowworm swarm in essence.Randomly generated glowworm swarm cannot cover all conditions of solution space in most cases.Therefore, uniform distribution of glowworm swarm in solution space is an effective strategy.More uniform distribution of swarm can be realized with the theory and method of good-point set above.
The comparison between Figures 2 and 3 indicates that the data point distribution in exponential sequence method is more uniform, which is able to cover the solution space better.In the meantime, the structure of its good-point set is more stable; that is, the distribution effect is consistent when  is unchanged.Therefore, a better initial glowworm distribution can be obtained by applying good-point set in initial glowworm swarm distribution.

Flow of GSOK GP Algorithm.
Glowworm individuals are deemed as the feasible solutions of a clustering center point when combining improved glowworm algorithm with -means algorithm to solve clustering problems.In view of the characteristic that clustering center points are surrounded by data points of data objects, the density of clustering center points is represented by an extreme value of various data point densities within one domain.Therefore, take the density of glowworm individuals in data object set as their fitness, and obtain a superior initial clustering center point set through optimizing of density extreme value by glowworms.The main algorithm flow is as follows.
Step 1. Initialize with the glowworm swarm based on goodpoint set.As for the data set  = { 1 ,  2 , . . .,   } needing to be clustered, initialize and assign glowworm number  in swarm, initial location of glowworm, step , fluorescein initial value  0 , fluorescein volatilization rate , domain change rate , decision domain initial value  0 , domain threshold  max , and other parameters related in the Euclidean space where  is limited.
Step 4. Determine moving direction.Glowworm   searches the glowworm with higher fluorescein in decision domain and selects the glowworm   with higher fluorescein through roulette approach, which acts as the moving direction of the next step.Dset (  ) represents the glowworm set in the domain, Dset (  ) represents the glowworm set with higher fluorescein in the domain, and (Dset (  )) represents the probability of each glowworm to be selected.Choose the glowworm   with the maximum probability to act as the moving direction of glowworm   . Dset Step 5. Update location.Glowworm   moves by the step  towards the direction of glowworm   to complete location update of all glowworms.
Step 6. Update decision domain.  () represents the decision radius of glowworm   in round  iteration,   represents the threshold of glowworm number in the domain, and   represents the glowworm number within the decision radius.
Step 7. Judge the termination condition of glowworm search and enter iteration of the next round.
Step 8. Glowworm algorithm terminates, and  extreme points are output to act as the initial clustering center points for -means algorithm.
Step 9. Execute -means algorithm and output clustering result.

Density-Based Fitness Function.
Cluster center is a glowworm data point surrounded by adjacent points of low local density in GSOK GP algorithm; therefore, cluster center can be interpreted as a local optimal point on fitness.
It should be noted that adjustment for Euclidean distance calculation method is only applied in the process of searching initial clustering center in GSO algorithm, and general Euclidean distance calculation approach needs to be adopted in algorithm evaluation, so as to compare and analyze with other algorithms.

Selection of Extreme Point.
A relatively large distance between cluster centers is necessary in clustering algorithm.Therefore, select  centers in multiple cluster centers to constitute the initial clustering centers of -means algorithm; that is, selecting  extreme points in extreme point set  = { 1 ,  2 , . . .,   } to act as the initial clustering centers of -means algorithm needs to follow distance maximization principle.When  > , the basic steps for selecting extreme points are as follows: (1) Firstly, select the glowworm with the highest fitness to act as the first clustering center point.(2) Secondly, calculate the distances from other clustering center points to the first clustering center point, and select the one with the largest distance to act as the second clustering center point.
(3) Repeat step (2) to calculate the sum of the distances from other clustering center points to clustering centers selected, and select the one with the largest distance to act as the next clustering center point until  clustering center points are obtained.

Experiment and Analysis
4.1.Experimental Environment.Matlab is employed to compile GSOK GP algorithm and two UCI data sets shown in Table 1 are selected to test its effectiveness in this paper.Design parameters of GSO algorithm referring to relevant literatures, and select relevant parameters of M-GSO algorithm as follows based on actual clustering problems:  = 50,  = 0.4,  = 0.6,  = 0.08,  = 1,  0 = 5, and   = 6, with maximum iterations: 100.SSE, clustering accuracy, and robustness are used to evaluate clustering effect of algorithm in this paper.SSE employs the sum of the Euclidean distances from all data objects to their cluster center points.The calculation approach is as follows: , where   is the cluster center point of   .
The clustering accuracy proposed by Gan et al. is taken as one of the clustering effect evaluation standards in this paper [21].Clustering accuracy refers to the proportion of accurately classified samples to total samples.The definition of clustering accuracy AC is as follows: where  represents the number of categories of data sets,  represents the total number of samples in the data set,   represents the number of samples accurately classified into Category .
In addition, the robustness indicators proposed in literature [22] are used to identify the algorithm stability in this paper.The algorithm robustness in this paper is calculated with the mean square error of results of multiple experiments as per the calculation formula below: where AC * is the optimal value of clustering accuracy and AC  is the average value of clustering accuracy obtained by operating the algorithm multiple times.The smaller the  is, the higher the algorithm robustness will be.2.
There are 214 data sets in Glass data set; each object has 9 attributes, which can be classified into 6 categories in total.The experimental results of Glass data set are shown in Table 3.
It can be learnt from Tables 2 and 3 that GSOK GP algorithm is superior to traditional -means algorithm and PSOK algorithm on SSE and average accuracy.
Calculation results based on comparing the robustness of traditional -means algorithm, PSOK algorithm, and GSOK GP algorithm are shown in Table 4.
Table 4 indicates that the operation results of 20 independent operations of GSOK GP algorithm for Iris data set are consistent, which proves significant stability.And the fluctuation in the operation results of 20 independent operations for Glass data set is obviously smaller than that of means algorithm and PSOK algorithm.Therefore, GSOK GP algorithm has better robustness in the experiments.

Conclusion
Traditional -means clustering algorithm is widely used due to its simple principle and high execution efficiency.However, -means algorithm relies on initial clustering centers, which leads to large difference in the clustering result, low accuracy, and lack of stability of traditional -means algorithm.In this paper, the initial clustering centers in -means algorithm are optimized with improved glowworm algorithm based on good-point set, and the clustering effect is improved.
The GSOK GP algorithm proposed in this paper is mainly applied to solving data object clustering problems under unsupervised learning conditions.The difference between the GSOK GP algorithm and traditional clustering methods is that it combines GSO algorithm and -means algorithm together to improve the clustering effect.In particular, as for the effect of initial clustering centers on clustering results, this paper provides more scientific descriptions for data object space by introducing the theory and method of good-point set and obtains superior initial clustering center points with the searching ability of GSO algorithm.Through comparison and analysis, GSOK GP algorithm is proved to have better clustering effect and stability.
In addition, the adverse effect of computing efficiency of GSOK GP algorithm for glowworm density in case of large data object has also been noticed, which means that the convergence of GSOK GP algorithm needs to be improved further, so as to apply it better when addressing clustering problems under large data volume.

Figure 3 :
Figure 3:  data points (glowworms) distributed in good-point set exponential sequence method.
)3.3.2.Weighted Euclidean Distance.Since there is large difference in value range of the data object in different dimensions, partial attributes with a large value range may have greater effect on the Euclidean distance between data points if only Euclidean distance is applied, which will cause greater effect on the clustering result.Therefore, calculation of Euclidean distance needs to be adjusted through different weights allocation in the process of initial clustering center search by the glowworm if assuming each dimension of the data object has the same effect on the clustering result without prior knowledge.
Assumption 3. Value range of data object  in each dimension is expressed as follows:

Table 1 :
Selection of experimental data set.

Table 2 :
Experimental results of iris data set.

Table 3 :
Experimental results of glass data set.

Table 4 :
[9]parison of robustness () of each algorithm (%).Experimental Results and Analysis.The data of executing GSOK GP algorithm 20 times for Iris and Glass data sets, respectively, and independently is shown in Tables2, 3, and 4. The data of executing -means algorithm and PSOK algorithm 20 times is cited from literature[9].There are 150 sample objects in Iris data set, each of which has 4 attributes, which can be classified into 3 categories in total.The experimental results of Iris data set are shown in Table