^{1}

^{2}

^{3}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

^{3}

As an important data analysis method in data mining, clustering analysis has been researched extensively and in depth. Aiming at the limitation of

As an unsupervised data analysis method, clustering analysis is widely applied in such fields as data mining, pattern recognition, machine learning, and artificial intelligence [

The GSO algorithm [

This paper introduces GSO algorithm into clustering analysis, regards each glowworm as a feasible solution in clustering center of data object space, searches data object space through the optimization process of glowworm, and solves clustering center by obtaining multiple extreme points. In this way, it combines GSO algorithm and

Basic ideas of

Euclidean distance between data points

SSE of clustering results

where

Randomly select

Allocate other data points in data object

Recalculate clustering center

If

It can be learnt from the steps above that initial clustering centers have significant effect on the clustering result and operating efficiency of

In GSO algorithm, each glowworm is deemed as a feasible solution of target problem in space. Glowworms gather towards high brightness glowworm through mutual attraction and location movement, and multiple extreme points are found out in the solution space of a target problem. In this way, the problem is solved. Its main ideas can be described as follows.

Initialize glowworm swarm

Calculate glowworm fitness based on objective function. Calculate the fitness

Calculate the moving direction and step of glowworm. Each glowworm

Update glowworm locations. Update the location of each glowworm

Update the decision domain radius of glowworm.

Judge whether the algorithm has converged or reached the maximum iterations (itmax) and determine whether to enter the next round of iteration.

It can be learnt from the steps above that algorithm execution efficiency can be improved and premature local optimum of algorithm can be avoided by optimizing the initial distribution of glowworm swarm.

Basic definition and structure of good-point set are as follows:

(1) Assume

(2) Assume

(3) Assume

(4) Assume

It has been proved by applicable theorems that, with respect to approximate integration, the order of deviation

This paper proposes an improved GSO algorithm based on good-point set to solve clustering problems on the basis of analysis of relevant algorithms above and characteristics of clustering problems. Its main ideas can be described as firstly, optimize the initial distribution of glowworm swarm through good-point set, so as to optimize GSO algorithm. Secondly, optimize the initial clustering centers in clustering data objects, and obtain characteristics of multiple extreme points and a clustering center point set with optimized GSO algorithm. Thirdly, select

Flow of GSOK_GP algorithm.

Optimization of initial distribution of glowworm swarm is to represent the characteristics of solution space more scientifically utilizing glowworm swarm in essence. Randomly generated glowworm swarm cannot cover all conditions of solution space in most cases. Therefore, uniform distribution of glowworm swarm in solution space is an effective strategy. More uniform distribution of swarm can be realized with the theory and method of good-point set above.

Assume the initial glowworm swarm number is

(1) Square root sequence method:

(2) Cyclotomic field method:

(3) Exponential sequence method:

Assuming

Randomly distributed

The comparison between Figures

Glowworm individuals are deemed as the feasible solutions of a clustering center point when combining improved glowworm algorithm with

Initialize with the glowworm swarm based on good-point set. As for the data set

Calculate glowworm fitness, namely, the number

Update glowworm fluorescein.

Determine moving direction. Glowworm

Update location. Glowworm

Update decision domain.

Judge the termination condition of glowworm search and enter iteration of the next round.

Glowworm algorithm terminates, and

Execute

Cluster center is a glowworm data point surrounded by adjacent points of low local density in GSOK_GP algorithm; therefore, cluster center can be interpreted as a local optimal point on fitness.

Since there is large difference in value range of the data object in different dimensions, partial attributes with a large value range may have greater effect on the Euclidean distance between data points if only Euclidean distance is applied, which will cause greater effect on the clustering result. Therefore, calculation of Euclidean distance needs to be adjusted through different weights allocation in the process of initial clustering center search by the glowworm if assuming each dimension of the data object has the same effect on the clustering result without prior knowledge.

Value range of data object

It should be noted that adjustment for Euclidean distance calculation method is only applied in the process of searching initial clustering center in GSO algorithm, and general Euclidean distance calculation approach needs to be adopted in algorithm evaluation, so as to compare and analyze with other algorithms.

A relatively large distance between cluster centers is necessary in clustering algorithm. Therefore, select

(1) Firstly, select the glowworm with the highest fitness to act as the first clustering center point.

(2) Secondly, calculate the distances from other clustering center points to the first clustering center point, and select the one with the largest distance to act as the second clustering center point.

(3) Repeat step (2) to calculate the sum of the distances from other clustering center points to clustering centers selected, and select the one with the largest distance to act as the next clustering center point until

Matlab is employed to compile GSOK_GP algorithm and two UCI data sets shown in Table

Selection of experimental data set.

Data set | Number of dimensions | Number of categories | Number of samples |
---|---|---|---|

Iris | 4 | 3 | 150 |

Glass | 9 | 6 | 214 |

SSE, clustering accuracy, and robustness are used to evaluate clustering effect of algorithm in this paper. SSE employs the sum of the Euclidean distances from all data objects to their cluster center points. The calculation approach is as follows:

The clustering accuracy proposed by Gan et al. is taken as one of the clustering effect evaluation standards in this paper [

where

In addition, the robustness indicators proposed in literature [

where

The data of executing GSOK_GP algorithm 20 times for Iris and Glass data sets, respectively, and independently is shown in Tables

Experimental results of iris data set.

Algorithm | Average value of SSEs | Average value of ACs (%) | Standard deviation value of ACs |
---|---|---|---|

| 102.57 | 83.95 | 0.0451 |

PSOK | 99.61 | 87.39 | 0.0420 |

GSOK_GP | 97.32 | 89.33 | 0 |

Experimental results of glass data set.

Algorithm | Average value of SSEs | Average value of ACs (%) | Standard deviation value of ACs |
---|---|---|---|

| 241.03 | 51.70 | 0.0157 |

PSOK | 233.23 | 52.20 | 0.0108 |

GSOK_GP | 225.08 | 53.50 | 0.0090 |

Comparison of robustness

Algorithm | Iris | Glass |
---|---|---|

| 10.7 | 5.73 |

PSOK | 8.07 | 3.65 |

GSOK_GP | 0 | 2.97 |

There are 150 sample objects in Iris data set, each of which has 4 attributes, which can be classified into 3 categories in total. The experimental results of Iris data set are shown in Table

There are 214 data sets in Glass data set; each object has 9 attributes, which can be classified into 6 categories in total. The experimental results of Glass data set are shown in Table

It can be learnt from Tables

Calculation results based on comparing the robustness of traditional

Table

Traditional

The GSOK_GP algorithm proposed in this paper is mainly applied to solving data object clustering problems under unsupervised learning conditions. The difference between the GSOK_GP algorithm and traditional clustering methods is that it combines GSO algorithm and

In addition, the adverse effect of computing efficiency of GSOK_GP algorithm for glowworm density in case of large data object has also been noticed, which means that the convergence of GSOK_GP algorithm needs to be improved further, so as to apply it better when addressing clustering problems under large data volume.

The authors declare that they have no conflicts of interest.

The work was supported by National Natural Science Foundation of China (nos. 91546108, 71271071, 71490725, and 71521001), fund of Provincial Excellent Young Talents of Colleges and Universities of Anhui Province (no. 2013SQRW115ZD), fund of Support Program for Young Talents of Colleges and Universities of Anhui Province, fund of Natural Science of Colleges and Universities of Anhui Province (no. KJ2016A162), fund of Social Science Planning Project of Anhui Province (no. AHSKYG2017D136), and fund of Scientific Research Team of Anhui Economic Management Institute (no. YJKT1417T01).