With respect to the cluster problem of the evaluation information of mass customers in service management, a cluster algorithm of new Gaussian kernel FCM (fuzzy C-means) is proposed based on the idea of FCM. First, the paper defines a Euclidean distance formula between two data points and makes them cluster adaptively based on the distance classification approach and nearest neighbors in deleting relative data. Second, the defects of the FCM algorithm are analyzed, and a solution algorithm is designed based on the dual goals of obtaining a short distance between whole classes and long distances between different classes. Finally, an example is given to illustrate the results compared with the existing FCM algorithm.
1. Introduction
Clustering is an unsupervised learning method that is not reliant on predefined classes and training datasets with class labels. Clustering objects are divided into classes or clusters on the basis of feature similarity measurement. Hence, the same clusters share high similarities within the same cluster but largely differ from each other between different clusters. Traditional clustering methods are primarily based on the partition, hierarchy, grid, density, and model. As data mining in rapid development necessitates higher requirements for clustering, a clustering algorithm based on sample attribution, preprocessing, similarity measurement, allocation and scheduling, update strategy, and measurement [1, 2] have been advanced and applied to data mining [3, 4]. Considering the fuzziness of membership between sample points and cluster centers, the objective function-based fuzzy c-means (FCM) algorithm still prevails in theory and practice.
The core of an FCM algorithm is to design and determine a clustering center. The design mainly consists of quantifying cluster centers, locating them, and scheming an objective function accordingly. The cluster centers are quantified manually in most cases, or their optimal amount is determined in a given range using information entropy and other methods. For example, Duan and Wang [5] indicated that the clustering center was acquired by multiattribute information with broken-line fuzzy numbers. A novel clustering algorithm, Nei Mu, was proposed in [6] based on which datasets are converted into data points of attribute space to construct a directed graph of K-nearest neighbors. This algorithm contributes to upgrading the clustering of data with large density fluctuation and an arbitrary distribution, but not all data points have K-nearest neighbors. Xue and Sha [7] initiated a coordinate-based density method using a gray prediction model of a clustering algorithm to determine the initial clustering center.
A clustering center should be determined with modifications in a dynamic process. The existing determination methods largely include K-means clustering algorithms, partition- and density-based clustering algorithms, clustering algorithms based on the local density of data points, and KZZ algorithms. For these, the K-means algorithm is used with a given initial center, whereas partition- and density-based clustering algorithms are used to determine the initial clustering center by a density function of sample points using max-min distance means or the maximum distance product method. Zhang and Wang [8] pointed out the nearest data points were bracketed to facilitate location of other clustering centers at the same time that a resolution with high constraints was added to the objective function. Chiu Stephen [9] defined measures for each data point to identify the initial clustering center. Agustin et al. [10] studied a group genetic algorithm, aiming to improve the performance of group clustering by coding and defining fitness functions. A semisupervised clustering algorithm was put forward in [11] via the kernel FCM clustering algorithm with clustering errors containing labeled and unlabeled data to design the objective function. Since FCM fails to deal with noise, an efficient kernel-induced FCM based on a Gaussian function was presented in [12] to improve the objective function.
The following cases are some of the existing FCM studies. Qian and Yao [13] focused on high sensitivity to the initial center point and introduced three incremental fuzzy clustering algorithms for large-scale sparse high-dimensional datasets. Niu and She [14] proposed a fast parallel clustering algorithm based on cluster initialization. By generating a hierarchical K-means clustering tree to autoselect the number of clusters, Hu [15] obtained better clustering results. Aiming at the high time complexity of traditional FCM algorithms, a single-pass Bayesian fuzzy clustering algorithm was advocated for large-scale data in [16], which boosted its performance in time complexity and convergence. Zhou et al. [17] introduced the neighborhood information of multidimensional data to improve the clustering algorithm, increasing the robustness of outliers and noise points. Chen and Liu [18] designed a clustering algorithm on the minimum connected dominating set to remedy the defect that common algorithms easily fall into local minimum points. Xie et al. [19] combined the GWO algorithm with the principle of maximum entropy in a multidimensional big data environment. Duan and Wang [5] described multiple attributes of the objects to be clustered as polygonal fuzzy numbers, and a clustering algorithm was designed accordingly. By advancing an adaptive algorithm for the entropy weight of the feature weight of FCM, Huang et al. [20] focused on the influence of the feature weight on a clustering algorithm. Taking the preference vectors’ clustering degree as a neighborhood similarity, Xu and Fan [21] aimed at constructing a heuristic clustering algorithm for multiattribute complex large group clustering and decision.
These documents focus on FCM algorithm-associated issues, but few achievements have been made in big data scenarios. In view of the differences between large data point clustering and small sample point clustering, the sample points of big data were simplified in this paper, making FCM more applicable for big data scenarios. Next, an FCM algorithm was designed by taking both long between-class distances and short inner-class distances into consideration, which traditional FCM algorithms failed to do. This study thus provides theoretical and practical guidance for data clustering in a big data environment.
2. Gaussian Kernel FCM Clustering Algorithm
Since service resources are generally allocated in multiple ways, and there is a reciprocal relationship between the limited resources in one channel of allocation and those in another in terms of resource quantity, group consistency is beyond reach in which different resource consumers prefer different channels, leading to changing evaluation data. If the price mechanism fails to optimize the service resource allocation, consumer demands should be considered while pursuing social benefits to attain higher efficiency of resource allocation. Consumers primarily feature heterogeneity, conflicts of interest, and differences in evaluation forms, which necessitate decomposition of the customer group to divide the large-scale consumer groups into several small clusters, thus simplifying resource coordination.
Suppose that the consumer subject of a service resource is expressed as X=x1,…,xn, individual consumer as xi,i∈1,…,n, number of channels (data dimension) as p, evaluation data as xij,j∈1,…,p, Uik is the membership of sample i in Class k, with fuzzy matrix U=uikn×c provided that there are c classes, and yk,k∈1,…,c represents cluster centers. Its objective function can be represented by the Gaussian kernel FCM clustering algorithm [8]:(1)JU,V=2∑i=1n∑k=1cuikmβ−βϕxi,yk,where ϕxi,yk=exp−xi−yk2/σ2, β is a characteristic constant of the Gaussian function, and m is a fuzzy index used to control the fuzzy degree of classification. The higher the index, the higher the fuzzy degree. σ2 is the variance of the given data. Hence,(2)uik=1/β−βϕxi,yk1/m−1∑k=1c1/β−βϕxi,yk1/m−1,(3)yk=∑i=1nuikmβϕxi,ykxi∑i=1nuikmβϕxi,yk.
If Vpresent−Vprevious≤ε, then the iteration is discontinued, at which there is optimum classification. Both traditional FCM algorithms and the Gaussian kernel FCM clustering algorithm focus on the inner-class distance instead of between-class distance. Result-oriented, both values should be considered in order to obtain better clustering. Due to the large number of consumers to whom service resources are allocated, direct computing of membership will lead to problems such as high computing complexity and slow convergence of the optimal solution, resulting in a decline in clustering efficiency. Therefore, preprocessing of data points should occur prior to clustering in order to reduce the number of data points that need clustering and enhance the scalability of the clustering algorithm.
3. Preprocessing of Evaluation Information of Consumers
∀xi,xj can be considered as a constraint data pair. The Euclidean distance formula is deployed to calculate its distance:(4)dxi,xj=xi−xj=xi1−xj12+⋯+xip−xjp2.
Set ε and γ in advance for dxi,xj (both ε and γ can take lower values for more accurate classification).
If dxi,xj≤ε, then it is considered that xi and xj are extremely close and can be placed into one class
If dxi,xj≥γ, it is considered that xi is far from xj, and bracketing them together is next to impossible
Data points between ε and γ cannot be effectively identified
To delete data points quickly, the characteristics of distances between data points and the possibility of clustering different data points should be considered and investigated in the preprocessing procedure. Deletion should be done via the following steps:
Step 1: take data points xi and xj with the smallest distance in X to meet dxi,xj≤ε, and combine F1=ϕ with xi and xj. Then, F1+xi,xj⟶F1, and X−xi,xj⟶X.
Step 2: take the mean value of xi and xj by xi+xj/2 as a new data point, and identify data point xl in Set X whose mean value is less than ε. Then, F1+xl⟶F1 and X−xl⟶X.
Step 3: take xi,xj,xl as a new data point with a value of xi+xj/2+xl/2. Repeat Step 2 until no new data points can be found, and form new sets F1 and X.
Step 4: repeat Steps 1 to 3 for Set X form Set Y, including F1,…,Fm and X, wherein the final mean values of data points in F1,…,Fm are taken as new data points, respectively.
Step 5: let data points in Y be nodes in the graph based on graph theory, and the connecting line of nodes be the distance. If the distance is greater than γ, then the connecting line is deleted, thus forming a connected network graph. Assuming that points a,b, and c make a circle, and a is the farthest from b, it can be considered that there is a higher probability of forming a cluster by a and b than by a and c, or b and c, so the connecting line can be deleted. Here, Graph G without cycles is the connected network graph.
Step 6: in Graph G with a plurality of nodes, the nodes are sorted by the number of adjacent points. Each node has a plurality of nondominated nodes to make a cluster.
Step 7: since cluster sides vary in length, it is difficult to generate an effective cluster set for clusters with their average value as its sides. Deletion leads to basically equal distance from each data point in a cluster to the center of the cluster. Thus, point estimation may be adopted to work out its expected value.
Given that Cluster Cl has q neighbors, its sample variance is sl, xi−x¯/sl/q≤t1−α. An adaptive k-nearest neighbor algorithm is used to search a data point closest to or within a set distance from the given data point and merge it into clustering components for clustering fusion. Therefore, xi that satisfies the formula will be included in Cluster Cl; otherwise, the data point is deleted from Cl. If included in multiple classes, the data point will enter the cluster as a matter of priority, where it is the nearest to the cluster center, so the average value of all data points in Cluster Cl is data point zt.
Step 8: data points are downsized using the approach mentioned above, and the pertinence of clustering by Set Z is strengthened. The original dataset X becomes Set Z=z1,…,zm.
4. Clustering Algorithm of Consumer Evaluation
The number of clusters and the initial cluster center must be determined first to cluster by the FCM clustering algorithm. The former may be obtained by manually determining or defining an interval range and giving preference to the best cluster number. From the perspective of consumer clustering, the number of clusters is that of the evaluation channels, which is expressed as P, since clustering better coordinates the needs of consumers that prefer different service channels.
The initial cluster center will change with the objective function value of the optimal fuzzy classification. However, it is difficult to meet the difference requirements between classes. In general, scholars add a penalty function to the objective function of the existing model (1) or similar models to maximize the between-class distance 1/pp−1∑q=1p∑k=1pdyq,yk. However, the following problems are encountered:
The inner-class distance is a function of the distances between each data point and the cluster center and the power of membership, while the between-class distance is the average value of distance differences between cluster centers. Both vary widely in value and thus are beyond comparison. Their incorporation into an objective function (minimum value) may fail to accommodate maximizing the between-class distance and minimizing the inner-class distance in an iteration, but may focus on the former.
Iteration termination occurs on the basis that understanding the difference of objective functions is within a specific range, with which the optimal cluster center and membership function can be obtained. As a likely nonconvex function with a local optimal solution, the objective function at the end of the iteration may not be solved, resulting in a small value, and there may be a small difference in the objective function values of two iterations and large values in another two. The algorithm cannot prove its convergence.
To maintain a short inner-class distance and long between-class distance, partitioning the two indexes and setting a more appropriate iteration termination condition should be necessary based on the above considerations. Then, determination of an optimized cluster center can proceed. The steps are as follows:
Step 1: as the clustering result is sensitive to the selection of the initial cluster centers, the distance between cluster centers should be increased as much as possible. The dominant point with the most neighbors in dataset Z is taken as the first cluster center, the data point farthest from the dominant point as the second, the data point with the largest product of the distance to the two cluster centers as the third, and so on, until P initial clustering centers are solved.
Step 2: calculate uik and yk, respectively, by equations (2) and (3). β can be given or estimated by a sample variance: 1/p∑j=1p1/n−1∑i=1nxij−1/n∑i=1nxij2.
Step 3: set the threshold α of the inner-class distance and β of the between-class distance. Variance s=1/p−1∑k=1pyk−y¯k2 is used to characterize between-class differences, where y¯k=1/p∑k=1pyk.
Step 4: if JU,V≤α and s≥β, then the iteration is terminated, and the obtained uik and yk are the most suitable membership function and cluster center, respectively.
Step 5 (sample classification): work out the distance ϕxi,yk between data points zj,j∈1,…,m and each cluster center, and classify the minimum values.
5. Simulation Research
Given that a service resource targets a large number of consumers and may be allocated via five channels, a random sample survey was conducted on 100 consumers to seek their service evaluation data on each channel. The consumer group is clustered to pursue a more effective allocation of resources.
By the given steps, the possibility of clustering each data point (consumer) is preprocessed based on the evaluation data. Calculate the distance between two respective evaluation data points by formula (4), and ε=0.2andγ=0.70:
Data close to each other are initially clustered by Steps 1 to 4 in Section 3 to obtain Set Y (Y=F1,F2,…,F23,X) composed of F1,…,Fm and X, where in F1=x1,x3 and F2=x58,x77…Fi are regarded as new data and each data in X as separate data. Thus, the initial evaluation set is simplified to Set Y with 72 data points.
By step 5, the data in Set Y are processed for connected graph G without circles (see Figure 1) formed by elements in Y, wherein isolated points and points not on the main connected graph are not drawn.
Process each node on Graph G by employing steps 6 and 7 to conclude dataset Z (including 41 data points) on the basis of an adaptive nearest-neighbor classification rule. zi represents one or several data points in X, whose partial relationship is shown in Table 1.
Since the evaluation data involve 5 channels, find 5 initial cluster centers by Step 1 in Section 4: y10=0.50,0.42,0.66,0.53,0.56, y20=0.96,0.98,0.17,0.87,0.06, y30=0.13,0.01,0.11,0.04,0.70, y40=0.88,0.06,0.59,0.92,1.00, and y50=0.03,1.00,0.28,0.26,0.44.
Given m=2, calculate uik and yk with Steps 2 to 4. Suppose α=130.5 and β=0.3. The iteration stops, provided that J=130.43, s=0.3227, and t=14, so the cluster center is in its optimal state after 14 iterations: y114=0.45,0.32,0.48,0.24,0.53, y214=0.92,0.97,0.18,0.85,0.09, y314=0.33,0.19,0.09,0.30,0.48, y414=0.87,0.06,0.59,0.92,0.99, and y514=0.27,0.76,0.15,0.17,0.27.
Perform Step 5 to classify samples, procuring a clustering result of Set Z.
Get Set X based on Table 2 and the corresponding relationship between Set Z and Set X in Table 1. This is the clustering result of the original evaluation data, as shown in Table 3.
The distance Vpresent−Vprevious (expressed by di,j) upon 33 iterations is shown in Table 4, according to [9].
di,j after 33 iterations.
i,j
di,j
i,j
di,j
i,j
di,j
i,j
di,j
1, 2
0.0130
9, 10
0.0276
17, 18
0.0024
25, 26
0.0023
2, 3
0.0141
10, 11
0.0249
18, 19
0.0021
26, 27
0.0024
3, 4
0.0157
11, 12
0.0199
19, 20
0.0019
27, 28
0.0027
4, 5
0.0177
12, 13
0.0134
20, 21
0.0019
28, 29
0.0029
5, 6
0.0205
13, 14
0.0085
21, 22
0.0018
29, 30
0.0032
6, 7
0.0239
14, 15
0.0055
22, 23
0.0019
30, 31
0.0035
7, 8
0.0272
15, 16
0.0038
23, 24
0.0020
31, 32
0.0039
8, 9
0.0287
16, 17
0.0029
24, 25
0.0021
32, 33
0.0044
Changes in di,j are shown in Figure 2.
Changes in di,j.
Figure 2 illustrates that the value of di,j increases first, decreases next, and then increases and is free of monotone convergence. The clustering center may not be optimal when the iteration discontinues when Vpresent−Vprevious≤ε, and a more suitable center satisfying the conditions may appear after n iterations with smaller values of di,j.
In this paper, the iteration ceased when JU,V≤α and s≥β, where the additional condition s≥β ensured an appropriate distance between different classes and made it easier to attain appropriate values for α and β.
6. Conclusion
Complex huge group clustering is the basis for the effective distribution of service resources and group coordination; nevertheless, traditional FCM and its improved versions are incapable of processing numerous data points to be clustered. In this case, the deletion of data points was studied in the current paper by using a graph-based clustering algorithm, adaptive clustering algorithm, and Gaussian kernel clustering algorithm. Meanwhile, a new Gaussian algorithm was proposed to present both the inner-class distance and between-class distance, which the objective function fails to do.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
LiW.LiJ.Improvement of semi-supervised kernel clustering algorithm based on multi-factor stock selection20183333036ZhenL.Tactical berth allocation under uncertainty2015247392894410.1016/j.ejor.2015.05.0792-s2.0-84940583866ZhangZ.LinJ.MiaoR.Hesitant fuzzy language condensed hierarchical clustering algorithm and its application201921ZhenL.Modeling of yard congestion and optimization of yard template in container ports201690DuanY.WangG.A FCM clustering algorithm based on polygonal fuzzy numbers to describe multiple attribute index information20161232203228YingD.YingX.YeJ.A novel clustering algorithm based on graph theory20094534750XueY.ShaX.On gray prediction model based on an improved FCM algorithm2017092932ZhangH.WangJ.Improved fuzzy C-means clustering algorithm based on selecting initial clustering center2009366206208Chiu StephenL.A cluster estimation method with extension to fuzzy model identificationProceedings of the IEEE Conference on Control Applications Part 2August 1994Orlando, FL, USA12401245Agustin-BlasL. E.Salcedo-SanzS.Jimenez-FernandezS.Carro-CalvoL.Del SerJ.Portilla-FiguerasJ. A.A new grouping genetic algorithm for clustering problems20123996959703ZhangH.LuJ.Semi-supervised fuzzy clustering: a kernel-based approach200922647748110.1016/j.knosys.2009.06.0092-s2.0-67650474083RamathilagamS.HuangY.-M.Extended Gaussian kernel version of fuzzy c-means in the problem of data analyzing20113843793380510.1016/j.eswa.2010.09.0402-s2.0-78650684673QianX.YaoL.Extended incremental fuzzy clustering algorithm for sparse high-dimensional big data20194567581NiuX.SheK.Research on fast parallel clustering algorithm for large scale data2012391134137HuW.Improved hierarchical K-means clustering algorithm2013492157159LiuJ.JiangY.WangJ.DengZ.WangS.Single pass bayesian fuzzy clustering201829926642680ZhouS.XuW.TianC.Data-weighted fuzzy C-means clustering algorithm2014361123142319ChenX.LiuR.Improved clustering algorithm and its application in complex huge group decision-making2006281116951699XieF.LeiC.LiF.HuangD.YangJ.Unsupervised hyperspectral feature selection based on fuzzy C-means and grey wolf optimizer20194093344336710.1080/01431161.2018.15413662-s2.0-85057314841HuangH.ChangK.YuH.Research on adaptive entropy weight fuzzy C-means clustering algorithm2016361219223XuX.FanY.Improved ants-clustering algorithm and its application in multi-attribute large group decision making2011332346349