Adaptive Initialization Method Based on Spatial Local Information for k-Means Algorithm

k-means algorithm is a widely used clustering algorithm in data mining and machine learning community. However, the initial guess of cluster centers affects the clustering result seriously, which means that improper initialization cannot lead to a desirous clustering result. How to choose suitable initial centers is an important research issue for k-means algorithm. In this paper, we propose an adaptive initialization framework based on spatial local information (AIF-SLI), which takes advantage of local density of data distribution. As it is difficult to estimate density correctly, we develop two approximate estimations: density by t-nearest neighborhoods (t-NN) and density by ε-neighborhoods (ε-Ball), leading to two implements of the proposed framework. Our empirical study on more than 20 datasets shows promising performance of the proposed framework and denotes that it has several advantages: (1) can find the reasonable candidates of initial centers effectively; (2) it can reduce the iterations of k-means’ methods significantly; (3) it is robust to outliers; and (4) it is easy to implement.


Introduction
Clustering is a process of grouping a set of data objects into clusters based on information found in that data [1], which has a long history in a variety of scientific disciplines from statistics and computer science to biology, medicine, and even psychology.The main goals of clustering involve compressing, classifying, and gaining some useful information from data.
Clustering algorithms can be divided into two categories roughly: hierarchical and partitional [2].Hierarchical clustering algorithms recursively find nested clusters either in agglomerative (button-up) mode or in divisive (top-down) mode, whereas partitional algorithms find clusters simultaneously as a partition of the dataset.Most hierarchical algorithms have quadratic or higher time complexity with the number of data points [3] and therefore are not suitable for large scale big data application.However, partitional algorithms often have lower time complexity and are used in many large scale tasks including bag-of-features (BoF) method in computer vision [4], color quantization in graphics and image processing [5], bag-of-words model for text classification [6], and pretraining of deep learning nowadays [7].
The -means algorithm is undoubtedly one of the most popular and widely used clustering algorithms [8].-means algorithm is a hard partitional algorithm, which divides a dataset into a set of exhaustive and mutually exclusive clusters.That is, for a given dataset X = {x 1 , . . ., x n } in R  , -means algorithm iteratively divides X into  clusters  = { 1 , . . .,   }, subject to X = ⋃  =1   and   ∩   = 0, for all 1 ≤  ̸ =  ≤ .The -means algorithm usually generates clusters by optimizing a certain criterion function, and the most intuitive and frequently used one is the Sum of Squared Error (SSE) which is given by where ‖ ⋅ ‖ 2 denotes the Euclidean norm or ℓ 2 norm and c i = (1/|  |) ∑ ∈  x j is the center of cluster   whose cardinality is |  |.As a result, finding the optimal clusters for -means algorithm turns to minimize a criterion function.Other criterion functions can also be used, such as city block (ℓ 1 norm), hamming distance, and cosine dissimilarity.
After giving the initial centers, -means algorithm repeats two alternate procedures to group data points into  desired clusters [9]: it first properly assigns each data point to one of the  separate initial centers and then updates the clusters' center based on the assignments.The assignment and update procedures are repeated until either there is no further changes of the criterion function or the maximum number of iteration reaches.
In spite of it is popularity, -means algorithm has some drawbacks [10]: (1) it needs the user to specify the number of clusters in advance or run independently for different values of  and then selects the partition that appears to be the most meaningful by domain experts; (2) it can only detect compact, hyperphysical clusters that are well separated and is not suitable for high dimension task because of using Euclidean distance as its default similarity metric; (3) it is sensitive to noise and outliers in datasets, for even a few of such data points can significantly influence the center (mean) of their respective cluster; (4) it often converges to a local minimum of the criterion function due to its gradient descent nature and nonconvexity of the criterion function; (5) it is highly sensitive to the selection of the initial guess of centers.Adverse effects of improper initialization include empty cluster, slower convergence, and higher chance of getting stuck in bad local minima [5].As discussed by Celebi et al. [10], all of these drawbacks except for choosing the number of clusters can be remedied by using a suitable adaptive initialization method.
As mentioned by Celebi and Kingravi in [11], initial centers should avoid choosing outliers and being too close to each other.There are many studies focusing on initialization of -means, such as random selection, probability-based initial methods, and factor analysis.A more intuitive idea is to determine initial centers according to the spatial distribution of data points.That is, if we choose initial centers from regions with high local density of data distribution, which is a kind of spatial local information of data points, outliers will be prevented from being chosen.And, by keeping them away with certain distance, we will get the suitable initial centers for -means algorithm (as shown in Figure 2).
Inspired by the discussion above, we propose an adaptive initialization framework based on spatial local information (AIF-SLI), which takes advantage of local density of data distribution, and develop two approximate estimations for density, that is, density by -nearest neighborhoods (-NN) and density by -neighborhoods (-Ball), leading to two implements of the proposed framework.Both implements need three steps.Firstly, we have to find out the high density regions of data points based on the estimation of local data density.Secondly, we need to mark data points that belong to high density regions as candidates of initial centers.And thirdly, we should determine which candidates should be selected as initial centers and make sure that they are kept away with certain distance.
The contributions of this paper are as follows: (1) we propose an adaptive framework of initial guess of clusters' center based on spatial local information; (2) we derive two implements of the proposed framework; (3) we give a comparative empirical study among the proposed framework and the state-of-the-art techniques and analyse the rationality of the proposed framework.
Our empirical study on more than 20 datasets shows promising performance of the proposed framework and denotes that it has several advantages: (1) it can find the suitable candidates of initial centers effectively; (2) it can reduce the iteration of -means methods significantly; (3) it is robust to outliers; and (4) it is easy to implement.
The rest of this paper is organized as follows.Section 2 reviews the related work in the literature of initialization for -means algorithm.Section 3 gives the proposed adaptive initialization framework based on spatial local information.The two implements of the proposed framework, -nearest neighborhoods (-NN) and -neighborhoods (-Ball), are derived in Section 4. Section 5 describes the datasets, feature normalization, the performance criteria, and parameter settings in our experiments.Section 6 denotes our extensive empirical study of evaluating the performance of the proposed method.Section 7 analyses the complexity for our method and the suitability of initialing -means algorithm based on spatial local information.After that, we conclude this work.

Related Work
There is a considerable amount of literature on the initialization methods of -means algorithm; we briefly review some of the commonly used ones.A comprehensive review can be found in recent work by Celebi et al. [10].
There are a number of initialization methods selecting the initial centers in a probability manner, implicitly or explicitly.The most used initial method for -means algorithm is MacQueen's second initialization, that is, selecting the initial centers randomly [12].Due to its random nature, it inevitably selects outliers as the initial centers and leads to more iteration or bad local minima.-means based on this initial method usually needs to run several times with multiple different initial partitions and chooses the partition with the smallest squared error.Forgy's method [13] is a random assignment method, which first assigns each point to one of the  clusters uniformly.Then the initial centers are obtained by centroids of data points according to their assignment.This method usually confuses with MacQueen's second method discussed above in which the first step is selecting initial centers not doing assignment.Both methods can be viewed as selecting (assigning) under the uniform distribution of data points.The -means++ method [14] chooses the first center arbitrarily and the th center ( ∈ {2, 3, . . ., }) with a probability of md(x  ) 2 /(∑  =1 md(x  ) 2 ), where md(x) denotes the minimum distance from a point x to the previously selected centers.It implies that data points with larger distance to the selected centers have higher probability to be chosen as new initial centers.A parallel version of -means++ algorithm was developed in [15].
Several works consider choosing initial centers with a desired distance apart from each other.Ball and Hall's method [16] takes the centroid of dataset X as the first initial center; that is,  0 1 = (1/|X|) ∑ x  ∈X x  .It then takes the data point, which is at least  units apart from the previously selected centers, as initial center until  centers are obtained.This method chooses data points with a desired distance apart from each other and at the same time prevents initial centers from becoming too close to each other.However, it is difficult to determine a reasonable threshold .Maximin method [17,18] takes the first center arbitrarily in dataset X.And the second initial center is chosen with the largest distance from the first center.Other initial centers are chosen to be the points with the greatest minimum distance to the previously selected centers.It should be noted that in the work of [18], it chooses the data point with maximum norm in dataset X as the first center.Erisoglu et al. [19] choose two of  features that describe the changes best in the dataset as main axes, where  is the dimension of the space where dataset X lies.All initial centers are chosen based on the main axes by projecting dataset X to main axes.The data point with maximum distance from the mean of dataset is chosen as the first initial center.The th center is chosen with the largest sum of distance from previously selected centers, until  centers are obtained.They cannot guarantee not selecting outliers as initial centers.
Recently, researches utilize factor analysis method to initialize -means algorithm.The PCA-Part method [20] uses a divisive hierarchical approach based on Principal Component Analysis (PCA).This method iteratively selects the cluster with the greatest SSE and divides it into two subclusters by a hyperplane that passes through the cluster center and orthogonal to the direction of the principal eigenvector of related covariance matrix.In the first step, it takes the entire dataset as the cluster with the greatest SSE.The Var-Part method [20] is an approximation to PCA-Part, which assumes that the covariance matrix is diagonal.In this case, the direction of hyperplane is orthogonal to the coordinate axis with the greatest variance.Celebi and Kingravi [11] use Otsu's algorithm to get an adaptive threshold for PCA-Part and Var-Part algorithm [20], in which original takes the cluster center as threshold to divide a cluster into two subclusters.This leads a deterministic initialization method for -means and experiments to show that it gets promising performance compared to the original algorithm.Onoda et al. [21] acquire initial centers by independent components analysis (ICA).They first calculate the  independent components of X and then choose the point that has the least cosine distance from the th independent components as the th ( ∈ {1, 2, . . ., }) center.
Al-Daoud's method [22] first uniformly partitions the data space into  grids.It takes   points as initial centers from grid ,  = 1, . . ., , where   =   /,   is the number of data points in grid  and  is the total number of data points in dataset X.This method can be viewed as a density-based initialization method.However, this method would ignore clusters with fewer number of data points and it is not easy to decide how many grids  are suitable.
The proposed initialization method bears some similarity to that of Al-Daoud and Robetrs [22], in which local density of data points is taken into consideration when determining initial centers.However, the proposed method differs remarkably from Al-Daoud's method.In their work, local density is estimated based on artificial and hard grid segmentation; different grid segmentation leads to different local density estimation.Our method estimates local density of data points according to their local spatial distribution, which can be regarded as a nature and soft group segmentation, and initial centers are determined by the distribution of data points.

Proposed Framework of AIF-SLI
Since reasonable initial centers should avoid choosing outliers and being too close to each other, we propose an adaptive initialization framework based on spatial local information (AIF-SLI) for -means algorithm.
AIF-SLI takes advantage of local density of data points; that is, for a given dataset X = {x 1 , . . ., x n } in R  , we first estimate its local density via defining a function (x) to describe the local density for each data point x ∈ X.For point x, (x) describes the compact of data points within a small region containing x as its inner point.After that, we then find out regions with density higher than a threshold .Initial centers of -means algorithm are chosen from these regions with higher local density.
Generally, the proposed initialization framework involves three steps: (1) estimating local density for each data point  ∈ X; (2) finding out regions with density higher than a threshold and marking data points in these regions as candidates for initial centers; (3) determining initial centers from candidates and making sure that the initial centers are separate with certain distance.The workflow of the proposed framework is demonstrated in Figure 1.
It should be noticed that although being used to initialize -means algorithm, the proposed framework can be easily extended to other clustering algorithms such as the Gaussian Mixture Model.

Two Implements for AIF-SLI
In this section, we denote the two implements of the proposed framework: based on -nearest neighborhoods (-NN) and based on -neighborhoods (-Ball), leading to two adaptive initialization methods.We also give an algorithm to determine initial centers from candidates.

AIF-SLI Based on t-Nearest Neighborhoods (t-NN).
In this subsection, we denote an approximate estimation for local density of dataset by -nearest neighborhoods (-NN).It is well known that -NN is a natural choice for the approximate estimation of local density.Inspired by the superiority of Laplacian eigenmaps [23], we first construct an adjacency graph by -NN and obtain a Gram matrix from Data input

Estimate local density of data distribution
Find regions with higher density and mark candidates

Determine initial centers and keep them with certain distance Run k-means
The proposed framework  the constructed graph.Gram matrix can be precomputed and loaded to memory before clustering.Values for Gram matrix is computed by where   (  ) denotes the -nearest neighborhood of   .Then, we introduce a vector ⃗  with  components whose entries are given by   = ∑  =1   .The vector ⃗  provides a natural measure of local density of data points: for data point   , if the value of   is larger, the local density is higher.
Algorithm 1 shows the detailed steps of approximate version of the AIF-SLI method based on -nearest neighborhoods (-NN).In step (4), the median( ⃗ ) denotes the median value of vector ⃗  (in our experiments, we use the median function in MATLAB); other criteria can also be used to determine the threshold , such as the mean or certain confidence interval of quantile.

AIF-SLI Based on 𝜖-Neighborhoods (𝜖-Ball).
In this subsection, we denote an approximate estimation for local density of dataset by -neighborhoods (-Ball).We also construct an adjacency graph and Gram matrix.Values for Gram matrix are computed by The definition above of (3) seems much more intuitive than the definition of (2).Equation ( 3) counts the number of data points in a small region which contains x  as its center.More numbers in a region seem more compact of data points in that region.Here, the value of  is also an issue that needs to be considered, and we set it to a weighted half of the average distance of points in X; that is,  = 0.5 × Mean Dist.The Mean Dist means average distance of points in X which is computed as follow: where (  ,   ) is the distance of data points   and   .Algorithm for density by -neighborhoods (-Ball), which we give its name as AIF-SLI--Ball, describes the same steps as Algorithm 1, and its details are shown in Algorithm 2. (5) repeat step (4) until  = .

Input
Algorithm 3: Determining initial centers from candidates.

Determining Initial Centers from Candidates.
In this subsection, we describe the details of how to determine initial centers from candidates.We propose an algorithm very similar to the Maximin method [17] in the candidates set obtained by Algorithm 1 or Algorithm 2. Algorithm 3 shows the detailed steps of the proposed algorithm.Without loss of generality, we choose the first data point in candidates set as the first initial center in our experiments, which makes our algorithm deterministic.It should be noted that any other methods such as -means++ can also be used to determine initial centers from candidates in our work and lead to promising performance.

Experiment Setup
5.1.Dataset Description.We use 25 common datasets from the UCI machine learning repository [24] to perform our experiments.Table 1 gives the descriptions for these UCI datasets.For each dataset, the number of clusters () is set to be equal to the true number of classes (  ), as commonly seen in the related literature [9][10][11]20].

Performance Criteria.
As in [10,11], the performance of the initialization methods is quantified using two effectiveness (quality) and one efficiency (speed) criteria.(i) Initial SSE.This is the SSE value calculated after the initialization phase, before the clustering phase.It gives us a measure of the effectiveness of an initialization method by itself.
(ii) Final SSE.This is the SSE value calculated after the clustering phase.It gives us a measure of the effectiveness of an initialization method when its output is refined by -means; lower value infers compact clusters.Note that SSE value is the objective function of -means algorithm, that is, (1).
(iii) Number of Iterations.This is the number of iterations that -means requires until reaching convergence when initialized by a particular initialization method.
It is an efficiency measure independent of programming language, implementation style, compiler, and CPU architecture.
All methods compared in this paper are implemented using MATLAB and executed on a PC with 2.5 GHz CPU and 4 GB RAM.

Attributes Normalization.
Feature normalization is a common preprocessing step in computer vision and machine learning community.Normalization is necessary to prevent features with large ranges from dominating the distance calculations and also to avoid numerical instabilities in the computations.Two commonly used normalization schemes are linear scaling to unit range (min-max normalization) and linear scaling to unit variance (-score normalization).Minmax normalization is shown as follows: where  min ,  max denote the minimum and maximum values of the feature, respectively.Min-max normalization scales the attribute to the [0, 1] interval.-score normalization approximately maps the attribute almost to the [−2, 2] interval using where  and  denote the mean and variance of the feature, respectively.Several studies reveal that the former is more preferable than the latter in clustering research since the latter is likely to eliminate valuable between-cluster variation [20,25].In this paper, we apply the min-max normalization to map the attributes of each dataset to the [0, 1] interval.5.4.Parameter Settings.In our AIF-SLI--NN algorithm, how to choose the number of neighborhood, , is an important issue.We analyse the final SSE by different choices of the value of  on some datasets described in Table 1, and the results are demonstrated in Table 2.
From this table, we set  = max{20, 0.1 × /} in our experiments, which shows neither less discriminatory nor sensitive to outliers.And the convergence of -means was controlled by two criteria: the number of iterations reaches a maximum allowable value (we set the maximum value to be 100 in our experiments) or the relative improvement in SSE between two consecutive iterations drops below a threshold; that is, |SSE +1 − SSE  |/SSE  < , and  = 10 −6 .

Experimental Results
In this section, we compare two implements of our AIF-SLI method: AIF-SLI--NN and AIF-SLI--Balls, with five stateof-the-art methods: MacQueen's second method (random) [12], Maximin method (Maximin) [18], -means++ (Kpp) [14], Var-Part [20], and PCA-Part [20].All seven methods are executed 100 times and the mean is collected for each performance criterion.Tables 3, 4, 5, and 6 denote the performance measurements for each method with respect to initial SSE, final SSE, percentage comparison for final SSE, and number of iterations, respectively.We should note that Maximin method (Maximin) [18], Var-Part [20], PCA-Part  [20], and our two algorithms AIF-SLI--NN and AIF-SLI--Balls are deterministic algorithms.Table 3 shows that PCA-Part and Var-Part obtaining lowest initial SSE over other methods at most datasets, and our two algorithms achieve second optimal value.This may be because PCA-Part and Var-Part take the mean of each feature as threshold.Table 4 shows PCA-Part and Var-Part obtaining largest final SSE over other methods, which infers that PCA-Part and Var-Part cannot get compact clusters after clustering.Table 4 also shows that AIF-SLI--NN and AIF-SLI--Balls get lowest final SSEs on most datasets, which infers that they obtain more compact clusters after clustering than other methods.Table 5 describes the percentage comparison for final SSE.We take the Kpp as the baseline algorithm, which is demonstrated by number "1"; the values in this table denote the percentage of algorithms compared to Kpp; the symbol "+" ("−") denotes that its value is larger (smaller) than Kpp.Take the dataset 4 as an example; the value −0.8% under algorithm random denotes that the final SSE by random is smaller than that of Kpp by 0.8%, and +9.0%under algorithm Maximin denotes that the final SSE by Maximin is larger than that of Kpp by 9.0%.The last line of this table denotes the average percentage of these algorithms compared to Kpp, and this line shows that both AIF-SLI--NN and AIF-SLI--Balls get better final SSE than Kpp.Table 6 describes the number of iterations from after initialization until -means algorithm terminates.This table implies that AIF-SLI--NN and AIF-SLI--Balls converge quickly at most datasets.
From the experiment results, the proposed two algorithms reduce the number of iterations for -means methods significantly, and from final SSE results in Tables 4 and 5, both proposed algorithms get compact clusters, which infers that our initial centers are suitable for -means algorithm, and this suitability or reasonableness comes from the advantage of spatial local information.The proposed two algorithms are approximate estimation of spatial local information; if we make use of more elegant approaches to estimate the spatial local information, better results may be obtained.

Discussion
7.1.Complexity Analysis.Table 7 gives the average CPU time over 100 times executed by seven algorithms mentioned above under a subset of datasets in Table 1 (as the CPU time is dependent on implementation style, compiler, and CPU architecture, we do not add it as a performance criterion in our experiments).The CPU time is taken by preprocessing (including the calculation of Gram matrix), the initialization and the -means running time.In Table 7, for datasets with small scale datasets and low dimension, as datasets 4 and 5, all algorithms run quickly.However, when the number of data points is large but dimension is low, such as datasets 1, 13, and 25, both algorithms, AIF-SLI--NN and AIF-SLI--Balls, need more time to execute than others.As the dimension is larger, taking datasets 8, 18, and 22 as an example, PCA-Part algorithm takes the longest running time, followed by algorithms AIF-SLI--NN, Var-Part, and AIF-SLI--Balls.The AIF-SLI--Balls algorithm always runs quicker than AIF-SLI--NN, which is due to the lower computational complexity of finding the -Balls than that of -nearest neighborhood.
Table 8 shows the CPU time of a single running for each component in AIF-SLI--NN algorithm; these components include calculating the vector ⃗  (including constructing the Gram matrix), initialization, and running -means.Calculating the vector ⃗  takes more than 90% of the running time for large datasets, because of finding the -nearest neighborhoods for each data point, whose computational complexity is ( 2 ), time consuming.
It is should be noted that the Gram matrix has dimension  2 , which can be very large for many applications when  is of the order of a million or impossible for cases when  is of the order of a billion.In our experiments, we use the single command in MATLAB to change the data points from 8 bytes to 4 bytes to save the memory, especially for dataset 13.And also, if the parameter  in -nearest neighborhood algorithm is small enough, the Gram matrix can be very sparse and a sparse matrix can be constructed to replace the full Gram matrix.Computational complexity and optimization of memory of the proposed method are worthy of study when the Gram matrix is sparse.
The AIF-SLI--Balls algorithm has similar characteristic to AIF-SLI--NN, and we omit its analysis for the limited space.

Rationality Analysis.
In this section, we denote the rationality analysis of the proposed framework.Figure 2 shows a dataset generated by three Gaussian functions with different means and variance.From Figure 2, we confirm that, by utilizing the spatial local information, which takes advantage of local density of dataset, the initial centers can be chosen from regions with high local density and avoid choosing outliers as initial centers, which is reasonable as discussed above.At the same time, the initial centers are very close to the final centers which explains the reason why our method can reduce the iteration significantly.

Conclusion
In this paper, we propose an adaptive initialization framework based on spatial local information (AIF-SLI) for means algorithm, which is designed by taking advantage of the local density of data points.We first describe the framework of AIF-SLI based on defining a function to

Figure 1 :
Figure 1: The workflow of the proposed framework.

Figure 2 :
Figure 2: Rationality analysis of the proposed two algorithms derived from the framework of AIF-SLI.The red points denote the candidates, the yellow triangles are the initial centers chosen from candidates, and the final centers (centers after clustering) are presented with green diamonds.

Table 1 :
UCI dataset descriptions (: number of data points, : number of data dimensions, and   : number of classes).

Table 2 :
Final SSE by different choices of  in AIF-SLI--NN.

Table 3 :
Initial SSE comparison of the initialization methods.The optimal results are marked in bold font and the second optimal results in italic font.The last line summarizes the number of optimal and the second optimal results.

Table 4 :
Final SSE comparison of the initialization methods.

Table 5 :
Percentage comparison for final SSE.

Table 6 :
Number of iterations comparison of the initialization methods.

Table 7 :
The average CPU time (ms).

Table 8 :
CPU time (ms) of each component in AIF-SLI--NN.