A Robust k-Means Clustering Algorithm Based on Observation Point Mechanism

College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, China Department of Trace Inspection Technology, Criminal Investigation Police University of China, Shenyang 110854, China Key Laboratory of Trace Inspection and Identification Technology of 4e Ministry of Public Security, Shenyang 110854, China


Introduction
Clustering is an important research branch of data mining. e k-means algorithm is one of the most popular clustering methods [1]. When performing k-means clustering, we usually use a local search to find the solution [2,3], i.e., selecting k points μ 1 , μ 2 , . . . , μ k as the initial cluster centers and then optimizing them by an iterative process to minimize the following objective function (see, for example, [4,5]): where X j is the j-th data point belonging to the i-th cluster C i . It is well known that the solution of equation (1) is affected by the initial values of μ i (i � 1, 2, . . . , k).
In order to choose μ i properly, the k-means++ algorithm [6] picks out a set of points as the initial center points whose distances between each other are as large as possible. However, this method for choosing the initial center points is sensitive to outliers [7][8][9]. Some methods use the subsets of the original data set to determine μ i . For instance, the CLARA [10] and CLARANS [11] algorithms use PAM [12] to calculate the initial cluster centers from the random subsets of the original data set. e sampling-based methods weaken the sensitivity because the sampling process can discard some outliers in the original data set, but it cannot guarantee all outliers to be ignored in the sampling process. erefore, the remaining outliers in subsets still affect the clustering results. e automatic clustering algorithms are attracting more and more attention from the academic community, e.g., the density-based spatial clustering of applications with noise (DBSCAN) algorithm [13][14][15], depth difference-based clustering algorithm [16], and Tanir's method [17]. Recently, a new automatic clustering algorithm named I-nice was proposed in [18]. Inspired by the observation point mechanism of I-nice algorithm, we propose a two-stage k-means clustering algorithm in this paper to find the cluster centers from a subset of the original data set with all outliers removed. In the first stage, we select a small subset of original data set based on a set of nondegenerate observation points.
e subset contains only all the higher density points of the original data set and does not have the outliers. erefore, it is a good representation of the original data set for finding the proper cluster centers. In the second stage, we perform the k-means algorithm on the subset to obtain a set of cluster centers and then the other points in the original data set can be clustered accordingly.
Selecting the subset in the first stage is based on a set of d + 1 nondegenerate observation points that are assigned to the data space R d , where d is the dimension of data points. For each observation point, we compute a set of distances between it and all data points in the original data set. e set of distances generates a distance distribution with respect to the observation point. From the distance distribution, we identify the dense areas and extract the subset of data points in the dense areas. en, we take the intersection of all d + 1 subsets of data points in all dense areas from those d + 1 distance distributions. After refining this intersection subset of data points, we obtain a subset without outliers of the original data set. erefore, it can be used to find the proper cluster centers. Finally, we conduct some convictive experiments to validate the effectiveness of our proposed algorithm and the experimental results demonstrate that our proposed algorithm is robust to outliers. e remainder of this paper is organized as follows. We describe the related mathematical principles of our algorithm in Section 2. e details of two-stage k-means clustering algorithm and its pseudocode are presented in Section 3. In Section 4, we present a series of experiments to validate the feasibility of our proposed algorithm. Finally, we summarize the conclusions and future work in Section 5.

Mathematical Principles
by the triangle inequality for every X, Y ∈ D. Hence, the distance between two data points in D is larger than the difference of their corresponding two distances in D. erefore, for any positive number r and a point X ∈ D, the number of points in D with distances to X less than r is not greater than the number of points in D whose distances to d(X, O) are less than r. In particular, if X is a proper cluster center in D, d(X, O) will be a data point in D which has more points close to it. at is to say, if X is a dense point in D, it is also corresponding to a dense point in D.
Unfortunately, the converse is not true. Because two points in D which have a small distance may correspond to two points in D that have a large distance, a proper cluster center of D may not be corresponding to a proper cluster center of D. Hence, we can deduce that D retains the partial clustering information of D. In order to obtain more clustering information of D, one possible way is to choose more observation points to generate more distance sets and then combines all those different clustering information together. is is the main idea behind our new algorithm. We provide the following two theorems to guarantee the correctness of the abovementioned statements.
If the determinant of A is not equal to zero, we say A is nondegenerate and O 0 , O 1 , . . . , O d is a set of nondegenerate points.
□ Remark 1. For the convenience of calculation, we can us, if we have obtained the distance between X and O 0 , then computing the square of the distance between X and O i (i � 1, 2, . . . , d) will convert to addition operation three times, which will decrease the time complexity in generating those distance sets.
then, it has but X 1 ≠ X 2 . By eorem 1, we can confirm that all different clustering points can be distinguished by choosing a set of nondegenerate points as the observation points. us, d + 1 is the minimum number of the observation points to distinguish all the cluster centers of the original data set.
for i � 0, 1, . . . , d and j � 1, 2. If, for each i � 0, 1, . . . , d, then, Solving the system of equation (16) results in en, we can obtain □ Remark 3. If we normalize the original data set D, for example, we perform the min-max normalization on D, and we can deduce that M ≤ Remark 4. Suppose A is a generated distance set of D with respect to the observation point O. We cannot confirm whether two elements in A which have a small difference are corresponding to two data points in D which also have a small distance. But by eorem 2, if all the d + 1 pairs of generated distances of X and Y have a small difference, then X must have a small distance to Y. is can be used to adjust the dense of the selected subset.

Remark 5.
e observation point mechanism aims to transform the original multidimensional data points into one-dimensional distance points, which is different from the landmark point or representative point mechanisms. e landmark point [19] is the core of landmark-based spectral clustering (LSC) algorithm which generates some representative data points as the landmarks and represents the remaining data points as linear combinations of these landmarks. e representative points [20] are the subset of original data set and used in the ultrascalable spectral clustering (U-SPEC) algorithm to alleviate the huge computational burden of spectral clustering. e observation points are designed to enhance the robustness of k-means Complexity clustering, while the landmark points or representative points are used to speed up the spectral clustering.

The Proposed Two-Stage k-Means Clustering Algorithm
Given a data set D with N objects, we want to partition D into k clusters. e main idea of our two-stage k-means clustering algorithm is that we only need to deal with a small subset of D which has a similar clustering structure to D. In order to select a proper subset with the abovementioned property, we need to discard all outliers in D and retain a portion of those points that are close to the cluster centers.

Description of Algorithm
First of all, we conduct the normalization operation on the original data set D.
Obviously, the transformation on D is a composition of a translation transformation and a dilation transformation. e dilation factor 1/(M − m) is the same for each dimension; hence, the dilation transformation does not change the cluster structure. Because the translation transformation also does not change the cluster structure of a dataset, the cluster structure of D is totally the same as that of D. We also note that the value of every component of X j is in the be the set of observation points. Denote D j the generated distance set of D with respect to the observation point . eorem 1 shows that we can identify X by the (d + 1)-dimensional vector, and hence, it is reasonable to expect that the clustering structure about D can be deduced by those d + 1 distance sets.

Selecting a Representative Subset of D in the First Stage.
For each D i , we can get a set S i consisting of all candidate higher density points of D i by using the grid-based clustering methods (e.g., [21]). For example, first, we arrange D i in the ascending order. Second, a fixed value δ i is selected to be a quantile of diff(D i ). ird, for each s in D i , we counter the number of elements of D i in the interval (s − δ i , s + δ i ). us, we obtain a positive integer sequence, where each member indicates the relative size of the density of the corresponding element of D i . Finally, we select out those s in D i such that the corresponding integer number is either a local maximum or beyond a threshold. In the following experiments, we will set δ i as two times the p-th percentile of set diff(D i ) for some p, where D i is the rearrangement of D i in the ascending order and diff(D i ) is the sequence of the first-order difference on D i . Denote N as the cardinality of D i . If N is small, we usually choose a smaller p; for example, p � 75. If N is very large, we choose a bigger p; e.g., p � 99. Otherwise, we can choose a proper p between them; e.g., p � 90. Now, we have obtained d + 1 sets S 0 , S 1 , . . . , S d with each one containing all the higher-density points of the corresponding distance set. By the triangle inequality, we have the following property: If there is an According to this property, we can select a subset S of D whose distances to the i-th observation point are in S i for all i ∈ 0, 1, . . . , d { }. For each point X ∈ D, we have mapped it to a (d + 1)-dimensional vector. By Remark 4 of eorem 2, all the d + 1 pairs of the corresponding components of two points that belong to the same cluster will have a little difference between them. But it is possible that there are some data points which have some components that have the little difference with that of one cluster center and have the other components that have the little difference with that of another cluster center. In such case, few outliers may be missed by the above selection criterion. To discard those few outliers in S and decrease the number of elements of S, we need to refine S. We have the following criterion according to Remark 4 of eorem 2.
Suppose X 1 has been selected. Given a data point X 2 , if, for every i ∈ 0, 1, We denote S � Y 1 , Y 2 , . . . , Y m and set We also need a counter to indicate the density of each data point of S a . Firstly, we let Y 1 ∈ S a and make the indicative number of Y 1 to be 1. We then sequentially choose the data points in S and dynamically construct S c and S a according to the following process. Suppose we choose Y i from S, then we compute the distance between Y i and each data point in S a . If there are some distances less than a threshold value δ, we add 1 to each of the counter of data point that corresponds to these distances and then discard Y i . Meanwhile, if the counter number of a data point in S a is bigger than another threshold value n, then we remove this data point from S a and add it into S c . But if each one of the point in S a has distance to Y i bigger than δ, we continue to check whether there is a point in S c that has distance to Y i less than δ; we will discard Y i if there is any and we will add 4 Complexity Y i to S c if not. Finally, we obtain a set S c that closely represents the original data set and the size of it is smaller than that of the original data set. Furthermore, all outliers of original data set are not included in this selected subset.

Clustering S c and D in the Second Stage.
Since the selected set has discarded all outliers and has a smaller size than the original data set, the running time will decrease significantly when performing the k-means algorithm on the selected subset. Furthermore, because the subset closely represents the original data set, the cluster centers will also be suitable to be chosen as the cluster centers of the original data set. When we have identified the cluster centers, it is easy to cluster the whole data set. e pseudo-code of our proposed algorithm is presented in Algorithm 1.

Analysis of Computational Complexity.
In this section, we analyze the computational complexity of the proposed algorithm. When running the classical k-means algorithm, each iteration needs to compute the distances between each data point in the whole data and those new modified cluster centers, which has a time complexity of O(Nk d). In our algorithm, the time cost in the first stage mainly consists of four parts. e first part is to generate d + 1 one-dimensional data sets, which has a time complexity of O(N d). e second one is to find those intervals which contain the local maximum of distances, which has a time complexity of O(N). e third part is to select S, which has the time complexity O(N d). In the fourth part, we refine S and obtain the subset We note that the time complexity of the fourth part in the first stage is usually much less than O(N 2 ). Since many data points have been discarded when constructing the sets S c and S a , we do not have to compute the distances with all N data points in S.

Experimental Results and Analysis
We have conducted a series of experiments on 6 synthetic data sets and 3 benchmark data sets (UCI [22] and KEEL [23]) to validate the effectiveness of the proposed two-stage k-means clustering algorithm in this section. e synthetic data sets can be downloaded from BaiduPan (https://pan. baidu.com/s/1MfS8JfQdJLHYSlpZdndLUQ) with the extraction code "p3mc." We first present the clustering results of our proposed algorithm and the k-means algorithm on two synthetic data sets, i.e., the data set #1 and data set #2. e experimental results are shown in Figures 1 and 2. For simplicity, we only use the experimental results on data set #1 to explain the advantage of our proposed algorithm.
ere are two clusters in data set #1, where each cluster includes 41 data points. e data points obey the 2-dimensional normal distributions with mean vectors (3,11) and (12,5) and covariance matrices 3 0 0 2 and 2 0 0 3 , respectively. ere are also two outliers in data set #1. Figure 1(b) gives the selected data points of normalized data points corresponding to the data set #1 as shown in Figure 1(a). In Figure 1(b), we can find that outliers have been removed in the first stage of our proposed method. Figure 1(c) shows the clustering result of the k-means algorithm. We can see that outliers seriously impact the clustering result of the k-means algorithm, although there are only two outlier data points in the data set #1. e clustering result of our proposed method is presented in Figure 1(d), where the cluster center can be found correctly without the disturbance of outliers. e similar results can be found in Figure 2 for the data set #2 which includes 7 clusters and 10 outliers.
e experimental results reflect that our proposed two-stage k-means clustering algorithm is not sensitive to outliers and can obtain the better clustering results than that of k-means clustering algorithm.
Furthermore, we choose another four synthetic data sets as shown in Figure 3 (only 2-dimensional illustration) and three real-world data sets to compare the clustering performances of our proposed algorithm with the kmeans algorithm. e details of these data sets and experimental results are summarized in Table 1, where N is the number of the elements of the data set, t is the proportion of the outlier in the data set, k is the number of clusters, d is the dimension of data point, p is the percentile number, n c is the cardinality of selected subset, ARI kmeans and Time kmeans are the adjusted Rand index (ARI) and time consumption of k-means algorithm, and ARI our and Time our are ARI and time consumption of our proposed algorithm. In Table 1, we can see that our proposed algorithm obtains the larger ARIs with the lower time consumption in comparison with k-means clustering algorithm on these synthetic data sets. For the real data sets without outliers, our algorithm can obtain the ARIs comparable to that of k-means algorithm. Nevertheless, the ARIs of k-means algorithm are severely degraded when the outliers are deliberately arranged in the real data sets, while the experimental results in Table 1 demonstrate that our proposed clustering algorithm is robust to the outliers. Table 2 shows the details of comparison on four large-scale synthetic data sets. e variables in Table 2 have the same meaning as that in Table 1. e comparison of time complexity between our proposed algorithm and k-means algorithm in Table 2 reflects that our algorithm has less time consumption than k-means algorithm. Especially, we can find that the superiority of our proposed method on time consumption is more obvious for data set with the larger size and dimension. Furthermore, the most time-consuming procedure in our algorithm, i.e., the selection of high-density distances for each generated distance set can be ran in the parallel way, which make our algorithm to be easily extended to perform the clustering task for large-scale data set. Normalize D and generate D � X i : i � 1, 2, . . . , N ;

Complexity
In addition, we provide a real application, i.e., the tyre inclusion identification, to validate the clustering performance of our proposed clustering algorithm. Figure 4 shows two tyres with different kinds of inclusions, where each picture includes 1027 × 768 pixels. Figures 5 and 6 present the clustering results of our proposed algorithm and k-means clustering algorithm on Tyre #1 and Tyre #2, respectively. In these figures, we can see that our proposed method can accurately identify the cluster centers without the disturbance of outliers. e inclusions can be clearly recognized by our proposed algorithm in the tyres, while the k-means clustering algorithm does not find the inclusions distinctly, e.g., Figures 6(b), 6(d), and 6(f) include not only the inclusions but also the tyre traces. Above all, the experimental results demonstrate the better clustering performance in comparison with the classical k-means clustering algorithm when handling the clustering tasks with the disturbance of outliers.

Conclusions and Future Work
In this paper, we proposed a robust two-stage k-means clustering algorithm which can accurately identify the cluster centers without the disturbance of outliers. As the direct application of the observation point mechanism of I-nice [18], we select a small subset from the original data set based on a set of nondegenerate observation points in the first stage. In the second stage, we use the k-means clustering algorithm to cluster the selected subset and make these cluster centers as the true cluster centers of the original data set. e theoretical analysis and experimental verification demonstrate the feasibility and effectiveness of proposed clustering algorithm. e future studies will be focused on three directions. First, we will try to use the k-nearest neighbors (kNN) method to improve the selection of observation points. Second, we will seek the real applications for the two-stage k-means clustering algorithm. ird, we will extend our proposed algorithm to cluster big data based on the random sample partition model [24].
Data Availability e data used in our manuscript can be accessed by readers via our BaiduPan (https://pan.baidu.com/s/ 1MfS8JfQdJLHYSlpZdndLUQ) with the extraction code "p3mc."

Conflicts of Interest
e authors declare that they have no conflicts of interest.