Robust K-Median and K-Means Clustering Algorithms for Incomplete Data

Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and Kmeans. However, in practice, it is often hard to obtain accurate estimation of themissing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO) formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.


Introduction
In the field of data mining and machine learning, it is a common occurrence that the considered data sets contain several observations with missing feature values.Such incomplete data occur in a wide array of application domains due to various reasons, including improper collection process of data sets, high cost to obtain some feature values, and missing response in the questionnaire.For example, online shopping users may only rate a small fraction of the available books, movies, or songs, which leads to massive amounts of missing feature values, Marlin [1].Theoretical study of pattern recognition for incomplete data is first conducted by Sebestyen [2] under certain probabilistic assumptions.Expectation maximization algorithms have also been proposed to compute maximum likelihood estimates for missing data in Dempster et al. [3].Early empirical studies on incomplete data are reported by Dixon [4] and Jain and Dubes [5].
Clustering analysis has been regarded as an effective method to extract useful features and explore potential data patterns.Due to the presence of missing feature values, there is an urgent need to cluster incomplete data in many fields, such as image analysis [6], information retrieval [7], and clinical medicine [8].To cluster incomplete data, the basic approach is the two-step method, which first estimates the missing feature values using imputation and then applies the classical clustering methods.Troyanskaya et al. [9] investigate three imputation based clustering methods for gene microarray data, including the singular value decomposition, weighted K-nearest neighbors (KNN), and row average methods.Troyanskaya et al. [9] conclude that the KNN method appears to provide a more robust and sensitive result for missing value estimation than others.Miyamoto et al. [10] also use a similar imputation based fuzzy c-means (FCM) method to handle incomplete data.Acuna and Rodriguez [11] and Farhangfar et al. [12] compare the performance of different imputation methods for missing values, including single imputation methods, such as the mean, median, hot deck, and Naive-Bayes methods and the polytomous regression based multiple imputation method for classification problems.Saravanan and Sailakshmi [13] propose fuzzy probabilistic cmeans algorithms to impute the missing values using the genetic algorithm.
Besides the imputation based methods, Hathaway and Bezdek [14] propose four strategies to make the classical FCM clustering algorithm applicable to incomplete data.The simplest whole data strategy (WDS) deletes all incomplete samples and applies the FCM algorithm to the remaining complete data.This strategy is only useful when only a few incomplete samples include missing values.To calculate distances of missing data in the process of implementing FCM, the partial distance strategy (PDS) can be used.PDS has also been used in pattern recognition in Dixon [4] and fuzzy clustering with missing values in Miyamoto et al. [10] and Timm and Kruse [15].The third and fourth strategies can be viewed as iterative imputation based methods.The optimal completion strategy (OCS) imputes the missing values by the maximum likelihood estimate in an iterative optimization procedure, and the nearest prototype strategy (NPS) is a simple modification of OCS, in which missing elements are imputed considering only the nearest prototype.Clustering methods without elimination or imputation for incomplete data have also been proposed.Shibayama [16] uses the principal component analysis (PCA) method to capture the structure of incomplete data and Honda and Ichihashi [17] propose linear fuzzy clustering methods based on the local PCA.Zhang and Chen [18] propose a kernel-based FCM clustering algorithm for incomplete data, which estimates the missing feature values based on the fuzzy membership and cluster prototype.Sadaaki et al. [19] further combine the linear fuzzy clustering with PDS, OCS, and NPS proposed by Hathaway and Bezdek [14].
Both direct imputation and iterative imputation (such as OCS, NPS) methods assume that the miss feature value can be well estimated by a single value.However, it is usually hard to obtain accurate estimates of the missing values, and thus clustering methods based on imputation are sensitive to the estimation accuracy.To address this issue, Li et al. [20] use nearest-neighbor intervals to represent the missing values and extend FCM by defining new interval distance function for interval data.Interval data have been verified as an effective way to handle the missing values and further used to propose effective clustering methods.Li et al. [21] also represent the missing values by interval data but search for appropriate imputations of missing values in the intervals using the genetic algorithm.Wang et al. [22] use an improved backpropagation (BP) neural network to estimate the interval data for missing values.Zhang et al. [23] propose an improved interval construction method based on preclassification results and use the particle swarm optimization to search for the optimal clustering.Zhang et al. [8] represent the missing values by probabilistic information granules and design an efficient trilevel alternating optimization method to find both the optimal clustering results and the optimal missing values simultaneously.
Recently, robust optimization has been widely accepted as an effective method to handle uncertain or missing data and used in the field of data mining and machine learning, such as the minimax probability machine [24][25][26][27], robust support vector machines [28,29], and robust quadratic regression [30].This paper aims at designing robust clustering algorithms for incomplete data.The improved interval construction method based on preclassification is used to obtain the interval data for missing values.Based on the interval data representation, we present robust K-median and K-means clustering algorithms.Different from the existing algorithms, which use either the interval distance function or optimal imputation [20,21,23], we reformulate the clustering problem as a minimax robust optimization problem based on interval data.
Specifically, for given cluster prototype and membership matrices, we introduce a concept of robust clustering objective function, which is the maximum of clustering objective function when the missing values vary in the constructed intervals.Then the proposed algorithms aim at finding optimal cluster prototype and membership matrices, which minimize the robust clustering objective function.For both robust K-median and K-mean clustering problems, we give equivalent reformulations for the robust objective function and present effective solution methods.Compared with existing methods, the proposed algorithms are insensitive to estimation errors of the constructed intervals, especially when the missing rate is high.Comparisons and analysis of numerical experimental results on UCI data sets also validate the effectiveness of the proposed robust algorithms.
Compared with existing algorithms, the advantages of the proposed robust clustering algorithms are twofold.First, our algorithms can cluster incomplete data without imputation for the missing feature values and provide robust clustering results, which are insensitive to estimation errors.Our experiments also validate the effectiveness of the proposed algorithm in terms of robustness and accuracy by comparison with existing algorithms.Second, the proposed algorithms are easy to understand and implement.Specifically, the time complexity of the robust K-median and K-means clustering algorithms is () and (( + log )), respectively, where  is the number of objects,  is the dimension of features,  is the number of clusters, and  is the number of iterations.Our algorithms have similar computation complexity to the classical K-median and K-means clustering algorithms and are more efficient than the clustering algorithms for incomplete data proposed by Zhang et al. [8] with the time complexity of ( 2 ) (when log  ≤  2 for the robust K-means clustering algorithm).
The paper is organized as follows.Section 2 reviews the classical K-median and K-mean algorithms and presents the robust K-median and K-means clustering problems.Section 3 gives effective algorithms for the proposed robust optimization problems.Section 4 reports experimental results.Finally, we conclude this paper with further research direction in Section 5.

K-Median and K-Means Clustering for Complete Data.
Consider the problem of clustering a set of  objects  = {1, . . ., } into  clusters.For each object  ∈ , we have a set of  features {  :  ∈ }, where   describes the th features of the object  quantitatively.Let   = ( 1 , . . .,   ) T be the feature vector of the object  and  = ( 1 , . . .,   ) be the feature matrix or data set.
The task of clustering can be reformulated as an optimization problem, which minimizes the following clustering objective function: under the following constraints: where  = 1,2.For  = 1, . . ., , V  ∈   is the th cluster prototypes and, for any  ∈ ,   indicates whether the object  belongs to the th cluster.K-median and K-means are effective algorithms to solve the clustering problem for  = 1 and  = 2, respectively.In the following, let the cluster prototype matrix Both algorithms solve the clustering problem in iterative ways as follows.
Step 3. Update the cluster prototype matrix   by fixing the membership matrix   .When  = 1, for any  = 1, . . .,  and  ∈ , set V   as the median of the th feature values of these objects in cluster .When  = 2, for any  = 1, . . ., , set V   as the centroid of these objects in cluster ; that is, Step 4. If, for any  ∈  and  = 1, . . ., , we have    =  −1  , then stop and return to  and ; otherwise, go to Step 2.

Robust K-Median and K-Means Clustering for Incomplete
Data.Due to various reasons, the feature matrix  may contain missing components.For example, when || = 3, for a certain object  ∈ , we may have   = (1, 0.5, ?) T , which indicates that the third-feature value of object  is missing.We refer to a data set  as an incomplete data set if it contains at least one missing feature value for some objects; that is, there exists at least one  ∈  and  ∈ , such that   = ?.To describe the missing data set, for any  ∈ , we further partition the feature set of  into two subsets: In practice, it is difficult to obtain accurate estimations of missing feature values.Thus, in this paper, we represent missing values by intervals.Specifically, for any  ∈ , we use an interval [ −  ,  +  ] to represent unknown missing feature value where  ∈  0  and use   to represent known feature value where  ∈  1  .To simplify notations, in the following, let   = ( −  +  +  )/2 and   = ( +  −  −  )/2 for any  ∈  0  and   = 0 for any  ∈  1   .For details on how to construct these intervals for missing values, see Li et al. [20] and Zhang et al. [23].
This paper aims at designing robust clustering methods, such that the worst-case performance of the cluster output can be guaranteed.The logic of the proposed method can be explained as a two-player game: a clustering decision-maker first makes clustering decision, and then an adversarial player chooses values of missing features from certain intervals.Thus, a robust clustering decision-maker will select the cluster, such that the worst-case cluster objective function is minimized.
(RCP) is a discrete minimax problem.When there is no missing data, that is,  0  = 0 for any  ∈ , (RCP) reduces to the classical clustering problem (1).Since problem (1) is NP-hard problem [31,32], finding the global optimal solution of (RCP) is a challenging task.In the next section, we propose effective robust K-median and K-means algorithms for (RCP).

Robust K-Median Clustering Algorithm.
In this subsection, we provide a robust K-median clustering algorithm for (RCP) when  = 1.We first show how to simplify the robust cluster objective function.
where (7) uses the fact that the maximum of a convex function over a convex set is attained at extreme points and (8) uses constraints (2).Since max{| − |, | + |} = || + || and   = 0, for any  ∈  and  ∈  1  , we further have Equation (9) shows that the existence of missing values increases the cluster objective function.Based on (9), the robust K-median clustering algorithm can be given in Algorithm 1.
Output.The cluster prototype matrix  * and membership matrix  * .

Robust K-Means Clustering Algorithm.
In this subsection, a robust K-median clustering algorithm for (RCP) when  = 2 is proposed.Similarly to the analysis of   (, ) when  = 1, we first simply the robust cluster objective function as follows: Since max{( − ) 2 , ( + ) 2 } =  2 +  2 + 2||||, we have To minimize   (, ), we need to update  and  in an alternative manner.Specifically, when the value of  is fixed, each object  ∈  can be assigned to any cluster in the following index set: When the value of  is fixed, for each cluster  = 1, . . ., , let   = { ∈  :   = 1}.Then the optimal value of V *  can be obtained by solving the following piecewise convex optimization problem: Note that optimization problem ( 13) is decomposable in .Thus, to obtain the optimal value of V *  , it is sufficient to solve the following subproblem: Procedure 1 (procedure of solving the Subproblem ( 14)).
Step 2. Identify potential minimum points.
Procedure 1 solves Subproblem ( 14) by enumerating all potential minimum points.It is easy to see that Procedure 1 can be implemented in O(  log   ) time if the ranking step uses effective sorting methods, such as the Heapsort.
Based on the above discussion, the robust K-means clustering algorithm can be described in Algorithm 2.
Output.The cluster prototype matrix  * and membership matrix  * .
Step 2. Let  =  + 1 and update   by fixing  −1 : For any  ∈ , randomly select  * that belongs to the index set (12).

Computational Complexity. It is well known that the time complexity of the classical K-median and K-means algorithms is O(𝑛𝑚𝐾𝑇)
, where  = || is the number of objects,  = || is the dimension of features,  is the number of clusters, and  is the number of iterations.We will show that the proposed robust K-median clustering algorithm has an () time complexity and the robust K-means clustering algorithm has an ((+log )) time complexity.
Specifically, the initialization step of Algorithm 1 takes O() time to initialize the cluster prototype matrix.For a given cluster prototype matrix, Algorithm 1 takes O() time to update the membership matrix.Note that the median of  scalar can be computed in O() time [33] For the robust K-means clustering algorithm, it is easy to see that the first two steps of Algorithm 2 take O() and O() time, respectively.Let |  | =   .For given  and , Procedure 1 takes O(  log   ) time to compute V   .Therefore, Step 3 of Algorithm 2 takes O( log ) time since  ∑  =1   log   ≤  log  time.Note that the last step of Algorithm 2 also takes O().Thus, the time complexity of the robust K-means clustering algorithms is (( + log )).
In addition, it is easy to see that both the robust Kmedian and robust K-means clustering algorithms have a space complexity of (( + )).Therefore, compared with the classical K-median and robust K-means algorithms, the proposed robust clustering algorithms consume same computation resources.

Numerical Experiments
In this section, we compare the proposed robust clustering algorithms with others on two data sets from the UCI machine learning repository.Section 4.1 describes the data sets and experimental setup, and Section 4.2 reports and discusses the experimental results.

Data Sets and Experimental
Setup.Two widely used data sets, Iris and Seeds, are used to test the performance of the proposed algorithms.The Iris data consists of 150 objects and each object has four features of Iris flowers, including sepal length, sepal width, petal length, and petal width.The Iris data includes three clusters, Setosa, Versicolour, and Virginica, and each cluster contains 50 objects.The optimal cluster prototypes of the Iris data have been reported by Hathaway and Bezdek [34].The Seeds data set consists of 210 kernels of three different varieties of wheat, and each kernel has seven real-valued features, including area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient, and length of kernel groove.
We generate the missing values under the missing completely at random (MCAR) mechanism as in Hathaway and Bezdek [14] and Li et al. [20].Specifically, we randomly select a specified percentage of components and designate them as missing.To make the incomplete data tractable, we also make sure that the following constraints are satisfied: (1) each object retains at least one feature; (2) each feature has at least one value present in the incomplete data set.
In addition to the Iris and Seeds data sets with artificially generated missing values, we also test the proposed algorithms on a real-world incomplete data set and the Stone Flakes data set [35], which consists of 79 eight-dimensional attribute stone flake objects in the prehistoric era.These objects belong to three different historic ages.The Stone Flakes data set is incomplete and there are 6 incomplete objects with 10 missing feature values.Li et al. [20] use the  nearest neighbors to construct intervals for missing feature values and, from their numerical experiments,  = 6 is a good choice.To further test the impact of the interval size on the clustering performance of the proposed robust clustering algorithms, the interval for the missing value   is constructed as [(1−)  , (1+)  ], where   is estimated by the  nearest neighbors and  ∈ (0, 1).

Results and Discussion
. We first test and compare the performance of the proposed robust K-median (labelled "RKM1") on both Iris and Seeds data sets under different missing rates from 0% to 20%.The classical K-median algorithms have also been modified based on WDS, PDS, and NPS to handle incomplete data sets.Since the performance of K-median algorithm depends on the initial cluster prototypes, we repeat each algorithm 100 times and report the averaged performance.
Tables 1 and 2 report the averaged performance of different K-median algorithms on the incomplete Iris and Seeds data, respectively.The first column in each table gives the missing rate.The second to seventh columns give the averaged misclassification rates by comparison with the true clustering result, where the fifth to seventh columns correspond to the RKM1 algorithms with different values of  ranging from 0.05 to 0.15.In Table 1, the eighth to thirteenth columns give the averaged cluster prototype errors of different algorithms, which are calculated by where  * represents the cluster prototypes given by a certain K-median algorithm and Ṽ is the actual cluster prototypes of the Iris data set without missing values.Since the actual cluster prototypes of the Seeds data set are unknown, such results are not reported in Table 2.
From Tables 1 and 2, we have the following observations.
(1) When there is no missing value, that is, the missing rate is equal to zero, all K-median algorithms give the same results.As the missing rate increases, in most cases, both the misclassification rate and prototype error of all algorithms become larger.(2) When the missing rate is small, the missing data have little adverse effect on the performance of the proposed RKM1.For example, the misclassification rate of RKM1 when the missing rate is around 5% is even smaller than that of RKM1 when the missing rate is zero.(3) When the missing rate is large, compared with the WDS, PDS, and NPS based K-median algorithms, RKM1 provides clustering results with lower numbers of misclassification and prototype errors.(4) Experimental results also show that the interval size affects the performance of RKM1.Specifically, as the value of  increases from 0.05 to 0.15, for most cases, the misclassification rate of RKM1 first decreases and then increases.However, when the missing rate is high (20%), RKM1 with a small value of  provides the best clustering performance.
The proposed robust K-means algorithm (labelled "RKM2") is also tested on both Iris and Seeds data sets and compared with the WDS, PDS, and NPS based K-means algorithms.Tables 3 and 4 report the averaged performance of these algorithms by repeating each algorithm 100 times.
Tables 3 and 4 also validate the robustness of the proposed RKM2 against the missing values.When there are missing values, RKM2 provides robust cluster results with smaller misclassification rate and prototype error compared with the WDS, PDS, and NPS based K-means algorithms.For example, when the missing rate is 5%, the misclassification rate given by RKM2 with  = 0.10 on the Seeds data set is only 10.34%, while the best misclassification rate given by other Kmeans algorithms is 12.10%.The impact of the interval size on the performance of RKM2 is similar to that of RKM1; that is, for most cases the RKM2 with  = 0.10 provides the best clustering performance in terms of both misclassification rate and prototype error.Finally, we test the performance of the proposed robust clustering algorithm on a real-world incomplete data set, the Stone Flakes data set.From the above discussion, we set  = 0.10 for both RKM1 and RKM2. Figure 1 demonstrates the numbers of misclassification of different algorithms.From Figure 1, we see that RKM1 provides the lowest misclassification rate and RKM2 provides the second best performance.

Conclusion
This paper considers the clustering problem for incomplete data.To reduce the effect of missing values on the performance of clustering results, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function, which is defined as the worst-case cluster objective function when the missing values vary in the constructed intervals.Then, we propose a robust clustering model which aims at minimizing the robust cluster objective function.Robust K-median and K-means algorithms are designed to solve the proposed robust clustering problem.The time complexity of the robust K-median and K-means clustering algorithms is () and (( + log )), respectively.Numerical experiments on both artificially generated and real-world incomplete data sets show that the proposed algorithms are robust against the missing data and provide better clustering performance by comparison with the existing WDS, PDS, and NPS based K-median and Kmeans algorithms.
Both K-median and K-means algorithms solve clustering incomplete data with hard constraints; that is, each object only belongs to one cluster.To solve clustering incomplete data with soft constraints, we will further study the robust fuzzy K-median and K-robust clustering algorithms in the future.

Figure 1 :
Figure 1: Numbers of misclassification of different algorithms on the Stone Flakes data set.

Table 1 :
Performance of different K-median algorithms on the IRIS data.

Table 3 :
Performance of different K-means algorithms on the IRIS data.

Table 4 :
Misclassification rates of different K-means algorithms on the Seeds data.