^{1}

^{1}

^{1,2}

^{3}

^{1}

^{2}

^{3}

Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and K-means. However, in practice, it is often hard to obtain accurate estimation of the missing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO) formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.

In the field of data mining and machine learning, it is a common occurrence that the considered data sets contain several observations with missing feature values. Such incomplete data occur in a wide array of application domains due to various reasons, including improper collection process of data sets, high cost to obtain some feature values, and missing response in the questionnaire. For example, online shopping users may only rate a small fraction of the available books, movies, or songs, which leads to massive amounts of missing feature values, Marlin [

Clustering analysis has been regarded as an effective method to extract useful features and explore potential data patterns. Due to the presence of missing feature values, there is an urgent need to cluster incomplete data in many fields, such as image analysis [

Besides the imputation based methods, Hathaway and Bezdek [

Both direct imputation and iterative imputation (such as OCS, NPS) methods assume that the miss feature value can be well estimated by a single value. However, it is usually hard to obtain accurate estimates of the missing values, and thus clustering methods based on imputation are sensitive to the estimation accuracy. To address this issue, Li et al. [

Recently, robust optimization has been widely accepted as an effective method to handle uncertain or missing data and used in the field of data mining and machine learning, such as the minimax probability machine [

Specifically, for given cluster prototype and membership matrices, we introduce a concept of robust clustering objective function, which is the maximum of clustering objective function when the missing values vary in the constructed intervals. Then the proposed algorithms aim at finding optimal cluster prototype and membership matrices, which minimize the robust clustering objective function. For both robust K-median and K-mean clustering problems, we give equivalent reformulations for the robust objective function and present effective solution methods. Compared with existing methods, the proposed algorithms are insensitive to estimation errors of the constructed intervals, especially when the missing rate is high. Comparisons and analysis of numerical experimental results on UCI data sets also validate the effectiveness of the proposed robust algorithms.

Compared with existing algorithms, the advantages of the proposed robust clustering algorithms are twofold. First, our algorithms can cluster incomplete data without imputation for the missing feature values and provide robust clustering results, which are insensitive to estimation errors. Our experiments also validate the effectiveness of the proposed algorithm in terms of robustness and accuracy by comparison with existing algorithms. Second, the proposed algorithms are easy to understand and implement. Specifically, the time complexity of the robust K-median and K-means clustering algorithms is

The paper is organized as follows. Section

Consider the problem of clustering a set of

The task of clustering can be reformulated as an optimization problem, which minimizes the following clustering objective function:

Both algorithms solve the clustering problem in iterative ways as follows.

Set iteration index

Let

Update the cluster prototype matrix

If, for any

Due to various reasons, the feature matrix

In practice, it is difficult to obtain accurate estimations of missing feature values. Thus, in this paper, we represent missing values by intervals. Specifically, for any

This paper aims at designing robust clustering methods, such that the worst-case performance of the cluster output can be guaranteed. The logic of the proposed method can be explained as a two-player game: a clustering decision-maker first makes clustering decision, and then an adversarial player chooses values of missing features from certain intervals. Thus, a robust clustering decision-maker will select the cluster, such that the worst-case cluster objective function is minimized.

To introduce robust clustering problem, we first define the following robust cluster objective function:

(RCP) is a discrete minimax problem. When there is no missing data, that is,

In this subsection, we provide a robust K-median clustering algorithm for (RCP) when

For any

For any

In this subsection, a robust K-median clustering algorithm for (RCP) when

To minimize

For

Subproblem (

Procedure

Based on the above discussion, the robust K-means clustering algorithm can be described in Algorithm

For any

For any

For any

It is well known that the time complexity of the classical K-median and K-means algorithms is

Specifically, the initialization step of Algorithm

For the robust K-means clustering algorithm, it is easy to see that the first two steps of Algorithm

In addition, it is easy to see that both the robust K-median and robust K-means clustering algorithms have a space complexity of

In this section, we compare the proposed robust clustering algorithms with others on two data sets from the UCI machine learning repository. Section

Two widely used data sets, Iris and Seeds, are used to test the performance of the proposed algorithms. The Iris data consists of 150 objects and each object has four features of Iris flowers, including sepal length, sepal width, petal length, and petal width. The Iris data includes three clusters, Setosa, Versicolour, and Virginica, and each cluster contains 50 objects. The optimal cluster prototypes of the Iris data have been reported by Hathaway and Bezdek [

We generate the missing values under the missing completely at random (MCAR) mechanism as in Hathaway and Bezdek [

each object retains at least one feature;

each feature has at least one value present in the incomplete data set.

In addition to the Iris and Seeds data sets with artificially generated missing values, we also test the proposed algorithms on a real-world incomplete data set and the Stone Flakes data set [

Li et al. [

We first test and compare the performance of the proposed robust K-median (labelled “RKM1”) on both Iris and Seeds data sets under different missing rates from

Tables

Performance of different K-median algorithms on the IRIS data.

% | Misclassification rate | Prototype error | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

WDS | PDS | NPS | RKM1 | WDS | PDS | NPS | RKM1 | |||||

0.05 | 0.10 | 0.15 | 0.05 | 0.10 | 0.15 | |||||||

0 | 17.2 | 17.2 | 17.2 | 17.2 | 17.2 | 17.2 | 0.126 | 0.126 | 0.126 | 0.126 | 0.126 | 0.126 |

5 | 18.2 | 20.9 | 18.1 | 16.7 | 16.5 | 17.3 | 0.213 | 0.322 | 0.217 | 0.165 | 0.160 | 0.178 |

10 | 20.1 | 23.0 | 19.8 | 17.3 | 17.1 | 18.5 | 0.263 | 0.419 | 0.283 | 0.218 | 0.221 | 0.249 |

15 | 24.9 | 26.3 | 24.2 | 19.2 | 19.3 | 20.2 | 0.488 | 0.599 | 0.493 | 0.563 | 0.553 | 0.572 |

20 | 25.1 | 28.2 | 26.5 | 21.6 | 22.1 | 23.6 | 2.319 | 2.773 | 2.185 | 1.391 | 1.508 | 1.857 |

Misclassification rates of different K-median algorithms on the Seeds data.

% | WDS | PDS | NPS | RKM1 | ||
---|---|---|---|---|---|---|

0.05 | 0.10 | 0.15 | ||||

0 | 10.62 | 10.62 | 10.62 | 10.62 | 10.62 | 10.62 |

5 | 11.76 | 13.48 | 12.43 | 10.14 | 10.23 | 11.73 |

10 | 11.24 | 21.81 | 14.95 | 12.24 | 11.67 | 12.49 |

15 | 13.38 | 22.67 | 17.05 | 12.81 | 12.65 | 13.35 |

20 | 17.86 | 32.48 | 16.05 | 14.38 | 14.19 | 15.94 |

From Tables

When there is no missing value, that is, the missing rate is equal to zero, all K-median algorithms give the same results. As the missing rate increases, in most cases, both the misclassification rate and prototype error of all algorithms become larger.

When the missing rate is small, the missing data have little adverse effect on the performance of the proposed RKM1. For example, the misclassification rate of RKM1 when the missing rate is around

When the missing rate is large, compared with the WDS, PDS, and NPS based K-median algorithms, RKM1 provides clustering results with lower numbers of misclassification and prototype errors.

Experimental results also show that the interval size affects the performance of RKM1. Specifically, as the value of

The proposed robust K-means algorithm (labelled “RKM2”) is also tested on both Iris and Seeds data sets and compared with the WDS, PDS, and NPS based K-means algorithms. Tables

Performance of different K-means algorithms on the IRIS data.

% | Misclassification rate | Prototype error | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

WDS | PDS | NPS | RKM2 | WDS | PDS | NPS | RKM2 | |||||

0.05 | 0.10 | 0.15 | 0.05 | 0.10 | 0.15 | |||||||

0 | 17.8 | 17.8 | 17.8 | 17.8 | 17.8 | 17.8 | 0.165 | 0.165 | 0.165 | 0.165 | 0.165 | 0.165 |

5 | 19.2 | 21.5 | 19.1 | 17.1 | 17.4 | 18.3 | 0.243 | 0.147 | 0.238 | 0.485 | 0.513 | 0.626 |

10 | 21.3 | 23.0 | 20.1 | 18.1 | 18.3 | 19.2 | 0.193 | 0.208 | 0.316 | 0.761 | 0.729 | 0.831 |

15 | 25.1 | 26.1 | 24.8 | 21.2 | 22.5 | 23.7 | 0.52 | 0.637 | 0.496 | 1.653 | 1.721 | 1.796 |

20 | 25.8 | 27.0 | 26.7 | 24.8 | 24.4 | 25.6 | 2.641 | 2.871 | 2.373 | 2.587 | 2.673 | 2.639 |

Misclassification rates of different K-means algorithms on the Seeds data.

% | WDS | PDS | NPS | RKM2 | ||
---|---|---|---|---|---|---|

0.05 | 0.10 | 0.15 | ||||

0 | 10.76 | 10.76 | 10.76 | 10.76 | 10.76 | 10.76 |

5 | 12.10 | 13.29 | 12.71 | 10.62 | 10.34 | 11.26 |

10 | 11.05 | 22.62 | 15.38 | 13.67 | 13.28 | 14.61 |

15 | 13.48 | 23.67 | 17.38 | 14.33 | 13.97 | 15.56 |

20 | 19.00 | 35.86 | 16.95 | 15.81 | 15.16 | 16.33 |

Tables

Finally, we test the performance of the proposed robust clustering algorithm on a real-world incomplete data set, the Stone Flakes data set. From the above discussion, we set

Numbers of misclassification of different algorithms on the Stone Flakes data set.

This paper considers the clustering problem for incomplete data. To reduce the effect of missing values on the performance of clustering results, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function, which is defined as the worst-case cluster objective function when the missing values vary in the constructed intervals. Then, we propose a robust clustering model which aims at minimizing the robust cluster objective function. Robust K-median and K-means algorithms are designed to solve the proposed robust clustering problem. The time complexity of the robust K-median and K-means clustering algorithms is

Both K-median and K-means algorithms solve clustering incomplete data with hard constraints; that is, each object only belongs to one cluster. To solve clustering incomplete data with soft constraints, we will further study the robust fuzzy K-median and K-robust clustering algorithms in the future.

The authors declare that there is no conflict of interests regarding the publication of this paper.

This work was supported by the Major Program of the National Natural Science Foundation of China under Grant nos. 41427806 and 61503211, the Natural Science Foundation of Beijing under Grant 9152002, and the Project of China Ocean Association under Grant DYXM-125-25-02.