MPE Mathematical Problems in Engineering 1563-5147 1024-123X Hindawi Publishing Corporation 10.1155/2015/535932 535932 Research Article K -Nearest Neighbor Intervals Based AP Clustering Algorithm for Large Incomplete Data Lu Cheng 1, 2 http://orcid.org/0000-0001-7361-9283 Song Shiji 1 Wu Cheng 1 Zhang Hui 1 Department of Automation Tsinghua University Beijing 100084 China tsinghua.edu.cn 2 Army Aviation Institute Beijing 101123 China 2015 1692015 2015 15 01 2015 02 03 2015 1692015 2015 Copyright © 2015 Cheng Lu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The Affinity Propagation (AP) algorithm is an effective algorithm for clustering analysis, but it can not be directly applicable to the case of incomplete data. In view of the prevalence of missing data and the uncertainty of missing attributes, we put forward a modified AP clustering algorithm based on K-nearest neighbor intervals (KNNI) for incomplete data. Based on an Improved Partial Data Strategy, the proposed algorithm estimates the KNNI representation of missing attributes by using the attribute distribution information of the available data. The similarity function can be changed by dealing with the interval data. Then the improved AP algorithm can be applicable to the case of incomplete data. Experiments on several UCI datasets show that the proposed algorithm achieves impressive clustering results.

1. Introduction

With the developments of sensors and database technology, people get more focus on the Big Data issue . But too often the data is difficult to analyze. Cluster analysis is one of the common methods for analyzing data, which is to partition a set of objects into different groups, so that the data in each cluster share some common traits. Affinity Propagation (AP) is a relatively new clustering algorithm that has been introduced by Frey and Dueck , which can handle large datasets in a relatively short period to obtain more satisfactory results. AP algorithm has superiority over other clustering algorithms in terms of processing efficiency and quality of clustering, and AP algorithm does not require the prespecified number of clusters and the initial cluster centers. Thus, AP algorithm has attracted the attention of many scholars, and various improvements have emerged .

As a common and effective clustering algorithm, the original AP clustering algorithm is only applicable to complete data like other traditional clustering algorithms. However, in practice, many datasets suffer from incompleteness due to various reasons, such as bad sensors, mechanical failures to collect data, illegible images due to low pixels and noises, and unanswered questions in surveys. Therefore, some strategies should be employed to make AP applicable to such incomplete datasets.

In the literature, several approaches to handle incomplete data have been proposed, including listwise deletion (LD), imputation, model-based method, and direct analysis . There is a strong connection between these methods on the concrete implementation algorithm. LD ignores those samples with missing values, which may lose a lot of sample information. Imputation and model-based method are usually based on the assumption that data attributes are missing at random. They substitute the missing values with appropriate estimates and construct a complete dataset. However, it is inefficient to perform imputation, and they usually lead to results far from satisfactory. For incomplete data, many methods have been proposed to reduce the impact of the presence of the missing values on the clustering performance in pattern recognition. An important empirically oriented study was done by Dixon . The expectation-maximization (EM) algorithm  is a commonly used iterative algorithm based on maximum likelihood estimation in missing data analysis. Neither statistical methods nor machine learning method for dealing with missing data meets the actual needs of current. Various methods for handling missing data remain to be further optimized.

Though incomplete data appears everywhere, principled clustering methods for such data still deserve further research. The existing research on improved methods for clustering model is mainly concentrated on the fuzzy C -means clustering (FCM) algorithm (Bezdek, 1981) . In 1998, imputation and discarding/ignoring were proposed by Miyamoto et al.  for handling missing values in FCM. In 2001, Hathaway and Bezdek proposed four strategies to improve the FCM clustering of incomplete data and proved the convergence of the algorithms . These strategies are whole data strategy (WDS), partial distance strategy (PDS), optimal completion strategy (OCS), and nearest prototype strategy (NPS). In addition, Hathaway and Bezdek used triangle inequality-based approximation schemes (NERFCM) to cluster incomplete relational data . Li et al.  put forward a FCM algorithm based on the nearest neighbor intervals and solved the case of incomplete data. Zhang and Chen  introduced a kernel method into the standard FCM algorithm.

However, FCM algorithms are sensitive to the initial centers, which makes the clustering results unstable. In particular when some data are missing, the selection of the initial cluster centers becomes more important. To address this issue, we consider the AP algorithm, which does not require initial cluster centers and the number of clusters. Three strategies for solving AP clustering of incomplete datasets had been proposed in our previous research . These strategies were simple and easy to implement which directly deal with incomplete dataset using AP algorithm. However, the effect of dataset information on missing attributes had not been studied, by which clustering quality would be affected. In this paper, based on Improved Partial Data Strategy (IPDS), a modified AP algorithm for incomplete data based on K -nearest neighbor intervals (KNNI-AP) is proposed. First, missing attributes are represented by KNNI on the basis of IPDS, which are robust. Second, the clustering problems are transformed into clustering problems with interval-valued data, which may provide more accurate clustering results. Third, AP algorithm simultaneously considers all data points as potential centers, which makes the clustering results more stable and accurate.

The remainder of this paper is organized as follows. Section 2 presents a description of AP algorithm and AP clustering algorithm for interval-valued data (IAP) based on clustering objective function minimization. The KNNI representation of missing attributes and the novel KNNI-AP algorithm are introduced in Section 3. Section 4 presents clustering results of several UCI datasets and a comparative study of our proposed algorithm with KNNI-FCM and other methods for handling missing values using AP. We conclude this work and discuss the future work in Section 5.

2. AP Clustering Algorithm for Interval-Valued Data 2.1. AP Clustering Algorithm

AP algorithm and K -means algorithm have similar objective function, but the AP algorithm simultaneously considers all data points as the potential centers.

Let a complete dataset X = { x 1 , x 2 , , x N } , where x i R . The goal of AP is to find an optimal exemplar set X C = x c 1 , x c 2 , , x c k , 1 < k < N , by minimizing the clustering error function: (1) J C = i = 1 N d 2 x i , C x i , where C ( x i ) represents the exemplar for given x i . Each data point only corresponds to a cluster, and each exemplar is an actual data point which is the center of the cluster.

First, AP algorithm takes each data point as the candidate exemplar and calculates the attractiveness information between sample points, that is, the similarity between any two sample points. The similarity can be set according to specific applications; similarity measurement mainly includes similarity coefficient function and distance function. Common distance functions are Euclidean distance, Manhattan distance, and Mahalanobis distance. In the traditional clustering problem, similarity is usually set as the negative of squared Euclidean distance: (2) s i , j = - d 2 x i , x j = - x i - x j 2 2 , i j , where s ( i , j ) is stored in a similarity matrix, representing the suitability that the sample x i is the exemplar of the sample x j . s ( i , i ) is set for each sample, called “preference.” The greater the value is, the more possible the corresponding point is selected as the exemplar. Because all samples are equally suitable as centers, the preferences should be set as a common value P . The number of identified exemplars is influenced by P , which can be changeable for different numbers of clusters. Frey and Dueck  suggested preference is the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters). We also employ  to measure the preference values to get more accurate clustering results.

To select appropriate clustering centers, AP algorithm searches for two different pieces of information: responsibility ( r ( i , j ) ) and availability ( a ( i , j ) ).   r ( i , j ) sent from sample i to sample j reflects how well-suited sample j is to be served as the cluster center for sample i . a ( i , j ) sent from sample j to sample i reflects how appropriate for sample i to choose sample j as its exemplar. The message-passing procedure terminates after a fixed number of iterations or the changes in the messages fall below a threshold.

2.2. AP Clustering Algorithm for Interval-Valued Data (IAP)

AP algorithm should be adjusted to deal with interval data. Let an M -dimensional interval-valued dataset X ¯ = { x 1 ¯ , x 2 ¯ , , x N ¯ } , where the i th sample is expressed as X i ¯ = { x i 1 ¯ , x i 2 ¯ , , x i M ¯ } and x i l ¯ = [ x i l - , x i l + ] , ( 1 l M ) . To find the optimal exemplar set X c ¯ = { x c 1 ¯ , x c 2 ¯ , , x c k ¯ } ( 1 < k < N ) , we minimize the following clustering error function: (3) J C = i = 1 N d 2 x i ¯ , C x i ¯ , where C ( x i ) ¯ represents the exemplar for given x i ¯ . The similarity is changed as (4) s i , j = - d 2 x i ¯ , x j ¯ = - x i ¯ - x j ¯ 2 2 , i j .

The Euclidean distance can be defined as (5) x i ¯ - x j ¯ 2 2 = l = 1 M x i l + - x j l + 2 + l = 1 M x i l - - x j l - 2 + l = 1 M x i l + - x j l + + x i l - - x j l - 2 2 .

Similarity matrix of X ¯ can be calculated accordingly. Then the two pieces of information are updated alternately, which are both zero in the initial stage, and the update process is given as follows: (6) r i , j s i , j - max a i , j + s i , j , a i , j m i n i j 0 , r j , j + i i , i j m a x 0 , r i , j , i j m a x 0 , r i , j , i i j , i j m a x 0 , r i , j , i = j .

To avoid the numerical oscillation, the damping factor λ is introduced as follows: (7) R i = 1 - λ R i + λ R i - 1 , A i = 1 - λ A i + λ A i - 1 .

The procedure of IAP can be described as follows.

Input is the similarity matrix S and the preference P .

Output is the clustering result.

Step 1.

Initialize responsibility ( r ( i , j ) ) and availability ( a ( i , j ) ) to zero: r ( i , j ) = 0 ; a ( i , j ) = 0 .

Step 2.

Update the responsibilities.

Step 3.

Update the availabilities.

Step 4.

Terminate the message-passing procedure after a fixed number of iterations or the changes in the messages fall below a threshold. Otherwise go to Step 2.

3. AP Algorithm for Incomplete Data Based on <inline-formula> <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M52"> <mml:mrow> <mml:mi>K</mml:mi></mml:mrow> </mml:math></inline-formula>-Nearest Neighbor Intervals (KNNI-AP) 3.1. <inline-formula> <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M53"> <mml:mrow> <mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow> </mml:math></inline-formula>-Nearest Neighbor Intervals of Missing Attributes

As a common method to handle missing data, neighbor imputation has been widely used in many areas . Imputation is the problem of approximating the value of a function for a nongiven point in some space when given the value of that function in points around (neighboring) that point. As a simple imputation, the nearest neighbor algorithm selects the nearest sample and does not consider the neighboring samples at all, which is easy to implement and is commonly used. An improved method is K -nearest neighbor imputation , where missing attributes are supplemented by the mean value of the attributes in the K -nearest neighbor values. Subsequently, García-Laencina et al.  proposed a K -nearest neighbor interpolation method based on weighted distance characteristics of multiple information. Huang and Zhu  introduced a pseudodistance neighbor interpolation method. All the approaches mentioned above developed imputation, which are unsuitable to represent the uncertainty of missing attributes completely.

To produce a robust estimation, K -nearest neighbor intervals (KNNI) of missing attributes are proposed. Let X = { x 1 , x 2 , , x N } be an M -dimensional incomplete dataset, which contains at least one incomplete sample with some (but not all) missing attribute values. For an incomplete X i = { x i 1 , x i 2 , , x i M } , the K -nearest neighbors should be found first.

On the basis of the Improved Partial Data Strategy (IPDS), the attribute information of both complete sample and incomplete sample (nonmissing attributes) can be fully used. The distance between sample A and sample B can be obtained as follows: (8) x a - x b 2 = j = 1 M d j x a j , x b j 2 × M ω , where (9) d j x a j , x b j = 0 , 1 - m a j 1 - m b j = 0 , d N x a j , x b j , o t h e r s , d N x a j , x b j = x a j - x b j m a x x j - m i n x j , m i j = 1 , x i j    i s    m i s s i n g , 0 , x i j    i s    n o t    m i s s i n g , where 1 j M , 1 i N .   d j ( x a j , x b j ) represents the distance on the j th attribute between the two samples. ω is the feature dimension in which the two samples are both not missing, and M is the dimensions of all features. m a x ( x j ) and m i n ( x j ) are the maximum and minimum of the observation data when the missing attribute exists. m i j is indicator function to explain whether the variable is missing.

According to the principle of the nearest neighbor approach, sample and its nearest neighbor share same or similar attributes. Therefore, for sample A , the ranges of missing attributes are basically between the minimum and maximum values of the corresponding attribute values of its K -nearest neighbors. Then the K -nearest neighbor interval of the sample can be determined, and the dataset can be converted into interval dataset. The missing attribute x a j is represented by its corresponding K -nearest neighbor interval x a j ¯ = [ x a j - , x a j + ] , and nonmissing attribute x c j can also be rewritten into interval form x c j ¯ = [ x c j - , x c j + ] , where x c j - = x c j + = x c j . That is, the original values are unchanged. Then the interval dataset X ¯ = { x 1 ¯ , x 2 ¯ , , x N ¯ } is formed.

The selected K is critical to make the intervals represent the missing attributes effectively. If K is too small, the interval values may not express the missing attribute correctly, which likely leads to a biased estimation. In the extreme situation when K is 1, KNNI is degraded into NNI. However, if K is too large, the interval values also cannot correctly characterize the missing attribute values. In the extreme situation when K is as large as n (the number of samples in the dataset), the missing attribute interval is the range of all samples on the attribute, which is too large to represent the missing attribute properly. This will confuse the attribute characteristics among different clusters and result in unreasonable clustering results. K is not only related with the ratio of the missing attribute, but also related with the distribution of the sample and the relevant clusters. Thus, how to choose an effective K will directly affect the accuracy of clustering.

We randomly selected 3 kinds of three-dimensional data to form a dataset. For example, the number of samples is 900, and the missing rate is 15%. KNNI-AP algorithm is used, respectively, when K is selected from 1 to 50; from the test results we can see that clustering results are basically stabilized when K is more than 10 and there is uncertainty when K is too small. Similar to the above process, for random missing data with different dimensions and different sample numbers, the values of K were tested. It can be found that K selected as the cube root of the sample numbers is more appropriate. Therefore, in this paper, the selected K is the cube root of the sample numbers rounded to the nearest integer.

3.2. AP Algorithm for Incomplete Data Based on KNNI (KNNI-AP)

KNNI-AP proposed here deals with clustering problem for incomplete data by transforming the dataset to an interval-valued one. The range of missing attribute interval x a j ¯ = [ x a j - , x a j + ] will be large if the j th attributes are dispersive in clusters and will be small if the j th attributes are compact in clusters. So the KNNI can represent the uncertainty of missing attributes better. The lower and upper boundaries of missing attributes interval are determined by the distributions of attributes in clusters, that is, by the geometrical structure of clusters which can present to some extent the shape of clusters and sample distribution of the dataset. The proposed KNNI-AP can validate the robustness of clustering pattern.

For an M -dimensional incomplete dataset X = { x 1 , x 2 , , x N } , the procedure of KNNI-AP can be described as follows.

Step 1.

Set K as the cube root of the sample numbers rounded to the nearest integer.

Step 2.

The distance between sample A and sample B can be obtained based on the IPDS, and the similarity matrix S 1 can be constructed.

Step 3.

Form the corresponding interval dataset X ¯ = { x 1 ¯ , x 2 ¯ , , x N ¯ } . For each missing attribute x a j , find its K -nearest neighbors using S 1 . x a j is represented by x a j ¯ = [ x a j - , x a j + ] , and nonmissing attribute x c j is rewritten into interval form x c j ¯ = [ x c j - , x c j + ] , where x c j - = x c j + = x c j .

Step 4.

Calculate the similarity matrix S of X ¯ = { x 1 ¯ , x 2 ¯ , , x N ¯ } . Choose the parameter of AP: maximum number of iterations performed by AP (default 2000); convergence of the algorithm if the estimated cluster centers stay fixed for convits iterations (default 50); decreasing step of preferences (default 0.01); damping factor (default 0.5).

Step 5.

Apply Preference Range algorithm to computing the range of preference. Initialize the preference: P = P m i n - p s t e p . Update the preference: P = P + p s t e p .

Step 6.

Apply IAP algorithm to generating C clusters. If cluster number is known then judge weather C is equal to the given number of clusters; else a series of Sil values corresponding to the clustering result with different numbers of cluster is calculated.

Step 7.

If cluster number is known, algorithm terminates until C is equal to the given number of clusters; else it terminates until Sil is the largest.

4. Simulation Analysis 4.1. Incomplete Datasets

In order to test the proposed clustering algorithm, we use artificially generated incomplete datasets. The scheme for artificially generating an incomplete dataset X is to randomly select a specified percentage of components and designate them as missing. The random selection of missing attribute values should satisfy the following :

each original feature vector x k retains at least one component;

each attribute has at least one value present in the incomplete dataset X .

At least one-dimensional data exists for each vector data and at least one or more kinds of data exist for each dimension. That is, the data in each row are not empty; each column of data cannot be null. In the following experiments, we test the performance of proposed algorithm on commonly used UCI datasets: Iris, Seeds, Wisconsin Diagnostic Breast Cancer (WDBC), and Wholesale customers, which are taken from the UCI machine repository  and are often used as standard databases to test the performance of clustering algorithms.

The Iris dataset contains 150 four-dimensional attribute vectors. The Wine dataset used in this paper contains 178 three-dimensional attribute vectors. The WDBC dataset comprises 569 samples and, for each sample, there are 30 attributes. The Wholesale customers dataset refers to clients of a wholesale distributor containing 440 6-dimensional attribute vectors.

4.2. Compared Algorithms

To test the clustering performance, we take AP based on the K -nearest neighbor mean (KNNM-AP), AP based on the nearest neighbor (1NN-AP), AP based on IPDS (IPDS-AP), and FCM based on the K -nearest neighbor interval (KNNI-FCM) as compared algorithms. IPDS-AP directly deals with incomplete dataset using AP algorithm, and the others are imputation algorithms using different methods to handle missing values. For KNNM-AP, missing attributes are calculated by the K -nearest neighbor mean; for 1NN-AP, missing attributes are replaced by the nearest neighbor; for KNNI-FCM, missing attributes are calculated by KNNI similar to KNNM-AP.

4.3. Evaluation Method

To evaluate the quality of clustering results, we use misclassification ratio and Fowlkes-Mallows index .

Fowlkes-Mallows (FM) index is used to measure the clustering performance based on external criteria. In general, the larger the FM value is, the better the clustering performance is. The FM index is defined as (10) F M = a w 1 · w 2 = a a + b · a a + c , where w 1 = a + b and w 2 = a + c .

C 1 is a clustering structure of the dataset and C 2 is a defined partition of the data. We refer to a pair of samples ( x u , x v ) from the dataset using the following terms.

SS: if both samples belong to the same cluster of the clustering structure C 1 and to the same group of partition C 2 .

SD: if samples belong to the same cluster of C 1 and to different groups of C 2 .

DS: if samples belong to different clusters of C 1 and to the same group of C 2 .

DD: if both samples belong to different clusters of C 1 and to different groups of C 2 .

a , b , c , and d are the number of SS, SD, DS, and DD pairs, respectively. Then a + b + c + d = W , which is the maximum number of all pairs in the dataset (meaning, W = N ( N - 1 ) / 2 , where N is the total number of samples in the dataset).

The misclassification rate calculates the proportion of an observation being allocated to the incorrect group. It is calculated as follows: the number of incorrect classifications is divided by the total number of samples.

4.4. Experimental Results and Discussion

For the four datasets, damping factor λ = 0.85 , decreasing step of preferences p s t e p = 0.01 , max iteration time n r u n = 2000 , and convergence condition n c o n v = 100 . Because missing data was randomly generated, different tests lead to different results, and we noticed significant variation in the results from trial to trial. To eliminate the variation in the results, Figures 14 and Tables 14 give the averaged results over 30 trials on incomplete Iris, Wine, WDBC, and Wholesale customers datasets. Figures can intuitively reflect the effects of the algorithms and tables can accurately characterize the clustering results of the algorithms. In particular, 30 trials are generated for each row in the table, and the same incomplete dataset is used in each trial for each algorithm, so that the results can be correctly compared. In the tables, the optimal solutions in each row are highlighted in bold, and the suboptimal solutions are italic.

Averaged results of 30 trials using incomplete Iris dataset.

Missing rate (%) Misclassification ratio (%) Fowlkes-Mallows index
IPDS 1NN KNNM KNNI KNNI-FCM IPDS 1NN KNNM KNNI KNNI-FCM
0 10 7.33 7.33 7.33 10.67 0.8306 0.8668 0.8668 0.8668 0.8196
5 10.37 8.51 8.71 7.69 10.53 0.8259 0.8504 0.8477 0.8616 0.8214
10 8.96 8.40 8.67 8.13 10.56 0.8461 0.8516 0.8479 0.8553 0.8209
15 9.98 8.76 8.87 8.49 10.56 0.8342 0.8469 0.8454 0.8504 0.8227
20 10.51 9.24 9.31 8.44 10.27 0.8277 0.8397 0.8392 0.8512 0.8262
25 9.16 9.40 9.40 9.02 10.62 0.8443 0.8377 0.8375 0.8428 0.8235

Averaged results of 30 trials using incomplete Wine dataset.

Missing rate (%) Misclassification ratio (%) Fowlkes-Mallows index
IPDS 1NN KNNM KNNI KNNI-FCM IPDS 1NN KNNM KNNI KNNI-FCM
0 8.99 8.99 8.99 8.99 8.99 0.8303 0.8303 0.8303 0.8303 0.8291
5 16.94 9.24 9.13 8.93 9.19 0.7272 0.8257 0.8277 0.8305 0.8256
10 23.88 9.38 9.07 8.99 9.19 0.6553 0.8230 0.8283 0.8299 0.8259
15 24.72 9.69 9.33 9.30 9.61 0.6417 0.8178 0.8284 0.8245 0.8187
20 24.80 10.34 9.78 9.66 10.06 0.6338 0.8068 0.8162 0.8183 0.8111
25 29.92 10.93 9.97 10.17 10.22 0.6164 0.7970 0.8133 0.8097 0.8088

Averaged results of 30 trials using incomplete WDBC dataset.

Missing rate (%) Misclassification ratio (%) Fowlkes-Mallows index
IPDS 1NN KNNM KNNI KNNI-FCM IPDS 1NN KNNM KNNI KNNI-FCM
0 6.50 6.50 6.50 6.50 7.21 0.8866 0.8866 0.8866 0.8866 0.8758
5 6.49 6.49 6.50 6.46 7.19 0.8868 0.8868 0.8866 0.8872 0.8760
10 6.43 6.48 6.49 6.47 7.21 0.8879 0.8869 0.8868 0.8871 0.8758
15 6.64 6.65 6.49 6.45 7.21 0.8849 0.8847 0.8871 0.8878 0.8758
20 6.57 6.55 6.47 6.46 7.19 0.8863 0.8861 0.8872 0.8875 0.8761
25 6.66 6.46 6.42 6.39 7.21 0.8848 0.8873 0.8880 0.8884 0.8758

Averaged results of 30 trials using incomplete Wholesale dataset.

Missing rate (%) Misclassification ratio (%) Fowlkes-Mallows index
IPDS 1NN KNNM KNNI KNNI-FCM IPDS 1NN KNNM KNNI KNNI-FCM
0 12.73 12.73 12.73 12.73 13.86 0.8084 0.8084 0.8084 0.8084 0.8047
5 15.01 13.02 12.38 11.71 13.61 0.7745 0.8039 0.8118 0.8203 0.8071
10 14.79 12.61 12.41 11.74 13.55 0.7846 0.8105 0.8125 0.8197 0.8076
15 16.59 12.27 12.26 11.97 13.35 0.7523 0.8128 0.8134 0.8192 0.8096
20 13.97 12.30 12.35 11.94 13.45 0.7922 0.8133 0.8149 0.8182 0.8084
25 14.65 12.65 12.50 12.83 13.26 0.7803 0.8094 0.8127 0.8045 0.8099

Averaged clustering results of 30 trials using incomplete Iris dataset.

Averaged clustering results of 30 trials using incomplete Wine set.

Averaged clustering results of 30 trials using incomplete WDBC dataset.

Averaged clustering results of 30 trials using incomplete Wholesale dataset.

To test the clustering performance, the clustering results of KNNI-AP, 1NN-AP, KNNM-AP, IPDS-AP, and KNNI-FCM are compared. From figures and tables, it can be seen that KNNI-AP, 1NN-AP, and KNNM-AP reduce to regular AP and KNNI-FCM reduces to regular FCM for 0% missing data. For other cases, different methods for handling missing attributes in AP and FCM lead to different clustering results. The different algorithms result in different misclassification ratio and FM index for the different algorithms. With the growth rate of missing data, the uncertainty of dataset increases; therefore, the misclassification ratio increases and FM decreases generally. However, because of handing the missing attributes, the clustering results of algorithms for incomplete data sometimes may be similar to or better than the the results of complete data.

In terms of misclassification ratio, KNNI-AP is always the best performer except for incomplete Wine dataset with 25% missing attributes, incomplete WDBC dataset with 10% missing attributes, and incomplete Wholesale customers dataset with 25% missing attributes. In the three cases, KNNI-AP almost gives suboptimal solutions beside the last case where the result of KNNI-AP is better than the results of IPDS-AP and KNNI-FCM. As for the FM index, KNNI-AP is always the best performer except for incomplete Iris and Wine datasets with 25% missing attributes, incomplete WDBC dataset with 10% missing attributes, and incomplete Wholesale dataset with 25% missing attributes. In the four cases, KNNI-AP almost gives suboptimal solutions beside the last case where the result of KNNI-AP is better than the result of IPDS-AP. From figures and tables, in general, the larger the FM value is, the smaller the misclassification ratio is except for the 25% cases of incomplete Iris and Wholesale datasets. We use misclassification ratio and FM index based on external criteria to accurately evaluate the quality of clustering results.

Comparing KNNI-AP with IPDS-AP, 1NN-AP, and KNNM-AP, the methods are all based on AP algorithm. IPDS-AP ignores missing attributes in incomplete data and scales the partial distances by the reciprocal of the proportion of components used based on the range of feature values, in which the distribution information of missing attributes implicitly embodied in the other data is not taken into account. 1NN-AP substitutes the missing attribute by the corresponding attribute of the nearest neighbor, in which AP algorithm is used to handle the complete dataset. Similarly, missing attributes are supplemented by the mean value of the attributes in the KNNM-AP. Compared with IPDS-AP, KNNI-AP uses the attribute distribution information of datasets sufficiently, including complete data and nonmissing attributes of incomplete data, in which the missing attributes are represented by KNNI on the basis of IPDS. Compared with the other two methods, KNNI-AP achieves interval estimation of missing attributes, taking advantage of the improved IAP, which represents the uncertainty of missing attributes and makes the representation more robust. Furthermore, cluster algorithm with interval data has advantages over cluster with point data, which can present the uncertainty of missing attributes to some degree, thus resulting in more accurate clustering performance.

Comparing KNNI-AP with KNNI-FCM, the methods are both based on KNNI, and the difference between them is the clustering algorithm they use. AP has the advantage that it works for any meaningful measure of similarity between data samples. Unlike most prototype-based clustering algorithms (e.g., K -means), AP does not require a vector space structure and the exemplars are chosen among the observed data samples and are not computed as hypothetical averages of cluster samples. These characteristics make AP clustering particularly suitable for applications in many fields. From our experiments, clustering results of KNNI-AP are far better than those of KNNI-FCM. And in most cases, the methods based on AP algorithm are also better than KNNI-FCM, which show that AP makes the clustering results more stable and accurate.

5. Conclusion

In this paper, we have studied incomplete data clustering using AP algorithm and presented an AP clustering algorithm based on KNNI for incomplete data. The proposed algorithm is based on the IPDS, estimating the KNNI representation of missing attributes by using K -nearest neighbor interval principle. The proposed algorithm has three main advantages. First, missing attributes are represented by KNNI on the basis of IPDS, which are robust. Second, the interval estimations use the attribute distribution information of datasets sufficiently, which is superior in expressing the uncertainty of missing attributes and enhances the robustness of missing attributes representation. Third, AP algorithm does not require a vector space structure and the exemplars are chosen among the observed data samples and are not computed as hypothetical averages, which makes the clustering results more stable and accurate than other center algorithms.

The results reported in this paper show that our proposed KNNI-AP algorithm is general, simple, and appropriate for the AP clustering with incomplete data. It can be understood that the final clustering results depend on the choice of K and P for KNNI-AP. In the future, our work will focus on the selection of K and P with theoretical basis and the improvement on the similarity measurement of AP when the missing percentage is large, which will be helpful to extend KNNI-AP to solve clustering incomplete data with various missing percentages.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China under Grant nos. 41427806 and 61273233, the Research Fund for the Doctoral Program of Higher Education under Grant nos. 20120002110035 and 20130002130010, the Project of China Ocean Association under Grant no. DY125-25-02, and Tsinghua University Initiative Scientific Research Program under Grant no. 20131089300.

Lazer D. Kennedy R. King G. Vespignani A. The parable of Google flu: traps in big data analysis Science 2014 343 6176 1203 1205 10.1126/science.1248506 Frey B. J. Dueck D. Clustering by passing messages between data points Science 2007 315 5814 972 976 10.1126/science.1136800 2-s2.0-33847172327 MR2292174 Frey B. J. Dueck D. Response to comment on ‘clustering by passing messages between data points’ Science 2008 319 5864 726d Wang K. Zhang J. Li D. Zhang X. Guo T. Adaptive affinity propagation clustering Acta Automatica Sinica 2007 33 12 1242 1246 He Y. Chen Q. Wang X. Xu R. Bai X. Meng X. An adaptive affinity propagation document clustering Proceedings of the 7th International Conference on Informatics and Systems (INFOS '10) March 2010 IEEE 1 7 2-s2.0-77953159047 Little R. J. Rubin D. B. Statistical Analysis with Nissing Data 2002 2nd New York, NY, USA John Wiley & Sons 10.1002/9781119013563 MR1925014 Dixon J. K. Pattern recognition with partly missing data IEEE Transactions on Systems, Man and Cybernetics 1979 9 10 617 621 10.1109/tsmc.1979.4310090 2-s2.0-0018656141 Dempster A. P. Laird N. M. Rubin D. B. Maximum likelihood from incomplete data via the EM algorithm Journal of the Royal Statistical Society. Series B. Methodological 1977 39 1 1 38 MR0501537 Bezdek J. C. Pattern Recognition with Fuzzy Objective Function Algorithms 1981 Kluwer Academic Publishers MR631231 Miyamoto S. Takata O. Umayahara K. Handling missing values in fuzzy c-means Proceedings of the 3rd Asian Fuzzy Systems Symposium 1998 139 142 Hathaway R. J. Bezdek J. C. Fuzzy c-means clustering of incomplete data IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2001 31 5 735 744 10.1109/3477.956035 2-s2.0-0035481296 Hathaway R. J. Bezdek J. C. Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm Pattern Recognition Letters 2002 23 1–3 151 160 10.1016/s0167-8655(01)00115-5 2-s2.0-0036132613 Li D. Gu H. Zhang L. A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data Expert Systems with Applications 2010 37 10 6942 6947 10.1016/j.eswa.2010.03.028 2-s2.0-78649930585 Zhang D.-Q. Chen S.-C. Clustering incomplete data using kernel-based fuzzy C-means algorithm Neural Processing Letters 2003 18 3 155 162 10.1023/b:nepl.0000011135.19145.1b 2-s2.0-0842300534 Lu C. Song S. Wu C. Afinity propagation clustering with incomplete data Computational Intelligence, Networked Systems and Their Applications 2014 Berlin, Germany Springer 239 248 Ohmann J. L. Gregory M. J. Predictive mapping of forest composition and structure with direct gradient analysis and nearest-neighbor imputation in coastal Oregon, U.S.A. Canadian Journal of Forest Research 2002 32 4 725 741 10.1139/x02-011 2-s2.0-0036229447 Acuna E. Rodriguez C. The treatment of missing values and its effect on classifier accuracy Classification, Clustering, and Data Mining Applications 2004 Berlin, Germany Springer 639 647 García-Laencina P. J. Sancho-Gómez J.-L. Figueiras-Vidal A. R. Pattern classification with missing data: a review Neural Computing and Applications 2010 19 2 263 282 10.1007/s00521-009-0295-6 2-s2.0-79952229665 Huang X. Zhu Q. A pseudo-nearest-neighbor approach for missing data recovery on Gaussian random data sets Pattern Recognition Letters 2002 23 13 1613 1622 10.1016/S0167-8655(02)00125-3 2-s2.0-0036832985 UCI machine learning repository 2014, http://archive.ics.uci.edu/ml/datasets.html Halkidi M. Batistakis Y. Vazirgiannis M. Cluster validity methods: part I ACM SIGMOD Record 2002 31 2 40 45 10.1145/565117.565124 2-s2.0-0141860731