The Affinity Propagation (AP) algorithm is an effective algorithm for clustering analysis, but it can not be directly applicable to the case of incomplete data. In view of the prevalence of missing data and the uncertainty of missing attributes, we put forward a modified AP clustering algorithm based on
With the developments of sensors and database technology, people get more focus on the Big Data issue [
As a common and effective clustering algorithm, the original AP clustering algorithm is only applicable to complete data like other traditional clustering algorithms. However, in practice, many datasets suffer from incompleteness due to various reasons, such as bad sensors, mechanical failures to collect data, illegible images due to low pixels and noises, and unanswered questions in surveys. Therefore, some strategies should be employed to make AP applicable to such incomplete datasets.
In the literature, several approaches to handle incomplete data have been proposed, including listwise deletion (LD), imputation, modelbased method, and direct analysis [
Though incomplete data appears everywhere, principled clustering methods for such data still deserve further research. The existing research on improved methods for clustering model is mainly concentrated on the fuzzy
However, FCM algorithms are sensitive to the initial centers, which makes the clustering results unstable. In particular when some data are missing, the selection of the initial cluster centers becomes more important. To address this issue, we consider the AP algorithm, which does not require initial cluster centers and the number of clusters. Three strategies for solving AP clustering of incomplete datasets had been proposed in our previous research [
The remainder of this paper is organized as follows. Section
AP algorithm and
Let a complete dataset
First, AP algorithm takes each data point as the candidate exemplar and calculates the attractiveness information between sample points, that is, the similarity between any two sample points. The similarity can be set according to specific applications; similarity measurement mainly includes similarity coefficient function and distance function. Common distance functions are Euclidean distance, Manhattan distance, and Mahalanobis distance. In the traditional clustering problem, similarity is usually set as the negative of squared Euclidean distance:
To select appropriate clustering centers, AP algorithm searches for two different pieces of information: responsibility (
AP algorithm should be adjusted to deal with interval data. Let an
The Euclidean distance can be defined as
Similarity matrix of
To avoid the numerical oscillation, the damping factor
The procedure of IAP can be described as follows.
Input is the similarity matrix
Output is the clustering result.
Initialize responsibility (
Update the responsibilities.
Update the availabilities.
Terminate the messagepassing procedure after a fixed number of iterations or the changes in the messages fall below a threshold. Otherwise go to Step
As a common method to handle missing data, neighbor imputation has been widely used in many areas [
To produce a robust estimation,
On the basis of the Improved Partial Data Strategy (IPDS), the attribute information of both complete sample and incomplete sample (nonmissing attributes) can be fully used. The distance between sample
According to the principle of the nearest neighbor approach, sample and its nearest neighbor share same or similar attributes. Therefore, for sample
The selected
We randomly selected 3 kinds of threedimensional data to form a dataset. For example, the number of samples is 900, and the missing rate is 15%. KNNIAP algorithm is used, respectively, when
KNNIAP proposed here deals with clustering problem for incomplete data by transforming the dataset to an intervalvalued one. The range of missing attribute interval
For an
Set
The distance between sample
Form the corresponding interval dataset
Calculate the similarity matrix
Apply Preference Range algorithm to computing the range of preference. Initialize the preference:
Apply IAP algorithm to generating
If cluster number is known, algorithm terminates until
In order to test the proposed clustering algorithm, we use artificially generated incomplete datasets. The scheme for artificially generating an incomplete dataset
each original feature vector
each attribute has at least one value present in the incomplete dataset
At least onedimensional data exists for each vector data and at least one or more kinds of data exist for each dimension. That is, the data in each row are not empty; each column of data cannot be null. In the following experiments, we test the performance of proposed algorithm on commonly used UCI datasets: Iris, Seeds, Wisconsin Diagnostic Breast Cancer (WDBC), and Wholesale customers, which are taken from the UCI machine repository [
The Iris dataset contains 150 fourdimensional attribute vectors. The Wine dataset used in this paper contains 178 threedimensional attribute vectors. The WDBC dataset comprises 569 samples and, for each sample, there are 30 attributes. The Wholesale customers dataset refers to clients of a wholesale distributor containing 440 6dimensional attribute vectors.
To test the clustering performance, we take AP based on the
To evaluate the quality of clustering results, we use misclassification ratio and FowlkesMallows index [
FowlkesMallows (FM) index is used to measure the clustering performance based on external criteria. In general, the larger the FM value is, the better the clustering performance is. The FM index is defined as
SS: if both samples belong to the same cluster of the clustering structure
SD: if samples belong to the same cluster of
DS: if samples belong to different clusters of
DD: if both samples belong to different clusters of
The misclassification rate calculates the proportion of an observation being allocated to the incorrect group. It is calculated as follows: the number of incorrect classifications is divided by the total number of samples.
For the four datasets, damping factor
Averaged results of 30 trials using incomplete Iris dataset.
Missing rate (%)  Misclassification ratio (%)  FowlkesMallows index  

IPDS  1NN  KNNM  KNNI  KNNIFCM  IPDS  1NN  KNNM  KNNI  KNNIFCM  
0  10  7.33  7.33  7.33  10.67  0.8306  0.8668  0.8668  0.8668  0.8196 
5  10.37 

8.71 

10.53  0.8259 

0.8477 

0.8214 
10  8.96 

8.67 

10.56  0.8461 

0.8479 

0.8209 
15  9.98 

8.87 

10.56  0.8342 

0.8454 

0.8227 
20  10.51 

9.31 

10.27  0.8277 

0.8392 

0.8262 
25  9.16 



10.62 

0.8377  0.8375 

0.8235 
Averaged results of 30 trials using incomplete Wine dataset.
Missing rate (%)  Misclassification ratio (%)  FowlkesMallows index  

IPDS  1NN  KNNM  KNNI  KNNIFCM  IPDS  1NN  KNNM  KNNI  KNNIFCM  
0  8.99  8.99  8.99  8.99  8.99  0.8303  0.8303  0.8303  0.8303  0.8291 
5  16.94  9.24 


9.19  0.7272  0.8257 


0.8256 
10  23.88  9.38 


9.19  0.6553  0.8230 


0.8259 
15  24.72  9.69 


9.61  0.6417  0.8178 


0.8187 
20  24.80  10.34 


10.06  0.6338  0.8068 


0.8111 
25  29.92  10.93 


10.22  0.6164  0.7970 


0.8088 
Averaged results of 30 trials using incomplete WDBC dataset.
Missing rate (%)  Misclassification ratio (%)  FowlkesMallows index  

IPDS  1NN  KNNM  KNNI  KNNIFCM  IPDS  1NN  KNNM  KNNI  KNNIFCM  
0  6.50  6.50  6.50  6.50  7.21  0.8866  0.8866  0.8866  0.8866  0.8758 
5 


6.50 

7.19 


0.8866 

0.8760 
10 

6.48  6.49 

7.21 

0.8869  0.8868 

0.8758 
15  6.64  6.65 


7.21  0.8849  0.8847 


0.8758 
20  6.57  6.55 


7.19  0.8863  0.8861 


0.8761 
25  6.66  6.46 


7.21  0.8848  0.8873 


0.8758 
Averaged results of 30 trials using incomplete Wholesale dataset.
Missing rate (%)  Misclassification ratio (%)  FowlkesMallows index  

IPDS  1NN  KNNM  KNNI  KNNIFCM  IPDS  1NN  KNNM  KNNI  KNNIFCM  
0  12.73  12.73  12.73  12.73  13.86  0.8084  0.8084  0.8084  0.8084  0.8047 
5  15.01  13.02 


13.61  0.7745  0.8039 


0.8071 
10  14.79  12.61 


13.55  0.7846  0.8105 


0.8076 
15  16.59  12.27 


13.35  0.7523  0.8128 


0.8096 
20  13.97 

12.35 

13.45  0.7922  0.8133 


0.8084 
25  14.65 


12.83  13.26  0.7803  0.8094 

0.8045 

Averaged clustering results of 30 trials using incomplete Iris dataset.
Averaged clustering results of 30 trials using incomplete Wine set.
Averaged clustering results of 30 trials using incomplete WDBC dataset.
Averaged clustering results of 30 trials using incomplete Wholesale dataset.
To test the clustering performance, the clustering results of KNNIAP, 1NNAP, KNNMAP, IPDSAP, and KNNIFCM are compared. From figures and tables, it can be seen that KNNIAP, 1NNAP, and KNNMAP reduce to regular AP and KNNIFCM reduces to regular FCM for 0% missing data. For other cases, different methods for handling missing attributes in AP and FCM lead to different clustering results. The different algorithms result in different misclassification ratio and FM index for the different algorithms. With the growth rate of missing data, the uncertainty of dataset increases; therefore, the misclassification ratio increases and FM decreases generally. However, because of handing the missing attributes, the clustering results of algorithms for incomplete data sometimes may be similar to or better than the the results of complete data.
In terms of misclassification ratio, KNNIAP is always the best performer except for incomplete Wine dataset with 25% missing attributes, incomplete WDBC dataset with 10% missing attributes, and incomplete Wholesale customers dataset with 25% missing attributes. In the three cases, KNNIAP almost gives suboptimal solutions beside the last case where the result of KNNIAP is better than the results of IPDSAP and KNNIFCM. As for the FM index, KNNIAP is always the best performer except for incomplete Iris and Wine datasets with 25% missing attributes, incomplete WDBC dataset with 10% missing attributes, and incomplete Wholesale dataset with 25% missing attributes. In the four cases, KNNIAP almost gives suboptimal solutions beside the last case where the result of KNNIAP is better than the result of IPDSAP. From figures and tables, in general, the larger the FM value is, the smaller the misclassification ratio is except for the 25% cases of incomplete Iris and Wholesale datasets. We use misclassification ratio and FM index based on external criteria to accurately evaluate the quality of clustering results.
Comparing KNNIAP with IPDSAP, 1NNAP, and KNNMAP, the methods are all based on AP algorithm. IPDSAP ignores missing attributes in incomplete data and scales the partial distances by the reciprocal of the proportion of components used based on the range of feature values, in which the distribution information of missing attributes implicitly embodied in the other data is not taken into account. 1NNAP substitutes the missing attribute by the corresponding attribute of the nearest neighbor, in which AP algorithm is used to handle the complete dataset. Similarly, missing attributes are supplemented by the mean value of the attributes in the KNNMAP. Compared with IPDSAP, KNNIAP uses the attribute distribution information of datasets sufficiently, including complete data and nonmissing attributes of incomplete data, in which the missing attributes are represented by KNNI on the basis of IPDS. Compared with the other two methods, KNNIAP achieves interval estimation of missing attributes, taking advantage of the improved IAP, which represents the uncertainty of missing attributes and makes the representation more robust. Furthermore, cluster algorithm with interval data has advantages over cluster with point data, which can present the uncertainty of missing attributes to some degree, thus resulting in more accurate clustering performance.
Comparing KNNIAP with KNNIFCM, the methods are both based on KNNI, and the difference between them is the clustering algorithm they use. AP has the advantage that it works for any meaningful measure of similarity between data samples. Unlike most prototypebased clustering algorithms (e.g.,
In this paper, we have studied incomplete data clustering using AP algorithm and presented an AP clustering algorithm based on KNNI for incomplete data. The proposed algorithm is based on the IPDS, estimating the KNNI representation of missing attributes by using
The results reported in this paper show that our proposed KNNIAP algorithm is general, simple, and appropriate for the AP clustering with incomplete data. It can be understood that the final clustering results depend on the choice of
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was partially supported by the National Natural Science Foundation of China under Grant nos. 41427806 and 61273233, the Research Fund for the Doctoral Program of Higher Education under Grant nos. 20120002110035 and 20130002130010, the Project of China Ocean Association under Grant no. DY1252502, and Tsinghua University Initiative Scientific Research Program under Grant no. 20131089300.