An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets

Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP) algorithm is an exemplar-based clusteringmethod which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processingmixed datasets. In this paper, we propose a novel similaritymeasure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.


Introduction
With the development of information technology and with the wide use of computer and networks, the explosion of data in almost all fields provides a totally new perspective for data scientists towards knowledge discovery and future decision.Because of the urgent need of data processing, new techniques that can extract useful information and knowledge from the vast amount of data have been developed by researchers.In this context, data mining is an effective and attractive approach to meet these requirements.
Clustering is one of the most commonly encountered data mining techniques that are implemented to extract knowledge arising from many areas, some of which are community detection [1], pattern recognition [2,3], bioinformatics [4], and spatial database applications, for example, GIS or astronomical data [5,6].The general purpose of clustering is to partition a dataset consisting of  points embedded in -dimensional space into  clusters, such that the data points within the same cluster are more similar to each other than to data points in other clusters [7][8][9].Because of the simplicity and ease of implementation in a wide variety of scenarios, distance-based clustering methods, for instance, -means, -medians, -medoids, and hierarchical clustering, are widely used and deeply researched.The main problems of distance-based clustering methods are defining a proper similarity measure to discriminate the similarity or dissimilarity between different data points and aggregating most similar elements into appropriate clusters in an unsupervised way.Thus, the problem of clustering can be reduced to the problem of finding a distance function for that data type [10][11][12].Traditional clustering methods use Euclidean distance measure to calculate the similarity (or dissimilarity) of two data points [13,14].It is suitable for the datasets that are purely numeric.Actually, datasets in real world are more complicated.Large amount of data is mixed containing both numeric attributes like height, age, and so forth and categorical attributes like male or female, on or off, and so forth.In this case, however, Euclidean distance measure fails to judge the similarity of two data points when attributes are of categorical or mixed type.
Up to the present, researchers have been developing many ways dealing with mixed data.Similarity based agglomerative clustering (SBAC) [15], a hierarchical agglomerative algorithm, based on Goodall similarity measure [16], presented by Li and Biswas works well with mixed numeric and categorical attributes.But the amount of computation, while clustering large datasets, is rapidly increasing, which is not acceptable.Huang [17] proposed k-prototype clustering method that divides the dataset into two distinct parts, one for numeric attributes and another for categorical attributes, and handles the two components separately.Due to the information loss in dealing with cluster center and the simple binary distance measure between two categorical attributes of Huang's algorithm, Ahmad and Dey [18] developed a modified cost function alleviating the shortcomings of Huang's cost function based on a -mean type algorithm.In Ahmad and Dey's algorithm, the distance computation of two values in a single categorical attribute considers not only the attribute they belong to, but also other attributes including the numeric ones.They also proposed a significance computing approach of a numeric attribute based on the attribute value distributions within the data.Ji et al. [19] improved Ahmad and Dey's algorithm with a novel fuzzy k-prototype algorithm integrating mean and fuzzy centroid to represent the prototype of a cluster.Like many other fuzzy -means type algorithms, Ji's algorithm also needs the determination of fuzzy coefficient value.
The novel affinity propagation clustering (APC) algorithm based on message passing is a more powerful approach proposed by Frey and Dueck [20] in 2007.Traditional distance-based clustering methods satisfy the conditions of metric similarities, that is, symmetry, nonnegativity, and the triangle inequality.Compared to the traditional approaches, the affinity propagation algorithm's ability to take as input also general nonmetric similarities makes it suitable for exploratory data analysis using unusual measures of similarity [21].For instance, AP has been used to identify key sentences and air-travel routing on the basis of nonstandard optimization criteria [20].Furthermore, affinity propagation is a completely data-driven analysis technique that partitions the data points to different clusters and identifies exemplars among them by simultaneously considering all data points as possible exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges [22].However, the original AP method assumes features are numeric valued, which means the algorithm cannot process features of categorical values or mixed type values.
Based on AP algorithm and Ahmad and Dey's mixed similarities measure architecture [18], this paper proposes an adaption affinity propagation clustering method for mixed numeric and categorical attributes datasets using a novel similarity measure as a cost function.The key innovative points of the paper are as follows.(1) This paper has applied the AP algorithm to cluster mixed type attributes datasets for the first time.The rest of the paper is organized as follows.We start in Section 2 with a brief review of affinity propagation clustering algorithm and the distance measure for mixed type datasets.In Section 3, the novel similarity measure for mixed type data is introduced and then the novel adapted AP approach is described in detail.Section 4 presents the experimental methodology and results on several benchmark datasets as well as the comparisons with the selected baseline algorithms.Discussions and conclusions are given in Section 5.

Background
2.1.Description of AP.Exemplar-based clustering, such as the popular -centers and -medians clustering methods, partitions the dataset by identifying a subset of representative elements (exemplars), so that the sum of distances between data points and their exemplars is minimized [23].The traditional clustering analysis methods usually start with an initialization step that the algorithm selects  initial data centers as exemplars and allocates other data points based on the distances to exemplars.It is obvious that different initial selection comes to different clustering results.On the contrary, AP runs based on an entirely different mechanism.Firstly, all data points are considered as potential exemplars and they are viewed as nodes in a network.Secondly, a number of real-valued messages are iteratively transmitted along edges of the network so that a relevant set of exemplars and corresponding clusters is identified [20].Details of the framework are as follows [24].
AP takes as input a matrix of real-valued similarities between data points.Let {  },  = 1, . . ., ,  = 1, . . ., , be a set of  2 real-valued variables where   indicates the similarity between two objects   and   in it.AP defines   as the negative of the square of their Euclidean distance; that is, (, ) = −‖  −   ‖ 2 ,  ̸ = .The self-similarities   are referred to as "preferences" that influence the probability of one point being an exemplar.If there is no a priori knowledge, preferences are set to common values so that each data point is regarded as a potential exemplar with equal probability.
As mentioned above, in AP algorithm, data points exchange information by passing messages.Two kinds of messages are produced in the procedure and each takes into account a different kind of competition.One is called"'responsibility" (, ), which is sent from data point  to candidate exemplar point .(, ) reflects the accumulated evidence for how well suited point  is to serve as the exemplar for point , taking other potential exemplars for point  into consideration.The "responsibility" (, ) is updated as follows: The other one is called "availability" (, ), gathering evidence from data points as to whether each candidate exemplar would make a good exemplar.It is sent from the candidate representative point  to point , reflecting the accumulated evidence for how appropriate it would be for point  to choose point  as its exemplar.Beside, the support from other points that point  should be an exemplar is taken into account.The "availability" (, ) is updated as follows: The "self-availability" (, ) reflects accumulated evidence that point  is an exemplar, based on the positive responsibilities sent to candidate exemplar  from other points.(, ) is updated differently as follows: After iterative message passing, exemplars can be identified by calculating maximum of (, ) + (, ) for point .If  = , point  is selected as an exemplar, or point  is the exemplar of point .Furthermore, in order to avoid numerical oscillations in some circumstances when updating the messages, the damping factor  is introduced to iteration process: where  indicates the iteration times.The primary advantage of AP algorithm is that AP does not need to preassign the number of clusters, which is different with -means methods specifying the  value.This is because AP considers each data point as a potential exemplar and the probability of being an exemplar depends on the shared value of preference.With greater value of preference, AP generates more clusters.Another advantage is that AP only accepts the collection of similarities as input, which eliminates the need to deal with the raw dataset directly.This is an instrumental feature with which researchers can determine the similarity input matrix using various distance measures that are suitable for the objects of clustering.Moreover, wide-ranging applications [20] of AP demonstrate its ability of processing large datasets rapidly and effectively.
However, AP meets some limitations as well.The specific value of "preferences, " for the procedure of clustering, is a double-edged sword.Frey and Dueck [20] suggested setting the shared value of preferences as the median of the input similarities resulting in a moderate number of clusters.It is difficult to determine suitable value of preference in which different values lead to completely different results if there is no a priori knowledge.In addition, the damping factor  acquires an appropriate value.Equation (4) shows that a larger value of damping factor means not easily trapping into oscillations but reducing the convergence rate, while a smaller one results in a fast rate of convergence but with a risk of no convergence when the message-passing procedure is terminated.Frey and Dueck [20] suggested a default damping factor of  = 0.5 to keep balance between convergence and oscillation.

Distance and Significance.
Based on Huang's cost function [17], Ahmad and Dey developed a motivated distance and significance computation framework that not only considers the distances between pairs of distinct values within an attribute, but also takes their cooccurrence with other attributes into account [18].Two parts are introduced to generate the distance matrix of a mixed dataset.
The first step is to calculate the distance between each pair of values for categorical attributes.For the given mixed type dataset,   denote a categorical attribute, in which  and  are two of the values.Let   denote another categorical attribute and  a subset of values of   .Accordingly, ∼  denotes the complementary set of .The conditional probability   ( | ) means the probability that a data point having value  for   has a value belonging to  for   .Likewise,   (∼  | ) denotes the conditional probability that a data point having value  for   has a value belonging to ∼  for   .
The distance between the pair of values  and  of   as regards the attribute   and a particular subset  is defined as follows: Distance between attribute values  and  for   concerning attribute   is denoted by   (, ) and is given by where  is the subset  of values of   that maximizes the quantity   ( | ) +   (∼  | ).Considering both   ( | ) and   (∼  | ) lie between 0 and 1.0, in order to restrict the value of   (, ) between 0 and 1.0,   (, ) is defined as For a dataset with  attributes, including categorical and numeric attributes which have been discretized, the distance between two dissimilar values  and  of any categorical attribute is given by  (, ) = ∑ =1,...,, ̸ =   (, )  − 1 .
Using ( 5) to (8), it is possible to compute the distance between two distinct values of categorical attributes and the discretized numeric attributes.
In the second step, significance for each numeric attribute is determined.Significance of an attribute defines the importance of that attribute in the dataset [25,26].To compute the significance of a numeric attribute, it must be discretized to  intervals first.Thus each interval can be assigned a categorical value  [1],  [2], . . ., [].Therefore, using ( 5) to (8), in the same way as it is computed for categorical values, the distance

Method
The main idea of our proposed algorithm is that our method attempts to obtain a clustering result using a similarity measure for mixed data based on AP clustering approach.
We propose the novel method to handle the problem in the following sections.
Set  = 0; for each numeric attribute   in dataset  do Figure out the similarity matrix based on (10) as the input; Calculate the median of similarities as the shared value of preference; Perform the AP algorithm using ( 1)-( 4) to obtain an   classification result; Discretize attribute   to   intervals according to the clustering result;  =  + 1; end for Establish a new dataset  which is a pure categorical dataset composed of the discretized numeric attributes and the original categorical attributes; for each attribute   in dataset  do Calculate the distance between two distinct values of any categorical attribute using ( 5)-( 8); Compute the significance (weight) of each numeric attribute using (9) in which the interval  is replaced by   .end for Algorithm 1: Pseudocode of computing significance.

Improved Similarity Measure. The main advantage of
Ahmad and Dey's distance measure method is that it considers the distance between a pair of values for an attribute as the function of their cooccurrence probabilities with a set of values of another categorical attribute.Therefore, the distance is a reflection of the difference between two categorical values rather than being the same or different (0 or 1).On the other hand, significance of an attribute is not user-defined like other algorithms but computed based on the discretization of numeric attributes.In other words, the algorithm, by itself, decides which attribute is more important and assigns a higher weight to it.
However, the distance measure also faces some limitations.In the process of discretization of numeric attributes, a number of the intervals () of numeric values are user-defined referring to different problems, equally for all numeric attributes, which will cause an inaccurate discretization because the algorithm has no consideration on the different distribution of distinct attributes.Beside, one should test the parameter  to find a suitable interval for discretizing numeric values.
As mentioned in Section 2.1, AP algorithm separates data objects into suitable clusters without assigning the object number of classes, since each data point is viewed as a potential exemplar.Therefore, we propose an improved similarity measure based on Ahmad and Dey's work in which the discretization operation is replaced by AP clustering discretization approach.Data objects are allocated to clusters as natural as they are distributed.Furthermore, different intervals emphasize the distinction of each attribute influencing the significance values.
For a given mixed dataset, let   ,  = 1, . . ., , denote a numeric attribute, whose values are { 1 ,  2 , . . .,   }, where  is the number of numeric attributes and  is the number of data points.The similarity between  1 and  2 is defined by where  can be viewed as an  ×  matrix and the similarity (  ,   ) indicates the negative squared error for points   and   .The novel method for computing significance of numeric attribute is listed in Algorithm 1. Figure 1 illustrates the performance of three different discretization techniques.Raw data, equal width, equal frequency, and AP method are listed in the figure.It shows that AP method performs the best reflection for the distribution regularity of the data points.
Let  denote the  ×  similarity matrix, and we define the similarity between two values   and   as follows: where ∑   =1 (  ⋅ (   −    )) 2 denotes the distance of objects   and   for numeric attributes only,   is the significance of the th numeric attribute described in Section 3.1, and ∑   =1 ((   ,    )) 2 denotes the distance between data objects   and   in terms of categorical attributes only.The similarities are set to a negative squared error to coordinate the input of AP algorithm.

Adaptive AP Algorithm.
In Section 2.1, we discussed the advantages and limitations of AP algorithm.The shared value of "preferences" () is the key value that determines the clustering performance as well as the number of classes in the result.In some cases, the objective number of clusters is preassigned while it is hard to define an appropriate value of .This is because there is not a one-to-one correspondence between the output number of classes and the value of  which means a certain range of values will arrive at the same number of clusters, with different distributions yet.To search the optimal  value for the given number of classes, an adaptive strategy is proposed as follows: where  denotes the running time of AP algorithm,  > 1 is a function of the number of clusters   in the th running We named  as coarse tuning coefficient while  was named as fine tuning coefficient.When the value of   is much greater than the target value , a relevant greater value of  should be employed, so that the  value may reduce quickly.On the contrary, when   is close to the target , smaller value of  should be defined.So we set the coarse tuning coefficient as  = √(  − ) + 0.5.In this case, the algorithm is able to tune the value of  dynamically, according to the current cluster number   .
Since the coarse tuning strategy makes the algorithm obtain the right number of clusters, fine tuning steps lead to the better clustering performance.In the iteration stage of   ̸ = ,  is set to 0. Meanwhile, when entering the stage of   = ,  is assigned to 1. Value of  is important for scanning local area to maximize the energy function.Referring to the settings in [27],  is defined as  = 0.01  , where   denotes the initial value of .The scanning stage may be terminated after the energy function decreases or after a fixed number of iterations of fine tuning.
On the other hand, the damping factor  is another parameter that controls the convergence and the speed of algorithm.Our intention is that, in the case of no oscillation, the algorithm is able to acquire a faster convergence speed.
An adaptive mechanism of  is adopted to balance the contradiction between oscillation and convergence.
Although maintaining a larger  close to 1 may avoid numerical oscillations much more easily, a homologous decline of the updating rate for "availability" and "responsibility" becomes inevitable.The algorithm needs more iteration times than that with smaller  to obtain a corresponding result.Therefore, a changing  along with the iteration of algorithm is a better choice.According to this conception, we have designed an adaptive mechanism for  as follows: where  denotes current number of iterations and iteration denotes maximum iteration. max and  min denote the maximum and minimum values of , respectively.We introduce the coefficient  to adjust the rate of descent for .When the value of  is greater than 1,  declines from flat to sharp.We recommend  to be greater than 1 to guarantee a smooth iteration process.

3.3.
The Proposed Algorithm.Based on the above explanations, the pseudocode of the proposed algorithm is listed in Algorithm 2.
the same similarity measure so they get close values in the result.635 data objects from the total 690 instances were clustered in their desired clusters while the other four algorithms give 388, 383, 609, and 578, respectively.Our approach achieves better result in the comparison.

Conclusion
Extracting knowledge and information from mixed data meets the urgent needs of real world applications.Affinity propagation is a novel unsupervised clustering method presented in recent years.In this paper, we proposed a new approach for clustering mixed numeric and categorical data based on AP method.We made the contribution of three aspects.Firstly, we extend AP method to deal with the mixed type dataset removing its numeric data limitation and the results have shown the feasibility of this extension.
Secondly, an improved mixed similarity measure is proposed to compute distances between pairs of values for categorical attribute and to obtain the weight coefficients for numeric attribute.Finally, we improve the original AP by employing adaption strategies.Our approach works well not only for mixed data clustering but also for clustering pure numeric or categorical data, Mathematical Problems in Engineering which has been demonstrated in the experiments by comparing with other clustering algorithms.The experimental results illustrate the efficiency of the proposed method on several real life mixed type datasets.However, like many other algorithms with parameter tuning problem, we introduce several user-defined parameters, and it is not always clear which is the best value for these parameters.Our future work will focus on the further improvement of AP algorithm and its applications on various fields.

( 2 )
This paper proposes a novel mixed similarities measure based on Ahmad and Dey's work.(3) The method improves the original AP clustering algorithm with adaption strategies.

Figure 1 :
Figure 1: Different discretization techniques.(a) The raw data are assigned a random -axis component such that the data points can be distinguished.(b)The data points are divided by equal width discretization method that cannot work well obviously.(c) The data points are divided by equal frequency method that is better than equal width method, but not the best.(d) The data points are well divided by AP method we proposed, and it is the best one of the three.