A Global-Relationship Dissimilarity Measure for the k-Modes Clustering Algorithm

The k-modes clustering algorithm has been widely used to cluster categorical data. In this paper, we firstly analyzed the k-modes algorithm and its dissimilarity measure. Based on this, we then proposed a novel dissimilarity measure, which is named as GRD. GRD considers not only the relationships between the object and all cluster modes but also the differences of different attributes. Finally the experiments were made on four real data sets from UCI. And the corresponding results show that GRD achieves better performance than two existing dissimilarity measures used in k-modes and Cao's algorithms.


Introduction
Clustering is an important technique in data mining, and its main task is to group the given data based on some similarity/dissimilarity measures [1]. Most clustering techniques use distances largely to measure the dissimilarity between different objects [2][3][4]. However, these methods work only on the data sets with numeric attributes, which limits their uses in solving categorical data clustering problems [5].
Some researchers have made great efforts to quantize relationships among different categorical attributes. Guha et al. [6] proposed a hierarchical clustering method termed ROCK, which can measure the similarity between a pair of objects [7]. In ROCK, the number of Link is computed as the number of common neighbors between two objects [8]. However, the following two deficiencies still exist: (1) two involved parameters ( , ) must be assigned in advance and (2) the mass calculation is involved [9]. For these reasons, some researchers have generated some new algorithms like QROCK [10], DNNS [11], and GE-ROCK [12] to modify or improve the ROCK algorithm. To remove the numeric-only limitation of k-means algorithm, Huang et al. [13,14] proposed the kmodes algorithm, which extends the k-means algorithm by using (1) a simple matching dissimilarity measure for categorical attributes; (2) modes in place of means for clustering; and (3) a frequency-related strategy to update modes to minimize the clustering costs [15]. In fact, the idea of simple matching has been used in many clustering algorithms, such as fuzzy k-modes algorithm [16], fuzzy k-modes algorithm with fuzzy centroid [17], and k-prototype algorithm [14]. However, simple matching often results in some low intradissimilarity clusters [18] and disregards of the dissimilarity hidden between the categorical values [19].
In this paper, a Global-Relationship Dissimilarity (GRD) measure for the k-modes clustering algorithm is proposed. This dissimilarity measure considers not only the relationships between the object and all cluster modes but also the differences of various attributes instead of simple matching. The clustering effectiveness of k-modes based on GRD (KBGRD) is demonstrated on four standard data sets from the UCI Machine Learning Repository [20].
The remainder of this paper is organized as follows: a detailed review of the dissimilarity measure used in k-modes is presented and analyzed in Section 2. In Section 3, the new dissimilarity measure GRD is proposed. Section 4 describes the details of KBGRD algorithm. Section 5 illustrates the performance and stability of KBGRD. Finally, a concluding remark is given in Section 6.

2
Computational Intelligence and Neuroscience an object. And the practical data usually contains categorical attributes [21]. We firstly define the term "data set" [22].

k-Modes Dissimilarity
Measure. The k-modes clustering algorithm is an improvement of the k-means algorithm [4] by using a simple dissimilarity measure for categorical data.
And it adopts a frequency-related strategy to update modes in the clustering to minimize the clustering costs. These extensions have excluded the numeric-only limitation existed in k-means algorithm and enable the clustering process to be used on large-size categorical data sets from real world database [22].

Definition 2.
Let IS = { , , , } be a categorical data set information system which is defined in Definition 1 and ∈ . For any object ∈ and cluster mode for 1 ≤ ≤ , Dis 0 ( , ) is the simple matching dissimilarity measure between object and the mode of the th cluster which is defined as follows: In ( There are nine objects { 1 , 2 , . . . , 9 } with four attributes { 1 , 2 , 3 , 4 } and three initial cluster modes as shown in Table 1. For determining the appropriate cluster of 1 , it is required to compute the dissimilarity of 1 and the three cluster modes. According to (1), Dis 0 ( 1 , 1 ) = Dis 0 ( 2 , 1 ) = Dis 0 ( 3 , 1 ) = 1. Therefore, it is impossible to determine exactly to which cluster the object 1 should be assigned.
The dissimilarity between an object and a cluster mode should consider the relationships between the object and all cluster modes as well as the differences of various attributes. When the k-modes dissimilarity measure is computing dissimilarity of a certain attribute, it only simply matches this object with this mode and ignores the differences of various attributes. Such as attribute "A4" in Table 1, almost all of Table 1: An artificial data set.
objects and cluster modes is "E"; "A4" should contribute more to dissimilarity than other attributes. However, the kmodes dissimilarity treats all attributes equally.

Global-Relationship Dissimilarity Measure
Definition 3. Let IS = { , , , } be a categorical data set information system which is defined in Definition 1 and ∈ . For any object ∈ and cluster mode for 1 ≤ ≤ , Dis( , ) is the new dissimilarity measure between object and the mode of the th cluster which is defined as In (2), is the dimension number of data set and the similarity function Sim( , ) is defined as follows: subject to where is the number of cluster modes, and here ( , ) is satisfied with Computational Intelligence and Neuroscience 3 As shown in Table 1, it is required to compute the dissimilarity of 1 with three cluster modes for determining which cluster 1 should be assigned to. According to (2)-(6), the following three ones can be got: Hence, 1 can be assigned to cluster "2" definitely.

KBGRD Algorithm
In this section, we give the concrete procedure of the kmodes based on GRD (KBGRD) algorithm. In addition, the computational complexity of KBGRD is analyzed.

KBGRD Algorithm Description
Definition 4. Let IS = { , , , } be a categorical data set information system which is defined in Definition 1 and ∈ . The k-modes algorithm uses the k-means paradigm to cluster categorical data. The objective function of the kmodes algorithm is defined as follows: In (7), is a binary variable and indicates whether object belongs to the th cluster; = 1 if belongs to the th cluster and 0 otherwise; = [ 1 , 2 , . . . , ]; and is the th cluster mode with categorical attributes 1 , 2 , . . . , .

Update and Convergence
Analysis. The steps of the KBGRD algorithm are presented below. Here ( ) and ( ) denote cluster modes and membership matrix at th iteration, respectively.
In each iteration, and are updated by the following formulae.
And when is given, is updated as follows: where (2) , . . . , ( ) }; is the number of categorical of attribute for 1 ≤ ≤ . Now we consider the convergence of the KBGRD algorithm.
Proof. For a given , we have ( ,̂) = ∑ =1 ∑ =1 Dis( , ). The updating method of is computing the minimized dissimilarity between objects and modes according to (8), and the dissimilarities of objects and modes are independent. So is updated by (8) such that ( ,̂) is minimized.
Proof. For a given , we have , ). Note that all inner sums are nonnegative and independent. Then minimizing (̂, ) is equivalent to maximizing each inner sum. When = ( ) , according to (9), is maximized. So is updated by (9) such that (̂, ) is minimized.

Computational Intelligence and Neuroscience
Input: data set U and initial cluster number ; Output: clusters.

Sub function Cluster(U, modes)
Begin: (1) for = 0 to // is the number of clusters. (2) for = 1 to // is the number of objects. (1) for = 0 to // is the number of clusters. (2) for = 1 to // is the number of objects.

Pseudocodes and Complexity
Analysis. The pseudocodes of KBGRD algorithm are presented in Pseudocode 1.
The major function of subfunction Cluster() is computing the dissimilarity between object and cluster mode and classifying the objects into the clusters whose dissimilarity is the minimum. The function of subfunction Fun() is computing the value of objective function.
In fact, main function is a controller, which controls the iterations of algorithm. We first choose distinct objects as initial modes. Line 2 is the initialization of cluster; Line Computational Intelligence and Neuroscience 5 3 computes original cluster result and "new Dissimilarity." Lines 4-9 are to iteratively update modes and clusters. And when "new Dissimilarity" is invariant, the iteration stops. Referring to the pseudocodes as shown in Pseudocode 1, the computational complexity of KBGRD algorithm is analyzed as follows. We only consider the major computational steps.
We firstly consider the computational complexity of two subfunctions. The computational complexity for computing the dissimilarity is ( ⋅ ⋅ ), where is the number of modes, n is the number of objects in data set, and is the dimension of data set. The computational complexity for assigning the th object into the lth cluster is ( ⋅ ). So the computational complexity for updating all clusters is ( ⋅ ⋅ ( + 1)), that is, ( ⋅ ⋅ ). The computational complexity of computing objective function is ( ⋅ ⋅ ).
Suppose that the iteration time is and the whole computational cost of KBGRD algorithm is ( ( ⋅ ⋅ ) + ( ⋅ ⋅ )) = 2 ( ⋅ ⋅ ⋅ ), that is, ( ⋅ ⋅ ⋅ ). This shows that the computational cost is linearly scalable with the number of objects, the number of attributes, and the number of clusters.

Experimental Environment and Evaluation Indexes.
The experiments are conducted on a PC with an Intel i3 processor and 4 G byte memory running the Windows 7 operating system. All algorithms are coded by JAVA on Eclipse.
To evaluate the efficiency of clustering algorithm, the evaluation indexes Accuracy (AC) and RandIndex are employed in the experiments. Let = { 1 , 2 , 3 } be the set of three classes in the data set and = { 1 , 2 , 3 } be the set of three clusters generated by the clustering algorithm. Given a pair of objects ( , ) in the data set, we refer to it as (1) if both objects belong to the same cluster in and the same cluster in ; (2) if the two objects belong to the same cluster in and two different clusters in ; (3) c if the two objects belong to two different clusters in and to the same cluster in ; (4) d if both objects belong to two different clusters in and two different clusters in .
Let 1 , 2 , 3 , and 4 be the number of a, b, c, and d, RandIndex [23] is defined as follows:  Accuracy (AC) is defined as follows: where is the number of clusters, n is the number of objects, and is the number of objects that are correctly assigned to the cluster (1 ≤ ≤ ). Four categorical data sets from the UCI Machine Learning Repository are used to evaluate the clustering performance, including QSAR Biodegradation (QSAR), Chess, Mushroom, and Nursery. The relative information about the data sets is tabulated in Table 2.

Experimental Results and Analysis.
In the experiments, we compare KBGRD algorithm with the original k-modes and Cao's algorithm [24]. Three algorithms are sequentially run on all data sets. Each algorithm requires the number of modes (ClusterNum) as an input parameter. We randomly select distinct ClusterNum objects as initial cluster modes. The number of iteration of all algorithms is no more than 500.
Note that there are very few missing values in the Mushroom data set; we use optimal completion strategy to deal with missing values. In the optimal completion strategy, the missing values in data set are viewed as additional variables [25,26].
Firstly, we set ClusterNum as the classes' number of the data set. The average RandIndex of ten times' experiments on four data sets for three algorithms is summarized in Table 3. The average AC of ten times' experiments on four data sets for three algorithms is summarized in Table 4. As shown in Tables 3 and 4, KBGRD achieves the highest RandIndex and AC. That is, it performs better than other algorithms under the same conditions. 6 Computational Intelligence and Neuroscience      In real world applications, the number of initial cluster modes is unknown. We evaluated clustering stability by setting different ClusterNum (10,15,20,25,30, and 35) for each data set and used RandIndex to evaluate clustering results. The average RandIndex of ten times' experiments on four data sets for three algorithms is summarized in Tables 5-8. And the last column shows the average clustering RandIndex of each algorithm on six ClusterNum. As shown in Tables 5-8, KBGRD achieves the highest RandIndex. That is to say, it performs better than other algorithms on four data sets. Additionally, KBGRD has the highest stability compared with other algorithms.

Conclusion
This paper analyzes the advantages and disadvantages of kmodes algorithms for categorical data. Based on this, we propose a novel dissimilarity measure (GRD) for clustering categorical data. This measure is used to improve the performance of the existing k-modes algorithm. The computational complexity of KBGRD algorithm has been analyzed which is linear with the number of data objects, attributes, and clusters. We have tested KBGRD algorithm on four real data sets from UCI. Experimental results have shown that KBGRD algorithm is effective and stable in clustering categorical data sets.

Conflicts of Interest
The authors declare that they have no conflicts of interest.