With the advent of the
Clustering task is a form of unsupervised learning that aims at finding underlying structures in unlabeled data. Objects are partitioned into homogeneous groups or clusters so that intracluster items have high similarity but are very dissimilar to objects in other clusters. A lot of clustering methods have been proposed and developed over decades (for a recent survey, see [
For numerical data, the
However, the
In this paper, we develop a new clustering method for categorical data based on community detection techniques [
Compared to prior work, our scheme highlights the following features:
We propose a novel clustering method called CDClustering for categorical data using community detection techniques. Our scheme uses a simple heuristic to determine the distance threshold for graph construction. It is also deterministic as opposed to the traditional
We evaluate our scheme on ten real categorical datasets and compare it against random initialization and two other initialization methods. The results show that our technique performs better than the competitors in terms of accuracy for most of the cases.
The remainder of the paper is organized as follows. Section
As in
In [
Wu et al. [
A long with the heuristics for cluster initialization discussed above, there are many ideas on improving the dissimilarity scores for the standard
There is a vast literature on community detection in graphs. For a recent comprehensive survey, we refer to [
Newman and Girvan [
Many methods for optimizing the modularity have been proposed over the last ten years, such as agglomerative greedy [
In this section, we review several key concepts in the
Notations summarizes the notation used in this paper.
Let
The
Given a set of data points
The original
Select
Allocate an object to the cluster whose mode is the nearest to it. Update the mode of the cluster after each allocation using the most frequent attribute values.
After all objects have been allocated to clusters, retest the dissimilarity of objects against the current modes. If an object is found such that its nearest mode belongs to another cluster rather than its current one, reallocate the object to that cluster, and update the modes of both clusters.
Repeat (3) until no object has changed clusters after a full cycle test of the whole dataset.
Given a simple graph
Using Figure
Similarly, for the clustering on Figure
Modularity of two different clustering.
Since its introduction in 2008,
We demonstrate Louvain method in Figure
Louvain method for modularity optimization.
This greedy agglomerative algorithm has several advantages as stated in [
Note that in Louvain method, the move of nodes to gain better modularity is restricted to neighbor (connected) communities. Therefore, detected communities belong to one and only one connected component. In other words, a community never spans different connected components of a graph.
To build the graph
In this paper, we propose a simple heuristic to estimate
In other words, given the cumulative distribution function (CDF) of pairwise distances, we can estimate
Estimation of the distance threshold
Soybean (
Mushroom (
Zoo (
LungCancer (
BreastCancer (
Dermatology (
Vote (
Nursery (
Chess (
Heart (
We also observe that the
Now we describe our community detectionbased clustering scheme (named CDClustering) which is outlined in Algorithm
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
The complexity of CDClustering is dominated by the computation of all pairwise Hamming distances and the Louvain method. All pairwise Hamming distances are computed in
Comparison of time complexity.
Clustering method  Time complexity 

Cao et al. [ 

Khan and Ahmad [ 

CDClustering 

In this section, we evaluate the performance of the proposed scheme. The realworld datasets and evaluation metrics are described in Sections
We pick ten purely categorical datasets from the UCI Machine Learning Repository [
Table
Dataset properties.
Dataset 




avg.intra.dist  avg.inter.dist  AC 

#comp  top 
Runtime (ms) 

Soybean  47  21  4  7  5.54  12.48 

246  3  47  31 
Mushroom  8,124  22  2  11  10.11  12.68  0.7244  12,924,407  1  5,366  8,284 
Zoo  101  16  7  2  2.40  7.75 

701  7  100  109 
LungCancer  32  56  3  22  24.20  26.09  0.5938  160  4  29  110 
BreastCancer  699  9  2  6  4.27  7.84 

115,068  1  699  93 
Dermatology  366  34  6  10  11.23  16.41 

9,588  4  364  125 
Vote  435  16  2  8  6.41  10.81 

47,102  1  432  93 
Nursery  12,960  8  5  3  5.03  5.67  0.4156  5,721,840  1  12,960  2,543 
Chess  3,196  36  2  9  9.65  10.35  0.6004  2,395,174  1  2,389  2,012 
Heart  303  13  5  5  6.58  7.85  0.4719  7,755  2  302  94 
The column R shows the value of Hamming distance threshold estimated from the CDF (see Figure
To evaluate the performance of clustering algorithms, we use the same metrics as in [
We demonstrate how to find the best confusion matrix and compute the precision, recall, and accuracy metrics in the following example.
Assume that a dataset of
To find the best confusion matrix
The best mapping in this case is
Groundtruth and predicted labels.
Object id  1  2  3  4  5  6  7  8  9  10 
Groundtruth label  1  1  1  2  2  2  2  3  3  3 
Predicted label  b  b  a  b  a  c  b  a  a  c 
A confusion matrix.
1  2  3  
c 

1  1 
a  1 

2 
b  2  2 

For comparison, we choose the algorithms by Cao et al. [
The clustering results for the ten categorical datasets are summarized in Tables
Accuracy: our scheme outperforms or equals other methods in 7 cases, in particular with large margins in LungCancer, BreastCancer, Dermatology, and Nursery datasets.
Precision: our scheme outperforms or equals other methods in 7 cases.
Recall: our scheme outperforms or equals other methods in 7 cases.
Clustering results for soybean small data.
Confusion matrix
Class  

D1  D2  D3  D4  
D1  10  0  0  0 
D2  0  10  0  0 
D3  0  0  10  0 
D4  0  0  0  17 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.8044 

0.9787 

PR  0.7969 

0.9773 

RE  0.8005 

0.9853 

Clustering results for Mushroom data.
Confusion matrix
Class  

Poisonous  Edible  
Poisonous  4093  2124 
Edible  115  1792 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.7206 

0.8288  0.7244 
PR  0.7448 

0.8688  0.7990 
RE  0.7167 

0.8228  0.7151 
Clustering results for Zoo data.
Confusion matrix
Class  

a  b  c  d  e  f  g  
a  37  0  0  0  0  0  0 
b  0  13  0  0  0  0  1 
c  0  0  20  0  0  0  1 
d  0  0  0  9  8  0  0 
e  4  0  0  0  0  0  0 
f  0  0  0  0  0  4  3 
g  0  0  0  1  0  0  0 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.7041  0.6733 

0.8218 
PR  0.5876  0.5996 

0.5688 
RE  0.5893  0.6233 

0.6861 
Clustering results for LungCancer data.
Confusion matrix
Class  

a  b  c  
a  5  2  2 
b  5  7  1 
c  3  0  7 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.5227  0.5313  0.4375 

PR  0.5590  0.5833  0.4468 

RE  0.5283  0.5393  0.4470 

Clustering results for BreastCancer data.
Confusion matrix
Class  

Benign  Malignant  
Benign  432  8 
Malignant  26  233 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.8174  0.9113  0.6323 

PR  0.8283  0.9292  0.5535 

RE  0.7996  0.8773  0.5336 

Clustering results for Dermatology data.
Confusion matrix
Class  

Seborrheic dermatitis  Psoriasis  Lichen planus  Chronic dermatitis  Pityriasis rosea  Pityriasis rubra pilaris  
Seborrheic dermatitis  61  0  2  0  49  0 
Psoriasis  0  111  0  0  0  0 
Lichen planus  0  0  70  0  0  0 
Chronic dermatitis  0  1  0  52  0  0 
Pityriasis rosea  0  0  0  0  0  1 
Pityriasis rubra pilaris  0  0  0  0  0  19 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.5683  0.5984  0.6175 

PR  0.5318  0.5548  0.6841 

RE  0.5028  0.5393  0.6165 

Clustering results for Nursery data.
Confusion matrix
Class  

Not_recom  Recommend  Very_recom  Priority  Spec_prior  
Not_recom  1440  0  132  1484  1264 
recommend  0  0  0  0  0 
Very_recom  0  0  0  0  0 
Priority  1440  2  196  1924  758 
Spec_prior  1440  0  0  858  2022 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.3331  0.3673  0.2804 

PR  0.2902  0.2978  0.2304 

RE 

0.2273  0.2044  0.2569 
Clustering results for Congressional Vote data.
Confusion matrix
Class  

Republican  Democrat  
Republican  160  48  
Democrat  8  219 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.8603  0.8644  0.8506 

PR  0.8554  0.8568  0.8484 

RE  0.8732  0.8730  0.8672 

Clustering results for Chess data.
Confusion matrix
Class  

Win  Nowin  
Win  1562  0  
Nowin  1277  357 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.6390 

0.7040  0.6004 
PR  0.5184  0.5449  0.5312 

RE  0.5394  0.5806  0.5540 

Clustering results for HeartDisease data.
Confusion matrix
Class  

0  1  2  3  4  
0  99  9  0  1  0 
1  24  10  4  2  2 
2  14  25  30  28  8 
3  20  9  1  2  1 
4  7  2  1  2  2 
Performance comparison
Random  Cao  Khan  Proposed  

AC  0.3895  0.3069  0.4422 

PR  0.3159  0.2763 

0.3271 
RE  0.3219  0.2641  0.3467 

To better understand the performance of CDClustering, we revisit Table
Rather than using the
Dataset with
Set of
Number of clusters
Hamming distance between
Hamming distance threshold
Simple graph for
The author declares that there are no conflicts of interest regarding the publication of this paper.