^{1}

^{1}

^{2}

^{1}

^{2}

A novel hybrid clustering method, named

Clustering is a method of separating similar data from distinctly different ones into relevant categories or clusters. Being an unsupervised approach, it helps to recognize and extract hidden patterns within the data. The distance, such as Euclidean and Manhattan as a special case of Minkowski, plays an important role in clustering algorithms. Clustering techniques enjoy some advantages as no requirement for domain knowledge or labeled data while they are able to deal with a wide variety of data, including noise and outliers, as well.

Clustering methods may be categorized into two general types: hard and soft. Hard clusters possess well-defined boundaries; examples include

We begin with a review of the current literature on classical and fuzzy clustering methods. Huang [

The present study proposes a hybrid clustering algorithm by the name of KCM that combines KM and CM algorithms to achieve its objective by improving the time processing of the CM method. The performances of KM, CM, and KCM techniques are then compared in terms of their accuracy and time processing using simulated data from sub-Gaussian distributions. The methods are also applied to the three standard real datasets, to determine and compare the precision and accuracy of the investigated algorithms. Finally, KM, CM, and KCM are compared using Minkowski distances. The objective is to identify the best combinations of the clustering method and distance measure with higher precision, accuracy measures, and cluster quality in terms of compactness and distinctiveness.

By definition, clustering groups a sample of vectors to

KM is one of the most popular clustering algorithms [

(1

be the initial centers,

(2

(3

(4

(5

(6

In

(

(

A novel approach called KCM method is proposed herein that combines the KM and CM methods. The combination is meant to overcome the limitations of both but enjoys their advantages. One of the disadvantages of CM method is long computational time while quick running is one of the advantages of KM method. The goal of the hybrid method is to introduce a fuzzy method faster than CM while its accuracy is close to the CM. In the proposed technique, KM is initially applied to individual data objects to generate

The hybrid method is more suitable for the large dataset, where it has reduced clusters of observations by their centers, eventually computed from the KM. The performance of the proposed approach is evaluated by comparing it with KM and CM algorithms in terms of both accuracy and time processing. It is shown that the proposed technique outperforms CM in time processing; it yields results over shorter times when compared with the CM algorithm. Given a set of

(1

(2

(3

(4

(5

(6

Simulated datasets are used to evaluate the KM, CM, and KCM clustering methods. We use an external clustering evaluation criterion for comparisons. The Rand index is a criterion used to compare an induced clustering structure (

The quantities

We used R 3.3.3 software, on a PC with CPU Core i5-3210 with 4 GB RAM to run all experiments in the next sections. For a fair comparison, termination condition of the algorithms is set as default of R standard codes.

A

In this simulation study, a set of real and simulated data generated by the sub-Gaussian and multivariate normal distributions was used. For clustering data using the proposed KCM method, the three Euclidean, Manhattan, and Minkowski distances were used. In addition, the results obtained from the KM, CM, and KCM algorithms were compared in terms of their time processing (in milliseconds) and accuracy. A set of data of 15000 observations having 30 attributes and parameter of stability in the range of

Comparison of KM, CM, and KCM algorithms in terms of accuracy based on Euclidean, Manhattan, and Minkowski (

Comparison of KM, CM, and KCM algorithms in terms of time processing based on Euclidean, Manhattan, and Minkowski

We have implemented the algorithms 100 times, and the average values of accuracy and time processes were computed. We classify the results as follows.

The increase in values of

Generally, the processing time of the KCM algorithm is less than the CM algorithm. For example, when the number of clusters is 40, the processing time of CM and KCM is shown in Table

The time processing of CM versus KCM with Euclidean, Manhattan, and Minkowski (

| Euclidean | Manhattan | Minkowski |
---|---|---|---|

0.5 | KCM = 6.25% CM | KCM = 1.82% CM | KCM = 0.83% CM |

1 | KCM = 9.09% CM | KCM = 1.15% CM | KCM = 0.79% CM |

1.5 | KCM = 11.11% CM | KCM = 1.06% CM | KCM = 2.56% CM |

2 | KCM = 11.76% CM | KCM = 0.59% CM | KCM = 1% CM |

In this section, the KM, CM, and KCM algorithms are tested for their performance using Iris (150 × 4), Wine (178 × 13), and Lens (24 × 4) datasets. The three Euclidean (Euc), Manhattan (Man), and Minkowski (Min) distance measures are used to see how they influence the overall clustering performance. The performance of these three techniques has been compared based on the following parameters:

Precision =

Accuracy =

A true positive (

Performance analysis of KM, CM, and KCM clustering techniques on data and benchmarks Iris, Lens, and Wine datasets. Note that each data has 3 clusters; the best precision and accuracy for each data set is in bold font.

Dataset | Performance parameters | Clusters | | Fuzzy | | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Euc | Man | Min | Euc | Man | Min | Euc | Man | Min | |||

Iris | Precision | 1 | 0.898 | 0.898 | 0.898 | 0.858 | 0.905 | 0.861 | 0.627 | 0.627 | 1 |

2 | 1 | 1 | 0.645 | 0.655 | 0.691 | 1 | 1 | 1 | 0.895 | ||

3 | 0.644 | 0.645 | 1 | 1 | 1 | 0.670 | 1 | 1 | 0.631 | ||

Average | 0.847 | 0.847 | 0.848 | 0.837 | 0.865 | 0.844 | | | 0.842 | ||

Accuracy | 0.880 | 0.880 | 0.880 | 0.880 | | 0.796 | 0.796 | 0.880 | 0.874 | ||

| |||||||||||

Lens | Precision | 1 | 0.167 | 0.167 | 0.25 | 0.286 | 0.267 | 0.278 | 0.200 | 0.238 | 0.286 |

2 | 0.273 | 0.200 | 0.238 | 0.200 | 0.250 | 0.25 | 0.167 | 0.286 | 0 | ||

3 | 0.250 | 0.286 | 0.278 | 0.167 | 0.200 | 0.238 | 0.295 | 0 | 0.238 | ||

Average | 0.230 | 0.217 | | 0.217 | 0.239 | | 0.221 | 0.175 | 0.175 | ||

Accuracy | 0.522 | 0.507 | | 0.547 | 0.533 | | 0.504 | 0.504 | 0.504 | ||

| |||||||||||

Wine | Precision | 1 | 0.356 | 0.517 | 0.957 | 0.345 | 0.858 | 0.957 | 0.360 | 0.504 | 1 |

2 | 0.595 | 0.430 | 0.356 | 0.577 | 0.605 | 0.577 | 0.605 | 1 | 0.595 | ||

3 | 0.957 | 1 | 0.595 | 0.956 | 0.386 | 0.345 | 0.957 | 0.556 | 0.338 | ||

Average | 0.636 | 0.649 | 0.636 | 0.626 | 0.616 | 0.626 | 0.641 | | 0.644 | ||

Accuracy | 0.719 | 0.685 | 0.719 | 0.719 | | 0.711 | 0.720 | 0.686 | 0.695 |

Using the Iris dataset led to a greater average precision of clusters formed by KCM-Euc and KCM-Man than those by KM and CM with the three distances. CM-Man recorded a greater accuracy than any of those formed by KM or KCM. Distance and algorithm type had no significant effect on the accuracy. As for the Lens dataset, average precision was generally low with all the algorithms examined. It, however, yielded acceptable accuracy values with the KM, CM, and KCM algorithms, but it does not exceed 0.50. With the Wine dataset, distance and algorithm type had a significant effect on the accuracy and average precision. The average of precision does not exceed 0.70 and the highest average recorded by KCM-Man. The CM-Man recorded a greater accuracy than any of those formed by KM or KCM.

In this paper, the two most famous clustering techniques, namely,

It was found that the KM algorithm had shorter time processes than CM and KCM algorithms for all values of

The accuracy of KM, CM, and KCM algorithms was increasing for

Using the real datasets revealed that the Iris dataset yielded higher precision values for clusters with the three distances. The clusters formed by the combined KCM-Euc were observed to be more distinct. Using the Lens dataset led to poor precision levels but acceptable accuracy values for all the combinations. With the Wine dataset, medium precision levels were achieved with all the combinations. CM-Man and KCM-Euc yielded the most compact clusters, while KCM-Man yielded the most distinct ones. In general, the Iris dataset not only formed the most compact and distinct clusters, but also yielded higher precision and accuracy levels for KM, CM, and KCM clusters with the three distances than did the Lens or Wine datasets.

Finally, we recall that the time computation in a clustering method depends on the algorithm and its implementation, programming language, and hardware. Therefore, based on the complexity of the clustering problem one can consider the best of them.

The codes and data are available upon request.

The authors declare that they have no conflicts of interest.