DBSCAN is a base algorithm for density-based clustering. It can find out the clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers. However, it is fail to handle the local density variation that exists within the cluster. Thus, a good clustering method should allow a significant density variation within the cluster because, if we go for homogeneous clustering, a large number of smaller unimportant clusters may be generated. In this paper, an enhancement of DBSCAN algorithm is proposed, which detects the clusters of different shapes and sizes that differ in local density. Our proposed method VMDBSCAN first finds out the “core” of each cluster—clusters generated after applying DBSCAN. Then, it “vibrates” points toward the cluster that has the maximum influence on these points. Therefore, our proposed method can find the correct number of clusters.

Unsupervised clustering techniques are an important data analysis task that tries to organize the data set into separated groups with respect to a distance or, equivalently, a similarity measure [

Clustering methods can be categorized into two main types: fuzzy clustering and hard clustering. In fuzzy clustering, data points can belong to more than one cluster with probabilities [

Partitioning algorithms minimize a given clustering criterion by iteratively relocating data points between clusters until a (locally) optimal partition is attained. The most popular partition-based clustering algorithms are the

Hierarchical algorithms provide a hierarchical grouping of the objects. These algorithms can be divided into two approaches, the bottom-up or agglomerative and the top-down or divisive approach. In case of agglomerative approach, at the start of the algorithm, each object represents a different cluster and at the end, all objects belong to the same cluster. In divisive method at the start of the algorithm all objects belong to the same cluster, which is split, until each object constitutes a different cluster. Hierarchal algorithms create nested relationships of clusters, which can be represented as a tree structure called dendrogram [

Between the clusters, one can determine the distance as the distance of the two nearest objects in the two clusters (single linkage clustering) [

Density-based algorithms like DBSCAN [

Grid-based algorithms quantize the object space into a finite number of cells (hyper-rectangles) and then perform the required operations on the quantized space. The advantage of this approach is the fast processing time that is in general independent of the number of data objects. The popular grid-based algorithms are STING [

Model-based algorithms find good approximations of model parameters that best fit the data. They can be either partitional or hierarchical, depending on the structure or model they hypothesize about the data set and the way they refine this model to identify partitionings. They are closer to density-based algorithms in that they grow particular clusters so that the preconceived model is improved. However, they sometimes start with a fixed number of clusters and they do not use the same concept of density. Most popular model-based clustering methods are EM [

Fuzzy algorithms suppose that no hard clusters exist on the set of objects, but one object can be assigned to more than one cluster. The best known fuzzy clustering algorithm is FCM (Fuzzy

Categorical data algorithms are specifically developed for data where Euclidean, or other numerical-oriented, distance measures cannot be applied.

Rest of the paper is organized as follows. Section

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [

OPTICS [

In order to adapt DBSCAN to data consisting of multiple processes, an improvement should be made to find the difference in the mth nearest distances of processes. Roy and Bhattacharyya [

Pascual et al. [

Another enhancement of the DBSCAN algorithm is DENCLUE [

EDBSCAN (an Enhanced Density-Based Spatial Clustering of Application with Noise) [

DD_DBSCAN [

In VDBSCAN [

CHAMELEON [

Most of the algorithms are not robust to noise and outlier density-based algorithms are more important in this case. However, most of the density based clustering algorithms, are not able to handle the local density variation. DBSCAN [

The DBSCAN [

The neighborhood within a radius

If the

Given a set of data objects,

An object

An object

According to the above definitions, it only needs to find out all the maximal density-connected spaces to cluster the data objects in an attribute space. And these density-connected spaces are the clusters. Every object not contained in any clusters is considered noise and can be ignored.

DBSCAN [

If the number of neighbors is greater than or equal to MinPts, a cluster is formed. The starting point and its neighbors are added to this cluster, and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors' recursively.

If the number of neighbors is less than MinPts, the point is marked as noise.

If a cluster is fully expanded (all points within reach are visited), then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.

One of the problems with DBSCAN is that it is has wide density variation within a cluster.

To overcome this problem, new algorithm VMDBSCAN based on DBSCAN algorithm is proposed in this section. It first clusters the data objects using DBSCAN. Then, it finds the density functions for all data objects within each cluster. The data object that has the minimum density function will be the core for that cluster. After that, it computes the density variation of a data object with respect to the density of core object of its cluster against all densities of other core's clusters. According to the density variance, we do the movement for data objects toward the new core. New core is one of other core's clusters, which has the maximum influence on the tested data object.

We intuitively present some definitions.

Suppose that

The influence function we will choose will be function that can determine the distance between two data objects, as Euclidean distance function.

Given a

According to Definitions

Core, the core object for each cluster, is the object that has the minimum density function value according to Definition

Total Density Function

In addition, according to our initial clusters which are given by the density-based clustering methods, we can takeover the influence function (Definition

Our main idea is the vibration of data objects according to the density of the data object with respect to core (Definition

We use

Formally, we can describe our proposed algorithm as follows

Calculate the Density Function for all the data objects.

Do clustering for the data objects using traditional DBSCAN algorithm.

Calculate the Density Function for all the data objects again, and then find out the core of each generated cluster.

For each data object, if its Total Density Function with respect to its core is greater than with respect to other cores, then vibrate the data objects in that cluster.

The proposed method of the algorithm is described as pseudo code in Algorithm

VMDBSCAN()

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13) if

(14)

(15)

(16)

(17)

(18)

The first step initializes the value of learning rate

We evaluated our proposed algorithm on several artificial and real data sets.

We use three artificial two-dimensional data sets, since the results are easily visualized. The first data set is shown in Figure

(a) 208 data points with one cluster. (b) DBSCAN applied

Figure

Figure

(a) 256 data points with tow cluster. (b) DBSCAN applied

Figure

(a) 5743 data points with five clusters. (b) DBSCAN applied

We use the iris data set from the UCI (

We apply another data set, which is Haberman data set from UCI (

Comparison of average error index between the results of DBSCAN and our proposed VMDBSCAN on real data set.

Dataset | True clusters | Determined clusters DBSCAN | Determined clusters VMDBSCAN | DBSCAN error % | VMDBSCAN error % |
---|---|---|---|---|---|

IRIS | 3 | 2 | 3 | 45.33 | 20.00 |

Haberman | 2 | 1 | 2 | 33.33 | 27.78 |

Glass | 6 | 3 | 4 | 68.22 | 62.61 |

We apply another data set, which is Glass data set from UCI (

We have proposed an enhancement algorithm based on DBSCAN to cope with the problems of one of the most used clustering algorithm. Our proposed algorithm VMDBSCAN gives far more stable estimates of the number of clusters than existing DBSCAN over many different types of data of different shapes and sizes. Future work will focus on determining the best value of the parameter