We revisit the classic DBSCAN algorithm by proposing a series of strategies to improve its robustness to various densities and its efficiency. Unlike the original DBSCAN, we first use the binary local sensitive hashing (LSH) which enables faster region query for the
Clustering studies [
To address the issue related to DBSCAN, this paper proposes a fast clustering algorithm BLSHDBSCAN based on collisionordered LSH and binary
(i) Using the binary LSH to query the
(ii) It constructs a binaryKNN representation method which can map the data into the Hamming space for the next clustering operation and greatly improve the speed of clustering.
(iii) It introduces a core point distinguishing method based on the influence space and designs the solution of influence space in the binary dataset to boost the clustering speed. At the same time, due to the density sensitivity of influence space, this improved method has much better clustering quality and efficiency compared with the original DBSCAN.
(iv) It introduces a seed point selection method, based on influence space and the
The rest of the paper is organized as follows.
In Section
DBSCAN is a typical densitybased spatial clustering algorithm. It has two important parameters
Suppose that we are given a dataset
If
A point
A point
In the
If
If point
To find the cluster, DBSCAN starts with an arbitrary object
However, the neighborhood query needs to calculate the distance between the query object and all other objects by liner search and it has a huge I/O cost. To solve the problem, we propose the following improvements: to accelerate the region query, using the binary LSH rather than linear search to query the
The LSH algorithm is usually for quick neighbor query. It involves two steps: index construction and object query. In index construction, through a set of hash functions, it projects similar data points into the same hash bucket with a higher probability. In object query, it uses a filterandrefine framework to hash the data into the hash bucket through the same hash functions. All the data points in the hash buckets are adopted as candidates, which are used to calculate the similarity with the query object to find the
If
If
where
LSH uses different hash function families for different distance functions. In this paper, binary hash function family based on
The index structure of LSH can be summarized as the following two steps [
(i) Giving a set of hash functions
(ii) Selecting an integer
When using the binary LSH in object query, for each query object
Basic LSH adopts all data objects which have the same conflicting bucket number with the query object as the candidates, and then it compares the similarity between the candidates and the query object to find the
However, in the
As we all know, the neighbor structure contains strong data class information. It is possible to effectively judge the similarity between the data objects through it [
The details can be described as follows. For any data object
Different from the original DBSCAN, this paper uses binary LSH rather than linear search to query the
Densitybased clustering is to find out the area where the density exceeds the threshold. In DBSCAN, it uses global parameters
For further explanation, we give the following definitions.
For
For
For
For
The influence space
Also, to calculate the
To improve the efficiency of the algorithm, on the one hand, we need to improve the efficiency of the neighbor query which has been solved by LSH and binaryKNN representation in Section
In the cluster expansion of DBSCAN, all points in the neighborhood are selected as the seeds for the next region query. However, our core point distinguishing the method proposed in Section
Seed points and
In this section, we introduce a seed point selection method based on influence space and the similarity between each neighborhood. In a neighborhood of a core point, it ascends the points which are in the influence space by
For
The detailed explanation for how to select the seed point in its
Here we explain why we choose objects with the lowest
The steps of the improved algorithm are as follows:
The flowchart is as Figure
Flowchart of the improved algorithm.
To evaluate our approaches, we demonstrate the superiority of BLSHDBSCAN in three aspects, the query time of neighborhood, clustering speed, and clustering quality in both synthetic datasets and realworld datasets. All the experiments are implemented in MATLAB under windows operating system. In the following experiments, we compare our BLSHDBSCAN with both DBSCAN and ISDBSCAN. The reason why we choose DBSCAN for comparison is that DBSCAN is the original algorithm; it makes great sense for illustrating the effectiveness of the improvement method by comparing the clustering quality and speed with each other. The reason we compare the BLSHDBSCAN with the algorithm ISDBSCAN is that they all use influence space to enhance the robustness in the variousdensity dataset. Unlike ISDBSCAN, our BLSHDBSCAN operates the clustering in Hamming space and adopts a seed point selection strategy to reduce the frequency of neighborhood query. Comparing the two algorithms can further illustrate the effectiveness of our improved strategy.
Unlike the DBSCAN which queries the neighborhood by linear search, our BLSHDBSCAN uses the binary LSH to query the
Test datasets summary.
Dataset  Objects  Dimensions 

Synthetic dataset  5000  5 
Synthetic dataset  10000  5 


Synthetic dataset  60000  5 
Figure
Comparison of query time.
The BLSHDBSCAN has introduced the binary influence space to improve the clustering quality in the datasets with various densities. To illustrate the positive influence of our improve strategy on the clustering quality, we carry out the experiments in several synthetic datasets which are introduced in Table
Overview of the synthetic datasets used in experiments.
Dataset  Objects  Dimension  Label  Characteristics of datasets 

Dataset 1  1502  2  2  Different shape 
Dataset 2  1419  2  7  Different shape, size, and density 
Since DBSCAN uses the global parameters
Best cluster results in different datasets.
From the first row of Figure
In this section, we will compare the clustering efficiency and clustering accuracy in realworld datasets. We use run time to represent the clustering efficiency and the clustering correct rate to represent the clustering accuracy. Because the points in experimental datasets are all classified, the correct rate in this section is obtained by comparing the clustering results of the algorithm with the original label of the data points.
BLSHDBSCAN adopts several strategies to enhance the clustering speed. It uses binary LSH which is a fast neighborhood query algorithm to speed up the region query. It adopts the binaryKNN representation method to map the clustering operation in Hamming space. It also selects few seed points instead of all neighbors for cluster expansion to decrease the frequency of region query. These methods all have improved the clustering efficiency to some extent. In order to illustrate the efficiency of these methods, we select several datasets from UCI datasets and then compare the run time and clustering accuracy of BLSHDBSCAN, DBSCAN, and ISDBSCAN. Table
Overview of the UCI datasets used in experiments.
Dataset  Object  Dimension  Label 

Iris  150  4  3 
Contraceptive Method Choice  1437  9  3 
Letter Recognition  20000  16  26 
Table
Correct rate comparison in different UCI datasets.
DBSCAN  ISBDBSCAN  BLSHDBSCAN  

Iris  69.33%  88%  89.33% 
Contraceptive Method Choice  42.70%  44.89%  42.70% 
Letter Recognition  50.2%  64.45%  64.425% 
Comparison of run time of the DBSCAN, ISDBSCAN, and BLSHDBSCAN.
Correct rates shown in Table
From Table
In conclusion, in the smallscale dataset, BLSHDBSCAN can greatly improve the clustering accuracy just as the ISDBSCAN. In the largescale dataset, compared with DBSCAN, it can get a higher accuracy and efficiency; compared with ISDBSCAN, it can decrease the run time sharply while maintaining the same level of accuracy.
In this paper, an improved DBSCAN algorithm is proposed to improve the robustness for various densities and clustering efficiency of the algorithm. The improved strategy includes using binary LSH instead of linear search for region query; designing a binary representation method based on
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported in part by National Key Basic Research Program of China (973 Program) (Project 2015CB856001) and by Guizhou Provincial Key Laboratory of Public Big Data (2017BDKFJJ002 and 2017BDKFJJ004); this research is also funded by project of Guizhou Provincial Education Department (KY[2016]124) and project of Department of Science and Technology of Guizhou Province (LH[2014]7628).