A Fast Multiscale Clustering Approach Based on DBSCAN

Multiscale brings great bene ﬁ ts for people to observe objects or problems from di ﬀ erent perspectives. It has practical signi ﬁ cance for clustering on multiscale data. At present, there is a lack of research on the clustering of large-scale data under the premise that clustering results of small-scale datasets have been obtained. If one does cluster on large-scale datasets by using traditional methods, two disadvantages are as follows: (1) Clustering results of small-scale datasets are not utilized. (2) Traditional method will cause more running overhead. Aims at these shortcomings, this paper proposes a multiscale clustering framework based on DBSCAN. This framework uses DBSCAN for clustering small-scale datasets, then introduces algorithm Scaling-Up Cluster Centers (SUCC) generating cluster centers of large-scale datasets by merging clustering results of small-scale datasets, not mining raw large-scale datasets. We show experimentally that, compared to traditional algorithm DBACAN and leading algorithms DBSCAN++ and HDBSCAN, SUCC can provide not only competitive performance but reduce computational cost. In addition, under the guidance of experts, the performance of SUCC is more competitive in accuracy.


Introduction
Clustering is one of the vital data mining and machine learning techniques and that aims to group similar objects into the same cluster and separate dissimilar objects into different clusters [1]. It is so prominent and recently attracted significant attention of researchers and practitioners from different domains of science and engineering [2]. Thousands of papers have been published [3][4][5][6]. However, these investigations concentrate on clustering from a single perspective.
The scale can be equivalent to the following concept: generic concept, level of abstraction or perspective of observation, the same problem, or system can be perceived at different scales based on particular needs [7]. That is called multiscale phenomena and has been widely applied to the academic fields, such as geoscience [8,9] and mathematics [10]. Multiscale clustering has gained popularity in recent years. However, these studies either focus on specific applica-tion scenarios [11][12][13] or rely on specific clustering methods [14][15][16], and they perform clustering based on raw data, which is highly time consuming. Figure 1 shows two implementations of multiscale clustering, which can help users conduct clustering analysis on multiscale datasets. The straightforward way of obtaining the knowledge of every scale is to apply the traditional mining algorithms to each scale dataset, which is very inefficient. In this study, we concentrate on Scaling-Up Cluster Centers (SUCC) from small scale to large scale and avoid repetitive clustering on raw datasets, with competitive efficiency.
1.1. Motivations. The multiscale phenomenon is a widely accepted concept that helps people understand problems from different levels of abstraction. Clustering analysis on multiscale datasets considerably influences knowledge discovery when capturing the essence of the problem. For example, assume that a telecommunication supplier has many filiales, and each filiale accomplishes clustering analysis for its customers. After that, the controlling corporation also needs to conduct clustering analysis for all customers. It is inefficient that clustering at each scale data by using the traditional method, i.e., reclustering, while SUCC can improve efficiency by computing cluster centers belonged small-scale data and obtaining large scale's clusters. In addition, the concept of "fine-grained and more fine-grained" in the RFID domain [17][18][19] and the strategy of dividing and conquering, i.e., separating tags into many small subsets, then handling in sequence [20][21][22][23], can motivate our work.

Contributions.
We provide a mathematical model (multiscale clustering framework), then design a novel algorithm named Scaling-Up Cluster Centers (SUCC) from small scale to large scale, which avoids repetitive clustering on raw datasets.
We conducted extensive experiments to evaluate our work. Experimental results show that the SUCC is efficient and reduces runtime consumption with competitive accuracy compared to traditional methods and the leading algorithms, which need to deal with raw data that is much more than cluster centers belonged small-scale data in most instances.
1.3. Structure of the Paper. The rest of this paper is organized as follows. Section 2 discusses the related work. The problem is described detailedly in Section 3. The framework of multiscale clustering and algorithm SUCC is designed in Section 4. Section 5 details the comparison experiments, and Section 6 concludes our work and provides some future research directions.

Related Work
Clustering has not been a precise definition; there are a lot of different clustering approaches towards various techniques.
In this section, we review the clustering approaches in two aspects.

Conventional Clustering Techniques.
Han et al. suggested the following four categories for dividing clustering techniques: partitioning methods, hierarchical methods, density-based methods, and grid-based methods [24].
Partitioning method assigns a dataset into k clusters such that each cluster must contain at least one element. k-means algorithm proposed by MacQueen in 1967 is the most classical representation of Partitioning method [25], and that is one of the best-known and simplest clustering algorithms [26]. Dunn developed Fuzzy c-means algorithm (FCM) for clustering data in 1973 [27]. FCM [28] allows one element to belong to two or more clusters unlike k-means where only one cluster is assigned to each element and improved by Bezdek in 1981. However, the time complexity of k-means is much less than that of FCM; thus, k-means work faster than FCM [29]. The advantages of partitioning algorithms include that they are (1) relatively scalable and simple and (2) suitable for finding spherical-shaped clusters. However, disadvantages of these algorithms include (1) the necessity for the user to specify the number of clusters in advance; (2) high sensitivity to value of initialization, noise, and outliers; and (3) failure to deal with nonconvex clusters [30].
Hierarchical clustering can be divided into agglomerative method and divisive method [31]. The former follows the bottom-up strategy, which generates clusters starting with single element and then merging these small clusters into larger and larger clusters, until all of the elements are in a single cluster or otherwise until certain termination conditions are satisfied. The latter follows the top-down strategy, which breaks up cluster placed all objects into smaller and smaller clusters, until each cluster containing only one element or until certain termination conditions are satisfied. There are three forms of the agglomerative method, namely, single-  Wireless Communications and Mobile Computing linkage clustering [32], complete-linkage clustering [33], and average-linkage clustering [34]. After that, some enhanced hierarchical clustering methods are introduced, such as BIRCH [35,36], CURE [37], ROCK [38], and CHAME-LEON [39]. Ester et al. demonstrated an algorithm called DBSCAN (density-based spatial clustering of applications with noise) [40], which discovers clusters of arbitrary shapes and is efficient for large spatial databases. Other well-known densitybased techniques [41] are as follows: SNOB proposed by Wallace and Dowe in 1994 and MCLUST [42] introduced by Fraley and Raftery in 1998. Among these methods, the expectation-maximization (EM) algorithm is the most popular [41,43].
These methods partition the space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time [24], no need of distance computations, and easy to determine which clusters are neighbouring. There are many other interesting grid-based techniques including STING (statistical information grid approach) [44] by Wang et al. in 1997, one of the highly scalable algorithm and has the ability to decompose the dataset into various levels of detail. CLIQUE [45] is developed by Agrawal et al. in 1998, which can be considered as both density-based and grid-based clustering methods. It automatically finds subspaces of high dimensional data space that allows better clustering than original space. The accuracy of the clustering result may be degraded at the expense of simplicity of the method CLIQUE.

Multilevel
Clustering. In recent years, the researchers introduced the idea of stratification into clustering. Psomopoulos et al. presented a multiple level clustering algorithm for detecting inter-and intragenome gene clusters, by introducing the notion of a hierarchy. This method is experimentally proven to be consistent with the phylogenetic relation and position of the genes involved [46]. Vu et al. developed a new multithreaded tool, fMLC, addressed the problem of clustering large-scale DNA sequences [12]. A multilevel clustering for star/galaxy separation was designed in 2016, consisting of three phases: coarsening clustering, representative data clustering, and merging [47]. In 2019, Zunic et al. proposed a multilevel clustering algorithm that is used on Internal Banking Payment System in a bank of Bosnia and Herzegovina and explained how the parameter effects on results and execution time in algorithm [13]. [11] presented a multilevel clustering approach that addressed the issue that conventional algorithms spent too much execution time clustering physical activity data. [48] proposed an MCT to identify local communities termed microcosms this year. These algorithms all aim to a specific application and solve the corresponding problem. To settle the problem of clustering partial labeled data, Liu et al. proposed a feasible solution [49]. [16] applied multilevel to spectral clustering and proposed multilevel approximate spectral clustering method, while Nouretdinov et al. applied multilevel to conformal clustering and proposed MLCC [15]. [14] improved the synchronization clustering by using an idea of "divide and collect" and presented MLSynC. One of these algorithms improved a specific algorithm by introducing the idea of multilevel. This paper proposes a general framework that transfers cluster center from small scale to large scale, which is independent of the conventional algorithm.

Problem Description
To facilitate the discussion in the remainder of the paper, we outline some basic concepts in this section. Formally, let a dataset of n objects X = fx 1 , x 2 , ⋯, x n g, x i = fI i1 , I i2 , ⋯, I id g ∈ A be a data object, and each object has d attributes denoted by A. Clustering is the process of distributing the objects in X into k clusters, c 1 , Definition 1. Scale partition: a set can be divided into one or more than one group by some scale, and each object has same scale value.
Definition 2. Multiscale datasets: if there are multiple scales on a dataset, these scales have a partial order relation, denoted by "≺." Multiscale datasets consist of multiple partitions set divided by multiple scales.
Example 2. Let π 1 be mod 4, π 2 be mod 2, and π 3 be mod 1; then, π 1 ≺ π 2 ≺ π 3 , where π 1 is small scale, π 3 is large scale, and π 2 is small scale about to π 3 and large scale about to π 1 . Multiscale dataset on X is fX/π 1 , X/π 2 , X/π 3 Definition 3. Similarity: let x i and x j be two elements of X or the cluster centers; then, the similarity between x i and x j is defined as follows: where |A| represents for the number of attribute and is equal to d and w t is the weight of t-th attribute. From Figure 1, k 2 can be obtained by two mathematical models: (1) k 2 = MðD 2 Þ, D 2 = TðD 1 Þ, and (2) k 2 = Tðk 1 Þ, k 1

Wireless Communications and Mobile Computing
= MðD 1 Þ, where TðD 1 Þ is scale partition, Tðk 1 Þ is transferring knowledge from small scale, and M is mining knowledge from dataset. In multiscale data mining, MðD 1 Þ is assumed to be accomplished. So, we apply the second model to mine the knowledge on large-scale dataset.

Proposed Framework
In this section, we firstly propose the framework of multiscale clustering; then, the scaling-up algorithm is designed to implement our framework; an example is given to explain our work in the end.
4.1. Framework. In this paper, cluster centers are used as a medium for knowledge transformation from small scales to large scales. The framework of multiscale clustering is shown in Figure 2. It calculates cluster centers of large scale according to cluster centers of small scales rather than clustering on large-scale datasets repeatedly. The process is listed as follows: (1) construct multiscale datasets, (2) clustering on small-scale datasets using conventional method, (3) compute the similarity among cluster centers of small scale and construct relation matrix of cluster centers, (4) found relation the graph of cluster centers, and (5) produce connected component of the graph. In particular, it applies formula (2) to found the graph of cluster centers, where Sðx i , x j Þ is computing according (1).

Design.
Multiscale clustering uses three algorithms to discuss its workflow. Algorithm 1 and Algorithm 2 are invoked sequentially, where algorithm 2 uses Algorithm 3 to obtain the connected component. In Algorithm 1, variable lengthD and label s represent the length of original dataset D and the predicted label of small scale, where its length is lengthD (line 1). Dataset is divided into multiple scales, and the smallest scale datasets will be cluster (line 2). Then, it clusters on every partition of the scale by using conventional method; the result is cluster center list and their corresponding cluster label list (lines 3-6); in particular, the label value of list i+1 starts from the label value of list i + cluster ' s number of D ssi . Algorithm 2 produces clusters of large scale. The p centers is the combination of all cluster centers belonged small scale, and center_dic is dictionary and used to record cluster center and its corresponding cluster label list (line 1). The mapping relationship between cluster center and its clusters' object index list is established (lines 2-9). Then, the relation matrix is computed (lines 10-12). Finally, it returns the clustering result (cluster label) of large scale, where getConnectCompont is invoked to produce new cluster by merging the close cluster of small scale (lines [13][14][15][16][17][18][19][20][21][22].
Algorithm 3 applies the mind of deep first search to relation matrix M and generates the cluster center list of which have close relation, and these indexes form a new cluster center of large scale.

Analysis.
In this section, we analyze execution time and memory consumption in detail. It should be noticed that the characteristic of our framework is that clustering of large scale is based on cluster center of small scale not primitive large-scale dataset. So, we need to consider Algorithm 2 that needs invoke Algorithm 3. Algorithm 3 computes one point's connected component using recursion; let m be the number of small-scale cluster center, then both the cost of its time and memory are O ðmÞ. There are three main works in Algorithm 2 as follows: (1) construct the dictionary that is the relation between cluster center and the index list of data elements belonged that cluster, (2) (3) After setting δ = 0:6, the cluster centers' relation graph is constructed. Then, we combinate the second and third cluster

Performance Evaluations
In this section, firstly we perform conventional algorithm DBSCAN on small-scale dataset and get clustering result, then run SUCC and obtain the clustering result belonged large-scale dataset. Secondly, we evaluate the accuracy and efficiency of SUCC by comparing it with DBSCAN [40] that is classical clustering algorithm, HDBSCAN [50] and DBSCAN++ [51], which are leading algorithms in the field of clustering and performed on large-scale dataset. HDBSCAN and SUCC have been implemented with python. DBSCAN comes from sklearn library. We directly use the relevant result of DBSCAN++ but proportionally calculate the run time by DBSCAN. These programs have been performed on a computer with a Inter(R) Core (TM) i7 Dual-Core 2.8GHz CPU, 16GB RAM, and the windows 10 × 64 Home operating system.

Data Preprocessing.
There are 10 real datasets in this article, as shown in Table 1, where C is the number of clusters, A is the number of features, n is the number of transactions in D, and size represents dataset size. In this section, we firstly divide each dataset into two parts, then divide them into three parts. These two partitions are used two methods: the first is sequence, and the second is interval. That is to say, the first partition is dividing the data into two equal parts; one method (2A) is that the data is divide into former one second and latter one second. Another method (2B) is odd rows as the first partition and even rows as the second partition. The second method has also similar two methods (3A, 3B). It is straightforward that original data is large-scale data, and its partitions are small-scale data.

Performance Comparison.
We evaluate the performance via two widely used clustering scores: the adjusted RAND index (ARI) [52] and adjusted mutual information score (AMI) [53], which are computed against the ground truth. We let min samples be 10 for all procedures throughout experiments.
Comparison result of ARI values is shown in Table 2. SUCC (3A or 3B) surpasses the compared algorithms on 6 datasets, and the average ARI value of SUCC (3A or 3B) in 10 datasets is also the highest. We analyze that the distribution of 6 datasets (such as Iris and Wine) is relatively uni-form, and the cluster densities among different divisions are similar, which is conducive to our algorithm. The results in Table 2 show that DBSCAN++ gives better performance compared to HDBSCAN and DBSCAN with respect to optimal parameter ε. In contrast, SUCC (2B or 3B) suffers more on PCA_MNIST, Mobile, Zoo, and Letters. The reason can be explained by that the distribution of the data divided by interval partition is quite different from that of the raw data, which causes SUCC to underperform. Therefore, our algorithm works better under the guidance of professional background.
The AMI values of four algorithms are shown in Table 3; the highest score of the row is in italic. We see that SUCC performs better than DBSCAN++ on 7 out of 10 datasets, while SUCC performs better than HDBSCAN on 9 out of 10 datasets, and SUCC performs better than DBSCAN on all datasets. It is obvious that the index AMI of algorithms is basically consistent with index ARI, except for on dataset Letters, where algorithm HDBSCAN has the best performance. An explanation of HDBSCAN has the highest score on Letters. It is not optimal that the parameter min samples = 10 in algorithms SUCC, DBSCAN, and DBSCAN++ with respect to letters, but algorithm HDBSCAN does not need this parameter. Table 4 shows the runtime (milliseconds) and standard errors for each algorithm and data. The runtime of SUCC (2A) is similar to that of SUCC (2B), while the runtime of SUCC (3A) is similar to that of SUCC (3B), so we omit the runtime of algorithm SUCC (2B) and SUCC (3B). The conclusions can be found from Table 4 as follows. At first, the runtime of SUCC on the three partitions is more than that on the two partitions with respect to every dataset. This is intuitive, because SUCC needs to merge more cluster centers belonged each partition on 3 partitions than that on 2 partitions. Then, an explanation of why SUCC spends shorter runtime clustering than compared algorithms did. SUCC deals with cluster cores belonged each division, while the compared algorithms run on raw dataset. In addition, there is no doubt that DBSCAN++ runs faster than DBSCAN and HDBSCAN on every dataset, because DBSCAN++ only computes the density estimates for a subset of the original dataset. This explains the reason that the distribution of these two datasets is particularly uniform, or the data distribution is      Wireless Communications and Mobile Computing consistent with data partition method. In addition, the clustering result is greatly affected by the value of p on some datasets, such as Libras. Therefore, we declare that the result of SUCC is more guaranteed under the guidance of experts.

Conclusions
In this study, we focus on multiscale clustering and propose an approach SUCC that generates clusters belonged largescale data, which just needs to merge the cluster cores belonged small-scale data, not the raw dataset. In general, the size of cluster cores belonged small-scale data is much smaller than that of the original data, so SUCC needs smaller time overhead. The experimental results show that algorithm SUCC is efficient and effective on the dataset whose distribution is uniform. SUCC can perform better if that is under expert guidance in practice.

Data Availability
The data underlying the results presented in the study are available within the manuscript.

Conflicts of Interest
The authors declare that they have no conflicts of interest.