Improved Density Based Spatial Clustering of Applications of Noise Clustering Algorithm for Knowledge Discovery in Spatial Data

There are many techniques available in the field of data mining and its subfield spatial data mining is to understand relationships between data objects. Data objects related with spatial features are called spatial databases. These relationships can be used for prediction and trend detection between spatial and nonspatial objects for social and scientific reasons. A huge data set may be collected from different sources as satellite images, X-rays, medical images, traffic cameras, and GIS system. To handle this large amount of data and set relationship between them in a certain manner with certain results is our primary purpose of this paper. This paper gives a complete process to understand how spatial data is different from other kinds of data sets and how it is refined to apply to get useful results and set trends to predict geographic information system and spatial data mining process. In this paper a new improved algorithm for clustering is designed because role of clustering is very indispensable in spatial data mining process. Clustering methods are useful in various fields of human life such as GIS (Geographic Information System), GPS (Global Positioning System), weather forecasting, air traffic controller, water treatment, area selection, cost estimation, planning of rural and urban areas, remote sensing, and VLSI designing. This paper presents study of various clustering methods and algorithms and an improved algorithm of DBSCAN as IDBSCAN (Improved Density Based Spatial Clustering of Application of Noise). The algorithm is designed by addition of some important attributes which are responsible for generation of better clusters from existing data sets in comparison of other methods.


Introduction
There are so many techniques and algorithms to discover meaningful and useful results from a large amount of spatial databases [1]. Clustering is one of the major data mining methods for spatial databases. Recently this technique has become the most popular method for research work in SPDM area. Spatial database contains different objects with similar attributes and properties and these properties are responsible for grouping of similar types of objects in a group which is the basis of clustering. So clustering is the process of grouping of large data sets into different groups according to their similar properties.
Every SPDM algorithm has its own advantages and disadvantages. Some algorithms require predefined values of attributes and some are applicable on certain types of data sets, that is, not applicable on arbitrary shape of data [2]. Other methods are based on static nature that is incapable of handling change of dynamic databases and some are not applicable on dense region and presence of physical obstacles. We have studied all possible methods of clustering and their advantages and limitations against the applicability. Our data sources may be tabular file, shape file, comma separated variable (csv) file, rgs file, dbf file, xls file, png file, image file, and so forth. Selection of attributes produces more refined clusters and formation of clusters gives desired results in terms of better efficiency with improved space and time complexity. In Figure 1, we are showing different databases and clusters with application of different methods. These are taken from SEQUOIA 2000 benchmark databases. After detailed study we came to know that still none other than DBSCAN algorithm is yet available for giving better results in the field of clustering. So our area of interest is to design and develop new density based algorithm with better performance. Our improved algorithm will work better in adverse conditions and arbitrary selection of databases. The size and shape and also amount of database will not change the accuracy of creation of clusters. The process of clustering is used for various scientific and social areas [2] like association of business rules, health data, medical imaging, seismology, land treatment, water treatment, cost analysis, and many more.
Spatial data mining works on spatial data. Spatial data mining is the discovery of interesting similarities of characteristics and patterns which may exist in large spatial data sets. Spatial clustering is the key concept to get all possible trends and clusters according to given nature of data sets. As discussed above our main objective is to design improved DBSCAN algorithm [3] for spatial data sets. In DBSCAN, the density is measured in the form of point which is obtained by counting the number of points in a region of specified radius around the point. Points with a certain threshold value and densities form the clusters. Major issue in DBSCAN is the selection of clustering attributes, detection of noise with different densities, and large difference of values of border objects in opposite directions of the same clusters. A point of any object is visited at least once and it may be visited multiple times if it is a candidate of different clusters. This paper is basically designed to give a complete working process of spatial data mining with new ideas of improvement in respect of drawbacks of previous work, that is, DBSCAN. Data collection [4] is a very important and typical process in spatial data mining and knowledge discovery but with the help of efforts of government agencies, scientific needs, and other private sectors it is possible to collect huge data sets of spatial features. For multidimensional data [5], moving objects and dynamic data selection needs new and advance methods of mining and knowledge discovery. To handle such kind of challenges and research activities, spatial data mining has developed as strong tool with geovisualization concept.
Significance of Work. Algorithm DBSCAN is improved as IDBSCAN with capacity of recognizing irregular shapes, including concave and nested shapes or satellite images. The IDBSCAN algorithm reduces time of computation as shown in Figure 9 and it is insensitive to bulky amounts of noise.
Mathematical Problems in Engineering 3 A very significant point is that here user does not require domain knowledge of input, that is, amount of clusters to be generated.
Primary motive of this paper is to design new and efficient system and algorithms for spatial data mining research, geocomputation, and map analysis. Spatial analysis and mining include various steps of knowledge discovery as data selection, data collection, data cleaning, preprocessing, clustering, classification, and transformation with the help of known and unknown results of knowledge discovery. Various simulative software programs [5, 6] such as GRASS, ERDAS, WEKA, ArcGIS, DIVA-GIS, and MapCalc are available for better experiments.
The rest of this paper is organized in the following manner. First we summarize almost all possible and available clustering methods and algorithms with their positive and limited aspects. Point 3 explains the process and meaning of BDSCAN algorithm. Point 4 shows limitations of existing DBSCAN algorithm. Point 5 shows our proposed and improved algorithm (IDBSCAN) which is the novelty of this research article. Point 6 shows analysis of algorithm (IDBSCAN). Point 7 gives idea of results. Finally in point 8 conclusion and future scope is discussed.

Related Work and Its Overview
In this section we will study the current and previous research work in spatial data mining and knowledge discovery [4]. As we have discussed clustering plays a key role in understanding and application of spatial data in real applications. So we will focus on meaning and methods of clustering in spatial data sets. Recent work in the database community includes density based methods, hierarchical methods, partition based methods, grid based methods, and constraint based methods [3,7]. A brief idea of each and every method is given here with their positive and limited aspects.

Density Based
Methods. This kind of methods considers clusters as dense region of objects that are different from lower dense regions in the data space. Density based regions are more appropriate and applicable in arbitrary shaped clusters but selection of attributes and selection of clusters with algorithms are more complex. It has the feature to merge two clusters that are sufficiently close to each other.
Density biased sampling, DBSCAN (Density Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to Identify Clustering Structure), DENCLUE (DENsity CLUstEring), and so forth are example of this method.
This method is our major discussion of this research paper so it will be discussed later in detail in the following sections.

Hierarchical Based Methods.
Hierarchical based methods put the data in a tree-like structure. These clusters are classified into agglomerative and divisive hierarchical clustering, depending on whether the decomposition is formed in a bottom-up or top-down manner. This (hierarchical methods) can also recognize arbitrary shaped clusters and handles outliers or noise excluding to some special conditions but this method does not work well for special characteristics of individual clusters and it is also time consuming for high dimensional data.

Partitioning
Methods. This method divides objects, which we want to cluster, into -partitions, where each partition represents a cluster and is a given parameter. Such algorithms form the clusters to optimize an objective criterion similarity function such as distance as a major parameter [1].
Partitioning methods cover the following five common algorithms: -means, -medoids, and CLARANS (Clustering Large Applications based upon RANdomized Search).
Although partitioning methods are better in generation of clustering results by using -mean, -medoids are easier to implement but selection of is random so no guarantee of quality of clustering and desired clusters is required in advance which is not more realistic. To handle outliers is also a big problem for such kind of methods. A major drawback of this method is that it is not applicable for large databases.

Grid Based Methods.
Grid based methods summarize the data space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. These methods are fast in nature and independent of the number of the data objects but dependent on the number of the cells in each dimension in related space of generated results. This method contains the following wellknown algorithms STING (STatistical INformation Grid), wavecluster, CLIQUE (CLustering In QUEst), and STING+. An experimental result is shown in Figure 2.
These methods automatically find subspaces of the highest dimensionality and they are insensitive to the order of records but the accuracy of the clustering result may be degraded at certain points. Major applications of such kind of methods are military deployment, situation awareness, and so forth. Most probable results of grid based methods are shown in Figure 2. 2.5. Constrained Based Methods. In our previous paragraphs, we studied that there are so many algorithms and methods to implement and apply clustering in real life and various social purposes. Unfortunately, most of the algorithms are not able to understand or specify real life constraints such as physical obstacles. So it was realized that there should be a method that can handle the concept of clustering in presence of physical obstacles. These are application specific and used as special cases.
Constraint based clustering in large databases method gives a novel concept called microclustering and sharing to scale up the algorithm or its working procedure. Other constraints are also discussed in this category as universal, and existential as averaging and summation.
COD-CLARANS (Clustering with Obstructed Distance based on CLARANS) is the first clustering algorithm that solves a problem which is known as the problem of clustering with obstacles entities (COE).
This method, that is, constraint based clustering, is not well suitable due to NP hard nature of the problems and the fact that there is no guarantee of accuracy of results when number of points is very large, that is, . To handle outliers is also a big problem with such kind of methods.
2.6. Spectral Clustering. Spectral clustering is a modern type of clustering method and is being used as a new approach of clustering. For graph and Laplacians based application it is mainly used with standard concept of mathematics and algebra. When constructing similarity graphs the goal is to model the local neighborhood relationships between the data points which is entirely different from -means and other methods. The main tool to understand spectral clustering is Laplacians graph matrices. First compute the unnormalized Laplacian from given graph and then determine the eigenvectors of the computed as 1 , 2 , . . . , . Same procedure is used for normalized Laplace and eigenvectors. This property is useful due to nature of change of Laplacian graph method. Output comes in the form of clusters. Consistency of normalized Laplacian data objects is much better than unnormalized ones so we should prefer normalized method of computing of Laplacian and then apply -neighborhood method rather than Sigma neighborhood method to select clusters and distance between data points. A very important and popular reason of being successful with spectral clustering is the fact that there is no consideration or assumption on the basis of clusters forms and their numbers in advance. Spectral clustering can solve very simple problems like intertwined spirals and it is used for large data sets, if points are given in the form of sparse. Spectral clustering is used as black box testing method which is the key concept of various clustering and scientific methods.

Basics of Density Based Clustering (DBSCAN)
Clustering and cluster analysis [3,6,8] is a widely used method of data analysis, and its function is to organize similar types of set of data items or objects into groups (cluster) so that items in a cluster are similar and different from other clusters. There are many different methods as discussed in our previous paragraphs of this paper about clustering. Density based methods are more effective and efficient for handling large spatial data. We can say that there is no other algorithm that can do what DBSCAN can do in the field of spatial clustering. DBSCAN has altered time to time by various researchers and project agencies in terms of various parameters. Smart definition of parameters gives better results. Selection and application of parameters is a great job in DBSCAN and a situation is shown here.
After study of many clustering algorithms [3,9] we have decided to select and improve DBSCAN algorithm. Some of the reasons why we have selected DBSCAN are its positive points as discussed in the following: (i) It is capable of discovering clusters with arbitrary shapes.
(ii) There is no need to predict the number of clusters in advance and hence it is more realistic.
(iii) There are greedy methods to replace R * -tree data type greedy queries.
(iv) Selection and application of attributes is always open to improve time and space complexity.
(v) It is robust to outliers and merging is possible with other clusters if they are similar.
We have tried to improve DBSCAN in the following directions. The first one is to make DBSCAN handle spatial, nonspatial, and temporal data at a time and distinguish them clearly. The second is to provide a certain density to each cluster so that we can make dense or nondense region accordingly. The third is selection of threshold value which is more realistic and understandable. Some concepts and definitions of DBSCAN which are directly and indirectly related to DBSCAN are explained here: (1) Cluster: in a database with given data objects as = { 1 , 2 , . . . , } the procedure of partitioning database into smaller parts which are similar in certain standards as = { 1 , 2 , . . . , } is called clustering; 's are clusters, where ≤ ( = 1, 2, 3, . . . , ).
(2) Neighborhood: a distance function (e.g., Manhattan distance and Euclidean distance) for any two points and denotes dist( , ).
(4) Core object: a point is a core point if at least Minpts points are within distance ∈ of it, and those points are said to be directly reachable from . In other words, a core object is a point that its neighborhood of a given radius (Eps) has to contain at least a minimum number (Minpts) of other points as shown in Figure 4.
(5) Directly density reachable: an object is directly density reachable from the object if is within Epsneighborhood of and is a core object.
(6) Density reachable: a point is reachable from if there is a chain 1 ⋅ ⋅ ⋅ with 1 = and = , where each +1 is directly reachable from with respect to Eps and Minpts, for 1 ≤ ≤ , ∈ .
(7) Density connected: an object is density connected to object with respect to Eps and Minpts if there is an object ∈ such that both and are density reachable from with respect to Eps and Minpts.
(8) Density based clusters: a cluster is nonempty subset of satisfying the following "maximality" and "connectivity" requirements: (i) ∀ , : if ∈ and is directly reachable from with respect to Eps and Minpts, then ∈ . (ii) ∀ , ∈ : is density connected to with respect to Eps and Minpts.
(9) Border objects: an object is a border object if it is not a core object but density reachable from another core object.

Problems of Existing Approaches
In a nutshell those problems can be summarized as follows [8,10]: (i) Identification of proper clusters for different types of spatial data sets.
(ii) Deficiency of methods in predicting the similarity and number of clusters in advance when the variations of data sets are used.
(iii) Difficulty in increasing and decreasing of the interdistance between clusters.
(iv) Problem of identification of actual noise points and border objects.
(v) Results being not consistent when clusters of different densities are present.
(vi) The problem that if the measurements of the neighbor objects have minor differences, then problem of identification of adjacent clusters arises as a major problem; the values of border objects may differ largely from opposite side of same cluster.
Situations go uncontrollable if we get data sets with arbitrary shapes like active and inactive volcanoes, polluted areas of Delhi city, and other GIS patterns and outputs of these data sets produce very adjacent and close clusters with overlapping boundaries [6,7]. A result of DBSCAN is shown in Figure 4. Different results are also shown in Figure 1.

Proposed Solution with Development of New Algorithm (IDBSCAN)
Our research object is to get solutions of given problem in Section 4. Here we have decided to design and develop new methods, techniques, and algorithms to find efficient, effective robustness to noise and outliers, tuning of proper parameters, and so forth. This can be achieved by considering only important components of existing algorithm and selection of data sets. However DBSCAN requires two important parameters as follows [2,10,11]: (i) Eps is the radius that represents spatial attribute (latitude and longitude) that delimitates the neighborhood area of a point. (ii) Minpts is the minimum number of points that must exist in the Eps-neighborhood.
This method is highly dependent on parameters provided by the users and expensive in computation when size of input data is unlimited. Our proposed algorithm I-DBSCAN requires 5 parameters as follows: is the neighbor list size. Eps1 is distance for spatial data objects. Eps2 is distance measure for nonspatial data objects. Minpts is minimum number of points within a cluster and Eps1 and Eps2.
is a threshold value.
IDBSCAN works as follows. The algorithm begins with any arbitrary point, , from database and retrieves all points density reachable from wrt to Eps and Minpts. The retrieval of density reachable objects is performed by iteratively collecting directly density reachable objects. If is a core point (i.e., | Eps ( )| ≥ Minpts), then and all points that are density reachable are collected in one cluster. If is a border point and no points are density reachable as per definition from then algorithm explores next point of the database. If is not a core point, then is considered as outlier and discarded later as noise if it does not belong to any cluster. The algorithm terminates when no new points can be assigned to any cluster. These attributes are described in Figure 3 and definitions from 1 to 10.
Flowchart of IDBSCAN algorithm is depicted in Figure 5 for better understanding of IDBSCAN and in Algorithm 1.
Use of stack is necessary to get density reachable objects from directly density reachable objects. If two clusters 1 and 2 have very less distance between them, a point may belong to both clusters 1 and 2 . Here this point will be considered as border point for both clusters and finally algorithm assigns this point to the first discovered clusters. So in this way we can overcome the problem of outliers and border object. Input: (1) A set of objects in a spatial area as = { 1 , 2 , . . . , ) (2) , The neighbor list size (3) Eps points -radius for spatial and non-spatial data objects (4) Minpts-The minimum number of points that must exist in the Eps neighborhood (5) -Threshold value to be included in a cluster. Output: Clusters with their core objects and noise points as = { 1 , 2 , . . . , }. Method: (1) Set cluster layer = 0; (2) Initialize a loop for selecting objects from the given data base For = 1 to do //select an arbitrary object and check if it is visited or not If does not belong to any cluster, then Move forward to process next point = process neghbors as region query( , Eps); If sizeof( ) < Minpts then Mark next point( ) as noise Else Cluster layer = Cluster layer + 1; //Increase the cluster number For = 1 to sizeof( ) // set cluster number to all points in End (of marking) Expand cluster by pushing all points to Expand cluster(push() all objects to ) While ( ! = empty()) //Repeat the process while database is not empty Object = pop(); //Apply pop operation on current object = process neighbors(current point, Eps1, Eps2) //spatial and non spatial objects distance If ≥ Minpts then For = 1 to in If ( is not visited and not identified as a noise and sizeofneghborpts ≥ ) then Add with current cluster End if End if Push( ) End for End for Region Query( , Eps) End while End of algorithm Return all points with cluster number and Eps-neighborhood.

Analysis of Performance of Algorithm
This session analyzes the working performance and its complexities for the new designed algorithm. Major notation used in this algorithm is number of objects in the database and is the size of list. A well-structured and stored database gives better results. So performing query operation and accessing data from database is a major role playing function to optimize overall performance. Indexing of data and selection of data are also a matter of consideration. It has been proved that sequence of selection of data points does not affect the time complexity [10].
Step by step analysis is given here: (1) Initialization of cluster will execute only once, that is, +1. (2) Counting of loop will take + 1 times.
(3) Comparison step < + 1 and hence total time + 1. Well-known indexing technique R-Tree and order of query is also helpful in reduction of time unit.     Table 1 is generated from SPDM Software DIVA-GIS in the form of .shp file format. red point is used to represent core point. Other details are as follows. The results with Eps set to 0.0045 and MinPts = 7 give 3700 clusters, and 488,763 out of the total of 710,148 points end up in a cluster.

Results and Discussion
Data collected from different research agencies are available but an example is shown here [12][13][14] in table form with spatial attributes.
The algorithm was implemented as per sequence of pseudocode in c language and then from the program [15], input and output forms are given as follows:  applicable for both and above described formats of data or file types. It has been noticed that, for large data size, generation of results degraded in the form logarithmic nature. Another screenshot is given here for estimation of performance. It is shown in Figure 9.
As shown in Figure 9 the performance of IDBSCAN is better in terms of data units and run time. IDBSCAN creates a linear increase in computing time related to the number of points in database. Figure 9 is generated on various size of data units and results are computed at general computer machine. Limitation and future aspects are mentioned in the next phase.
A logical comparison between DBSCAN and IDBSCAN may be summarized as follows [16]: (1) DBSCAN is density based approach but it is weak in density reachability of a point to others, whereas IDBSCAN is more accurate in density reachability approach.
(2) DBSCAN can find arbitrary shaped clusters but it is weak to separate those clusters who are very close (but not overlapped) where our IDBSCAN gives better performance for close points having dissimilar properties and very close clusters at boundary, that is, border points.
(3) Both do not require number of clusters ( ) in advance but DBSCAN cannot cluster data sets well with large differences in densities, whereas IDBSCAN can do it properly.
(6) Computational difference is given in Figure 9.
Mathematical Problems in Engineering 9

Conclusion and Future Work
In this research paper, we introduced a new density based clustering algorithm IDBSCAN that is designed by modification of DBSCAN clustering algorithm. A detailed study may be summarized as follows: (ii) This algorithm can find a cluster completely surrounded by different and very close clusters of different densities.
(iii) The second modification is about using five parameters instead of two as in DBSCAN.
(iv) The effectiveness of proposed algorithm was demonstrated by using a real and synthetic database.
(v) This paper gives an opportunity to apply clustering algorithm over new types of data and new application areas such as moving objects and trajectories, spatially embedded social networks, and geocoded multimedia and web based data.
From experimental results it has been found that very large and dense data needs higher computational power. So, for further extension it may be designed for parallel and multithreading concept. Experimental results are very relevant according to our objective and improved algorithm. Selection of Eps and Minpts with threshold Δ may be more intelligent and heuristically efficient.
Scope of designing of new spatial data mining algorithm is still considerable for neighborhood objects and graphs.