A Novel Artificial Immune Algorithm for Spatial Clustering with Obstacle Constraint and Its Applications

An important component of a spatial clustering algorithm is the distance measure between sample points in object space. In this paper, the traditional Euclidean distance measure is replaced with innovative obstacle distance measure for spatial clustering under obstacle constraints. Firstly, we present a path searching algorithm to approximate the obstacle distance between two points for dealing with obstacles and facilitators. Taking obstacle distance as similarity metric, we subsequently propose the artificial immune clustering with obstacle entity (AICOE) algorithm for clustering spatial point data in the presence of obstacles and facilitators. Finally, the paper presents a comparative analysis of AICOE algorithm and the classical clustering algorithms. Our clustering model based on artificial immune system is also applied to the case of public facility location problem in order to establish the practical applicability of our approach. By using the clone selection principle and updating the cluster centers based on the elite antibodies, the AICOE algorithm is able to achieve the global optimum and better clustering effect.


Introduction
Spatial clustering analysis is an important research problem in data mining and knowledge discovery, the aim of which is to group spatial data points into clusters. Based on the similarity or spatial proximity of spatial entities, the spatial dataset is divided into a series of meaningful clusters [1]. Due to the spatial data cluster rule, clustering algorithms can be divided into spatial clustering algorithm based on partition [2,3], spatial clustering algorithm based on hierarchy [4,5], spatial clustering algorithm based on density [6], and spatial clustering algorithm based on grid [7].
The distance measure between sample points in object space is an important component of a spatial clustering algorithm. The above traditional clustering algorithms assume that two spatial entities are directly reachable and use a variety of straight-line distance metrics to measure the degree of similarity between spatial entities. However physical barriers often exist in the realistic region. If these obstacles and facilitators are not considered during the clustering process, the clustering results are often not realistic. Taking the simulated dataset in Figure 1(a) as an example, where the points represent the location of consumers, the clustering result shown in Figure 1(b) can be obtained, when the rivers and hill as obstacles are not considered. If the obstacles are taken into account and bridges as facilitators are not considered, the clustering result in Figure 1(c) can be gained. Considering both the obstacles and facilitators, Figure 1(d) demonstrates the more efficient clustering patterns.
At present, only a few clustering algorithms consider obstacles and/or facilitators in the spatial clustering process. COE-CLARANS algorithm [8] is the first spatial clustering algorithm with obstacles constraints in a spatial database, which is an extension of classic partitional clustering algorithm. It has similar limitations to the CLARANS algorithm [9], which has sensitive density variation and poor efficiency. DBCluC [10] extends the concepts of DBSCAN algorithm [11], utilizing obstruction lines to fill the visible space of obstacles. However, it cannot discover clusters of different densities. DBRS+ is the extension of DBRS algorithm [12], considering the continuity in a neighborhood. Global parameters used by DBRS+ algorithm make it suffer from 2 Computational Intelligence and Neuroscience River Hill Bridge (a) the problem of uneven density. AUTOCLUST+ is a graphbased clustering algorithm, which is based on AUTOCLUST clustering algorithm [13]. For the statistical indicators used by AUTOCLUST+ algorithm, it could not deal with planar obstacles. Liu et al. presented an adaptive spatial clustering algorithm [14] in the presence of obstacles and facilitators, which has the same defect as AUTOCLUST+ algorithm.
Recently, the artificial immune system (AIS) inspired by biological evolution provides a new idea for clustering analysis. Due to the adaptability and self-organising behaviour of the artificial immune system, it has gradually become a research hotspot in the domain of smart computing [15][16][17][18][19][20]. Bereta and Burczyński performed the clustering analysis by means of an effective and stable immune -means algorithm for both unsupervised and supervised learning [21]. Gou et al. proposed the multielitist immune clonal quantum clustering algorithm by embedding a potential evolution formula into affinity function calculation of multielitist immune clonal optimization and updating the cluster center based on the distance matrix [22]. Liu et al. put forward a novel immune clustering algorithm based on clonal selection method and immunodominance theory [23].
In this paper, a path searching algorithm is firstly proposed for the approximate optimal path between two points among obstacles to achieve the corresponding obstacle distance. It does not need preprocessing and can deal with both linear and planar obstacles. Based on the path searching algorithm, a spatial clustering algorithm is proposed to the spatial data clustering in the presence of both obstacles and facilitators. A case study is also carried out to apply our method to the problem of public facility optimization.
The remainder of this paper is organized as follows. Section 2 at first presents the path searching algorithm and then elaborates the details of AICOE algorithm, including analysis of population partition, the design of affinity function, and immune operators. Section 3 shows the experimental results. Section 4 presents the conclusions and main findings.

Obstacles Representation.
Physical obstacles in the real world can generally be divided into linear obstacles (e.g., river, highway) and planar obstacles (e.g., lake). Facilitators (e.g., bridge) are physical objects which can strengthen straight reachability among objects. In processing geospatial data, representation of the spatial entities needs to be firstly determined [14]. In this paper, the vector data structure is used to represent spatial data. Obstacles entities are approximated as polylines and polygons. A facilitator is abstracted as a vertex on an obstacle.
Relevant definitions are provided as follows.
is the number of ( ) }.
Definition 4 (direct reachability). For any two points , in a two-dimensional space, is called directly reachable from , if segment does not intersect with any obstacle; otherwise, is called indirectly reachable from .

The Obstacle Distance between the Spatial Entities.
Currently, the method of distance calculation often computes Euclidean distance between two clustering points. When physical obstacles exist in the real space, obstacles constraints should be taken into account to solve the distance between the two entities in the space. The algorithm handles linear obstacles and planar obstacles, respectively. When traversing linear obstacles, facilitators are also taken into account for path construction. Figure 2(a) illustrates the process of constructing approximate optimal path for linear obstacle, which presents a schematic view of Step 4 of the algorithm. When traversing planar obstacles, path is generated by the method to construct the minimum convex hull. In the case of no more than 100,000 two-dimensional space data samples, the calculation of the minimum convex hull can be finished within a few seconds [24]. Here Graham algorithm is used to produce the minimum convex hull [25]. Figures 2(b) and 2(c) and Figure 2(d), respectively, illustrate the construction process of the approximate optimal path for planar obstacles. ( , , ) is the smallest convex hull which is constructed from the start point to the end point containing all the points of the vertex set . ℎ ( ) ( ( , , )) denotes the path from the start point to the end point , which is constructed by the adjacent edges of ( , , ) in the clockwise direction; ℎ ( ) ( ( , , )) denotes the path from the start point to the end point , which is constructed by the adjacent edges of ( , , ) in the counterclockwise direction. path1 and path2, respectively, are the obstacle paths on the left and right hand of → . When new segments are added to path1 and path2, the start points of the added segments are denoted by 1 and 2, respectively. Similarly, the end points are denoted by 1 and 2.
( , ) represents the obstacle distance between two spatial entities. If is directly reachablefrom , ( , ) is Euclidean distance between the two points, denoted by ( , ); if is indirectly reachablefrom , path is configured to bypass the obstacles while , , respectively, are taken as the start and end points.
The path searching algorithm for the approximate optimal path between two points among obstacles can be elaborated as follows.
Step 2. Find the obstacles intersect with → , which in turn are represented as 1 , 2 , . . . , ∈ ∪ , where is the number of the obstacles.

Computational Intelligence and Neuroscience
Step 4. If ∈ , execute the following steps.
(i) Select the vertex ∈ ( ) ( → ) which has the smallest distance to → .
(ii) Select the vertex V ∈ ( ) ( → ) which has the smallest distance to → .
(v) Go to Step 6.
Step 5. If ∈ , there are the following two cases.
(II) If < , execute the following steps.

Spatial Clustering Algorithm with Obstacle Constraints
Based on Artificial Immune System. Computational intelligence techniques have been widely applied to data engineering research, including classification, clustering, deviation, or outlier detection [19]. Artificial immune system (AIS) is an intelligent method, which mimics natural biological function of the immune system. For its promising performance in immune recognition, the ability of immune learning and immune memory, AIS gradually becomes an important branch of intelligent computing [26][27][28][29]. In order to solve the problems of the traditional cluster algorithm in sensitivity to the initial value and the tendency to fall into local optimum, while maintaining its advantages of fast convergence speed, a novel spatial clustering algorithm with obstacle constraints is proposed in this paper.

Affinity Function Design and Immune Operators.
In most occasions, the most used similarity metric in a clustering algorithm is distance metric. The total within-cluster variance or the total mean-square quantization error (MSE) [30] is calculated as follows: where ‖V − ‖ denotes the similarity between sample point V and clustering center and the obstacle distance is used as a distance metric in this paper. Obstacles constraints should be taken into account for clustering algorithms in the paper. On this basis, cluster centers set = { 1 , 2 , . . . , } and the corresponding partition = { 1 , 2 , . . . , } are achieved by applying the rule that the nearer sample points are apart from a cluster center in obstacle distance. Bearing in mind the measurement of the MSE in (1), we design an affinity function , in (2), which represents the affinity of the antibody of with antigen . Let in-cluster = ∑ =1 ∑ V ∈ ∩ (V , ); then Computational Intelligence and Neuroscience 5 where 0 is a small positive number to avoid illness (i.e., denominator equals zero). fmeans denotes the average value of population affinity, which can be calculated as ⊆ is memory cell subset. Threshold value of immunosuppression is calculated as where , = ( , ), which represents the affinity of the antibody of with antibody .
The antibody selection operations, cloning operations, and mutation operations of AICOE algorithm were defined in the literature [31].
Step 3. According to the affinity calculations by Step 2, optimal antibody subset is composed of top ( ≤ ) affinity antibodies where ⊆ ( ). Add to .
Step 4. Generation of the next generation antibody set is elaborated as follows.
(I) Obtain bstAS1 via performing clone operation on bstAS.
(II) Obtain bstAS2 via performing mutation operation on bstAS1. Add bstAS2 to .
(III) Implement the immunosuppression operation on . Calculate the value of according to (4). For all , ∈ , if the value of , is less than , randomly delete one of the two antibodies.
(IV) Randomly generate antibody subset to update the next generation antibody set, denoted by rdmAS.
Step 5. Calculate the value of the fmeans of contemporary population by using (3). If the difference fmeans in certain continual iterations does not exceed , stop the algorithm; otherwise go to Step 2.

Case Implementation and Results
This paper presents two sets of experiments to prove the effectiveness of the AICOE algorithm. The first experiment uses a set of simulated data, which are generated by the simulation of ArcGIS 9.3. Experimental results are compared with -means clustering algorithm [2,3]. The second experiment is carried out on a case study on Wuhu city and compares the results with the COE-CLARANS algorithm [8].
All algorithms are implemented in C# language and executed on Pentium 4.3 HZ, 2 GB RAM computers. The main parameters of the algorithm are defined as follows: mutation rate = 0.35, inhibition threshold = 0.05, and the iterative stopping criteria parameter = 1.0 − 4.

Simulation Experimental
Results. The classical -means clustering algorithm has been widely used for its simplicity and feasibility. The AICOE algorithm uses obstacle distance defined in this paper for clustering analysis, and -means algorithm uses Euclidean distance as similarity measure of samples. Simulated dataset of the first experiment is shown in Figure 3(a). When cluster number = 6, the clustering results of -means clustering algorithm and AICOE algorithm are shown in Figures 3(b) and 3(c), respectively. Experimental results show that the clustering results of the AICOE algorithm considering obstacles and facilitators are more efficient than -means algorithm.

Study Area and Data.
In this test, the AICOE algorithm is applied to an urban spatial dataset of the city of Wuhu in China (Figure 4). This paper takes 994 residential communities as two-dimensional points, where the points are represented as ( , ). In this case study, each residential community is treated as cluster sample point, with its population being an attribute. The highways, rivers, and lakes in the territory are regarded as spatial obstacles, as defined in Definitions 1 and 2, respectively. Pedestrian bridge and underpass on a highway and the bridge on the water body serve as connected points, and the remaining vertices are unconnected points. Digital map of Chinese Wuhu stored in ArcGis 9.3 was used. And automatic programming has been devised to generate spatial points as cluster points to the address of the residential communities. The purpose of this paper is to find the suitable centers (medoids) and their corresponding clusters.

Clustering Algorithm Application and Contrastive Analysis.
The COE-CLARANS algorithm [8] and the AICOE algorithm are compared by simulation experiment. The AICOE algorithm uses obstacle distance defined in this paper for clustering analysis. The comparison results of clustering analysis using COE-CLARANS algorithm and AICOE algorithm are shown in Figure 5, and the comparison results of clustering analysis using COE-CLARANS algorithm and AICOE algorithm considering clustering centers are shown in Figure 6.
Given the covered range of different types of public facilities, a clustering simulation is carried out to generate 5, 10, and 15 subclasses, respectively, in this paper. Because Yangtze River is the main obstacle of Wuhu territory, the 6 Computational Intelligence and Neuroscience   Figure 7 demonstrates that the COE-CLARANS algorithm is sensitive to initial value, while the AICOE algorithm avoids this flaw effectively. Meanwhile, the AICOE algorithm can get global optimal solution in fewer iterations. Table 1 shows the results of scalability experiments for the comparison of the COE-CLARANS algorithm and the AICOE algorithm. The synthetic dataset in the following experiments is generated from a Gaussian distribution. The size of dataset varies from 25,000 to 100,000 points. The obstacles and facilitators are generated manually. The number of the obstacles varies from 5 to 20, and the number of vertices of each obstacle is 10. The number of the facilitators accounts for 20% of the number of the obstacles. Table 1 illustrates that the AICOE algorithm is faster than the COE-CLARANS algorithm.
By comparison of the COE-CLARANS algorithm and the AICOE algorithm for handling spatial clustering with physical constraints, the experimental results show that the COE-CLARANS algorithm causes grouping biases due to its microclustering approach. Correspondingly, the AICOE algorithm operates with all the data with less prior preprocessing. The quality of clustering results achieved by the AICOE algorithm surpasses the results of the COE-CLARANS algorithm. Next, the simulation results also indicate that the AICOE algorithm overcomes the COE-CLARANS shortcoming of sensitivity to initial value. The reason for this drawback is that COE-CLARANS algorithm selects the optimum set of representatives for clusters with a two-phase heuristic method. Last, the results of scalability experiments illuminate that the COE-CLARANS algorithm which is affected by the low efficiency of preprocessing runs slower than the AICOE algorithm.

Conclusions
Artificial immune clustering with obstacle entity algorithm (i.e., AICOE) has been presented in this paper. By means of experiments on both synthetic and real world datasets, the AICOE algorithm has the following advantages. First, through the path searching algorithm, obstacles and facilitators can be effectively considered with less prior preprocessing compared to the related algorithm (e.g., COE-CLARANS). Then, by embedding the obstacle distance     metric into affinity function calculation of immune clonal optimization and updating the cluster centers based on the elite antibodies, the AICOE algorithm effectively solves the shortcomings of the traditional method. The comparative experimental and case study with the classic clustering algorithms has demonstrated the rationality, performance, and practical applicability of the AICOE algorithm. Due to the complexity of geographic data and the difference of data formats, present researches on spatial clustering with obstacle constraint mainly aim at clustering method for two-dimensional spatial data points [8,10,[12][13][14]. There are two directions for future work. One is to extend our approach for conducting comprehensive experiments on more complex databases from real application. The other is to take nonspatial attributes into account for a comprehensive analysis of spatial database.