Spatial cluster analysis is an important data mining task. Typical techniques include CLARANS, density- and gravity-based clustering, and other algorithms based on traditional von Neumann's computing architecture. The purpose of this paper is to propose a technique for spatial cluster analysis based on sticker systems of DNA computing. We will adopt the Bin-Packing Problem idea and then design algorithms of sticker programming. The proposed technique has a better time complexity. In the case when only the intracluster dissimilarity is taken into account, this time complexity is polynomial in the amount of data points, which reduces the NP-completeness nature of spatial cluster analysis. The new technique provides an alternative method for traditional cluster analysis.

Spatial cluster analysis is a traditional problem in knowledge discovery from databases [

For example, Bouguila [

Adleman [

According to Păun et al. [

Inspired by the work of Alonso Sanches and Soma [

The rest of this paper is organized as follows: in Section

Let

Now we denote a partition of

To simplify the multiplicity of optimization, we often use the following variation of the above problems:

Next we only consider the cluster problem (

For the purpose of this paper, we will use a simplified version of the total energy as shown in the following equation:

In the case when the number of clusters

We now propose a

First we consider the case when

The

Then the final problem is

Next when

The

First we recall some standard operations of DNA computing as shown in [

The sticker model is based on the paradigm of Watson-Crick complementarity and was first proposed in [

The basic operations of the sticker model are merge, separate, set, and clear and are listed as follows [

Now we consider a test tube

We only give a DNA algorithm for

Now we consider solving the spatial clustering problem as described in Section

For two points

Now we define the dissimilarity matrix as

For the partition

For an integer

Then we append

(a)

as the position numbers of

(b)

(c)

(d)

Now we present algorithms to implement the above procedures.

(a) Generation of all the possible

(b) Energy computation. The problem is to compute totals of energy for those

the counting number of each bin is stored in the following

(c) The third step is to eliminate unfeasible partitions. This is done by checking the last

(d) The last step is to find the best solution with least energy. If

then we get the optimal solution.

Coding structure of clusters.

In this section we consider cluster analysis when

Similar to the previous section, the

Now the coding is

Coding structure when

The clustering algorithm consists of four steps as shown in Algorithm

(a)

as the position numbers of

(b)

(c)

(d)

Now we present algorithms to implement the above procedures.

(a) Generation of all the possible

(b) Energy computation. The problem is to compute totals of energy for those

Hence

the counting number of each bin is stored in the following

(c) The next step is to find the best solution with least energy. If

then we get the optimal solution. The final number

(d) The final step is to count the number of clusters. It is stored in the last sticker while in the variable

In this paper we presented a new DNA-based technique for spatial cluster analysis. Two cases when the number of clusters

Finally we will point out that up to the authors knowledge, this is the first research in cluster analysis by sticker DNA systems. It provides an alternative solution for this traditional knowledge engineering problem, which is

Research is supported by the Natural Science Foundation of China (no. 61170038, 60873058), the Natural Science Foundation of Shandong Province (no. ZR2011FM001), and the Shandong Soft Science Major Project (no. 2010RKMA2005).