Spatial Cluster Analysis by the Bin-Packing Problem and DNA Computing Technique

Spatial cluster analysis is an important data mining task. Typical techniques include CLARANS, densityand gravity-based clustering, and other algorithms based on traditional von Neumann’s computing architecture. The purpose of this paper is to propose a technique for spatial cluster analysis based on sticker systems of DNA computing.Wewill adopt the Bin-Packing Problem idea and then design algorithms of sticker programming. The proposed technique has a better time complexity. In the case when only the intracluster dissimilarity is taken into account, this time complexity is polynomial in the amount of data points, which reduces the NP-completeness nature of spatial cluster analysis. The new technique provides an alternative method for traditional cluster analysis.


Introduction
Spatial cluster analysis is a traditional problem in knowledge discovery from databases [1].It has wide applications since increasingly large amounts of data obtained from satellite images, X-ray crystallography, or other automatic equipment are stored in spatial databases.The most classical spatial clustering technique is due to Ng and Han [2] who developed a variant PAM algorithm called CLARANS, while new techniques are proposed continuously in the literature aiming to reduce the time complexity or to fit for more complicated cluster shapes.
For example, Bouguila [3] proposed some model-based methods for unsupervised discrete feature selection.Wang et al. [4] developed techniques to detect clusters with irregular boundaries by a minimum spanning tree-based clustering algorithms.By using an efficient implementation of the cut and the cycle property of the minimum spanning trees, they obtain a performance better than ( 2 ), where  is the number of data points.In another paper, Wang and Huang [5] developed a new density-based clustering framework by a level set approach.By a valley seeking method, data points are grouped into corresponding clusters.
Adleman [6] and Lipton [7] pioneer a new era of DNA computing in 1994 with their experiments which demonstrated that the tools of laboratory molecular biology could be used to solve computation problems.Based on Adleman and Lipton's research, a number of applications of DNA computing in solving combinatorially complex problems such as factorization, graph theory, control, and nanostructures have emerged.There appeared also theoretical studies including DNA computers which are programmable, autonomous computing machines with hardware in biological molecules mode; see [8] for details.
According to Pȃun et al. [8], common DNA systems in DNA computing include the sticker system, the insertiondeletion system, the splicing system, and H systems.Among those, the sticker system has the ability to represent bits which is similar to the silicon computer memory.In a recent work, Alonso Sanches and Soma [9] propose an algorithm based on the sticker model of DNA computing [10] to solve the Bin-Packing Problem (BPP), which belongs to the class NP-Hard in the strong sense.The authors show that their proposed algorithms have time complexities bounded by ( 2 ) which are the first attempt to use DNA computing for the Bin-Packing Problem.Here the integer  is the number of items to be put in the bins.
Inspired by the work of Alonso Sanches and Soma [9], we propose a new DNA computing approach for spatial cluster analysis in this paper by the Bin-Packing Problem technique.The basic idea is to take clusters as bins and locate data points into bins.In order to complete evaluation of clustering, we need to accumulate dissimilarities within clusters.By the sticker system we can accomplish these tasks.We also show that our algorithm has a time complexity in polynomial in the case when only intracluster dissimilarity is considered, relative to the amount of data points.Notice that cluster analysis is NP-complete.It is interesting to notice that the method in this paper is new in cluster analysis.
The rest of this paper is organized as follows: in Section 2, we present the Bin-Packing Problem formulation of spatial clustering problem for the purpose of this paper.Then in Section 3 some basic facts on sticker model are presented with implementation of some new operations.The following two sections are devoted to the coding of the problem and the algorithms of clustering with sticker system.Finally, a brief conclusion is reached.

Formulation of the Problem
Let   be the real Euclidean space of dimension .A subset Ω ⊂   is called a spatial dataset with  points and Ω = { 1 ,  2 , . . .,   }, where   = ( 1 ,  2 , . . .,   ) ∈   for each  = 1, . . ., .A clustering problem over Ω is to group the dataset Ω into  partitions called clusters where the intracluster similarity is maximal and the intercluster similarity is minimal.In this sense, clustering is an optimization process in two levels: one is maximization and the other is minimization.Here the integer  indicates the number of clusters.There are two kinds of clustering when we consider  as a parameter.The first kind is fixed number clustering, where the number of clusters  is a priori determined.The second kind is flexible clustering where the number  is chosen as one of the parameters to meet the two level optimization problem.Now we denote a partition of Ω by C : C = ( 1 , . . .,   ) with Ω =  1 ∪ ⋅ ⋅ ⋅ ∪   and   ⊆ Ω for 1 ≤  ≤ .If we define Asim(  ) as the intracluster dissimilarity measure for   ∈ C and Simm(  ,   ) as the intercluster similarity measure for   ,   ∈ C, then the two kinds of clustering problems are formulated as follows: To simplify the multiplicity of optimization, we often use the following variation of the above problems: Next we only consider the cluster problem (1) or (3).In order to unite the two optimization formula: we introduce the following total energy function: Simm (  ,   ) . ( For the purpose of this paper, we will use a simplified version of the total energy as shown in the following equation: In the case when the number of clusters  is a variable, the total energy is computed for nonempty clusters and the optimized number  of clusters is the counting of nonempty bins: We now propose a Bin-Packing Problem (BPP) formulation of the clustering problem as stated above.The classical one-dimensional BPP is given as a set of  items  1 , . . .,   with respective weights (  ) =   ∈ (0, ], 1 ≤  ≤ .The aim is to allocate all items into  bins with equal capacity  and by using a minimum number of bins [9].For clustering purpose we assume that there are  empty bins and we allocate all items into the bins with least energy.If we consider  as a variable, then the problem is to allocate  points into  bins with least energy.The capacity restriction  is removed.For the two cases of clustering, there are altogether   (  , resp.) combinations of allocation and the best solution can be achieved by brute force search.
First we consider the case when  is fixed.To solve the problem, we consider an array C of integers The th bin (cluster)   is defined as   = {  :   = ,  = 1, . . ., } for  = 1, . . ., .We will identify the allocation C with its corresponding partition.Therefore the energy function is defined on {C} of all allocations.In order to guarantee the bins are nonempty, we need to add a restriction that #(  ) > 0 for  = 1, . . ., , where #(  ) denote the cardinality of the set   .

A Sticker DNA Model
First we recall some standard operations of DNA computing as shown in [8].They are merge, amplify, detect, separate, and append.
(iii) detect: Given a tube , return true if  contains at least one DNA strand, otherwise return false.
(v) append: Given a tube  and a word ,  (, ) affixes  at the end of each sequence in .
The sticker model is based on the paradigm of Watson-Crick complementarity and was first proposed in [10].There are two kinds of single-stranded DNA molecules, the memory strands and sticker strands, in this model.A memory strand is  bases in length and contains  nonoverlapping substrands, each of which is  bases long, where  =  [8].A sticker is  bases long and complementary to exactly one of the  substrands in the memory strand.A specific substrand of a memory strand is either on or off and is called a bit.If a sticker is annealed to its matching substrand on a memory strand, then the particular substrand is said to be on.Otherwise it is said to be off.These partially double strands are called memory complexes.
The basic operations of the sticker model are merge, separate, set, and clear and are listed as follows [8].Among these, merge is exactly as the standard operation as shown before.
(ii) set:  ← ( 0 , ).A new tube  is produced from  0 by turning the th bit on.
(iii) clear:  ← ( 0 , ).A new tube  is produced from  0 by turning the th bit off.Now we consider a test tube  consisting memory complexes .We define the length of  as the number of bits, that is, the number of substrands (stickers) contained in  denoted by ().Each numerical value is represented by -bit stickers, where  is a constant designed for a certain problem.For a -bit stickers , the corresponding numerical value is denoted by ℎ().The substring in a memory complex from the ( + 1)th bit to the ( + )th bit of  is denoted by (, ), where  is an integer with 0 ≤  ≤ () − 1. Apart from the basic operations, we need more operations designed and inspired by Alonso Sanches and Soma [9]  (v) clearq: (, ).For each strand in the tube , turn all bits off from ( + 1)th bit to ( + )th bit.
We only give a DNA algorithm for ℎ and  as the other algorithms are presented in Alonso Sanches and Soma [9].Suppose the binary digits of the integer  is  +1  +2 ⋅ ⋅ ⋅  + (see Algorithm 1).

Sticker Algorithms for Fixed 𝑘
Now we consider solving the spatial clustering problem as described in Section 2, where the number of clusters  is fixed, and Ω = { 1 ,  2 , . . .,   },   = ( 1 ,  2 , . . .,   ) ∈   for each  = 1, 2, . . ., .A partition of the dataset Ω is denoted by C which is an array of integers For two points ,  we use (, ) to denote the Euclidean distance between them.We use  = max ,∈Ω (, ) to denote the diameter of Ω.Let the dissimilarity measure of ,  ∈ Ω be d(, ) = (, )/ ∈ [0, 1].Now we convert the dissimilarity measure into binary string consisting of "0"s and "1"s.For an acceptable given error rate to measure the dissimilarity  > 0, divide the interval [0, 1] into 2  1 subintervals with equal width 2 − 1 < .Now choose an integer  such that  2 2  1 < 2  .Then we can use a  bits string to represent the subintervals.For  ∈ [0, 1] let its corresponding string be () = [2 − 1 ], where operator [⋅] is the largest integer without exceeding it.We will use a sticker system with  stickers in length that is capable of representing numbers between 0, 1, . . .,  and 0, 1, . . ., 2  .Now we define the dissimilarity matrix as For the partition C, the th bin (cluster)   is defined as   = {  :   = ,  = 1, . . ., } for  = 1, . . ., .A partition is called feasible if   ̸ = 0 for  = 1, 2, . . ., .The energy of a partition defined by (9) has the following form: For an integer , we use seq() to represent the subsequence of  stickers corresponding to .Conversely, if  is a sequence, we use ℎ() to denote the numerical value decoded by the -bit sticker .By the sticker model [9], a memory complex seq(C) is designed as the coding of C: Then we append +1 stickers representing  1 , . . .,   and  = (C).Finally we append  stickers to store the cardinality of clusters.The structure of stickers for our problem is shown    in Figure 1.The clustering algorithm consists of four steps as shown in Algorithm 2.

Sticker Algorithms for Variable 𝑘
In this section we consider cluster analysis when  is a variable.In this case a partition of the dataset Ω is denoted by C which is an array of integers Similar to the previous section, the th bin (cluster)   is defined as   = {  :   = ,  = 1, . . ., } for  = 1, . . ., .Notice that   may be empty and the final number of clusters is the counting of nonempty clusters.The energy of a partition defined by (11) has the following form: Now the coding is seq(C).Then we append +1 numbers of  bits, that is, ( + 1) stickers representing  1 , . . .,   and  = (C).Next we append  values to store the counting number of the  clusters.Finally we append a value to store the number of valid (nonempty) clusters.The structure of stickers in this case is shown in Figure 2.
The clustering algorithm consists of four steps as shown in Algorithm 3.

Conclusion
In this paper we presented a new DNA-based technique for spatial cluster analysis.Two cases when the number of clusters  is predefined and not determined are considered.If we take the scale of data , and the length of bits for a sticker , as a variables, then clearly Algorithm 1 has a time complexity of ().Among the four steps of Algorithm 2, the operator generate () has a time complexity of (), and the operator energy () has complexity of ( 2 ).The remaining two operators all have complexity of ().Thus the total time complexity for fixed number of clusters  is ( 2 ).In the other case when  is dynamic, time complexity for the four algorithms changes to ( 2 ), ( 3 ), (), and ().Hence the total complexity is ( 3 ).The reason why our complexity is worse than that of [9] (of course for a different problem) is that the summation of dissimilarity is time consuming.It is interesting if one can reduce this complexity to ( 2 ).
Finally we will point out that up to the authors knowledge, this is the first research in cluster analysis by sticker DNA systems.It provides an alternative solution for this traditional knowledge engineering problem, which is not combinatorial in nature.Comparing many applications of DNA computing mainly in combinatorial problems, this is still interesting.
(a) generate: Generate multiple copies of all the   combinations as C. Append 1, 2, . . .,  as the position numbers of C. Then append  1 , . . .,   and  to store the energies.(b) energy: Compute the dissimilarities of the  possible clusters and store in energy.(c) find: Find the best solution.(d) count: Count the number of clusters.Now we present algorithms to implement the above procedures.(a) Generation of all the possible

Figure 2 :
Figure 2: Coding structure when  is changing.
in order to handle numerical computations.Generate multiple copies of all the   combinations as C. Append 1, 2, . ..,  as the position numbers of C. Then append  1 , . ..,   and  to store the energies.(b)energy: Compute the dissimilarities of the  clusters and store the energy.(c)prune: Discard unfeasible partitions, that is, those where there exists empty clusters.(d)find: Find the best solution.Now we present algorithms to implement the above procedures.(a)Generation of all the possible   solutions.Append  values in order to store the energies.Energy computation.The problem is to compute totals of energy for those  where   = .Hence   = ∑ ,∈  , ̸ =    and   = { |   = }.The total energy is stored in .At the same time, the counting number of each bin is stored in the following  stickers.The third step is to eliminate unfeasible partitions.This is done by checking the last  stickers.The last step is to find the best solution with least energy.If () = yes in the final step, then we get the optimal solution.
solutions.Append  values in order to store the energies.Energy computation.The problem is to compute totals of energy for those  where   = .Hence   = ∑ ,∈  , ̸ =    and   = { |   = }.The total energy is stored in .At the same time, the counting number of each bin is stored in the following  stickers.The next step is to find the best solution with least energy.If () = yes in the final step, then we get the optimal solution.The final number  of clusters in stored in the last sticker.The final step is to count the number of clusters.It is stored in the last sticker while in the variable .