Spatial Cluster Analysis by the Adleman-Lipton DNA Computing Model and Flexible Grids

Spatial cluster analysis is an important data-mining task. Typical techniques include CLARANS, densityand gravity-based clustering, and other algorithms based on traditional von Neumann’s computing architecture. The purpose of this paper is to propose a technique for spatial cluster analysis based on DNA computing and a grid technique. We will adopt the Adleman-Lipton model and then design a flexible grid algorithm. Examples are given to show the effect of the algorithm. The new clustering technique provides an alternative for traditional cluster analysis.


Introduction
Deoxyribonucleic acid computing, or DNA computing in short, has attracted strong interests and wide focus recently.It is inspired by the similarity between the way DNA stores and manipulates information with traditional Turing machine.Although DNA computing is in a sense similar to evolutionary computing, but the significant difference between them lies in the computing medium, biomolecules rather than transistor chips.It is this difference that makes DNA computing a promising field with ultimate goal of making DNA computers 1 .
The essential work to reveal the ability of DNA in computing is by Adleman's experiment Adleman 2 , which demonstrated that the tools of laboratory molecular biology could be used to solve computation problems.Adleman also proves the huge information storage capacity of DNA which is contained in the sequence of nucleotide bases that hydrogen bonds in a complementary fashion to form double-stranded molecules from single-stranded oligonucleotides.Adleman's work was later generalized by Lipton 3 to the satisfiability problem.Based on Adleman and Lipton's research, a number of applications of DNA computing in solving combinatorially complex problems such as factorization, graph theory, control, and nanostructures have emerged 1 .There appeared also theoretical studies including DNA computers which are programmable, autonomous computing machines with hardware in biological molecules mode, see 4-7 for references.
Adleman and Lipton's original works include a basic computing model, often referred to as the Adleman-Lipton model.Later generalizations include the sticker model, the splicing model, and the insertion deletion model 1 .However, most applications in this area are restricted to problems of combinatory types due to searching nature of DNA computing.It is a challenge how to design applications of optimization types.
Spatial cluster analysis is a traditional problem in knowledge discovery from databases 8 .It has wide applications since increasingly large amounts of data obtained from satellite images, X-ray crystallography, or other automatic equipment are stored in spatial databases.The most classical spatial clustering technique is due to Ng and Han 9 who developed a variant PAM algorithm called CLARANS, while new techniques are proposed continuously in the literature aiming to reduce the time complexity, or to fit for more complicated cluster shapes.
For example, Bouguila 10 proposed some model-based methods for unsupervised discrete feature selection.Wang et al. 11 developed techniques to detect clusters with irregular boundaries by a minimum spanning tree-based clustering algorithms.By using an efficient implementation of the cut and the cycle property of the minimum spanning trees, they obtain a performance better than O N 2 .In another paper, Wang and Huang 12 developed a new density-based clustering framework by a level set approach.By a valley seeking method data points are grouped into corresponding clusters.
Although DNA computing and cluster analysis receive much attention and rapid development, there have appeared rare combination of these two important research areas.Up to the authors knowledge, the combination of DNA computing and cluster analysis is found in a few researches such as Bakar et al. 7 .
Inspired by the research of Bakar et al. 7 , this paper focuses on the joint study of DNA computing with cluster analysis.We propose a new grid-based clustering technique which can be solved by DNA computing.Different with other researches, this can reduce the searching space significantly.Finally we present two examples to show the details of our technique.

DNA Structures
Macromolecules of nucleic acids are composed of nucleotide building blocks.In DNA, the nucleotides are the purines adenine A , guanine G , the pyrimidines thymine T , and cytosine C .Single-stranded DNA molecules, or oligonucleotides, are formed by connecting the nucleotides together with phosphodiester bonds.The single strands of DNA can form a double-stranded molecule when the nucleotides hydrogen bonds to their Watson-Crick complements, A T and G C Figure 1 .
DNA stores information in nucleic acid and manipulates information via enzymes and interactions.A strand of DNA is encoded with four bases represented by the letters A, T, C, and G.Each strand has a 3 -and a 5 -end, and hence any single strand has a natural orientation.The cutting of certain strands of a DNA molecule is performed by the restriction enzymes.These enzymes catalyse the cutting operations at very specific DNA base sequences which are called recognition sites.nucleotides in the left end and five in the right end are not paired with nucleotides from the opposite strand caused by cutting, or some other operations.In this case, the molecule is called to have sticky ends.
Here is an example which illustrates the process by the enzyme EcoRI as shown in Figure 2 b where N represents some other arbitrary deoxyribonucleotide.EcoRI acts only at the six-term sequences which are exactly like the form The effect is to cut the molecule into two pieces as shown in Figure 2 c .There are over 100 different restriction enzymes, each of which cuts at its specific recognition site.A restriction enzyme cuts DNA into pieces with sticky ends.On the other hand, sticky ends will match and attach to other sticky ends of any other DNA that has been cut with the same enzyme.DNA ligase joins the matching sticky ends of the DNA pieces from different sources that have been cut by the same restriction enzyme.

DNA-Computing Models
There are several types of DNA-computing models among which the Adleman-Lipton model is the most traditional one.This model focuses on the hybridization between different DNA molecules as a basic step of computations.According to Adleman and Lipton's original works 2, 3 , this traditional DNA-computing strategy is based on enumerating all candidate solutions, and then using some selection process to choose the correct DNA.This technique requires that the size of the initial data pool increases exponentially with the number of variables in the calculation.
Apart from the Adleman-Lipton model, other DNA computing models appeared such as the sticker model, the splicing model.The Sticker model is based on a coding scheme called DNA complex.A DNA complex is a partially double DNA strand.Usually a double piece represents a bit with value one while a single-strand represents zero.Hence each complex is constructed by two single stranded DNA molecules referred to as memory strands and sticker strands.A memory strand contains n nonoverlapping substrands each of which is m bases long.Each sticker strand is m bases long and is complementary to exactly one of the n substrands in a memory strand.
The second model is the splicing model proposed by Tom 6 based on formal language theory.A splicing system S A, L, B, C consists of a finite alphabet A, a finite set I of initial strings in A * language over A , and finite sets B and C of triples c, x, d with c, d, x ∈ A * .Each such triple in B or C is called a pattern.For each such triple the string cxd is called a site, and the string x is called a crossing.Patterns in B are called left patterns, and patterns in C are called right patterns.The language L L S generated by S consists of the strings in I and all strings that can be obtained by adjoining to Lucxfq and pexdv whenever ucsdv and pexfq are in L and c, x, d and e, x, f are patterns of the same hand.A language L is a splicing language if there exists a splicing system S for which L L S .
The next model is the k-armed model which is based on some more complicated molecule structures which have three-dimensional DNA architecture.In 13 the authors pointed out that it is natural to use the armed model to represent SAT problem in terms of contact network framework, and they gave theoretical solutions to this NP-complete problems.Like the splicing model, biological operations in the k-armed model include cleaving and connecting.

Operations of the Adleman-Lipton Model
The basic principle of DNA computing is to use the encoded information in the sequence of nucleotides and evolve them by breaking and making new bonds between them to reach the answer.The basic operations performed by enzymes are denaturing, replicating, merging, detecting, and so forth.
According to the DNA computing models proposed by Adleman 2 and Lipton 3 , there are several basic DNA operations.One important operation is hybridization which is a main process in DNA computing to form all possibilities of solution strands in which the right answer lies.Hybridization is done by mixing strands in tubes with the help of some enzymes.
Apart form hybridization, the basic DNA operations available on DNA are mainly the following.
ii Amplify.Duplicate N 1 N.
iii Detect N .
iv Separate or extract.N ← N, w , N ← − N, w .Given a word w consisting of strings from Σ {A, G, C, T } and a tube N, generate two tubes N, w and − N, w which contain and does not contains the string w.
v Length separate.N ← N, ≤ n .Given a tube N and an integer n, generate a tube containing stands with length less or equal to n.
Given a tube and word generate a tube with stands beginning ending with the word.

Grid-Based Clustering
The grid-based clustering uses a multiresolution grid structure, called cells, which contains the data objects and acts as operands of clustering performance.Traditional approaches include STRING WaveCluster, and CLIQUE 8 .The most common grids are regular hypercubic grid.This requires that the grid construction covers all the data space with the same precision.The second method uses flexible grids, that is, multiresolution grids with hypercubic or hyperrectangular cells having randomly oriented borders 14 .

A Flexible Grid Definition
Suppose that the data set is Ω {x 1 , . . ., x N } ⊂ R n .It is bounded by a rectangle D 0 in R n .A grid is a undirected graph G V, E where each node of V is called a cell and is represented by a quad v D, c, p, σ , where D is a polyhedra, c is a center point of D, p |ω ∩ D| is the number of points of Ω covered by the cell, and σ is the diameter of D. We will always assume that a cell is nontrivial; that is, D has interior points in R n .For a cell v D, c, p, σ , its boundary is denoted by ∂D which is the set of hyperplane pieces bounding the polyhedra.If S ∈ ∂D and S is part of a hyperplane H, then H is called a tangent plane of the cell.Two nodes v i D i , c i , p i , σ i , i 1, 2, are called adjacent if the two cells share a common tangent plane and D 1 ∩ D 2 / ∅.For two adjacent cells, define an edge between them.Hence, E {uv : u and v are adjacent}.Figure 3 presents an illustration of tangent plane and adjacent cells.
To construct the graph G, we need two parameters p 0 and σ 0 indicating the minimum number of points to be considered, and the minimum diameter.Then the graph is constructed iteratively.We start with the first node v 1 D 0 , c 1 , N, σ 1 , where D 0 is the original rectangle, c 0 is the center of D 0 .Then at each step, the cell containing dense points controlled by a threshold value p 0 , or with larger diameter controlled by threshold value σ 0 , is split into two subcells by a hyperplane.A cell is sparse if it contains less points than p 0 .It is called a small cell if its diameter is less than σ 0 .If we reach a sparse or small cell, then add this cell to the node set of the graph.This step continues until no more cell left to be split.The resulting graph is called a flexible grid.Algorithm 1 gives the algorithm for the graph construction process.We present an example to show the data set and the flexible grid generated by the above algorithm Figure 4 .
Next we define the weights on edges.A weight on an edge is the dissimilarity of the adjacent nodes.Suppose that the Euclidean distance in R n is denoted by d •, • .Here and after, we will always assume that a data point x ∈ D means x ∈ D ∩ Ω.Then for two nodes v i D i , c i , p i , σ i , i 1, 2, the weight is defined by 3.1

Clustering Problems
When the graph is constructed, the clustering problem is converted into grouping nodes of the graph into clusters.Traditional techniques include the hierarchical clustering 8 For the purpose of this paper, we will give a different approach for this problem.First it should be noted that nodes corresponding to sparse areas are outliers.Therefore, in order to reduce computing complexity, we first remove all sparse graph nodes with corresponding edges.We still use G V, E to denote the resulting graph, where V is the set of vertices, and E the set of weighted edges.An example is shown in Figure 4 with part of its edges.Now we consider the problem of weight computation.By the graph construction procedure, we know that any node will correspond to a cell with diameter no larger than σ 0 .Therefore, the distance between cells can be approximated by d c 1 , c 2 , and 3.1 will be

3.2
In this way we can significantly reduce the computing time without loss of much precision.Again we define a parameter 0 < ω 0 ≤ ∞.We will eliminate those edges with weight ω > ω 0 .If this parameter ω 0 ∞, this means that no edges are eliminated.Now we use C {V q : q 1, 2, . . ., k} to denote a clustering of the vertices set V of graph G for threshold values p 0 and σ 0 .Next we will use |V q | to denote the number of its vertices.Define the energy of clustering as follows:

3.3
Then the clustering problem is a minimization of the energy functions.However, the optimization problem is hard to solve.We will present a variation in the following.

Path Clustering
Now we consider the graph G V, E with weight matrix W. Assume that the number k of clusters is a positive integer.A Hamiltonian path L of G is a path that visited each vertex exactly once.Now we remove k − 1 nonadjacent edges from L and denote the result by L k .

Clustering by DNA Computing
In this section we consider clustering of the graph G V, E .We still use N to denote the number of nodes in V , and this will not cause any confusion.Suppose that the number of clusters is k which is a priori determined, or defined in the process of clustering.The problem is to partition the vertex set V into k clusters.Suppose that the original data set Ω is bounded by a constant M/2 > 0, that is, x ≤ M/2 for x ∈ Ω.Here we use the Euclidean distance for , where x a 1 , . . ., a n .Points in Ω are denoted by x i and Ω {x 1 , . . ., x N }.A point in the data set Ω will be denoted by the lower case letter x.
For each point x ∈ Ω and a cluster C, define the distance between them as d x, C Clearly these distances are bounded by the constant M. For u, v ∈ V define the dissimilarity measure as ρ u, v ω x, y /M.Clearly these dissimilarity measures are in the interval 0, 1 .Now we convert the dissimilarity measures into integers.First we need to define an acceptable error rate ε > 0. This means that we do not distinguish those measures where their difference is less than ε.Now we divide the interval 0, 1 into I subintervals with equal width I −1 < ε.For z ∈ 0, 1 let its corresponding integer be s z Iz where operator • is the largest integer without exceeding it.Hence the dissimilarity measure lies in the set {1, . . ., I}.

Now we define the weight matrix on the graph G by
Thus any clustering can be taken as a rearrangement of the vertices v 1 , . . ., v N .For example, the vertex set {{v 3 , v 2 , v 1 }, {v 5 , v 4 }, {v 6 , v 7 , v 8 , v 9 }} with three clusters can be written as 4.2 Here we use the greed letter α as a separator between clusters.Therefore, we have k − 1 separators if we obtain k clusters.If we take dissimilarity measure into account, then a clustering will be as follows: 4.3 Now we have converted the clustering problem to a permutation problem.That is, any permutation of the set {1, 2, . . ., N, α, . . ., α} the number of α's is k − 1 is a candidate solution.The string with minimum length is the optimal solution.

DNA Coding for Data
First we give the encoding of the dissimilarity measure, or the weights.In order to do this, we assume that 1 is coded by the string AG.And any integer i ∈ {1, . . ., I} is coded by seq i : Next we present the encoding of a vertex v ∈ V .This is done by using a fixed mer sequence as the following example 20-mer : seq v 5 -TCTCT CTCTC TCTCT CTCTC TCTCT-3 .

4.5
The separator α is also coded with a single-strand seq α of the same length with that of points.This is done exactly as the data point coding except that we need to distinguish separators with data points.

Encoding Scheme
Now we need to put everything together for a candidate solution 4.3 .First we design a code for an edge uv ∈ E with weight w.This is done by linking the last half part of seq u , seq w , and the first half of seq v as shown in Figure 6 a .We need two special edges called the left half-edge and the right half-edge.The left half edge is a linking of seq u , seq w ,  Finally, the code for a cluster C is a permutation of the following string without changing the position of the left half-edge and the right half-edge: N points and k−1 separators altogether 4.6

DNA Program
Now we describe the biological operations for the clustering.First we put single strands in a tube T 0 including complementary strands of all vertices, complementary strands of the separator α, and complementary strands of integers 1, 2, . . ., I.These strands serve as splints.We also pour strands into T 0 of all the left half-edges, right half-edges, strands of integers 1, 2, . . ., I, all edges, and all centroids.Then hybridization and ligation can be executed.As a result, all combinations representing clustering schemes are obtained in the tube T 0 .
Step and Procedure 1 Input T 0 .
All DNA sequences and complementary DNA sequences are placed in empty test tube T 0 . 2 Amplify T 0 .Make all sequences in T 0 mixed together and execute ligation process.After hybridization process, all possible combinations of DNA sequences happen in Select only those DNA strands which include at least one separator α from T 0 and keep them into empty test tube T 1 .
Select all DNA strands that contain all the N vertices v 1 , .In order to select acceptable solutions among these combinations, we need to eliminate those strands which do not contain a separator α and those which do not contain all the vertices v 1 , . . ., v N .Finally, by counting the number of separators α as k of the strand with shortest length, we get the solution of the clustering problem.This final procedure can be implemented by direct observation to calculate the separator sequences by using special microscope such as atomic force microscope AFM to identify and calculate marking sequences 7 .
The DNA program of biological operations is shown in Algorithm 2.

Examples and Discussions
In this section, we present some examples to illustrate the performance of our algorithm.Then we will show that our technique will give clusters more naturally.

Example One
First we present an example with 20 points to be clustered that is discussed in 7 .First we construct a grid with interval 1 as shown in Figure 7. Then we take p 0 1 as minimum points located in each grid.As a result, the induced graph is disconnected with 10 subgraphs.The final result shows four clusters with additional 6 outliers.This is different with result of 7 where they obtain four clusters with no outliers.Now we change the grid into interval 2. Then we have only half the grids compared to grid with interval 1.We take p 0 1 but we use another parameter ω 0 1.This time we surprisingly obtain three clusters Figure 8 with two outliers.This fact shows that the clustering method proposed here is sensitive to the construction of grids and parameters.Considering various definitions and measurements of clustering, this is not too surprising.Now we construct a grid similar with Figure 8 but starting the first horizontal vertical lines not from coordinates 0, 2 and 2, 0 .Identically, we can implement this case by moving the whole data set in the grid of Figure 8.With the same parameters as above, we obtain the third clustering result as shown in Figure 9. Now four clusters are generated without outliers.It is interesting to note that the three different grids induce one common cluster left-bottom cluster .In fact, this common cluster is better organized than the other clusters.Hence we know that clustering is sensitive to the construction of grids especially for those bad clusters.

Example Two
Now we consider another example as shown in Figure 10 with the data to be clustered and the graph constructed.For this example, we take p 0 2. For adjacent cells, if they share a common edge, then we define their dissimilarity measure as 1.Otherwise if they share a common vertex, then define dissimilarity measure as 1.4.There are 40 nontrivial cells.
Since for any two cells p 1 p 2 ≤ 100, we take I 100 and by 4.1 the weight matrix is Here W 1 , W 2 , W 3 , and W 4 are 20 × 20 matrices, and W is symmetric.The matrix W 1 , W 3 , W 4 is shown in Table 1.The weighted graph is shown in Figure 11.
By the technique of this paper, the solution of clustering is a string.One of the clustering string as a clustering is as follows: 5.2

Clustering of the Iris Data
In this section, we present another detailed examples to illustrate the encoding of the DNAbased clustering technique proposed in previous sections.The new example is the wellknown Iris flower data set problem.The Iris flower data set is introduced by Sir Ronald Aylmer Fisher as an example of discriminant analysis 15 .The data set consists of 50 samples from each of three species of Iris flowers, that is, Iris setosa, Iris virginica, and Iris versicolor.Four features were measured from each sample, and they are the length and the width of sepal and petal 15, 16 .

Figure 11:
The constructed graph with weights on edges.

Grids and Graph of the Problem
Now we use a matrix X| 150 × 4 to denote the data set.Then X is located in a rectangle 4.

6.1
Then the cells can be denoted by D {D pqrs | 20 × 16 × 30 × 15 } with 144000 cells which is a huge number.Due to this reason, we choose x 2 x 4 among the four dimensions as shown in the first image in Figure 12 as cluster feature variables.The second figure in the same image shows the other two dimensions x 1 x 2 which is studied in Qu et al. 16 .
Next we design a flexible grid structure as shown in Figure 13.Choose parameter p 0 0. Then the induced graph is shown in Figure 14.By 3.2 and direct computation, we obtain dissimilarities as matrices of the graph.Now we define the error rate ε 0.001 and I 1000.We cut the weight value by a maximum value 999.Then the weight matrices of G are shown as in Tables 2 and 3   For the two subgraphs as in Figure 14, we can find Hamiltonian paths easily.For subgraph two, the path is 6.3 There exists many paths for subgraph one.One shortest path with indicator α along the edge v 16 v 22 as shown in Figure 15 a is 6.4 The detailed encoding scheme is shown in the Appendix.The final clustering result is shown in Figure 15 b .The number of points which is not correctly clustered is 7. Clearly this is much better than other methods such as CEPSO of 16 where the error number is more than 20.

Discussions
Now we present some discussions about the time and computational costs of the proposed technique in this section.According to Algorithms 1 and 2, the computational costs consist of two main procedures, that is, the grid construction and biological operations, and we will show that the complexity is roughly linear.First we check Algorithm 1 for the construction of flexible grids.The time complexity is the total searching counts t T t 1 |V t | where |V t | is the cardinality of V t and T is the final t that V t reaches null.Clearly |V t | ≤ 2 t .Now we estimate the upper bound of T .In the worst case when V t consists of 2 t cells, and each cell has a data population larger than p 0 , we have p 0 2 t ≤ N. 6.5 There we obtain the estimate T ≤ log N/p 0 .And the total time upper bound is a linear time as follows: Next we analyze the complexity of DNA program as proposed in Algorithm 2. It is clear that the DNA program consists of three searching procedures.Step 3 selects those strands which contain at least one marker α which is a O 1 operation.Step 4 checks if we get a sequence containing all the vertices with a complexity of O N v , where N v is the number of vertices and N v ≤ N. The final Step 6 counts the number of α s and splits the sequence into clusters.For the shortest sequence which contains all the vertices, the maximum number of markers α is N v .Hence the total DNA complexity of the program is O N which is linear.
Finally we make some comments about experiments.By Adleman 2 and Pȃun et al. 1 , when we design a DNA program for solving the problem, a test tube as taken in the laboratory is considered a multiset of words finite strings over the alphabet {A, C, G, T } with basic operations proposed as in 1, 2 .These basic operations are standard and biologically implementable.
However, when we want to simulate the algorithm with personal computers, there will appear some tricky complexity that lay behind the biological operations.Again we analyze the steps in Algorithm 2. First we consider Step 2, and we only consider generating sequences containing each vertex exactly once.The time complexity here will be O 2 N v O 2 N .By examining other steps, we find the total time is O 2 N .

Conclusion
In this paper we presented a new DNA-based technique for spatial cluster analysis.Two examples are given to show the effect of our algorithm.We do not need the number of clusters in advance.By a flexible grid method we can reduce the size of searching space significantly.This is different with other DNA-based applications in that they enumerate all solutions which will produce a large search space.This is especially useful for large database even through DNA computing that has large parallel ability.Also by changing the grid and corresponding parameters, we can get various clusters.VERTEX CODING:

Figure 3 :
Figure 3: Illustration of tangent plane and adjacent cells.

Figure 4 :
Figure 4: An example of flexible grid with induced graph.

Figure 5 :
Figure 5: An example of path clustering with three clusters.

3 . 4
The first minimization of 3.4 is clustering for a fixed k, and the latter is clustering without fixing k.Path clustering is slightly different with distance-based clustering.The next example illustrates path clustering Figure5.
GGAT GCAA CT GT GAAG TTCG AGAGAGAGAGAGAGAGAGAG CCAT GAAT GC GT CACG ACAT GT GAAG TTCG AGAGAGAGAGAGAGAGAGAG CCAT GAAT GC Coding for μ Coding for weight w Coding for v a Designated DNA coding for an edge uv Coding for μ Coding for v DNA coding for separator Coding for similarity of μ, v c Designated DNA coding for a centroid

Figure 7 :
Figure 7: Data example one with graph generated with grid interval 1. Parameters p 0 1.

DiscreteFigure 10 :
Figure 10: A data example with graph generated.The numbers in circles are the values of p in cells.

2 Figure 12 :
Figure 12: The Iris data figures for dimensions x 2 − x 4 and x 1 − x 2 .

Figure 13 :
Figure 13: The Iris data figure for dimension x 2 − x 4 and with grids and candidate graph nodes.

ab
The number in node is its ID, and the number on edges is the weight value The number in the circled nodes is the population of each cell

Figure 14 :a
Figure 14: The induced graph with weight.

Figure 15 :
Figure 15: A short Hamiltonian path for the induced graph.
Inputs: Ω {x 1 , x 2 , . . ., x N } dataset of N points in R n , D: hyper-rectangle containing Ω, p 0 : population threshold value, σ 0 : cell diameter threshold value . ., v N in test tube T 1 .Put them in empty test tube T 2 .5 Gel Electrophoresis.Find the shortest DNA sequence in test tube T 2 .Put them in an empty tube T 3 .This is the solution of clustering the problem.6 Count the number of separators α.Amplify and count the number of clusters in tube T 3 .

Table 1 :
. Here we write the matrix as Sample data weight matrix W 1 , W 3 , W 4 .

Table 3 :
Dissimilarity weight matrix of subgraphs II.