Structural Complexity of DNA Sequence

In modern bioinformatics, finding an efficient way to allocate sequence fragments with biological functions is an important issue. This paper presents a structural approach based on context-free grammars extracted from original DNA or protein sequences. This approach is radically different from all those statistical methods. Furthermore, this approach is compared with a topological entropy-based method for consistency and difference of the complexity results.


Introduction
DNA sequence analysis becomes important part in modern molecular biology. DNA sequence is composed of four nucleotide bases-adenine (abbreviated A), cytosine (C), guanine (G), and thymine (T) in any order. With four different nucleotides, 2 nucleotides could only code for maximum of 4 2 amino acids, but 3 nucleotides could code for a maximum 4 3 amino acids. George Gamow was the first person to postulate that every three bases can translate to a single amino acid, called a codon. Marshall Nirenberg and Heinrich J. Matthaei were the first to elucidate the nature of a genetic code. A short DNA sequence can contain less genetic information, while lots of bases may contain much more genetic information, and any two nucleotides switch place may change the meaning of genetic messages.
Sequence arrangement can produce many different results, but only few codons exist in living bodies. Some sequences do not contain any information which is known as junk DNA. Finding an efficient way to analyze a sequence fragment corresponding to genetic functions is also a challenging problem.
In recent papers, methods broadly fall into two categories, sequence complexity [1,2] and structural pattern analysis [3][4][5][6][7][8]. Koslicki [1] presented a method for computing sequence complexities. He redefined topological entropy function so that the complexity value will not converge toward zero for much longer sequences. With separate sequence into several segments, it can determine the segments where are exons or introns, and meaningful or meaningless. Hao et al. [7] given a graphical representation of DNA sequence, according to this paper, we can find some rare occurred subsequences. R. Zhang and C. T. Zhang [4] used four-nucleotide-related function drawing 3D curves graph to analyze the number of four-nucleotide occurrence probabilities. Liou et al. [9] had given a new idea in modeling complexity for music rhythms; this paper translated text messages into computable values, so computers can score for music rhythms.
In this paper, we propose a new method for calculating sequences different from other traditional methods. It holds not only statistical values but also structural information. We replace four nucleotides with tree structure presented in [9] and use mathematical tools to calculate complexity values of the sequences. So we can compare two sequences with values and determine dissimilarity between these two sequences. In biomedical section, we can use this technique to find the effective drugs for new virus with priority.

DNA Sequence Represented with Tree Structure
Our method uses Lindenmayer system [10][11][12] property among calculated complexities from tree structure [9]; it is a different way of computing complexities of sequences. At first, we introduce DNA tree and convert DNA sequence to tree structure. A DNA tree is a binary tree of which each subtree is also a DNA tree. Every tree node is either a terminal node or a node with two childrens (branches or descendants). Lindenmayer system is a powerful rewriting system used to model the growth processes of plant development. We will introduce it in Section 2.2 in detail. Lindenmayer system uses some initial and rewriting rules to construct beautiful graphs. Since it can construct a tree from rewriting rules, it also can extract rewriting rules from a tree. In this section, we will use tools to generate the rules from tree.
We use 4 fixed tree representations for nucleotide bases A, T, C, and G (see Figure 1). When we apply this method to amino acid sequence, we can construct more tree representation for amino acids, respectively.
When we transfer a sequence to DNA tree, we will replace every word to tree elements step by step, and two consecutive trees can combine to a bigger tree. Following the previous steps, a DNA sequence will be transfer to a DNA tree (see Figure 2).

Bracketed Strings for a DNA Sequence.
For computing complexity of our DNA tree, we need some rules for converting tree to another structure. We use a stack similarly structure to represent the hierarchy of DNA tree, called bracketed string. DNA tree can transfer to a unique bracketed string by the following symbols, and it can transfer back to the original tree: (i) : the current location of tree nodes; it can be replaced by any word or be omitted; (ii) +: the following string will express the right subtree; (iii) −: the following string will express the left subtree; (iv) [: this symbol is pairing with ]; "[⋅ ⋅ ⋅]" denotes a subtree where "⋅ ⋅ ⋅"; indicates all the bracketed strings of its subtree; (v) ]: see [ description.
Following the previous symbols, Figure 3 shows  And Figure 4 is the bracketed string of Figure 2. We can see that when the tree grows, string seems to be more redundant. Since we focus here only on DNA trees, we can simplify the bracketed string representations. First, our trees have only two subtrees. Second, the " " notation for the tree is trivial. With these two characteristics, we may omit the " " notation from the bracketed string and use only four symbols, {[, ], −, +}, to represent trees. In our cases, "[⋅ ⋅ ⋅]" denotes a subtree where "⋅ ⋅ ⋅" indicates all the bracketed strings of its subtrees. "−" indicated the next "[⋅ ⋅ ⋅]" notation for a tree is a left subtree of current node, and "+" is a right subtree vice versa. Figure 5 is the simplified string of bracketed string shown in Figure 4.

DNA Sequence Represented with L-System.
When we obtain DNA tree and bracketed string representation, we need rewriting rules for analyzing tree structure. There are some types of rewriting mechanism such as Chomsky grammar and Lindenmayer system (L-system for short). The largest difference between two string rewriting mechanisms lies in the technique used to apply productions. Chomsky grammar is suitable for applying productions sequentially, while Lsystem is for parallel. In our structure, applying L-system to our representations is better than Chomsky grammar. The L-system was introduced by the biologist Lindenmayer in 1968 [13]. The central concept of the L-system is rewriting. In general, rewriting is a technique used to define complex objects by successively replacing parts of a simple initial object, using a set of rewriting rules or productions. In the next section, we will present how we use L-system to our DNA tree. The L-system is defined as follows.
Definition 1. L-system grammars are very similar to the Chomsky grammar, defined as a tuple [14]: where (i) = { 1 , 2 , . . . , } is an alphabet, (ii) (start, axiom, or initiator) is a string of symbols from defining the initial state of the system, (iii) is defined by a production map : → * with → ( ) for each in . The identity production → is assumed. These symbols are called constants or terminals.

Rewriting Rules for DNA Sequences.
As discussed earlier, we want to generate the rules from DNA trees. In this section, Computational and Mathematical Methods in Medicine   we will explain how we apply rewriting rules to those trees. We can apply distinct variables to each node. Since the technique described previously always generates two subtrees for each node, for every nonterminal node, they always can be explained in the following format: where denotes the current node, denotes its left subtree, and denotes its right subtree, respectively. We give an example shown in Figure 6; left tree has three nodes and only root is nonterminal node, it can be rewritten as → . Right tree has five nodes, root with left subtree and right subtree . Left subtree is terminal, but right is not. has two terminal subtrees and , so this tree can be rewritten as → and → .

Rewriting Rules for Bracketed Strings.
Similarly, we can also use rewriting rules to generate bracketed strings. In rewriting rules for DNA trees shown in Section 2.3, we write → for a tree with left and right subtrees. Note that we call and as the nonterminals. In this section, terminal nodes will be separated from trees, and we use "null" to represent a terminal. Such tree will have a corresponding bracketed string as follows: represents the left subtree, while "[+ ⋅ ⋅ ⋅]" represents the right subtree. Therefore, we can replace the rewriting rules with where "⋅ ⋅ ⋅" is the rewriting rule for the bracketed string of each subtree. For the sake of readability, we replace the words such as " " and " ". In Figure 7, we show the rewriting rules for the bracketed string of the tree in Figure 3. As we can see, there are "nulls" in the rules. Those "nulls" do not have significant effects to our algorithm, so we simply ignore the nulls. Now, Figure 3 can apply new rewriting rules without trivial nulls as Figure 8.
When tree grows up, the rewriting rules may generate identical rules. Assume that we have the following rules: These rules can generate exactly one bracketed string and, thus, exactly one DNA tree. All these rules form a rule set that represents a unique DNA tree. When we look at , they have the same structure since they both have a right subtree and do not have a left subtree. The only difference is that one of the subtrees is and that the other is . We will define two terms to 4 Computational and Mathematical Methods in Medicine  express the similarity between two rewriting rules, and these terms can simplify complexity analysis.  [9] gave two definitions to classify similar rewriting rules described before as follows.

Homomorphism and
Definition 2. Homomorphism in rewriting rules. We define that rewriting rule 1 and rewriting rule 2 are homomorphic to each other if and only if they have the same structure.
In detail, rewriting rule 1 and rewriting rule 2 in DNA trees both have subtrees in corresponding positions or both not. Ignoring all nonterminals, if rule 1 and rule 2 generate the same bracketed string, then they are homomorphic by definition.
We find that , , , and are homomorphic to each other; they generate the same bracketed string, is not homomorphic to any of the other rules; its bracketed string is [− ].
Let us recall DNA tree example in Figure 2; we will use this figure as an example to clarify these definitions. Now we marked some nodes shown in Figure 9; there are tree rooted at A, B, C, and D, respectively, tree A, tree B, tree C, and tree D. Tree A is isomorphic to tree C on depth 0 to 3, but they are not isomorphic on depth 4. Tree B is isomorphic to tree C on depth from 0 to 2, but they are not isomorphic on depth 3. D is not isomorphic to any other trees, nor is it homomorphic to any other trees.
After we define the similarity between rules by homomorphism and isomorphism, we can classify all the rules into different subsets, and every subset has the same similarity relation. Now we list all the rewriting rules of Figure 2 into Table 1 but ignore terminal rules such as " → null" and transfer rule's name to class name (or class number). For example, we can give terminal rewriting rule a class, " 3 → null", and a rule link to two terminals; we can give them " 2 → 3 3 "; here 3 is the terminal class. After performing classification, we obtain not only a new rewriting rule set but also a context-free grammar, which can be converted to automata.
In Table 1

DNA Sequence Complexity
When we transfer the DNA sequence to the rewriting rules, and classify all those rules we attempt to explore the redundancy in the tree that will be the base for building the cognitive map [15]. We compute the complexity of the tree which those classified rules represent. We know that a classified rewriting rule set is also a context-free grammar, so there are some methods for computing complexity of rewriting rule as follows.
Definition 4. Topological entropy of a context-free grammar. The topological entropy 0 of (context-free grammar) CFG can be evaluated by means of the following three procedures [16,17].
Computational and Mathematical Methods in Medicine 5  (1) For each variable with productions (in Greibach form), where { 1 , 2 , . . . , , } are terminals and { 1 , 2 , . . . , , } are nonterminals. The formal algebraic expression for each variable is (2) By replacing every terminal with an auxiliary variable , one obtains the generating function where ( ) is the number of words of length descending from .
(2) The generating function of , ( ) has a new form as follows: If does not have any nonterminal variables, we set ( ) = 1.
(3) After formulating the generating function ( ), we intend to find the largest value of , max , at which rule for the root node of the DNA tree. After obtaining the largest value, max , of 1 ( ), we set = max , the radius of convergence of 1 ( ). We define the complexity of the DNA tree as Now we can do some examples of computation procedure for the complexity. According to our definition, the given values for the class parameters are listed in Table 3. There are five classes, so we obtain the formulas for 5 ( ), 4 ( ), 3 ( ), 2 ( ), and 1 ( ) successively. They are 19 .
Rearranging the previous equation for 1 ( ), we obtain a quadratic for 1 ( ): Solving 1 ( ), we obtain the formula  where = 32 361 (2( ) 6 + 2( ) 5 + 5( ) 4 ) , Finally, the radius of convergence, , and complexity, 0 = −ln , can be obtained from this formula. But, computing the max directly is difficult, so we use iterations and region tests to approximate the complexity; details are as follows.
(1) Rewrite the generating function as (2) The value from 0 ( ) to ( ). When −1 ( ) = ( ) for all rules, we say that ( ) reach the convergence, but is not the max we want. Here, we set = 1000 for each iteration.
(3) Now we can test whether ( ) is convergent or divergent at a number . We use binary search to test every real number between 0 and 1; in every test, when ( ) converges, we set bigger next time, but when ( ) diverges, we set smaller next time.
Running more iterations will obtain more precise radius.

Results
In 2011, Koslicki [1] gave an efficient way to compute the topological entropy of DNA sequence. He used fixed length depending on subword size to compute topological entropy of sequence. For example, in Figure 10 (all DNA and amino acid data can be found in NCBI website, http://www.ncbi.nlm.nih.gov/), the sequence length is 1027 characters, and there are three subword sizes 2, 3, and 4 with blue, red, and green lines, respectively. For larger subword size, much larger fragment is required for complexity computation. The required fragment size grows exponentially, while the length of sequence is not dependent on the growth rate of subword size, so it is not a good method for us overall. We present a new method called structural complexity in previous sections, and there are several benefits from using our method instead of Koslicki method, described as follows. (2) Two different characters that exchange position will change value since Koslicki method just calculates the statistical values without structural information.
Result was shown in Figure 11 bottom chart; the test sequence repeats the same subword several times. For blue line, all complexity values from topological entropy are equal within the region of repeated subwords. For red line, complexity values depend on the structure of subword. When the fragment of sequence is different from each other, our method will evaluate to different values.
(3) Our method can also calculate amino acid sequences. The Koslicki method depends on alphabet size and subword size, for example, in the basic length 2 substring calculation; since standard amino acid types have up to 20, it requires a minimum length of 20 2 + 2−1 to calculate, but the amino acid strings are usually very short. Sometimes, Koslicki method cannot compute the amino acid sequence efficiently. Figure 12 shows that complexity of amino acid sequence can also be calculated by our method.
We also did experiments with lots of data, including fixed fragment size and fixed method on test sequences (see Figures 13 and 14). Here, we redefine the Koslicki method; the fragment size is no longer dependent on subword size. Instead, fixed length fragment like our method is applied. This change allows us to compare the data easier, and not restricted to the exponentially growing fragment size anymore. In Figure 13, we found that for larger fragment, the complexity curve will become smoothly because fragments for each data point contain more information. And we note that there is a common local peak value of those figures; the simple sequence region is big enough that our fragment size still contains the same simple sequence.
When we compare with the same method shown in Figure 14, we found the same situation more obviously. Thus, if we have many complexity values with different sizes, we have the opportunity to restore the portion of the DNA.

Application to Virus Sequences Database and Other
Sequences. Now we can apply our technique to Chinese word sequences. Togawa et al. [18] gave a complexity of Chinese words, but his study was based on the number of strokes, which is different from our method. Here we use Big5 encoding for our system. Since the number of Chinese words is larger than 10000, we cannot directly use words as alphabet, so we need some conversion. We read a Chinese word into four hexadecimal letters so that we can replace the sequence with tree representation and compute the complexity. When it comes to biomedical section, we can create virus comparison database. Once a new virus or prion has been found, it will be easy to select corresponding drugs at the first time, according to cross comparison with each other by complexity in the database. We focus on most important viruses in recent years, such as Escherichia coli O157:H7 (E. coli o157), Enterovirus 71 (EV71), Influenza A virus subtype H1N1 (H1N1), Influenza A virus subtype H5N1 (H5N1), and severe acute respiratory syndrome (SARS). In recent years, these viruses have a significant impact and threat on the human world. We test these viruses and prions listed in Table 4. Here we can see that all prion regions cannot be analyzed by Koslicki method, but we can do it.
Finally, if any object can be written as a sequence, and there exists tree representation with alphabet of sequence, we can compute the complexity of the object.

Summary
In this paper, we give a method for computing complexity of DNA sequences. The traditional method focused on the statistical data or simply explored the structural complexity without value. In our method, we transform the DNA sequence to DNA tree with tree representations at first.
Then we transform the tree to context-free grammar format, so that it can be classified. Finally, we use redefined generating function and find the complexity values. We give a not only statistical but also structural complexity for DNA sequences, and this technique can be used in many important applications.