Generative Power and Closure Properties of Watson-Crick Grammars

WedefineWKlinear grammars, as an extension ofWKregular grammarswith linear grammar rules, andWKcontext-free grammars, thus investigating their computational power and closure properties.We show thatWK linear grammars can generate some contextsensitive languages. Moreover, we demonstrate that the family of WK regular languages is the proper subset of the family of WK linear languages, but it is not comparable with the family of linear languages. We also establish that the Watson-Crick regular grammars are closed under almost all of the main closure operations.


Introduction
DNA computing appears as a challenge to design new types of computing devices, which differ from classical counterparts in fundamental way, to solve wide spectrum of computationally intractable problems.DNA (deoxyribonucleic acid) is double-stranded chain of nucleotides, which differ by their chemical bases that are adenine (A), guanine (G), cytosine (C), and thymine (T), and they are paired as A-T, C-G according to the Watson-Crick complementary as it is illustrated in Figure 1 [1].The massive parallelism, another fundamental feature of DNA molecules, allows performing millions of cut and paste operations simultaneously on DNA strands until a complete set of new DNA strands performing are generated.These two features give high hope for the use of DNA molecules and DNA based biooperations to develop powerful computing paradigms and devices.
Since a DNA strand can be interpreted as a double strand sequence of symbols, the DNA replication and synthesize processes can be modeled using methods and techniques of formal language theory.Watson-Crick (WK) automata [2], one of the recent computational models abstracting the properties of DNA molecules, are finite automata with two reading heads, working on complete double-stranded sequences where characters on corresponding positions from the two strands of the input are related by a complementarity relation similar to the Watson-Crick complementarity of DNA nucleotides.The two strands of the input are separately read from left to right by heads controlled by a common state.Several variants have been introduced and studied in recent papers [3][4][5][6][7].
WK regular grammars [8], a grammar counterpart of WK automata, generate double-stranded strings related by a complementarity relation as in a WK automaton but use rules as in a regular grammar.The approach of using formal grammars in the study of biological and computational properties of DNA molecules by formal grammars is a new direction in the field of DNA computing: we can introduce powerful variants of WK grammars, such as WK linear, WK context-free, and WK regulated grammars, and use them in the investigation of the properties of DNA structures and also in DNA applications in food authentication, gene disease detection, and so forth.In this paper, we introduce WK linear grammars and study the generative capacity in the relationship of Chomsky grammars.
Further, as a motivation, we show synthesis processes, for instance, in DNA replication (Figure 2) can be simulated by derivations in WK grammars.The replication of DNA begins at the origin(s).The double strand then separated by proteins, producing bubble-like shape(s).The synthesis of new strands using the parental strands as templates starts from the origins and proceeds in the 5  to 3  direction of both strands [1].This synthesis process in general can be seen as a string generation.The enzymes responsible for the synthesizing, DNA polymerases, cannot initiate the process by themselves but can only add nucleotides to an existing RNA chain.This chain is called primer which is produced by the enzyme primase.From the grammar perspective, the primase can be interpreted as the start symbol .After the primer has been connected to the parental strand, one of the synthesizing enzymes, called DNA polymerase III, continues to add nucleotides one by one to the primer and are complemented with the parental strand.The synthesis finishes with replacing RNA primer with DNA nucleotides using the enzyme DNA polymerase I and joining with DNA ligase.Again, from the grammar perspective, DNA polymerase I and polymerase III act as production rules in the grammar, specially, DNA ligase resembles to a terminal production (see Figure 3).
The paper is organized as follows.In Section 2, we give some notions and definitions from the theories of formal languages and DNA computing needed in the sequel.In Section 3, we define WK grammars and languages generated A w S Figure 3: The simulation of a synthesis process with a derivation.by these grammars.Section 4 is devoted to the study of the generative capacity of WK regular and linear grammars.In Section 5, we investigate the closure properties of WK grammars.Furthermore, we show the application of WK grammars in the analyses of DNA structures and programming language structures in Section 6.As the conclusion, we discuss open problems and interesting future research topics related to WK grammars in Section 7.
Throughout this paper we use the following notations.Let ∈ be the belonging relationship of an element to the corresponding set and ∉ indicates its negation.The symbol ⊆ indicates the inclusion while ⊂ notes the proper inclusion.The notations 0, ||, and 2  denote the empty set, the cardinality of a set , and the power set of , respectively.When Σ is an alphabet (a finite set of symbols), the set of all finite strings is denoted by Σ * , while Σ + shows similar meaning without including empty strings (we use  for empty string).The length of a string  ∈ Σ * is shown by ||.A language is a subset  ⊆ Σ * .
Next we recall some terms regarding the closure properties of languages.The union of two languages,  1 ∪  2 , is the set of strings including the elements contained in both sets of  1 and  2 .The concatenation of two languages is yielded by lining two strings from both languages which is shown by The Kleene-star closure is the closure under the Kleene * operation, the set of all possible strings in  including the empty string.The mirror image of a word For language , its mirror image is A substitution is a mapping  : Σ * →  * where () =  and ( 1 ,  2 ) = ( 1 )( 2 ) for  1 ,  2 ∈ Σ * .The substitution for a language  ⊆ Σ * ; that is, () is the union of (), where  ∈ .A substitution  is called finite if its length |()| is finite for each  ∈ Σ.A morphism is a substitution where its length is 1.
A Chomsky grammar is defined by  = (, , , ) where  is the set of nonterminal symbols and  is the set of terminal symbols, and  ∩  = 0.  ∈  is the start symbol while  ⊆ ( ∪ ) * ( ∪ ) * × ( ∪ ) * is the set of production rules.We write  →  indicating the rewriting process of the strings based on the production rules (, ) ∈ .The term  directly derives  is written as  ⇒  when for some production rules  →  ∈ .A grammar generates a language defined by According to the forms of production rules, grammars are classified as follows.A grammar  = (, , , ) is called A right-linear and left-linear grammars are called regular.
The families of languages generated by these grammars are REG, LIN, CF, and CS, respectively.The families of recursive enumerable languages are denoted by RE while the families of finite languages are denoted by FIN.Thus the next relation holds [11].

Theorem 1 (Chomsky hierarchy). Consider
⊂  ⊂  ⊂  ⊂  ⊂ . ( We recall the definition of a finite automaton.A finite automaton (FA) is a quintuple  = (, ,  0 , , ), where  is the set of states,  0 ∈  is the initial state, and  ⊆  is set of final states.Meanwhile  is an alphabet and  :  ×  → 2  is called the transition function.The set (language) of all strings accepted by  is denoted by ().We denote the family of languages accepted by finite automata by FA.Then, FA = REG (see [11]).
Next, we cite some basic definitions and results of Watson-Crick automata.
The key feature of WK automata is the symmetric relation on an alphabet ; that is,  ⊆  × .In this paper, for simplicity, we use the form ⟨/V⟩ to mention the elements (, V) in the set of all pairs of strings  ×  (which we choose to write as [/]), and, instead of  * × * , we write ⟨ * / * ⟩.
Watson-Crick domain is the set of well-formed doublestranded strings (molecules) WK +  () holds the similar meaning without including .We write as Note that when the elements in the upper strand are complemented and have the same length with the lower strand, A Watson-Crick finite automaton (WKFA) is 6-tuple where , ,  0 , and  are the same as a FA.Meanwhile the transition function  is where (, ⟨/V⟩) is not an empty set only for finitely many triples (, , V) ∈  ×  * ×  * .Similar to FA, we can write the relation in transition function  2 ∈ ( 1 , ⟨/V⟩) as a rewriting rule in grammars; that is, We describe the reflexive and transitive closure of → as → * .
The language accepted by a WKFA  is The family of languages accepted is indicated by WKFA.It is shown in [10,12] that

Definitions
In this section we slightly modified the definition of Watson-Crick regular grammars introduced in [8] in order to extend the concept to linear grammars and context-free grammars.
Remark 5. We use a common notion "Watson-Crick grammars" referring to any type of WK grammars.
Definition 6.The language generated by a WK grammar is a quintuple  which is defined as

Generative Power of Watson-Crick Grammars
In this section, we establish results regarding the computational power of WK grammars.

A Normal Form for Watson-Crick Linear Grammars.
Next, we define 1-normal form for WK linear grammars and show that, for every WK linear grammar , there is an equivalent WK linear grammar   in the normal form; that is, () = (  ).
Definition 7. A linear WK grammar  = (, , , , ) is said to be in the 1-normal form if each rule in  of the form where . . .
where   1 , 1 ≤  ≤  1 , and where  > 1 or  > 1.Without loss of generality, we assume that  ≥ .Then, we define the following sequence of rightlinear production rules: where    , 1 ≤  ≤  − 1, are new nonterminals.We construct a WK linear grammar   = ( ∪   , , , ,  ∪   ), where   consists of productions defined above for each  → ⟨ Then, it is not difficult to see that, in every derivation, productions in the form of ( 22) and (24) in  can be replaced by the sequences of productions ( 23) and (25) in   and vice versa.Thus, () = (  ).

The Generative
In general, we have the derivation Thus,  1 generates the language ( Then, we have the following derivation for ,  ≥ 1:  (35) The following example shows that some WK linear languages cannot be generated by WK regular grammars.Lemma 14.The following language is not a WK regular language: Proof.The language  1 can be generated by the following WK linear grammar  4 = ({, , }, {, }, {(, ), (, )}, , ), where  consists of the rules: It is not difficult to see that where 2 ≤  ≤ 3.
Next, we show that  1 ∉ WKREG.We suppose, by contradiction, that  1 can be generated by a WK regular grammar   = (, {, }, , , ).Without loss of generality, we assume that   is in 1-normal form.Then, for each rule  → V in , we have  ∈  and Let  =       be a string in  1 such that  > ||.Then, the double-stranded sequence [      /      ] is generated by the grammar   .
Case 1.In any derivation for this string, first  can occur in the upper (or lower) strand if   has already been generated in the upper (or lower) strand.Thus, we obtain two possible successful derivations: where  ≤ .In the latter derivation in (40), we cannot control the number of occurrences of ; that is, the derivation may not be successful.In the former derivation in (40), using the second strand, we can generate   : Equation ( 41) is continued by generating 's in the first strand and we can use the second strand to control their number.Consider and  is related to .Since 2 ≤  ≤ 3, generally,  is not the same as  for all derivations.
Case 2. We can control the number of 's after 's by using the second strand for 's before 's.In this case, the number of 's cannot be related to the number of 's: In both cases, we cannot control the number of 's and the number of 's after 's at the same time using WK regular rules.
Since strings       are palindrome strings for even 's, the language {  :  ∈ {, } * } is not in WKREG; that is, we have the following.

Hierarchy of the Families of Watson-Crick Languages.
Combining the results above, we obtain the following theorem.
Theorem 16.The relations in Figure 4 hold; the dotted lines denote incomparability of the language families and the arrows denote proper inclusions of the lower families into the upper families, while the dotted arrows denote inclusions.

Closure Properties
In this section, we establish results regarding the closure properties of WK grammars.The families of WK languages are shown to be higher in the hierarchy than their respective Chomsky language families; thus it is interesting see how WK grammars work in terms of closure properties as the ones for the Chomsky languages.Moreover, researching closure properties of WK grammars ensure the safety and correctness of the results yielded when performing operations on the sets of DNA molecules generated by some WK grammars.
Then it is not difficult to see that () =  1 ∪  2 .
is closed under finite substitution and homomorphism.
For each rule  ∈ , we construct where  ≥ 0.
With the lemmas provided above in this subsection, the next theorem follows.
Theorem 30.The family of Watson-Crick context-free languages is closed under union, concatenation, Kleene-star, and homomorphism.
Closure of WKCF under complement and intersection depends on the generative capacity of WKCF.The nonclosure of context-free (CF) grammars for intersection was shown with the famous example that the intersection of two CF languages results in a string that cannot be generated by a CF grammars.If one can provide some examples of strings that cannot be generated by WK context-free grammars, then the nonclosure of WKCF can be proven.

Applications of Watson-Crick Grammars
In this section, we consider two examples of the applications of Watson-Crick grammars in the analyses of DNA structures and programming language structures.The analysis of DNA strings provides useful information: for instance, the finding of a specific pattern in a DNA string and the identification of the repeats of a pattern are very important for detecting mutation.One of the diseases caused by mutation is Huntington disease, resulting from trinucleotide repeat disorders [13][14][15].It is discovered that the number of repeats of the trinucleotide / in the patient's DNA with Huntington disease is not normal.The repeats are also useful for finding the origin of replication of microorganisms [16].
In this section, we show that Watson-Crick grammars can be used for analyzing the repeats in DNA strings.We use the DNA of a breed of pig, Sus scrofa breed mixed chromosome 1, Sscrofa10.2provided by the The National Center for Biotechnology Information (NCBI) database (NCBI Reference Sequence: NC 010443.4)[17].
Consider a part of the upper strand of the DNA from the breed stated above with the length of 100 nucleotides: cag cag ctgctg ctg ctg ctg cag ctg . (60) In this example, the pattern  is being repeated for six times in the direct strand (upper strand) and two times in the reverse strand (lower strand) which can be seen as  in the upper strand.Focusing on the repeats of  pattern, a DNA string containing such a pattern can be expressed as ({, , , } *  {, , , } * ) * . (61) We now construct a simple Watson-Crick regular grammar  to generate the above language.

Figure 4 :
Figure 4: The hierarchy of WK and Chomsky language families.

6. 1 .
DNA Structure Analysis.Since Watson-Crick grammars are developed based on the structure and recombinant behavior of DNA molecules, they can suitably be implemented in the study of DNA related problems.