Analysis of Similarity / Dissimilarity of DNA Sequences Based on Chaos Game Representation

and Applied Analysis 3 Table 1: The coding sequences of the first exon of β-globin gene of different species. Species Coding sequence ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGT Human TACTGCCCTGTGGGGCAAGGTGAACGTGGATTAAG TTGGTGGTGAGGCCCTGGGCAG ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGG Goat CTTCTGGGGCAAGGTGAAAGTGGATGAAGTTGGTG CTGAGGCCCTGGGCAG ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCA Opossum TCACTACCATCTGGTCTAAGGTGCAGGTTGACCA GACTGGTGGTGAGGCCCTTGGCAG ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCAT Gallus CACCGGCCTCTGGGGGAAGGTCAATGTGGCCGAAT GTGGGGCCGAAGCCCTGGCCAG ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGT Lemur CACCTCTCTGTGGGGCAAGGTGGATGTAGAGAAAG TTGGTGGCGAGGCCTTGGGCAG ATGGTTGCACCTGACTGATGCTGAGAAGTCTGCTG Mouse TCTCTTGCCTGTGGGCAAAGGTGAACCCCGATGAA GTTGGTGGTGAGGCCCTGGGCAGG ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGT Rabbit CACTGCCCTGTGGGGCAAGGTGAATGTGGAAGAAG TTGGTGGTGAGGCCCTGGGC ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGT Rat TAGTGGCCTGTGGGGAAAGGTGAACCCTGATAATG TTGGCGCTGAGGCCCTGGGCAG ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGT Gorilla TACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG TTGGTGGTGAGGCCCTGGGCAGG Table 2: Hurst exponent of the CGR-walk sequence {X n } of the nine species in Table 1. Human Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla H(XRY n ) 0.445 0.5024 0.6536 0.5075 0.5016 0.538 0.429 0.5791 0.4698 H(XMK n ) 0.7452 0.7853 0.6547 0.7212 0.7487 0.7094 0.8099 0.5237 0.7467 H(XWS n ) 0.641 0.6894 0.6292 0.5756 0.6753 0.8118 0.615 0.7255 0.6302 3. Numerical Characterization of DNA Sequences Researchers from computer science and mathematics have been attracted to study the comparison of DNA sequences. As pointed out in references [13, 16–28], some related work has made progress. Now, we may represent a DNA sequence by a random numerical sequence based on CGR-walk technique. Gao and Xu [29] also substantially corroborated the results that longrange correlations are uncovered remarkably in the data. In this paper, we explore the tendency of a series of data by calculating the hurst exponent [30]. And some work has been done to study the relation between long-range correlation and hurst exponent [31]. In order to numerically characterize a DNA sequence given by the CGR, we treat the hurst exponent as the efficient invariant that is sensitive to this kind of graphical representation. Because a DNA sequence can be regarded as an ordered set of alphabet N = (A, C, G, T), we represent a DNA sequence as a finite set with N elements, denoted as [i] := {1, 2, . . . , N}. For any time series {u i } i=1 , one candefine several quantities as follows [30]: (i) the partial mean


Introduction
A DNA sequence is comprised of four different nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T).Since the DNA molecule contains plentiful biological, physical, and chemical information, it has become very important to analyze DNA sequences statistically.Now the nucleotides stored in GenBank have exceeded hundreds of millions of bases and the increasing rate is considerably rapid.Therefore, biologists, physicists, mathematicians, and computer specialists have adopted different techniques to research DNA sequences in recent years, including the statistical methods and some mapping rules of the bases.
A great number of studies have proposed to convert the DNA sequences into digital sequences before downstream analysis.There are many statistical methods such as random walk, lévy-walk, entropy near method, root-mean-square fluctuation, wavelet transform and Fourier transform, and so forth, [1][2][3][4][5][6][7][8][9][10][11][12], which can be used as effective tools to process the DNA sequences.One-dimensional DNA walk was first proposed by Peng et al. [1].Bai et al. [13] later discussed the representation of DNA primary sequences by the same walk.Meanwhile, some investigators proposed several kinds of graphical representation of DNA sequences from different perspectives.For example, G-curve and H-curve were first proposed by Hamori and Ruskin in 1983 [14].R. Zhang and C. T. Zhang [15] considered a DNA primary sequence termed as Z-curve.Several researchers in their recent studies have outlined different kinds of graphical representation of DNA sequences based on 2D [16][17][18][19][20][21], 3D [22][23][24][25], 4D [26], 5D [27], and 6D [28] spaces.We here need to stress Chaos Game Representation (CGR) which was proposed as a scaleindependent representation for genomic sequences by Jeffrey [3] in 1990.Gao and Xu [29] pointed out that the CGR-walk model can easily generate a model sequence and can be fitted with a long-memory ARFIMA (, , ) model reasonably.However, they treated the four bases equally and ignored the hidden chemical classification of nucleotides.
Motivated by the above work, we consider in this paper different classifications of the four bases according to their chemical structure and the strength of the hydrogen bond, that is, purine R = {A, G} and pyrimidine Y = {C, T}; amino group M = {A, C} and keto group K = {G, T}; weak Hbonds W = {A, T} and strong H-bonds S = {G, C}.Then we give three kinds of mapping from the four bases A, C, G, and T to the continuous space and reconstruct CGR-walk sequences based on CGR coordinates.So we can convert a DNA sequence into a random numeric sequence, then select some numerical characterizations of the random sequence as new invariants for the DNA sequence.As an application, we make a comparison of the similarity and dissimilarity of the first exon of -globin gene sequences derived from nine species.

CGR-Walk Based on Three kinds of
Classification and Primary Sequences The CGR space can be viewed as a continuous reference system, where all possible sequences of any length occupy a unique position.And the position is produced by the four possible nucleotides, which are treated as vertices of a binary square.So it is planar.Since a genetic sequence can be treated formally as a string composed of the four letters "A, " "C, " "G, " and "T" (or "U"), the binary CGR vertices are assigned to the four nucleotides as A = (0, 0), G = (1, 1), C = (0, 1), T = (1, 0).The CGR coordinates are calculated iteratively by moving a pointer to half the distance between the previous position and the current binary representation.For example, if a "G, " is the next base, then a point is plotted half way between the previous point and the "G" corner.The iterated function can be given by where  = 1, . . .,  G ; CGR 0 = (0.5, 0.5) ;   ∈ {, , , } .
We take the first 6 bases of the sequence of human -globin in Table 1 as an example and present the above procedure in Figure 1.

The
Newly Proposed CGR Space.The aforementioned work treats the four nucleic acid bases equally.In this paper, however, we take the chemical structures of the four nucleic acid bases into consideration and make adjustments to the classification based on the elements of the minor diagonal.In the CGR space proposed by Jeffrey, the elements of the minor diagonal are purine R = {A, G} and the leading diagonal elements are pyrimidine Y = {C, T}.Considering amino group M = {A, C} and keto group K = {G, T}, we get the second CGR space as shown in Figure 2. In the same way, according to the strength of the hydrogen bond, the bases can also be classified into weak H-bonds W = {A, T} and strong H-bonds S = {G, C}, so the third kind of CGR space is obtained in Figure 3.

CGR-Walk
Digital Sequence.Now we can obtain map relationships between DNA sequences and the CGR coordinates in a right-angled plane.For a DNA sequence, we define an equation as follows: where   and   are the -coordinate and -coordinate of CGR, respectively.Then we can get a data sequence {  :  = 1, 2, . . ., }.In this way, we convert a DNA sequence into a random walk sequence under three different patterns.Consistent with the above three figures, we call them CGR-RY-, CGR-MK-, and CGR-WS-walk sequences, respectively.

Numerical Characterization of DNA Sequences
Researchers from computer science and mathematics have been attracted to study the comparison of DNA sequences.As pointed out in references [13,[16][17][18][19][20][21][22][23][24][25][26][27][28], some related work has made progress.Now, we may represent a DNA sequence by a random numerical sequence based on CGR-walk technique.Gao and Xu [29] also substantially corroborated the results that longrange correlations are uncovered remarkably in the data.In this paper, we explore the tendency of a series of data by calculating the hurst exponent [30].And some work has been done to study the relation between long-range correlation and hurst exponent [31].In order to numerically characterize a DNA sequence given by the CGR, we treat the hurst exponent as the efficient invariant that is sensitive to this kind of graphical representation.
Because a DNA sequence can be regarded as an ordered set of alphabet N = (A, C, G, T), we represent a DNA sequence as a finite set with  elements, denoted as [] := {1, 2, . . ., }.For any time series {  }  =1 , one can define several quantities as follows [30]: (i) the partial mean  (ii) the partial difference (iii) the difference (iv) and the standard deviation Hurst exponent is found to obey the relation: where  is called the hurst exponent.So we can compute the hurst exponent of RY-, MKand WS-CGR-walk sequences and characterize the coding sequences of the first exon of -globin gene of the nine species in Table 1.The results are listed in Table 2.
Besides, there are other numerical characterizations of random sequences, such as the mean, variance, mean square deviation, and so on.Here we choose the mean square deviation of CGR-walk sequence as follows: In ( 9)  means the classification of RY-, MK-, and WSsequences, and     is the mean [13].We then present the mean square deviations of three kinds of the CGR-walk sequences {  } in Table 3.

Similarity and Dissimilarity among the Coding Sequences of the First Exon of 𝛽-Globin Gene of Different Nine Species
Here we construct the three-component vectors in this way, whose components, respectively, are values of hurst exponent and mean square deviation.The analysis of similarity/dissimilarity among DNA sequences represented by the three-component vectors is based on the assumption that two DNA sequences are similar if the corresponding vectors point to one direction in the 3D space.Alternatively we can investigate the similarity among the vectors by calculating the Euclidean distance between their end points.Apparently, the smaller the Euclidean distance is, the more similar the two corresponding DNA sequences are.In Tables 4 and 5, we list the values of Euclidean distances between the 3-component vectors separately including hurst exponent and mean square deviation.We observe that the smallest entry is always the human-gorilla pair.Furthermore, the largest entries are associated with these rows belonging to opossum (the most remote species from the remaining mammals) and gallus (the only nonmammalian representative).We believe that these results are not accidental, and they coincide with other results in [13,[16][17][18][19][20][21][22][23][24][25][26][27][28].

Conclusion
DNA sequences play an important role in modern biological research because all the information of the hereditary and species evolution is contained in these macromolecules.How to gain more information from these DNA sequences is still a very challenging question.Description, comparison, and similarity analysis of DNA sequences still occupy important positions.
In this paper, we first construct three kinds of CGR spaces according to the elements of the minor diagonal because the four bases can be classified into R-Y, M-K, and W-S according to their chemical structures.Then we describe a DNA sequence by CGR-walk and convert it to a digital sequence.And we outline some efficient invariants of DNA sequences.As an application, we compare the similarity/dissimilarity of exon-1 of -globin genes for nine species.From the above tables, we can conclude that the results we got are consistent with known evolutionary facts.Therefore, the method proposed in the paper is visual and efficient.
On one hand, our work can be treated as an effective application of CGR.On the other hand, our method is a valid supplement to graphical representation of DNA sequences.In comparison with other graphical representations of biological sequences, our approach has the following advantages.
(1) Our graphical representation based on CGR considers the chemical structure classification of the nucleotides and thus may provide more biological information.
(2) It provides a more simple way of viewing, sorting, and comparing various gene structures, even for longer DNA sequences.
(3) Our graph is more sensitive, so it can numerically characterize the DNA sequences in a more exact way.

Table 1 :
The coding sequences of the first exon of -globin gene of different species.

Table 2 :
Hurst exponent of the CGR-walk sequence {  } of the nine species in Table1.

Table 3 :
Mean square deviations of the CGR-walk sequence {  } of the nine species of in Table1.

Table 4 :
Similarity/dissimilarity table for the nine DNA sequences in Table1based on Euclidean distance between the 3-component vectors in Table2.

Table 5 :
Similarity/dissimilarity table for the nine DNA sequences in Table1based on Euclidean distance between the 3-component vectors in Table3.