On the Existence of Wavelet Symmetries in Archaea DNA

This paper deals with the complex unit roots representation of archea DNA sequences and the analysis of symmetries in the wavelet coefficients of the digitalized sequence. It is shown that even for extremophile archaea, the distribution of nucleotides has to fulfill some (mathematical) constraints in such a way that the wavelet coefficients are symmetrically distributed, with respect to the nucleotides distribution.


Introduction
In some recent papers the existence of symmetries in nucleotide distribution has been studied for several living organisms [1][2][3][4][5][6] including mammals, fungi [1][2][3][4], and viruses [5,6]. Thus showing that any (investigated) DNA sequence, when converted into a digital sequence, features some fractal shape of its DNA walk and an apparently random-like distribution. However, when the short wavelet transform maps the digital sequence into the space of wavelet coefficients, and these coefficients are clustered then they are located along some symmetrical shapes.
One of the main tasks of this paper is to show that although the distribution of nucleotide, in any DNA sequence, can be considered as randomly given, when we compare a random sequence (and the corresponding random walk) with a DNA sequence (and walk) it can be seen that there exists some distinctions. So that the nucleotides distribution seems to side with a random distribution with some constraints. These constraints (rules) are singled out in the following, by showing the existence of hidden geometry which underlies the structure of a DNA sequence.
In other words, nucleotides are distributed along any DNA sequence at first apparently randomly but at second analysis according to some (statistical) mathematical constraints which does not allow a given nucleotide to be arbitrarily followed by any other remaining nucleotides.
It is interesting to notice that even in the primitives organisms which billions of years ago have been colonizing the earth under extreme conditions of life, their DNA has to fulfill the same constraints of the more evolved DNAs.
In order to achieve this goal some fundamental steps have to be taken into consideration and discussed.
(1) Since DNA is a sequence of symbols, a map of these symbols into numbers has to be defined. In the following we will consider the complex unit roots map, which has the advantage of being unitary and distributed along the unit circle.
(2) The indicator matrix is defined on the the indicator map. This matrix is important in order to draw the dot plot of the DNA sequence and from this plot we can see that apparently nucleotides seem to be randomly distributed. However, we will show by wavelet analysis that they look randomly distributed, while they are not.
(3) The Ulam spiral adapted to DNA sequences is defined in order to single out some geometrical patterns.
(4) Random walks on DNA, or short DNA walks, show that the random walks look like fractals.
(5) The analysis of clusters of wavelet coefficients show that DNA walks have to fulfill some geometrical constraints.
In all DNA sequences, analyzed so far, for different kinds of living organisms, this geometrical symmetry [1][2][3][4][5][6] has been detected. In the following this analysis is extended also to archaea, since they might be considered at the early stage of life and their DNA is compared with more evolved microorganisms as bacteria.
It will be shown that, inspite of the many similarities with random sequences, only the wavelet analysis makes it possible to single out some distinctions. In particular, the wavelet coefficients of all (analyzed) organisms tend to fulfill a minimum principle for the energy of the signal. Also the archaea which often live in extreme environments have to fulfill the same geometrical rule of any other living organism.
Some previous paper have studied various sequences of DNA such as leukemia tet variants, influenza viruses such as the A (H1N1) variant, mammalian, and a fungus (see [1][2][3]14]) provided by the National Center for Biotechnology Information [18][19][20][21]. In all these papers it was observed that DNA has to fulfill not only some chemical steady state given by the chemical ligands but also some symmetrical distribution of nucleotide along the sequence. In other words, base pairs have to be placed exactly in some positions.
According to previous results, it will be shown that as any other living organisms also these elementary organisms have DNA walks with fractal shape and wavelet coefficients bounded on a short-range wavelet transform. In other words, also anaerobic organism which should be understood as the most elementary at the first step of life have the same symmetries on wavelet coefficients as for more evolved organism, so that life has to fulfill some constrained distribution of nucleotides in order to give rise to some organism even at the most elementary step.
In particular, in Section 2, some remarks about the analysed data are given. Section 3 deals with some elementary plots which can easily visualize the distribution of nucleotides. The Ulam spiral plot is also proposed for the first time and it is observed a different distribution of weak/strong  hydrogen bonds. Section 4 provides some definitions about parameters of complexity. We will notice that all these parameters give rise to the same classification of organism. Section 4 proposes a complex numerical representation of DNA chains and random walks, while in final Section 6 the short wavelet trasform is given in order to single out some symmetries at the lower order of transform.

Materials and Methods
In the following we will take into consideration some genome, complete sequences of DNA, concerning the following archaea:   Moreover we will compare DNA sequences with artificial sequences of nucleotides randomly taken (see Section 4).

Archaea.
Archaea are a group of elementary single-cell microorganisms, having no cell nucleus or any other membrane-bound organelles within their cells. They are similar to bacteria, since they have the same size and shape (apart few exceptions) and the generally similar cell structure. However, the evolutionary history of archaea and their biochemistry has significant differences with regard to other forms of life. Therefore they are considered as members of a phylogenetic group distinct from bacteria and eukaryota. Archaea during their evolution have been spreading all over the Earth in almost all habitats [22,23] existing in a broad range of habitats, being one of the major contribution (20%) to earth's biomass. The most peculiar feature of archaea is that they can live in some environments with extreme life conditions (thus being considered as extremophiles [22,24]). Indeed, some archaea survive to high temperatures, over 100 • C, while others can live in very cold habitats or highly saline, acidic, or alkaline water. Nevertheless some archaea are living in mild conditions. It has been also recognized that the archaea may be the most ancient organisms on the Earth, so that archaea, and eu-karyotes are probably diverged early from an ancestral colony of organisms.
We will see, in the following, that archaea DNA it looks very close to random sequences so that we can assume that the ancestral organism were evolving by random permutations from a primitive assembly of nucleotides. So that the evolution can be seen as a tendency to a steady state far from the randomness. Therefore, the bacteria's DNA (and other eukaryotes' [1][2][3][4][5][6]), as a result of the evolution, shows the existence of some hidden stability.

Correlation Plots
In this section we will consider some elementary plots from where it is possible to visualize autocorrelation, distribution law of nucleotides and to measure some fundamental parameters by using frequency count. Let A DNA sequence is the finite symbolic sequence so that with being the nucleotide x at the position h.
In general we can define an -length alphabet as follows: let the -length DNA word be defined by the -combination of the 4 nucleotides (1). For each fixed length there are 4 words, however not all of them can be considered, from biological point of view, as independent instances (see, e.g., Table 1), for this we define the -length alphabet as the set of -length independent words: with | · · · | cardinality of the set and def = length a j , For instance with = 1, the alphabet is A 1 = A = {A, C, G, T}, with = 3 the alphabet is given by the 20 amino acids each amino acid being represented by a 3-length word of Table 1. Let S N be an N-length ordered sequence of nucleotides {A, C, G, T} and A the chosen alphabet, a DNA sequence of words is the finite symbolic sequence being the word x at the position h.

Indicator Matrix.
The 2D indicator function, based on the 1D definition given in [25], is the map such that with and, where for short, we have assumed According to (12), the indicator of an N-length sequence can be easily represented by the N ×N sparse symmetric matrix of binary values {0, 1} which results from the indicator matrix (see also [3][4][5]) This squared matrix can be plotted in 2 dimensions by putting a black dot where u hk = 1 and white spot when u hk = 0 ( Figure 1) thus giving rise to the two-dimensional dot plot, which is a special case of the recurrence plot [26].
A simple generalization of this matrix can be considered for the alphabets A , as follows. By choosing the 3 alphabet of amino acids, the 2D indicator function is the map such that According to (12), the indicator, on the 3-alphabet of amino acids of an N-length sequence can be easily represented by the N ×N sparse symmetric matrix of binary values {0, 1}: With the graphical representation of this matrix we can also show the correlation of amino acids.

Test Sequences.
In the following, in order to single out the main features of biological sequences, we will compare the DNA sequence with some test sequences.
(2) Pseudoperiodic N-sequence of nucleotides with period π is the direct sum of a given π-length pseudorandom sequence, such that N = kπ, (k ∈ N) and R i = R i+π , for example, When π = 1 we have a pseudorandom sequence.
If we plot the indicator matrix of some bacteria and compare it with a pseudorandom and periodic sequence, we can see that (Figure 1) (1) the main diagonal is a symmetry axis for the plot; (2) there are some motifs which are repeated at different scales like in a fractal; (3) periodicity is detected by parallel lines to the main diagonal (Figure 1(a2)); A C G T Figure 9: Spiral distribution of the first 3752 nucleotides for Acidianus hospitalis W1.
(4) empty spaces are more distributed than filled spaces, in the sense that the matrix u hk is a sparse matrix (having more 0's than 1's); (5) it seems that there are some square-like islands where black spots are more concentrated; these islands show the persistence of a nucleotide (Figures 1(a2) and 1(b1)); (6) the dot plot of archaea is very similar to the dot plot of a random sequence (Figures 1(a1) and 1(h3)).
It can be noticed that DNA sequences of a living organism resemble (Figure 1) random sequences, with some short range influence, built on the same alphabet. This has been taken as an axiom of nucleotides distribution, so that DNA sequences are often considered as Markov chain [27]. However, there are some hidden rules in combining the nucleotides and these rules lead, during the evolution, to a steady distribution. In fact, the more primitive the sequence is, the more randomly distributed the nucleotides are. It seems that as a consequence of the evolution, nucleotides move from a disordered aggregation toward a more organized structure, shown by the growing islands in the dot plot. The biological evolution is such that the challenge for the selforganization might follow from random permutations of a primitive disordered sequence so that the organization, that is, the complexity, is only the result of many arbitrary permutations of randomness. During the challenge for complexity, DNA sequence becomes "less random" and it loses some kind of energy.
From the graphical representation of the indicator matrix for bacteria and amino acids we can see a more sparse matrix, but with some typical plots (Figure 2).

Spiral Plot.
In this section we consider a 2D distribution of nucleotides, following the idea given by Ulam for the distribution of primes, along an Ulam-like spiral [28]. In order to find some patterns in their distribution, nucleotides are arranged along a rectangular spiral. This is equivalent to mapping the 1D sequence of integers into a 2D sequence as follows: distributed along the spiral looks like Figure 3. For each nucleotide we can draw a spiral containing the distribution of only one acid nucleic. To each organism there correspond four plots, for A, C, G, T, respectively.
Let us first note that on a random sequence ( Figure 4) the four distribution are equivalent.
By comparing the spirals of bacteria, random and archaea (Figures 4, 5, 6, 7, 8, 9, 10) we can see that there is a different distribution of each nucleotide. However the more evolved organism tends to have a higher percentage of weak hydrogen bonds (Figures 5, 6 and 7), so that we can assume the following.

Conjecture 1.
During the evolution, the distribution of nucleotides changes in a such way that strong hydrogen bonds tend to become weak. It should be noticed that along these spirals, there is a one-to-one map λ between N and the points of the spiral (with integer coordinates) in 2 λ : N −→ γ ⊂ × (26) so that This bijective map can be considered also between N and the complex space C so that each natural number corresponds to a complex number (with integer coefficients) Since these spirals seem to fill in a finite region of the plane we can evaluate the complexity of each curve by typical fractal measures.

Parameters of Complexity
In this section we define some parameters, based on frequency distribution, which can measure the complexity of a DNA by computing the complexity of its representation in the complex plane (for a more detailed analysis see [29] and references therein).
Let S N be an N-length-ordered sequence of nucleotides, and be the probability to find the nucleotide x at the position h, 1 ≤ h ≤ N. According to (12) we define as the number of nucleotides in the h-length segment of S N , so that The corresponding frequencies are We can assume that for large sequences

Randomness.
Since for a random sequence the frequencies of nucleotides coincide for large n, we can define as randomness index the following: with σ being the variance, so that R = 1 for random sequence and R = 0 for a nonrandom sequence. Over the first 10000 nucleotides we have the randomness value of Table 2. 14 Computational and Mathematical Methods in Medicine However, if we compute the randomness index over the frequencies of amino acids in the A 3 alphabet then we can observe a different distribution of values. Over the first 30000 nucleotides corresponding to 10000 amino acids, we have the randomness value of Table 3.
So that we can comment that the arising complexity of the words and alphabets shows a different randomness in each alphabet.

Fractal Dimension.
The fractal dimension is computed on the dot plot, by the box counting algorithm [34,35], as the average of the number p(n) of 1's in the randomly taken n×n minors of the N ×N indicator matrix u hk or equivalently the number p(n) of black dots in the randomly taken n × n squares over the dot plot The explicit computation enables us to compare the fractal dimension on the first 100-length segments of DNA chains, with an approximation up to 10 −3 (see Table 5).
If we compare the fractal dimensions of the bacteria with pseudorandom and pseudoperiodic we can see that the fractal dimension of nucleotide distribution ranges, for all variants, in the interval [1.28-1.30]. As expected, the more "random" sequences have higher fractal dimension.

Entropy.
Another fundamental parameter, related to the information content of a sequence which measures the heterogeneity of data, is the information entropy (or Shannon entropy) [36][37][38][39][40][41][42]. Based on the axiom that less information   implies a larger uncertainty and vice versa that more information leads us to a more deterministic model, the entropy concept has been recently offering some interesting interpretations about uncertainty in DNA. In fact, DNA as any other signal has been considered as a sequence of symbols carrying chemical-functional information.
The normalized Shannon entropy [39,40,42] is defined, over the alphabet A , as where p x (n) should be computed for large sequences. According to (32), (34), we will approximate its value with However, the entropy is a parameter very similar to the complexity. In fact, it can be easily seen that (for the proof see [29]) the entropy H and the measure of complexity K differ for a factor. There follows that the entropy does not give any new information comparing with the previous parameters.
As expected also the table of entropies classifies bacteria and archaea in the same way (Table 6).

Complex Root Representation of DNA Words
The complex (digital) representation of a DNA sequence of words is the map of the symbolic sequence of words into a set of complex numbers and it is defined as such that for each The complex root representation of the sequence S N is the sequence D (S N ) of complex numbers {y h } h=1,...,N defined as with i = √ −1 being the imaginary unit. There follows that, independently on the alphabet, it is being all complex roots, of the unit, located on the unit circle of the complex plane C 1 .
Therefore the complex representation of a DNA sequence is a sequence of complex numbers with y h given by (42). An n-length pseudorandom (white noise) complex sequence belonging to the unit circle can be defined directly by using some random exponents with r n , s n being random values in the set {0, N}.
When y k = ρ(x k ) with x k ∈ A and X k ∈ S N we will properly call these walks as DNA walk. When the y k are randomly generated we will call them random walks. By remembering the definition of frequencies, DNA walk is the complex value signal {Z n } n=0,...,N−1 with z n = ( [z n ], [z n ]) = a n − g n + (t n − c n )i, z n ∈ C 1 , where the coefficients a n , g n , t n , c n given by (12) fulfill the condition (31). If we compare the DNA walks ( Figure 11) some primitive archaea such as h3 are very similar to a random walk ( Figure 13). In particular archaea seem to grow less than other bacteria (with the exception of b2).
It is interesting also to notice that the random walks on amino acids ( Figure 12) show that more evolved organisms have some "periodic" behavior, while the absolute value of walks on archaea is growing fast.

Wavelet Analysis
Wavelet analysis is a powerful method extensively applied to the analysis of biological signals [12,[43][44][45] aiming to single out the most significant parameters of complexity and heterogeneity in a time series and, in particular, in a DNA sequence. This method is based on the analysis of wavelet coefficients which are obtained by the wavelet transform. We will consider in the following the Haar wavelet basis (see, e.g., [3,4,29]) made by scaling functions: and the Haar wavelets: 1, The discrete Haar wavelet transform is the N × N matrix W N : K N ⊂ 2 → K N ⊂ 2 which maps the vector The matrix W N can be easily computed by some recursive product [3,4,13,29,46] so that with N = 4, M = 2, we have [3,4,29] From (55) with M = 2, N = 4, by explicit computation, we have and [1][2][3]14] β 0 Thus the first wavelet coefficient α represents the average value of the sequence and the other coefficients β the finite differences. The wavelet coefficients β's, also called details coefficients, are strictly connected with the first-order properties of the discrete time series.
In the following we will consider the short wavelet transform which consists in the subdivision of the DNA sequence into 4-length segments and apply the wavelet transform to each segment. As a result, from the N = 2 M -length complex vector Y, which is subdivided into 2 M−2 segments, the 4parameter short Haar wavelet transform gives the cluster of points in the 8-dimensional space R 4 × R 4 , that is, (α, α * ), β 0 0 , β * 0 0 , . . . , β p−1 This algorithm enables us to construct clusters of wavelet coefficients and to study the correlation between the real and imaginary coefficients of the DNA representation and DNA walk. It has been observed [3,4,29] that some symmetry arises from the plots of wavelet coefficients of DNA walks.

Cluster Analysis of the Wavelet Coefficients of the Complex DNA Representation.
Let us first compute the clusters of wavelet coefficients for the random sequence (48). As can be seen the wavelet coefficients both for the sequence and for its series range in some discrete set of values (see Figure 13).
The cluster algorithm applied to the complex representation sequence shows that the values of the wavelet coefficients belong to some discrete finite sets ( Figure 14).
It should be noticed that this symmetry on detail coefficients is lost for wavelet transform on longer segments (Figures 15, 16 and 17).
There follows that DNA sequences have to be considered as Markov chain with short range dependence; in other words any acid nucleic is attached to the chain on the base of a correlation of the previous acid nucleic. In other words, if we look for a dependence rule on the DNA nucleotides this dependence might be summarized by a function as x n+1 = f (x n ), (n = 1, . . . , N). (61)

Conclusions
In this paper archaea DNAs have been studied by focussing on the main parameters for complexity. It has been shown that more or less the main indices for complexity and heterogeneity, such as entropy, fractal dimension, and complexity do not differ too much when we have to classify the complexity of the sequence. However, some DNA sequences look more close to random sequences than others, thus suggesting that the evolution involves a process of complexity reduction: the more evolved a sequence is, the more far from a random distribution it is. In any case seems to be apparently impossible to distinguish between a random sequence and a DNA chain. By using the short wavelet transform instead we have shown that on short range (4-nucleotides) a DNA sequence shows some symmetries that slowly disappear by increasing the length of the analysed segment. Moreover, more evolved organisms have a more symmetrical distribution of wavelet coefficients.